SitePoint
  • Premium
  • Library
  • Community
  • Save on SaaS
  • Jobs
  • Blog
LoginStart Free Trial
Preface
1

Sentiment Analysis with Python

This will print the top five results, as pictured below.

The top five results printed to screen

Data Cleaning and Tokenization

We’ll be building on previously gained knowledge on data cleaning and tokenization found in “ Getting Started with Natural Language Processing in Python ”. Data cleaning is the process of removing noise from our dataset, to help increase the accuracy of our prediction model. In the tokenization step, we convert each review into a list of words (tokens).

To help with data cleaning and tokenization, we need to download some NLTK resources:

Code snippet

nltk.download('punkt')nltk.download('wordnet')nltk.download('averaged_perceptron_tagger')nltk.download('stopwords')

We use these resources for the following:

  • Punkt: a pre-trained tokenizer for the English Language.
  • Wordnet: a large lexical database of English. It’s needed by the WordNetLemmatizer class to lemmatize sentences.
  • averaged_perceptron_tagger: a resource used for tagging words with their parts of speech.
  • stopwords: a corpus containing 2,400 stopwords for 11 languages.

Let’s create a function to help clean and tokenize each movie review:

End of PreviewSign Up to unlock the rest of this title.
On this page

Community Questions

Previous
Finish