This will print the top five results, as pictured below.

https://learnable-static.s3.amazonaws.com/premium/reeedr/books/sentiment-analysis-with-python/images/sentiment-analysis-1.png

We’ll be building on previously gained knowledge on data cleaning and tokenization found in “ Getting Started with Natural Language Processing in Python ”. Data cleaning is the process of removing noise from our dataset, to help increase the accuracy of our prediction model. In the tokenization step, we convert each review into a list of words (tokens).

To help with data cleaning and tokenization, we need to download some NLTK resources:

nltk.download('punkt') nltk.download('wordnet') nltk.download('averaged_perceptron_tagger') nltk.download('stopwords') 

We use these resources for the following:

Punkt : a pre-trained tokenizer for the English Language. Wordnet : a large lexical database of English. It’s needed by the WordNetLemmatizer class to lemmatize sentences. averaged_perceptron_tagger : a resource used for tagging words with their parts of speech. stopwords : a corpus containing 2,400 stopwords for 11 languages. 

Let’s create a function to help clean and tokenize each movie review:

Sentiment Analysis with Python

Preface

We create and come across textual data every day: when we leave reviews online, open customer support tickets, post tweets, send emails, and so on. Most of this data is unstructured data . In fact, according to an article on how AI is unleashing the power of unstructured data , it’s estimated that unstructured data accounts for 80–90% of the data we generate. Through data science, we can begin to make use of and classify this unstructured data. One way to do that is through “sentiment analysis”.

Sentiment Analysis with Python

Data Cleaning and Tokenization

Community Questions