Getting Started with Natural Language Processing in Python


Originally published at:

A significant portion of the data that is generated today is unstructured. Unstructured data includes social media comments, browsing history and customer feedback. Have you found yourself in a situation with a bunch of textual data to analyse, and no idea how to proceed?

The objective of this tutorial is to enable you to analyze textual data in Python through the concepts of Natural Language Processing (NLP). You will first learn how to tokenize your text into smaller chunks, normalize words to their root forms, and then, remove any noise in your documents to prepare them for further analysis.

Let's get started!


In this tutorial, we will use Python's nltk library to perform all NLP operations on the text. At the time of writing this tutorial, we used version 3.4 of nltk. To install the library, you can use the pip command on the terminal:

pip install nltk==3.4

To check which version of nltk you have in the system, you can import the library into the Python interpreter and check the version:

import nltk

To perform certain actions within nltk in this tutorial, you may have to download specific resources. We will describe each resource as and when required.

However, if you would like to avoid downloading individual resources later in the tutorial and grab them now in one go, run the following command:

python -m nltk.downloader all

Step 1: Convert into Tokens

A computer system can not find meaning in natural language by itself. The first step in processing natural language is to convert the original text into tokens. A token is a combination of continuous characters, with some meaning. It is up to you to decide how to break a sentence into tokens. For instance, an easy method is to split a sentence by whitespace to break it into individual words.



Step 1 describes syntactical recognition, right?

Does Step 2 apply to all words or just to verbs? For verbs you are describing verb tense, right?

For Step 3, punctuation is sometimes useful, correct? I suppose it depends on the extent to which we want to understand the words. Some words might be just noise or might be important. The word “the” might be considered noise but there is a difference between “a suspect” (usually a general comment) and “the suspect” (usually specific).



Hi Samuel,

For step 1, I would say that syntactical recognition can be achieved through tokenization.

In step 2, the example definitely works around verbs and nouns, though you can modify it to get the base form of any word.

In step 3, you should consider what is the objective of your analysis - you may want to retain stop words and punctuation depending on the use case. Further, as you have pointed out, you may want to group words together too - “a suspect” could be a separate entity from “the suspect”!



I think that if someone is attempting to parse words grammatically to understand what is said as best as possible then verbs can be really complicated. Verbs can have a tense (time), number (singular or plural) and a type of the subject.



True, it does pose a challenge. While writing this article, what I had in mind was something like social media comments where the idea is to parse the text and try to assess its meaning.



Good intro but it would’ve been nice if the example showed how you can use NLP to determine predictive intent(i.e. what does the customer want ? balance status, order status, make a payment etc) when they call up and don’t want to deal with the typical press 1 if you want x press 2 if you want y press 3 if you want z till they drill down to an option that can help them. People most of the time just press 0 to connect with a human and that slows down the entire call center.




Covering advanced topics would have made this article difficult to comprehend for beginners in NLP. The intent from text can be predicted as a next step by analyzing the word associations in language.



It is not a matter of advanced, it is more of a matter of complex. Natural Language is complicated. People do not communicate using simple combinations of letters and words. NLP is by definition not simple. This article is actually not about NLP; more specifically, this article has very minimal or no recognition of grammar. Recognition of the fundamentals of grammar is not advanced NLP.