A significant portion of the data that’s generated today is unstructured. Unstructured data includes social media comments, browsing history, and customer feedback. Have you found yourself in a situation with a bunch of textual data to analyze, and no idea how to proceed? Natural language processing in Python can help.
The objective of this tutorial is to enable you to analyze textual data in Python through the concepts of natural language processing (NLP). You’ll first learn how to tokenize your text into smaller chunks, normalize words to their root forms, and then remove any noise in your documents to prepare them for further analysis.
Let’s get started!
Key Takeaways
- Natural Language Processing (NLP) in Python involves tokenizing text into smaller chunks, normalizing words to their root forms, and cleaning documents to prepare them for further analysis. Python’s nltk library is used to perform these operations.
- Two techniques used to convert words to their base forms are stemming and lemmatization. Stemming is a simple algorithm that removes affixes from a word, while lemmatization normalizes a word based on the context and vocabulary of the text.
- Data cleaning in NLP involves removing punctuation and stop words (commonly-used words like “I”, “a”, and “the”) that add little meaning to text when analyzing it.
- After cleaning the text, the frequency of words can be found using the FreqDist class of NLTK. This can be useful for finding commonly occurring terms in a text.
Prerequisites
In this tutorial, we’ll use Python’s nltk
library to perform all NLP operations on the text. At the time of writing this tutorial, we’re using version 3.4 of nltk
. To install the library, you can use the pip
command on the terminal:
pip install nltk==3.4
To check which version of nltk
you have in the system, you can import the library into the Python interpreter and check the version:
import nltk
print(nltk.__version__)
To perform certain actions within nltk
in this tutorial, you may have to download specific resources. We’ll describe each resource as and when required.
However, if you’d like to avoid downloading individual resources later in the tutorial and grab them now in one go, run the following command:
python -m nltk.downloader all
Step 1: Convert into Tokens
A computer system cann’t find meaning in natural language by itself. The first step in processing natural language is to convert the original text into tokens. A token is a combination of continuous characters, with some meaning. It’s up to you to decide how to break a sentence into tokens. For instance, an easy method is to split a sentence by whitespace to break it into individual words.
In the NLTK library, you can use the word_tokenize()
function to convert a string to tokens. However, you’ll first need to download the punkt
resource. Run the following command in the terminal:
nltk.download('punkt')
Next, you need to import word_tokenize
from nltk.tokenize
to use it:
from nltk.tokenize import word_tokenize
print(word_tokenize("Hi, this is a nice hotel."))
The output of the code is as follows:
['Hi', ',', 'this', 'is', 'a', 'nice', 'hotel', '.']
You’ll notice that word_tokenize
doesn’t simply split a string based on whitespace, but also separates punctuation into tokens. It’s up to you if you’d like to retain the punctuation marks in the analysis.
Step 2: Convert Words to their Base Forms
When you’re processing natural language, you’ll often notice that there are various grammatical forms of the same word. For instance, “go”, “going” and “gone” are forms of the same verb, “go”.
While the necessities of your project may require you to retain words in various grammatical forms, let’s discuss a way to convert various grammatical forms of the same word into its base form. There are two techniques that you can use to convert a word to its base.
The first technique is stemming. Stemming is a simple algorithm that removes affixes from a word. There are various stemming algorithms available for use in NLTK. We’ll use the Porter algorithm in this tutorial.
We first import PorterStemmer
from nltk.stem.porter
. Next, we initialize the stemmer to the stemmer
variable, and then use the .stem()
method to find the base form of a word:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
print(stemmer.stem("going"))
The output of the code above is go
. If you run the stemmer for the other forms of “go” described above, you’ll notice that the stemmer returns the same base form, “go”. However, as stemming is only a simple algorithm based on removing word affixes, it fails when the words are less commonly used in language.
For example, when you try the stemmer on the word “constitutes”, it gives an unintuitive result:
print(stemmer.stem("constitutes"))
You’ll notice the output is “constitut”.
This issue is solved by moving on to a more complex approach to finding the base form of a word in a given context. The process is called lemmatization. Lemmatization normalizes a word based on the context and vocabulary of the text. In NLTK, you can lemmatize sentences using the WordNetLemmatizer
class.
First, you need to download the wordnet
resource from the NLTK downloader in the Python terminal:
nltk.download('wordnet')
Once it’s downloaded, you need to import the WordNetLemmatizer
class and initialize it:
from nltk.stem.wordnet import WordNetLemmatizer
lem = WordNetLemmatizer()
To use the lemmatizer, use the .lemmatize()
method. It takes two arguments: the word, and the context. In our example, we’ll use “v” for context. Let’s explore the context further after looking at the output of the .lemmatize()
method:
print(lem.lemmatize('constitutes', 'v'))
You’ll notice that the .lemmatize()
method correctly converts the word “constitutes” to its base form, “constitute”. You’ll also notice that lemmatization takes longer than stemming, as the algorithm is more complex.
Let’s check how to determine the second argument of the .lemmatize()
method programmatically. NLTK has a pos_tag()
function that helps in determining the context of a word in a sentence. However, you first need to download the averaged_perceptron_tagger
resource through the NLTK downloader:
nltk.download('averaged_perceptron_tagger')
Next, import the pos_tag()
function and run it on a sentence:
from nltk.tag import pos_tag
sample = "Hi, this is a nice hotel."
print(pos_tag(word_tokenize(sample)))
You’ll notice that the output is a list of pairs. Each pair consists of a token and its tag, which signifies the context of a token in the overall text. Notice that the tag for a punctuation mark is itself:
[('Hi', 'NNP'),
(',', ','),
('this', 'DT'),
('is', 'VBZ'),
('a', 'DT'),
('nice', 'JJ'),
('hotel', 'NN'),
('.', '.')]
How do you decode the context of each token? Here’s a full list of all tags and their corresponding meanings on the Web. Notice that the tags of all nouns begin with “N”, and for all verbs begin with “V”. We can use this information in the second argument of our .lemmatize()
method:
def lemmatize_tokens(stentence):
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = []
for word, tag in pos_tag(stentence):
if tag.startswith('NN'):
pos = 'n'
elif tag.startswith('VB'):
pos = 'v'
else:
pos = 'a'
lemmatized_tokens.append(lemmatizer.lemmatize(word, pos))
return lemmatized_tokens
sample = "Legal authority constitutes all magistrates."
print(lemmatize_tokens(word_tokenize(sample)))
The output of the code above is as follows:
['Legal', 'authority', 'constitute', 'all', 'magistrate', '.']
This output is expected, where “constitutes” and “magistrates” have been converted to “constitute” and “magistrate” respectively.
Step 3: Data Cleaning
The next step in preparing data is to clean the data and remove anything that doesn’t add meaning to your analysis. Broadly, we’ll look at removing punctuation and stop words from your analysis.
Removing punctuation is a fairly easy task. The punctuation
object of the string
library contains all the punctuation marks in English:
import string
print(string.punctuation)
The output of this code snippet is as follows:
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
In order to remove punctuation from tokens, you can simply run this:
for token in tokens:
if token in string.punctuation:
# Do something
Next, we’ll focus on removing stop words. Stop words are commonly-used words in language like “I”, “a” and “the”, which add little meaning to text when analyzing it. So we’ll remove stop words from our analysis. First, download the stopwords
resource from the NLTK downloader:
nltk.download('stopwords')
Once your download is complete, import stopwords
from nltk.corpus
and use the .words()
method with “english” as the argument. It’s a list of 179 stop words in the English language:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
We can combine the lemmatization example with the concepts discussed in this section to create the following function, clean_data()
. Additionally, before comparing if a word is a part of the stop words list, we convert it to lowercase. This way, we still capture a stop word if it occurs at the start of a sentence and is capitalized:
def clean_data(tokens, stop_words = ()):
cleaned_tokens = []
for token, tag in pos_tag(tokens):
if tag.startswith("NN"):
pos = 'n'
elif tag.startswith('VB'):
pos = 'v'
else:
pos = 'a'
lemmatizer = WordNetLemmatizer()
token = lemmatizer.lemmatize(token, pos)
if token not in string.punctuation and token.lower() not in stop_words:
cleaned_tokens.append(token)
return cleaned_tokens
sample = "The quick brown fox jumps over the lazy dog."
stop_words = stopwords.words('english')
clean_data(word_tokenize(sample), stop_words)
The output of the example is as follows:
['quick', 'brown', 'fox', 'jump', 'lazy', 'dog']
As you can see, the punctuation and stop words have been removed.
Word Frequency Distribution
Now that you’re familiar with the basic cleaning techniques in NLP, let’s try to find the frequency of words in text. For this exercise, we’ll use the text of the fairy tale, The Mouse, The Bird and The Sausage, which is available freely on Gutenberg. We’ll store the text of this fairy tale in a string, text
.
First, we tokenize text
and then clean it using the function clean_data
that we defined above:
tokens = word_tokenize(text)
cleaned_tokens = clean_data(tokens, stop_words = stop_words)
To find the frequency distribution of words in your text, you can use the FreqDist
class of NLTK. Initialize the class with the tokens as an argument. Then use the .most_common()
method to find the commonly occurring terms. Let’s try to find the top ten terms in this case:
from nltk import FreqDist
freq_dist = FreqDist(cleaned_tokens)
freq_dist.most_common(10)
Here are the ten most commonly occurring terms in this fairy tale:
[('bird', 15),
('sausage', 11),
('mouse', 8),
('wood', 7),
('time', 6),
('long', 5),
('make', 5),
('fly', 4),
('fetch', 4),
('water', 4)]
Unsurprisingly, the three most common terms are the three main characters in the fairy tale.
The frequency of words may not be very important when analyzing text. Typically, the next step in NLP is to generate a statistic — TF-IDF (term frequency — inverse document frequency) — which signifies the importance of a word in a list of documents.
Conclusion
In this tutorial, we’ve taken a first look at natural language processing in Python. We converted text to tokens, converted words to their base forms and, finally, cleaned the text to remove any part that didn’t add meaning to the analysis.
Although we’ve looked at simple NLP tasks in this tutorial, there are many more techniques to explore. We might, for example, want to perform topic modelling on textual data, where the objective is to find a common topic that a text might be talking about. A more complex task in NLP is the implementation of a sentiment analysis model to determine the feeling behind any text.
Have any comments or questions? Feel free to hit me up on Twitter.
Frequently Asked Questions (FAQs) on Natural Language Processing with Python
What are the key differences between Natural Language Processing (NLP) and Natural Language Understanding (NLU)?
Natural Language Processing (NLP) and Natural Language Understanding (NLU) are two subfields of artificial intelligence that often get confused. NLP is a broader concept that encompasses all the methods used to interact with computers using natural language. This includes both understanding and generating human language. On the other hand, NLU is a subset of NLP that specifically deals with the comprehension aspect. It involves the use of algorithms to understand and interpret human language in a valuable way.
How can I improve the accuracy of my NLP models in Python?
Improving the accuracy of NLP models involves several strategies. Firstly, you can use more training data. The more data your model has to learn from, the better it will perform. Secondly, consider using different NLP techniques. For instance, if you’re using Bag of Words (BoW), you might want to try Term Frequency-Inverse Document Frequency (TF-IDF) or Word2Vec. Lastly, fine-tuning your model’s parameters can also lead to significant improvements.
What are some common applications of NLP in real-world scenarios?
NLP has a wide range of applications in the real world. These include language translation, sentiment analysis, chatbots, voice assistants like Siri and Alexa, text summarization, and spam detection in emails.
How does tokenization work in NLP?
Tokenization is the process of breaking down text into individual words or tokens. This is a crucial step in NLP as it allows the model to understand and analyze the text. In Python, you can use the NLTK library’s word_tokenize function to perform tokenization.
What is the role of stop words in NLP?
Stop words are common words that are often filtered out during the preprocessing stage in NLP because they don’t carry much meaningful information. Examples include “is”, “the”, “and”, etc. Removing these words can help improve the performance of your NLP model.
How can I handle multiple languages in NLP?
Handling multiple languages in NLP can be challenging due to the differences in grammar, syntax, and vocabulary. However, Python’s NLTK library provides support for several languages. You can also use language detection libraries like langdetect to identify the language of the text before processing it.
What is stemming and lemmatization in NLP?
Stemming and lemmatization are techniques used to reduce words to their base or root form. The main difference between them is that stemming can often create non-existent words, while lemmatization reduces words to their linguistically correct base form.
How can I use NLP for sentiment analysis?
Sentiment analysis involves determining the sentiment expressed in a piece of text. This can be done using various NLP techniques. For instance, you can use the TextBlob library in Python to easily perform sentiment analysis.
What are n-grams in NLP?
N-grams are contiguous sequences of n items in a given sample of text or speech. They are used in NLP to predict the next item in a sequence. For example, in a bigram (n=2), you consider pairs of words for analysis or prediction.
How can I use NLP for text classification?
Text classification involves categorizing text into predefined classes. This can be done using various NLP techniques and machine learning algorithms. For instance, you can use the Bag of Words or TF-IDF for feature extraction, and then feed these features into a machine learning model for classification.
Shaumik is a data analyst by day, and a comic book enthusiast by night (or maybe, he's Batman?) Shaumik has been writing tutorials and creating screencasts for over five years. When not working, he's busy automating mundane daily tasks through meticulously written scripts!