Originally published at: https://www.sitepoint.com/natural-language-processing-python/
A significant portion of the data that is generated today is unstructured. Unstructured data includes social media comments, browsing history and customer feedback. Have you found yourself in a situation with a bunch of textual data to analyse, and no idea how to proceed?
The objective of this tutorial is to enable you to analyze textual data in Python through the concepts of Natural Language Processing (NLP). You will first learn how to tokenize your text into smaller chunks, normalize words to their root forms, and then, remove any noise in your documents to prepare them for further analysis.
Let's get started!
In this tutorial, we will use Python's
nltk library to perform all NLP operations on the text. At the time of writing this tutorial, we used version 3.4 of
nltk. To install the library, you can use the
pip command on the terminal:
pip install nltk==3.4
To check which version of
nltk you have in the system, you can import the library into the Python interpreter and check the version:
import nltk print(nltk.__version__)
To perform certain actions within
nltk in this tutorial, you may have to download specific resources. We will describe each resource as and when required.
However, if you would like to avoid downloading individual resources later in the tutorial and grab them now in one go, run the following command:
python -m nltk.downloader all
Step 1: Convert into Tokens
A computer system can not find meaning in natural language by itself. The first step in processing natural language is to convert the original text into tokens. A token is a combination of continuous characters, with some meaning. It is up to you to decide how to break a sentence into tokens. For instance, an easy method is to split a sentence by whitespace to break it into individual words.