How to Analyze Tweet Sentiments with PHP Machine Learning

This article was peer reviewed by Wern Ancheta. Thanks to all of SitePoint’s peer reviewers for making SitePoint content the best it can be!

As of late, it seems everyone and their proverbial grandma is talking about Machine Learning. Your social media feeds are inundated with posts about ML, Python, TensorFlow, Spark, Scala, Go and so on; and if you are anything like me, you might be wondering, what about PHP?

Yes, what about Machine Learning and PHP? Fortunately, someone was crazy enough not only to ask that question, but to also develop a generic machine learning library that we can use in our next project. In this post we are going take a look at PHP-ML – a machine learning library for PHP – and we’ll write a sentiment analysis class that we can later reuse for our own chat or tweet bot. The main goals of this post are:

Explore the general concepts around Machine learning and Sentiment Analysis
Review the capabilities and shortcomings of PHP-ML
Define the problem we are going to work on
Prove that trying to do Machine learning in PHP isn’t a completely crazy goal (optional)

A robot elephpant

What is Machine Learning?

Machine learning is a subset of Artificial Intelligence that focuses on giving “computers the ability to learn without being explicitly programmed”. This is achieved by using generic algorithms that can “learn” from a particular set of data.

For example, one common usage of machine learning is classification. Classification algorithms are used to put data into different groups or categories. Some examples of classification applications are:

Email spam filters
Market segmentation
Fraud detection

Machine learning is something of an umbrella term that covers many generic algorithms for different tasks, and there are two main algorithm types classified on how they learn – supervised learning and unsupervised learning.

Supervised Learning

In supervised learning, we train our algorithm using labelled data in the form of an input object (vector) and a desired output value; the algorithm analyzes the training data and produces what is referred to as an inferred function which we can apply to a new, unlabelled dataset.

For the remainder of this post we will focus on supervised learning, just because its easier to see and validate the relationship; keep in mind that both algorithms are equally important and interesting; one could argue that unsupervised is more useful because it precludes the labelled data requirements.

Unsupervised Learning

This type of learning on the other hand works with unlabelled data from the get-go. We don’t know the desired output values of the dataset and we are letting the algorithm draw inferences from datasets; unsupervised learning is especially handy when doing exploratory data analysis to find hidden patterns in the data.

PHP-ML

Meet PHP-ML, a library that claims to be a fresh approach to Machine Learning in PHP. The library implements algorithms, neural networks, and tools to do data pre-processing, cross validation, and feature extraction.

I’ll be the first to admit PHP is an unusual choice for machine learning, as the language’s strengths are not that well suited for Machine Learning applications. That said, not every machine learning application needs to process petabytes of data and do massive calculations – for simple applications, we should be able to get away with using PHP and PHP-ML.

The best use case that I can see for this library right now is the implementation of a classifier, be it something like a spam filter or even sentiment analysis. We are going to define a classification problem and build a solution step by step to see how we can use PHP-ML in our projects.

The Problem

To exemplify the process of implementing PHP-ML and adding some machine learning to our applications, I wanted to find a fun problem to tackle and what better way to showcase a classifier than building a tweet sentiment analysis class.

One of the key requirements needed to build successful machine learning projects is a decent starting dataset. Datasets are critical since they will allow us to train our classifier against already classified examples. As there has recently been significant noise in the media around airlines, what better dataset to use than tweets from customers to airlines?

Fortunately, a dataset of tweets is already available to us thanks to Kaggle.io. The Twitter US Airline Sentiment database can be downloaded from their site using this link

The Solution

Let’s begin by taking a look at the dataset we will be working on. The raw dataset has the following columns:

tweet_id
airline_sentiment
airline_sentiment_confidence
negativereason
negativereason_confidence
airline
airline_sentiment_gold
name
negativereason_gold
retweet_count
text
tweet_coord
tweet_created
tweet_location
user_timezone

And looks like following example (side-scrollable table):

tweet_id	airline_sentiment	airline_sentiment_confidence	negativereason	negativereason_confidence	airline	name	retweet_count	text	tweet_coord	tweet_created	tweet_location	user_timezone
570306133677760513	neutral	1.0			Virgin America	cairdin	0	@VirginAmerica What @dhepburn said.		2015-02-24 11:35:52 -0800		Eastern Time (US & Canada)
570301130888122368	positive	0.3486		0.0	Virgin America	jnardino	0	@VirginAmerica plus you’ve added commercials to the experience… tacky.		2015-02-24 11:15:59 -0800		Pacific Time (US & Canada)
570301083672813571	neutral	0.6837			Virgin America	yvonnalynn	0	@VirginAmerica I didn’t today… Must mean I need to take another trip!		2015-02-24 11:15:48 -0800	Lets Play	Central Time (US & Canada)
570301031407624196	negative	1.0	Bad Flight	0.7033	Virgin America	jnardino	0	“@VirginAmerica it’s really aggressive to blast obnoxious “”entertainment”” in your guests’ faces & they have little recourse”		2015-02-24 11:15:36 -0800		Pacific Time (US & Canada)
570300817074462722	negative	1.0	Can’t Tell	1.0	Virgin America	jnardino	0	@VirginAmerica and it’s a really big bad thing about it		2015-02-24 11:14:45 -0800		Pacific Time (US & Canada)
570300767074181121	negative	1.0	Can’t Tell	0.6842	Virgin America	jnardino	0	“@VirginAmerica seriously would pay $30 a flight for seats that didn’t have this playing.
it’s really the only bad thing about flying VA”		2015-02-24 11:14:33 -0800		Pacific Time (US & Canada)
570300616901320704	positive	0.6745		0.0	Virgin America	cjmcginnis	0	“@VirginAmerica yes	nearly every time I fly VX this “ear worm” won’t go away :)”		2015-02-24 11:13:57 -0800	San Francisco CA	Pacific Time (US & Canada)
570300248553349120	neutral	0.634			Virgin America	pilot	0	“@VirginAmerica Really missed a prime opportunity for Men Without Hats parody	there. https://t.co/mWpG7grEZP”		2015-02-24 11:12:29 -0800	Los Angeles	Pacific Time (US & Canada)

The file contains 14,640 tweets, so it’s a decent dataset for us to work with. Now, with the current amount of columns we have available we have way more data than what we need for our example; for practical purposes we only care about the following columns:

text
airline_sentiment

Where text will become our feature and the airline_sentiment becomes our target. The rest of the columns can be discarded as they will not be used for our exercise. Let’s start by creating the project, and initialize composer using the following file:

{
    "name": "amacgregor/phpml-exercise",
    "description": "Example implementation of a Tweet sentiment analysis with PHP-ML",
    "type": "project",
    "require": {
        "php-ai/php-ml": "^0.4.1"
    },
    "license": "Apache License 2.0",
    "authors": [
        {
            "name": "Allan MacGregor",
            "email": "amacgregor@allanmacgregor.com"
        }
    ],
    "autoload": {
        "psr-4": {"PhpmlExercise\\": "src/"}
    },
    "minimum-stability": "dev"
}

composer install

If you need an introduction to Composer, see here.

To make sure we are set up correctly, let’s create a quick script that will load our Tweets.csv data file and make sure it has the data we need. Copy the following code as reviewDataset.php in the root of our project:

<?php
namespace PhpmlExercise;

require __DIR__ . '/vendor/autoload.php';

use Phpml\Dataset\CsvDataset;

$dataset = new CsvDataset('datasets/raw/Tweets.csv',1);

foreach ($dataset->getSamples() as $sample) {
    print_r($sample);
}

Now, run the script with php reviewDataset.php, and let’s review the output:

Array( [0] => 569587371693355008 )
Array( [0] => 569587242672398336 )
Array( [0] => 569587188687634433 )
Array( [0] => 569587140490866689 )

Now that doesn’t look useful, does it? Let’s take a look at the CsvDataset class to get a better idea of what’s happening internally:

<?php 

    public function __construct(string $filepath, int $features, bool $headingRow = true)
    {
        if (!file_exists($filepath)) {
            throw FileException::missingFile(basename($filepath));
        }

        if (false === $handle = fopen($filepath, 'rb')) {
            throw FileException::cantOpenFile(basename($filepath));
        }

        if ($headingRow) {
            $data = fgetcsv($handle, 1000, ',');
            $this->columnNames = array_slice($data, 0, $features);
        } else {
            $this->columnNames = range(0, $features - 1);
        }

        while (($data = fgetcsv($handle, 1000, ',')) !== false) {
            $this->samples[] = array_slice($data, 0, $features);
            $this->targets[] = $data[$features];
        }
        fclose($handle);
    }

The CsvDataset constructor takes 3 arguments:

A file-path to the source CSV
An integer that specifies the number of features in our file
A boolean to indicate if the first row is header

If we look a little closer we can see that the class is mapping out the CSV file into two internal arrays: samples and targets. Samples contains all the features provided by the file and targets contains the known values (negative, positive, or neutral).

Based on the above, we can see that the format our CSV file needs to follow is as follows:

| feature_1 | feature_2 | feature_n | target |

We will need to generate a clean dataset with only the columns we need to continue working. Let’s call this script generateCleanDataset.php :

<?php
namespace PhpmlExercise;

require __DIR__ . '/vendor/autoload.php';

use Phpml\Exception\FileException;

$sourceFilepath         = __DIR__ . '/datasets/raw/Tweets.csv';
$destinationFilepath    = __DIR__ . '/datasets/clean_tweets.csv';

$rows =[];

$rows = getRows($sourceFilepath, $rows);
writeRows($destinationFilepath, $rows);


/**
 * @param $filepath
 * @param $rows
 * @return array
 */
function getRows($filepath, $rows)
{
    $handle = checkFilePermissions($filepath);

    while (($data = fgetcsv($handle, 1000, ',')) !== false) {
        $rows[] = [$data[10], $data[1]];
    }
    fclose($handle);
    return $rows;
}

/**
 * @param $filepath
 * @param string $mode
 * @return bool|resource
 * @throws FileException
 */
function checkFilePermissions($filepath, $mode = 'rb')
{
    if (!file_exists($filepath)) {
        throw FileException::missingFile(basename($filepath));
    }

    if (false === $handle = fopen($filepath, $mode)) {
        throw FileException::cantOpenFile(basename($filepath));
    }
    return $handle;
}

/**
 * @param $filepath
 * @param $rows
 * @internal param $list
 */
function writeRows($filepath, $rows)
{
    $handle = checkFilePermissions($filepath, 'wb');

    foreach ($rows as $row) {
        fputcsv($handle, $row);
    }

    fclose($handle);
}

Nothing too complex, just enough to do the job. Let’s execute it with phpgenerateCleanDataset.php.

Now, let’s go ahead and point our reviewDataset.php script back to the clean dataset:

Array
(
    [0] => @AmericanAir That will be the third time I have been called by 800-433-7300 an hung on before anyone speaks. What do I do now???
)
Array
(
    [0] => @AmericanAir How clueless is AA. Been waiting to hear for 2.5 weeks about a refund from a Cancelled Flightled flight &amp; been on hold now for 1hr 49min
)

BAM! This is data we can work with! So far, we have been creating simple scripts to manipulate the data. Next, we are going to start creating a new class under src/classification/SentimentAnalysis.php.

<?php
namespace PhpmlExercise\Classification;

/**
 * Class SentimentAnalysis
 * @package PhpmlExercise\Classification
 */
class SentimentAnalysis { 
    public function train() {}
    public function predict() {}
}

Our Sentiment class will need two functions in our sentiment analysis class:

A train function, which will take our dataset training samples and labels and some optional parameters.
A predict function, which will take an unlabelled dataset and assigned a set of labels based on the training data.

In the root of the project create a script called classifyTweets.php. We will use his script to instantiate and test our sentiment analysis class. Here is the template that we will use:

<?php

namespace PhpmlExercise;
use PhpmlExercise\Classification\SentimentAnalysis;

require __DIR__ . '/vendor/autoload.php';

// Step 1: Load the Dataset

// Step 2: Prepare the Dataset

// Step 3: Generate the training/testing Dataset

// Step 4: Train the classifier 

// Step 5: Test the classifier accuracy

Step 1: Load the Dataset

We already have the basic code that we can use for loading a CSV into a dataset object from our earlier examples. We are going to use the same code with a few tweaks:

<?php
...
use Phpml\Dataset\CsvDataset;
...
$dataset = new CsvDataset('datasets/clean_tweets.csv',1);

$samples = [];
foreach ($dataset->getSamples() as $sample) {
    $samples[] = $sample[0];
}

This generates a flat array with only the features – in this case the tweet text – which we are going to use to train our classifier.

Step 2: Prepare the Dataset

Now, having the raw text and passing that to a classifier wouldn’t be useful or accurate since every tweet is essentially different. Fortunately, there are ways of dealing with text when trying to apply classification or machine learning algorithms. For this example, we are going to make use of the following two classes:

Token Count Vectorizer: This will transform a collection of text samples to a vector of token counts. Essentially, every word in our tweet becomes a unique number and keeps track of amounts of occurrences of a word in a specific text sample.
Tf-idf Transformer: short for term frequency–inverse document frequency, is a numerical statistic intended to reflect how important a word is to a document in a collection or corpus.

Let’s start with our text vectorizer:

<?php
...
use Phpml\FeatureExtraction\TokenCountVectorizer;
use Phpml\Tokenization\WordTokenizer;

...
$vectorizer = new TokenCountVectorizer(new WordTokenizer());

$vectorizer->fit($samples);
$vectorizer->transform($samples);

Next, apply the Tf-idf Transformer:

<?php
...

use Phpml\FeatureExtraction\TfIdfTransformer;
...
$tfIdfTransformer = new TfIdfTransformer();

$tfIdfTransformer->fit($samples);
$tfIdfTransformer->transform($samples);

Our samples array is now in a format where it an easily be understood by our classifier. We are not done yet, we need to label each sample with its corresponding sentiment.

Step 3: Generate the Training Dataset

Fortunately, PHP-ML has this need already covered and the code is quite simple:

<?php
...
use Phpml\Dataset\ArrayDataset;
...
$dataset = new ArrayDataset($samples, $dataset->getTargets());

We could go ahead and use this dataset and train our classifier. We are missing a testing dataset to use as validation, however, so we are going to “cheat” a little bit and split our original dataset into two: a training dataset and a much smaller dataset that will be used for testing the accuracy of our model.

<?php
...
use Phpml\CrossValidation\StratifiedRandomSplit;
...
$randomSplit = new StratifiedRandomSplit($dataset, 0.1);

$trainingSamples = $randomSplit->getTrainSamples();
$trainingLabels     = $randomSplit->getTrainLabels();

$testSamples = $randomSplit->getTestSamples();
$testLabels      = $randomSplit->getTestLabels();

This approach is called cross-validation. The term comes from statistics and can be defined as follows:

Cross-validation, sometimes called rotation estimation, is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. — Wikipedia.com

Step 4: Train the Classifier

Finally, we are ready to go back and implement our SentimentAnalysis class. If you haven’t noticed by now, a huge part of machine learning is about gathering and manipulating the data; the actual implementation of the Machine learning models tends to be a lot less involved.

To implement our sentiment analysis class, we have three classification algorithms available:

Support Vector Classification
KNearestNeighbors
NaiveBayes

For this exercise we are going to use the simplest of them all, the NaiveBayes classifier, so let’s go ahead and update our class to implement the train method:

<?php

namespace PhpmlExercise\Classification;
use Phpml\Classification\NaiveBayes;

class SentimentAnalysis
{
    protected $classifier;

    public function __construct()
    {
        $this->classifier = new NaiveBayes();
    }
    public function train($samples, $labels)
    {
        $this->classifier->train($samples, $labels);
    }
}

As you can see, we are letting PHP-ML do all the heavy lifting for us. We are just creating a nice little abstraction for our project. But how do we know if our classifier is actually training and working? Time to use our testSamples and testLabels.

Step 5: Test the Classifier’s Accuracy

Before we can proceed with testing our classifier, we do have to implement the prediction method:

<?php
...
class SentimentAnalysis
{
...
    public function predict($samples)
    {
        return $this->classifier->predict($samples);
    }
}

And again, PHP-ML is doing us a solid and doing all the heavy lifting for us. Let’s update our classifyTweets class accordingly:

<?php
...
$predictedLabels = $classifier->predict($testSamples);

Finally, we need a way to test the accuracy of our trained model; thankfully PHP-ML has that covered too, and they have several metrics classes. In our case, we are interested in the accuracy of the model. Let’s take a look at the code:

<?php
...
use Phpml\Metric\Accuracy;
...
echo 'Accuracy: '.Accuracy::score($testLabels, $predictedLabels);

We should see something along the lines of:

Accuracy: 0.73651877133106%

Conclusion

This article fell a bit on the long side, so let’s do a recap of what we’ve learned so far:

Having a good dataset from the start is critical for implementing machine learning algorithms.
The difference between supervised learning and unsupervised Learning.
The meaning and use of cross-validation in machine learning.
That vectorization and transformation are essential to prepare text datasets for machine learning.
How to implement a Twitter sentiment analysis by using PHP-ML’s NaiveBayes classifier.

This post also served as an introduction to the PHP-ML library and hopefully gave you a good idea of what the library can do and how it can be embedded in your own projects.

Finally, this post is by no means comprehensive and there is plenty to learn, improve and experiment with; here are some ideas to get you started on how to improve things further:

Replace the NaiveBayes algorithm with the Support Vector Classification algorithm.
If you tried running against the full dataset (14,000 rows) you’d probably notice how memory intensive the process can be. Try implementing model persistence so it doesn’t have to be trained on each run.
Move the dataset generation to its own helper class.

I hope you found this article useful. If you have some application ideas regarding PHP-ML or any questions, don’t hesitate to drop them below into the comments area!

Frequently Asked Questions (FAQs) on PHP Machine Learning for Tweet Sentiment Analysis

How Can I Improve the Accuracy of My Sentiment Analysis?

Improving the accuracy of sentiment analysis involves several strategies. First, ensure that your training data is as clean and relevant as possible. This means removing any irrelevant data, such as stop words, punctuation, and URLs. Second, consider using a more sophisticated algorithm. While the Naive Bayes classifier is a good starting point, other algorithms such as Support Vector Machines (SVM) or deep learning models may provide better results. Lastly, consider using a larger dataset for training. The more data your model has to learn from, the more accurate it will be.

Can I Use Other Languages Besides PHP for Sentiment Analysis?

Yes, you can use other programming languages for sentiment analysis. Python, for example, is a popular choice due to its extensive machine learning libraries such as NLTK, TextBlob, and scikit-learn. However, PHP can also be used effectively for sentiment analysis, especially if you’re already comfortable with the language or if your project is built on a PHP framework.

How Can I Handle Sarcasm and Irony in Sentiment Analysis?

Handling sarcasm and irony in sentiment analysis is a challenging task. These linguistic features often involve saying something but meaning the opposite, which can be difficult for a machine learning model to understand. One approach is to use a more sophisticated model that can understand context, such as a deep learning model. Another approach is to use a specialized sarcasm detection model, which can be trained on a dataset of sarcastic comments.

How Can I Use Sentiment Analysis for Other Social Media Platforms?

The principles of sentiment analysis can be applied to any text data, including posts from other social media platforms. The main difference would be in how you collect the data. Each social media platform has its own API for accessing user posts, so you would need to familiarize yourself with the API of the platform you’re interested in.

Can I Use Sentiment Analysis for Languages Other Than English?

Yes, sentiment analysis can be used for any language. However, the effectiveness of the analysis will depend on the quality of your training data. If you’re working with a language other than English, you’ll need a dataset in that language to train your model. Some machine learning libraries also support multiple languages out of the box.

How Can I Visualize the Results of My Sentiment Analysis?

There are many ways to visualize sentiment analysis results. One common method is to use a bar chart to show the number of positive, negative, and neutral tweets. Another method is to use a word cloud to visualize the most frequently used words in your data. PHP has several libraries for creating these visualizations, such as pChart and GD.

How Can I Use Sentiment Analysis in a Real-World Application?

Sentiment analysis has many real-world applications. Businesses can use it to monitor customer opinions about their products or services, politicians can use it to gauge public opinion on policy issues, and researchers can use it to study social trends. The possibilities are endless.

How Can I Handle Emojis in Sentiment Analysis?

Emojis can carry significant sentiment information, so it’s important to include them in your analysis. One approach is to replace each emoji with its textual description before feeding the data into your model. There are libraries available that can help with this, such as Emojione for PHP.

How Can I Deal with Spelling Mistakes in Sentiment Analysis?

Spelling mistakes can be a challenge in sentiment analysis. One approach is to use a spell checker to correct mistakes before feeding the data into your model. Another approach is to use a model that can handle spelling mistakes, such as a deep learning model.

How Can I Keep My Sentiment Analysis Model Up-to-Date?

Keeping your sentiment analysis model up-to-date involves regularly retraining it on new data. This ensures that your model stays current with changes in language use and sentiment. You can automate this process by setting up a schedule for retraining your model.