PHP - - By Bruno Skvorc

Fighting Recruiter Spam with PHP – Proof of Concept

Ever since I moved off of Google services (due to quality, not privacy concerns), I’d been looking for the perfect email service. Having tried several, and having been with FastMail for a while now, I came to the realization that there’s no such thing. The biggest concern I have with modern email providers, is the fact that they are all quite bad at spam control.

I don’t mean the “Nigerian prince” type of spam, which is mostly blocked successfully (unless you’re using FastMail – they can’t even recognize those) but stuff that I’m really, really not interested in getting. Case in point, recruiter spam.

Illustration of blocked email

In this tutorial, we’ll get started with building a custom email processor which can read individual emails, run them through some predefined rules, and act on them. The end result will be very similar to what many providers offer out of the box, but it’ll lay the groundwork for more advanced aspects in future posts. Example uses of our app:

  • when recruiter-type keywords are detected, reply to the email with a template response and delete it. This is possible to some extent with rules that most email providers offer, but those aren’t very detailed, and usually don’t support variables.
  • when companies keep sending you emails even after you unsubscribe or report them for spam (e.g. Ello), the engine should remember these and in the future purge them automatically. Some providers (e.g. FastMail), won’t stop a sender from getting into your inbox even after hundreds of spam reports.

This way, we can keep the provider we’re used to, but also do some manual improvements their team just didn’t know how to do.

In this post, we’ll focus on the first use case.

Bootstrapping

Feel free to use your own environment if you have one set up – we’ll be using our Homestead Improved box as usual to get a pre-made environment going in a matter of minutes. To follow along, execute the following:

git clone https://github.com/swader/homestead_improved hi_mail
cd hi_mail
./bin/folderfix.sh
vagrant up; vagrant ssh
mkdir -p Project/public

You should already have homestead.app in your etc/hosts file if you’ve used Homestead Improved before. If not, add it as per instructions. The default site included with the box points to ~/Code/Project, which is good enough for us.

Once inside the box, we’ll create an index.php file in ~/Code/Project/public with some demo code:

<?php
phpinfo();

This screen immediately tells us what we need to know: is php-imap installed, or not?

screenshot of php-imap shown on the phpinfo screen

Sure enough, it comes pre-installed with the Homestead Improved box. If you’re missing php-imap, please follow the instructions to get it installed – we’ll need it before moving on (on Ubuntu, sudo apt-get install php7.0-imap should do the trick).

As a final bootstrapping step, let’s install the package we’ll use to interact with our IMAP inbox: tedivm/fetch.

composer require tedivm/fetch

Of course, we need to modify our index.php file to include Composer’s autoloader as well:

<?php

require_once `../vendor/autoload.php`;

Reading IMAP Inboxes

I have both a Gmail account, and a Fastmail account. The examples below will be using both, so that we can show the differences between the two inboxes, and apply tweaks as necessary to make our project provider-agnostic.

In PHP, built-in imap functions work like native file functions – you create a handle, and then pass it around into other functions. The API is old (really, really old!), so it only exists in this procedural form. This is why, whenever we can, we’ll use the Fetch library we installed previously.

Gmail – Basic Fetching

Let’s start with baby steps and log into our Gmail account. First, under account settings and in the Forwarding and POP/IMAP tab, make sure Enable IMAP is activated.

Enabled IMAP in Gmail

<?php

require_once '../vendor/autoload.php';

use Fetch\Server;

$server = new Server('imap.googlemail.com', 993);
$server->setAuthentication('account@gmail.com', 'password');


$messages = $server->getMessages();
/** @var $message \Fetch\Message */
foreach ($messages as $i => $message) {
    echo "Subject for {$i}: {$message->getSubject()}\n<br>";
}

Following the Fetch docs, we attempt to initiate a connection with our Gmail account. Unfortunately, if you have 2FA (2 factor authentication) activated, you’ll see the following error:

Exception requesting an application specific password for Gmail

This is easily rectified. We can go to our Google account’s app passwords page and generate one (select “Other” from the menu, give it an arbitrary name, and copy the password into the code). Now if we test things…

Gmail email subjects displayed on screen

Excellent – we got our Gmail emails. Now let’s hook into Fastmail.

FastMail – Basic Fetching

Similar to Gmail, Fastmail also supports app passwords, but it requires them regardless of you using 2FA or not. Create one here.

Generating an app password for Fastmail

The values for FastMail are as follows:

$server = new Server('imap.fastmail.com', 993);
$server->setAuthentication('username@fastmail.com', 'password');

$messages = $server->getMessages();
/** @var $message \Fetch\Message */
foreach ($messages as $i => $message) {
    echo "Subject for {$i}: {$message->getSubject()}\n<br>";
}

Note that the messages get fetched in a non-nested fashion, so even though your inbox might show a number like 100, the real number of emails retrieved might be much more than that, in case some emails are replies, grouped for context, and more.

Both Gmail and Fastmail default to the Inbox folder, which is exactly what we want.

Targeted Emails

Depending on the number of emails in your inboxes, you may have noticed a huge performance issue – it took forever to fetch them! Obviously, this can’t work if we want to process incoming emails in a timely manner. After all, our goal is to process all emails that come in, and deal with them if possible.

Unfortunately, since the email specification was developed back in the stone age of the internet, there’s no native way to get push notifications when a new email arrives. There is another way, though.

IMAP supports searching, and this can include flag statuses. As per the docs, passing in “UNSEEN” should return all unread messages:

$messages = $server->search('UNSEEN');

A single unseen message from the Gmail inbox

Sure enough, our email stating that we’ve successfully created an app password for this very app we’re building is still unread, sitting in the inbox. Success, and the call was quite fast, too!

Scanning the Emails

Now that we know how to retrieve unread messages, it’s time to analyze them and perform some actions on them if they trigger our rule checks.

Let’s use the first example – getting rid of recruiter spam. Recruiter emails come in different shapes and sizes, so there is no absolute way to identify them all. Instead, we should rely on several pointers the sum of which, if given a numeric value, can exceed a given threshold. For example, if 100 is the threshold required to mark an email as recruiter spam, we can produce the following table:

Rule Value Pts
contains finding IT opportunities 100
contains PHP specialists? 80
contains startups? 10
contains saw your profile on GitHub 50
contains explore-group.com 100
from @explorerec.com 100
contains new position 20
contains urgent(ly)? need 30
contains huge plus 15
contains full-stack developer 30
contains interviews? 20
contains CV 60
contains skills 10
contains candidates? 20

I’d like to thank all my Twitter followers who sent in some recruiter spam email examples and helped build the above table.

The explore-group ones refer to a team of brutally persistent spammers, so their emails will automatically trigger a recruiter spam alert.

Values are regular expressions – this allows us to do partial matches, which is particularly useful in recognizing sender domains, or strings that may vary slightly, but are essentially the same, like “PHP specialist” and “PHP specialists”.

For the sake of performance, it makes sense to check the rules in the order of their point value, descending. If only one of them triggers a 100, then there’s no need to check the rest.

Let’s see the code for this. Please forgive the code’s spaghetti nature – as this is just a proof of concept, it’ll be OOP-ed and packaged up in a followup article.

<?php

require_once '../vendor/autoload.php';

use Fetch\Message;
use Fetch\Server;

$inboxes = [
    'primary@gmail.com' => [
        'username' => 'primary@gmail.com',
        'password' => 'password',
        'aliases' => ['onealias@gmail.com', 'anotheralias@gmail.com'],
        'smtp' => 'smtp.googlemail.com',
        'imap' => 'imap.googlemail.com'
    ],
    'primary@mydomain.com' => [
        'username' => 'someusername',
        'password' => 'password',
        'aliases' => ['alias@mydomain.com'],
        'smtp' => 'smtp.fastmail.com',
        'smtp_port' => '587',
        'imap' => 'imap.fastmail.com',
        'starttls' => true
    ]
];

$rules = [
    ['contains' => 'finding IT opportunities', 'points' => 100],
    ['contains' => 'PHP specialists?', 'points' => 80],
    ['contains' => 'startups?', 'points' => 10],
    ['contains' => 'saw your profile on GitHub', 'points' => 50],
    ['contains' => 'explore-group\.com', 'points' => 100],
    ['from' => '@explorerec\.com', 'points' => 100],
    ['contains' => 'new position', 'points' => 20],
    ['contains' => 'urgent(ly)? need', 'points' => 30],
    ['contains' => 'huge plus', 'points' => 15],
    ['contains' => 'full-stack developer', 'points' => 30],
    ['contains' => 'interviews?', 'points' => 20],
    ['contains' => 'CV', 'points' => 60],
    ['contains' => 'skills', 'points' => 10],
    ['contains' => 'candidates?', 'points' => 20],
];

$points = [];
foreach ($rules as $key => &$rule) {
    $points[$key] = $rule['points'];
    if (isset($rule['contains'])) {
        $rule['contains'] = '/' . $rule['contains'] . '/i';
    }
    if (isset($rule['from'])) {
        $rule['from'] = '/' . $rule['from'] . '/i';
    }
}
array_multisort($points, SORT_DESC, $rules);

$unreadMessages = [];
foreach ($inboxes as $id => $inbox) {
    $server = new Server($inbox['imap'], 993);
    $server->setAuthentication($inbox['username'], $inbox['password']);
    $unreadMessages[$id] = $server->search('UNSEEN');
}

foreach ($unreadMessages as $id => $messages) {
    echo "Now processing: ".$id. "<br>";
    /**
     * @var Message $message
     */
    foreach ($messages as $i => $message) {

        $spam = isRecruiterSpam($rules, $message) ? '' : 'not';
        echo "Subject for {$i}: {$message->getSubject()} is probably {$spam} recruiter spam.\n<br>";
    }
}

function isRecruiterSpam($rules, Message $message)
{
    $sum = 0;
    foreach ($rules as $rule) {
        if (isset($rule['contains'])) {
            if (preg_match($rule['contains'], $message->getSubject())
                || preg_match($rule['contains'], $message->getHtmlBody())
            ) {
                $sum += $rule['points'];
            }
        } else {
            if (isset($rule['from'])) {
                if (preg_match($rule['from'], $message->getOverview()->from)
                ) {
                    $sum += $rule['points'];
                }
            }
        }
        if ($sum > 99) {
            return true;
        }
    }

    return false;
}

First, we define our inboxes and all the necessary configuration values. Then, we sort the rules array by the value of the points key, and turn the strings into regexes by adding delimiters. Next, we extract all the unseen messages from all our accounts, and then iterate through them.

At this point, we call the isRecruiterSpam function on each, which in turn grabs the from field, and the subject and HTML body, and runs the checks on them. After every rule, we check if the $sum has exceeded 100 points, and if so, we return true – we’re fairly certain the message is recruiter spam at that point. Otherwise, we keep summing up, and finally return false if all the rules are checked and the result is still under 100.

No messages have been detected as recruiter spam

In my initial test, no messages were flagged as recruiter spam. Let’s try forwarding a past one over from the other email account, and see what happens.

Two messages have been flagged as recruiter spam!

Success! Our engine has successfully recognized recruiter spam! Now, let’s see about replying.

Sending the replies

To reply to a message, we’ll need to pull in another package. Let’s make it SwiftMailer, as it’s the de-facto battle-tested standard in sending emails from PHP.

composer require swiftmailer/swiftmailer

We won’t be going through the very basics of SwiftMailer here, that’s documented elsewhere.

Let’s think about what needs to be done now:

  1. A message, once read and identified as recruiter spam, needs to be marked as read, otherwise it’ll keep getting picked up in subsequent searches.
  2. When replying, the reply should come from the email address it was sent to.
  3. Ideally, an auto-replied message should be placed into another folder on the server. This is useful for periodic checking and identification of false positives.

With the requirements defined, let’s see what the code might look like.

foreach ($unreadMessages as $id => $messages) {
    echo "Now processing: " . $id . "<br>";
    if (!empty($messages)) {
        $mailer = Swift_Mailer::newInstance(
            Swift_SmtpTransport::newInstance(
                $inboxes[$id]['smtp'], $inboxes[$id]['smtp_port'],
                (isset($inboxes[$id]['starttls'])) ? 'tls' : null
            )
                ->setUsername($inboxes[$id]['username'])
                ->setPassword($inboxes[$id]['password'])
                ->setStreamOptions(
                    [
                        'ssl' => [
                            'allow_self_signed' => true,
                            'verify_peer' => false,
                        ],
                    ]
                )

        );
    } else {
        continue;
    }
    /**
     * @var Message $message
     */
    foreach ($messages as $i => $message) {

        if (isRecruiterSpam($rules, $message)) {

            $message->setFlag(Message::FLAG_SEEN);

            $potentialSender = $message->getAddresses('to')[0]['address'];
            $sender = (in_array($potentialSender, $inboxes[$id]['aliases']))
                ? $potentialSender : $inboxes[$id]['aliases'][0];

            $reply = Swift_Message::newInstance('Re: ' . $message->getSubject())
                ->setFrom($message->getAddresses('to')[0]['address'])
                ->setTo($message->getAddresses('from')['address'])
                ->setBody(
                    file_get_contents('../templates/recruiter.html'),
                    'text/html'
                );

            $result = $mailer->send($reply);
        }

    }
}

In a nutshell:

  • if any of the inbox keys under $unreadMessages has a non-empty array (meaning it had some recruiter spam), we initiate a Mailer – this is for performance reasons. If we have many inboxes, we don’t want to build a Mailer instance even for the inboxes that are clean.
  • we iterate through these detected spam messages with the Mailer prepared, and then build an email reply. For recipient we select the sender, and for sender we select either the first address in the list of the original email’s recipients, if it’s among the aliases defined for this inbox, or if not, the first alias from the list. This is because some recruiters are so lazy they’ll mass mail a thousand people, put themselves as the recipient, and put the “victims” in BCC.
  • finally, we grab the contents of a prepared template email to inject as the email’s body, and send.

At this point, we need to actually write the email template:

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>Recruiter reply</title>
</head>
<body>
<p>Hello!</p>
<p>Please do not be alarmed by the quick reply - it's automated.</p>
<p>Based on your email's contents, you sound like a recruiter. As such, before getting back in touch with me, please read <a href="https://www.linkedin.com/pulse/20140516082146-67624539-dear-recruiters?trk=prof-post">this</a>.</p>
<p>In case we misidentified your intentions and your email, we apologize - our bot is still young and going through some growing pains. If that's the case and you have a genuine non-recruiter concern you'd like to discuss, please feel free to reply directly to this email.</p>
<p>Kind regards,<br>Bruno</p>
</body>
</html>

Sure enough, our reply comes back as planned.

An automatic reply to recruiter spam

Whitelisting

But what happens if the recruiter replies to our reply? You might think it’ll get an auto-reply again, since email chains contain quoted emails from before. Not so! It’s the email providers who connect emails into chains – the actual reply contains nothing but our own content, so there’s no reason to worry about someone getting stuck in a reply loop with our automatic engine.

Still, we can take care of this edge case to be on the safe side, by putting some extra content at the bottom of our template, e.g.:

...

<p>Kind regards,<br>Bruno</p>
<p style="text-align: right"><em>sent via our-little-app</em></p>
</body>
</html>

We can target this text in our rules like so:

Rule Value Pts
contains sent via our-little-app -1000

… but then we’d have to remove the early-detection mechanism which triggers a positive identification as soon as 100 points are reached, and it’d be clumsy to keep these whitelist rules in the same set as the blacklist ones.

It’s better if we just make a brand new detection mechanism for skipping blacklist scans entirely. It’ll be a separate loop that also checks body, subject, and headers, but the performance impact will be negligible because Fetch caches the message’s properties for repeat calls (a fetched message’s content cannot change – if it changes, it’s a new message – only flags can change).

We’ll make a new isWhitelisted function, and call it before we check the other rules.

        if (!isWhitelisted($whitelistRules, $message)
            && isRecruiterSpam($rules, $message)
        ) {

// ...

function isWhitelisted($rules, Message $message) {
    foreach ($rules as $rule) {
        if (isset($rule['contains'])) {
            if (preg_match($rule['contains'], $message->getSubject())
                || preg_match($rule['contains'], $message->getHtmlBody())
            ) {
                return true;
            }
        } else {
            if (isset($rule['from'])) {
                if (preg_match($rule['from'], $message->getOverview()->from)
                ) {
                    return true;
                }
            }
        }
    }
    return false;
}

You’ll notice it’s almost identical to isRecruiterSpam, only there are no points.

Naturally, we also need the $whitelistRules array, which at this point is fairly small:

$whitelistRules = [
    ['contains' => '/sent via our-little-app/i'],
];

With this, we not only made sure that emails that contain our auto-reply are ignored, but we can also easily let through emails from people/domains we know and trust. Coupled with the Contacts API many providers provide, the power we now wield over our inboxes is truly immense.

Folders

The final requirement for our proof of concept was moving the messages we auto-replied to into a separate folder. The IMAP API has folder support, and Fetch has implemented it rather smoothly, so it’s only a matter of a couple lines of code.

First, we’ll make a folder named “autoreplied” on each inbox if it doesn’t exist.

    $server = new Server($inbox['imap'], 993);
    $server->setAuthentication($inbox['username'], $inbox['password']);

    if (!$server->hasMailBox('autoreplied')) {
        $server->createMailBox('autoreplied');
    }

    $unreadMessages[$id] = $server->search('UNSEEN');

Then, after a reply has been sent, we’ll move (copy and delete) the message we’re replying to to this folder.

$result = $mailer->send($reply);
if ($result) {
    $message->moveToMailBox('autoreplied');
}

Sure enough, it moves the message into the newly created folder:

Message we autoreplied to has been marked as read and moved into a separate folder

Conclusion

In this tutorial, we touched on reading our inbox for recent messages, running them through some rules, and performing some actions on them based on what the rules said. We now have a way to filter our inboxes programmatically!

By now, you’ve probably noticed several problems with this implementation, some of which may be:

  • there’s no way to dynamically define rules or whitelists. You must change the code. A database would come in handy, and a log-in system with a CRUD interface of sorts.
  • there’s no caching whatsoever, so every call is quite slow.
  • as we come up with more rules and conditions, the isRecruiterSpam function will become more and more complex, finally reaching the point of total chaos. This is something we need to fix if we want a flexible, scalable, dynamic system – especially if we want to identify more types of emails than just recruiter spam!
  • adding more functionality to this app is tedious at best – we’re breaking every SOLID principle with this code, and need to refactor. Ideally, we want the app to be usable by multiple users at once. Not only that, but we also want to share some training data between users, for better spam protection.

We’ll deal with all of these problems in subsequent articles – now that we know everything we’ll need and our proofs of concept work, we can clean up the code and turn it all into something worth the effort.

In a followup, we’ll turn our spaghetti script experiment into a multi-user app by properly designing and structuring it. We’ll also power it up with a cronjob, and start building a proper rule engine. Stay tuned!