Monitoring File Integrity

Martin Psinas
Share

Ask yourself how you might address the following circumstances when managing a website:

  • A file is unintentionally added, modified or deleted
  • A file is maliciously added, modified or deleted
  • A file becomes corrupted

More importantly, would you even know if one of these circumstances occurred? If your answer is no, then keep reading. In this guide I will demonstrate how to create a profile of your file structure which can be used to monitor the integrity of your files.

The best way to determine whether or not a file has been altered is to hash its contents. PHP has several hashing functions available, but for this project I’ve decided to use the hash_file() function. It provides a wide range of different hashing algorithms which will make my code easy to modify at a later time should I decide to make a change.

Hashing is used in a wide variety of applications, everything from password protection to DNA sequencing. A hashing algorithm works by transforming a data into a fixed-sized, repeatable cryptographic string. They are designed so that even a slight modification to the data should produce a very different result. When two or more different pieces of data produce the same result string, it’s referred to as a “collision.” The strength of each hashing algorithm can be measured by both its speed and the probability of collisions.

In my examples I will be using the SHA-1 algorithm because it’s fast, the probability for collisions is low and it has been widely used and well tested. Of course, you’re welcome to research other algorithms and use any one you like.

Once the file’s hash has been obtained, it can be stored for later comparison. If hashing the file later doesn’t return the same hash string as before then we know the file has somehow been changed.

Database

To begin, we first need to layout a basic table to store the hashes of our files. I will be using the following schema:

CREATE TABLE integrity_hashes (
    file_path VARCHAR(200) NOT NULL,
    file_hash CHAR(40) NOT NULL,
    PRIMARY KEY (file_path)
);

file_path stores the location of a file on the server and, since the value will always be unique because two files cannot occupy the same location in the file system, is our primary. I have specified its maximum length as 200 characters which should allow for some lengthy file paths. file_hash stores the hash value of a file, which will be a SHA-1 40-character hexadecimal string.

Collecting Files

The next step is to build a profile of the file structure. We define the path of where we want to start collecting files and recursively iterate through each directory until we’ve covered the entire branch of the file system, and optionally exclude certain directories or file extensions. We collect the hashes we need as we’re traversing the file tree which are then stored in the database or used for comparison.

PHP offers several ways to navigate the file tree; for simplicity, I’ll be using the RecursiveDirectoryIterator class.

<?php
define("PATH", "/var/www/");
$files = array();

// extensions to fetch, an empty array will return all extensions
$ext = array("php");

// directories to ignore, an empty array will check all directories
$skip = array("logs", "logs/traffic");

// build profile
$dir = new RecursiveDirectoryIterator(PATH);
$iter = new RecursiveIteratorIterator($dir);
while ($iter->valid()) {
    // skip unwanted directories
    if (!$iter->isDot() && !in_array($iter->getSubPath(), $skip)) {
        // get specific file extensions
        if (!empty($ext)) {
            // PHP 5.3.4: if (in_array($iter->getExtension(), $ext)) {
            if (in_array(pathinfo($iter->key(), PATHINFO_EXTENSION), $ext)) {
                $files[$iter->key()] = hash_file("sha1", $iter->key());
            }
        }
        else {
            // ignore file extensions
            $files[$iter->key()] = hash_file("sha1", $iter->key());
        }
    }
    $iter->next();
}

Notice how I referenced the same folder logs twice in the $skip array. Just because I choose to ignore a specific directory doesn’t mean that the iterator will also ignore all of the sub-directories, which can be useful or annoying depending on your needs.

The RecursiveDirectoryIterator class gives us access to several methods:

  • valid() checks whether or not we’re working with a valid file
  • isDot() determines if the directory is “.” or “..
  • getSubPath() returns the folder name in which the file pointer is currently located
  • key() returns the full path and file name
  • next() starts the loop over again

There are also several more methods available to work with, but mostly the ones listed above are really all we need for the task at hand, although the getExtension() method has been added in PHP 5.3.4 which returns the file extension. If your version of PHP supports it, you can use it to filter out unwanted entries rather than what I did using pathinfo().

When executed, the code should populate the $files array with results similar to the following:

Array
(
    [/var/www/test.php] => b6b7c28e513dac784925665b54088045cf9cbcd3
    [/var/www/sub/hello.php] => a5d5b61aa8a61b7d9d765e1daf971a9a578f1cfa
    [/var/www/sub/world.php] => da39a3ee5e6b4b0d3255bfef95601890afd80709
)

Once we have the profile built, updating the database is easy peasy lemon squeezy.

<?php
$db = new PDO("mysql:host=" . DB_HOST . ";dbname=" . DB_NAME,
    DB_USER, DB_PASSWORD);

// clear old records
$db->query("TRUNCATE integrity_hashes");

// insert updated records
$sql = "INSERT INTO integrity_hashes (file_path, file_hash) VALUES (:path, :hash)";
$sth = $db->prepare($sql);
$sth->bindParam(":path", $path);
$sth->bindParam(":hash", $hash);
foreach ($files as $path => $hash) {
    $sth->execute();
}

Checking For Discrepancies

You now know how to build a fresh profile of the directory structure and how to update records in the database. The next step is to put it together into some sort of real world application like a cron job with e-mail notification, administrative interface or whatever else you prefer.

If you just want to gather a list of files that have changed and you don’t care how they changed, then the simplest approach is to pull the data from the database into an array similar to $files and then use PHP’s array_diff_assoc() function to weed out the riffraff.

<?php
// non-specific check for discrepancies
if (!empty($files)) {
    $result = $db->query("SELECT * FROM integrity_hashes")->fetchAll();
    if (!empty($result)) {
        foreach ($result as $value) {
            $tmp[$value["file_path"]] = $value["file_hash"];
        }
        $diffs = array_diff_assoc($files, $tmp);
        unset($tmp);
    }
}

In this example, $diffs will be populated with any discrepancies found, or it will be an empty array if the file structure is intact. Unlike array_diff(), array_diff_assoc() will use keys in the comparison which is important to us in case of a collision, such as two empty files having the same hash value.

If you want to take things a step further, you can throw in some simple logic to determine exactly how a file has been affected, whether it has been deleted, altered or added.

<?php
// specific check for discrepancies
if (!empty($files)) {
    $result = $db->query("SELECT * FROM integrity_hashes")->fetchAll();
    if (!empty($result)) {
        $diffs = array();
        $tmp = array();
        foreach ($result as $value) {
            if (!array_key_exists($value["file_path"], $files)) {
                $diffs["del"][$value["file_path"]] = $value["file_hash"];
                $tmp[$value["file_path"]] = $value["file_hash"];
            }
            else {
                if ($files[$value["file_path"]] != $value["file_hash"]) {
                    $diffs["alt"][$value["file_path"]] = $files[$value["file_path"]];
                    $tmp[$value["file_path"]] = $files[$value["file_path"]];
                }
                else {
                    // unchanged
                    $tmp[$value["file_path"]] = $value["file_hash"];
                }
            }
        }
        if (count($tmp) < count($files)) {
            $diffs["add"] = array_diff_assoc($files, $tmp);
        }
        unset($tmp);
    }
}

As we loop through the results from the database, we make several checks. First, array_key_exists() is used to check if the file path from our database is present in $files, and if not then the file must have been deleted. Second, if the file exists but the hash values do not match, the file must have been altered or is otherwise unchanged. We store each check into a temporary array named $tmp, and finally, if there are a greater number of $files than in our database then we know that those leftover un-checked files have been added.

When completed, $diffs will either be an empty array or it will contain any discrepancies found in the form of a multi-dimensional array which might appear as follows:

Array
(
    [alt] => Array
        (
            [/var/www/test.php] => eae71874e2277a5bc77176db14ac14bf28465ec3
            [/var/www/sub/hello.php] => a5d5b61aa8a61b7d9d765e1daf971a9a578f1cfa
        )

    [add] => Array
        (
            [/var/www/sub/world.php] => da39a3ee5e6b4b0d3255bfef95601890afd80709
        )

)

To display the results in a more user-friendly format, for an administrative interface or the like, you could for example loop through the results and output them in a bulleted list.

<?php
// display discrepancies
if (!empty($diffs)) {
    echo "<p>The following discrepancies were found:</p>";
    echo "<ul>";
    foreach ($diffs as $status => $affected) {
        if (is_array($affected) && !empty($affected)) {
            echo "<li>" . $status . "</li>";
            echo "<ol>";
            foreach($affected as $path => $hash) {
                echo "<li>" . $path . "</li>";
            }
            echo "</ol>";
        }
    }
    echo "</ul>";
}
else {
    echo "<p>File structure is intact.</p>";
}

At this point you can either provide a link which triggers an action to update the database with the new file structure, in which case you might opt to store $files in a session variable, or if you don’t approve of the discrepancies you can address them however you see fit.

Summary

Hopefully this guide has given you a better understanding of monitoring file integrity. Having something like this in place on your website is an invaluable security measure and you can be comfortable knowing that your files remain exactly as you intended. Of course, don’t forget to keep regular backups. You know… just in case.

Image via Semisatch / Shutterstock