Monitoring File Integrity

Tweet

Ask yourself how you might address the following circumstances when managing a website:

  • A file is unintentionally added, modified or deleted
  • A file is maliciously added, modified or deleted
  • A file becomes corrupted

More importantly, would you even know if one of these circumstances occurred? If your answer is no, then keep reading. In this guide I will demonstrate how to create a profile of your file structure which can be used to monitor the integrity of your files.

The best way to determine whether or not a file has been altered is to hash its contents. PHP has several hashing functions available, but for this project I’ve decided to use the hash_file() function. It provides a wide range of different hashing algorithms which will make my code easy to modify at a later time should I decide to make a change.

Hashing is used in a wide variety of applications, everything from password protection to DNA sequencing. A hashing algorithm works by transforming a data into a fixed-sized, repeatable cryptographic string. They are designed so that even a slight modification to the data should produce a very different result. When two or more different pieces of data produce the same result string, it’s referred to as a “collision.” The strength of each hashing algorithm can be measured by both its speed and the probability of collisions.

In my examples I will be using the SHA-1 algorithm because it’s fast, the probability for collisions is low and it has been widely used and well tested. Of course, you’re welcome to research other algorithms and use any one you like.

Once the file’s hash has been obtained, it can be stored for later comparison. If hashing the file later doesn’t return the same hash string as before then we know the file has somehow been changed.

Database

To begin, we first need to layout a basic table to store the hashes of our files. I will be using the following schema:

CREATE TABLE integrity_hashes (
    file_path VARCHAR(200) NOT NULL,
    file_hash CHAR(40) NOT NULL,
    PRIMARY KEY (file_path)
);

file_path stores the location of a file on the server and, since the value will always be unique because two files cannot occupy the same location in the file system, is our primary. I have specified its maximum length as 200 characters which should allow for some lengthy file paths. file_hash stores the hash value of a file, which will be a SHA-1 40-character hexadecimal string.

Collecting Files

The next step is to build a profile of the file structure. We define the path of where we want to start collecting files and recursively iterate through each directory until we’ve covered the entire branch of the file system, and optionally exclude certain directories or file extensions. We collect the hashes we need as we’re traversing the file tree which are then stored in the database or used for comparison.

PHP offers several ways to navigate the file tree; for simplicity, I’ll be using the RecursiveDirectoryIterator class.

<?php
define("PATH", "/var/www/");
$files = array();

// extensions to fetch, an empty array will return all extensions
$ext = array("php");

// directories to ignore, an empty array will check all directories
$skip = array("logs", "logs/traffic");

// build profile
$dir = new RecursiveDirectoryIterator(PATH);
$iter = new RecursiveIteratorIterator($dir);
while ($iter->valid()) {
    // skip unwanted directories
    if (!$iter->isDot() && !in_array($iter->getSubPath(), $skip)) {
        // get specific file extensions
        if (!empty($ext)) {
            // PHP 5.3.4: if (in_array($iter->getExtension(), $ext)) {
            if (in_array(pathinfo($iter->key(), PATHINFO_EXTENSION), $ext)) {
                $files[$iter->key()] = hash_file("sha1", $iter->key());
            }
        }
        else {
            // ignore file extensions
            $files[$iter->key()] = hash_file("sha1", $iter->key());
        }
    }
    $iter->next();
}

Notice how I referenced the same folder logs twice in the $skip array. Just because I choose to ignore a specific directory doesn’t mean that the iterator will also ignore all of the sub-directories, which can be useful or annoying depending on your needs.

The RecursiveDirectoryIterator class gives us access to several methods:

  • valid() checks whether or not we’re working with a valid file
  • isDot() determines if the directory is “.” or “..
  • getSubPath() returns the folder name in which the file pointer is currently located
  • key() returns the full path and file name
  • next() starts the loop over again

There are also several more methods available to work with, but mostly the ones listed above are really all we need for the task at hand, although the getExtension() method has been added in PHP 5.3.4 which returns the file extension. If your version of PHP supports it, you can use it to filter out unwanted entries rather than what I did using pathinfo().

When executed, the code should populate the $files array with results similar to the following:

Array
(
    [/var/www/test.php] => b6b7c28e513dac784925665b54088045cf9cbcd3
    [/var/www/sub/hello.php] => a5d5b61aa8a61b7d9d765e1daf971a9a578f1cfa
    [/var/www/sub/world.php] => da39a3ee5e6b4b0d3255bfef95601890afd80709
)

Once we have the profile built, updating the database is easy peasy lemon squeezy.

<?php
$db = new PDO("mysql:host=" . DB_HOST . ";dbname=" . DB_NAME,
    DB_USER, DB_PASSWORD);

// clear old records
$db->query("TRUNCATE integrity_hashes");

// insert updated records
$sql = "INSERT INTO integrity_hashes (file_path, file_hash) VALUES (:path, :hash)";
$sth = $db->prepare($sql);
$sth->bindParam(":path", $path);
$sth->bindParam(":hash", $hash);
foreach ($files as $path => $hash) {
    $sth->execute();
}

Checking For Discrepancies

You now know how to build a fresh profile of the directory structure and how to update records in the database. The next step is to put it together into some sort of real world application like a cron job with e-mail notification, administrative interface or whatever else you prefer.

If you just want to gather a list of files that have changed and you don’t care how they changed, then the simplest approach is to pull the data from the database into an array similar to $files and then use PHP’s array_diff_assoc() function to weed out the riffraff.

<?php
// non-specific check for discrepancies
if (!empty($files)) {
    $result = $db->query("SELECT * FROM integrity_hashes")->fetchAll();
    if (!empty($result)) {
        foreach ($result as $value) {
            $tmp[$value["file_path"]] = $value["file_hash"];
        }
        $diffs = array_diff_assoc($files, $tmp);
        unset($tmp);
    }
}

In this example, $diffs will be populated with any discrepancies found, or it will be an empty array if the file structure is intact. Unlike array_diff(), array_diff_assoc() will use keys in the comparison which is important to us in case of a collision, such as two empty files having the same hash value.

If you want to take things a step further, you can throw in some simple logic to determine exactly how a file has been affected, whether it has been deleted, altered or added.

<?php
// specific check for discrepancies
if (!empty($files)) {
    $result = $db->query("SELECT * FROM integrity_hashes")->fetchAll();
    if (!empty($result)) {
        $diffs = array();
        $tmp = array();
        foreach ($result as $value) {
            if (!array_key_exists($value["file_path"], $files)) {
                $diffs["del"][$value["file_path"]] = $value["file_hash"];
                $tmp[$value["file_path"]] = $value["file_hash"];
            }
            else {
                if ($files[$value["file_path"]] != $value["file_hash"]) {
                    $diffs["alt"][$value["file_path"]] = $files[$value["file_path"]];
                    $tmp[$value["file_path"]] = $files[$value["file_path"]];
                }
                else {
                    // unchanged
                    $tmp[$value["file_path"]] = $value["file_hash"];
                }
            }
        }
        if (count($tmp) < count($files)) {
            $diffs["add"] = array_diff_assoc($files, $tmp);
        }
        unset($tmp);
    }
}

As we loop through the results from the database, we make several checks. First, array_key_exists() is used to check if the file path from our database is present in $files, and if not then the file must have been deleted. Second, if the file exists but the hash values do not match, the file must have been altered or is otherwise unchanged. We store each check into a temporary array named $tmp, and finally, if there are a greater number of $files than in our database then we know that those leftover un-checked files have been added.

When completed, $diffs will either be an empty array or it will contain any discrepancies found in the form of a multi-dimensional array which might appear as follows:

Array
(
    [alt] => Array
        (
            [/var/www/test.php] => eae71874e2277a5bc77176db14ac14bf28465ec3
            [/var/www/sub/hello.php] => a5d5b61aa8a61b7d9d765e1daf971a9a578f1cfa
        )

    [add] => Array
        (
            [/var/www/sub/world.php] => da39a3ee5e6b4b0d3255bfef95601890afd80709
        )

)

To display the results in a more user-friendly format, for an administrative interface or the like, you could for example loop through the results and output them in a bulleted list.

<?php
// display discrepancies
if (!empty($diffs)) {
    echo "<p>The following discrepancies were found:</p>";
    echo "<ul>";
    foreach ($diffs as $status => $affected) {
        if (is_array($affected) && !empty($affected)) {
            echo "<li>" . $status . "</li>";
            echo "<ol>";
            foreach($affected as $path => $hash) {
                echo "<li>" . $path . "</li>";
            }
            echo "</ol>";
        }
    }
    echo "</ul>";
}
else {
    echo "<p>File structure is intact.</p>";
}

At this point you can either provide a link which triggers an action to update the database with the new file structure, in which case you might opt to store $files in a session variable, or if you don’t approve of the discrepancies you can address them however you see fit.

Summary

Hopefully this guide has given you a better understanding of monitoring file integrity. Having something like this in place on your website is an invaluable security measure and you can be comfortable knowing that your files remain exactly as you intended. Of course, don’t forget to keep regular backups. You know… just in case.

Image via Semisatch / Shutterstock

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • http://www.technewsdaddy.com john

    Nicely done!
    If you need this kind of functionality is a scalable way (i.e. for big setups), it ‘s best you use FAM http://en.wikipedia.org/wiki/File_Alteration_Monitor

  • August Trometer

    A while back, I created an open source class to take care of all the heavy lifting for file watching. You can monitor individual files, or entire folders.

    You can check it out here: https://github.com/august/HashWatch

  • http://mywebs.hubpages.com/ Anthony Goodley

    While you cover several PHP functions I’m unfamiliar with, you explain everything in sufficient detail I’m able to get the gist of it and follow the logic flow. Great well written article. I plan to add this functionality to a PHP script I’m developing.

  • http://www.d-mueller.de David Müller

    If you have to do some sophisticated stuff and need to prove that even the admin itself has not made any manipulations, you might want to take a look at the concept of trusted timestamps: http://www.d-mueller.de/blog/dealing-with-trusted-timestamps-in-php-rfc-3161/

  • Mastodont

    Very detailed article, thank you. The only thing I try to comment is database as storage. The array with results can be safely serialized and saved as file.

  • Ingus

    Umm, what’s wrong with using some sort of version control system like CVS/SVN/GIT?

  • http://www.ajohnstone.com Andrew Johnstone

    Whilst it maybe interesting to do this with PHP both the approach to handling this is not only inefficient, it’s simply the wrong tool for the job.

    1. Firstly there is a PHP Pecl extension called FAM. (http://php.net/manual/en/ref.fam.php).
    2. Use inotify, FAM most likely uses inotify. “Inotify is a Linux kernel feature that monitors file systems and immediately alerts an attentive application to relevant events, such as a delete, read, write, and even an unmount operation.”. (Windows equivalent can be found here http://msdn.microsoft.com/en-us/library/chzww271(v=vs.80).aspx)
    3. Version control systems. They performs exactly the same function as described above and provides a history and diffs of each modification.
    4. Use tripwire. If you are concerned about integrity and the security of your site. “Open Source Tripwire® software is a security and data integrity tool useful for monitoring and alerting on specific file change(s) on a range of systems.”. Simply change the policy file and you can monitor changes. RKhunter also provides a similar with simple checks on hashes.

    • http://www.psinas.com Martin Psinas

      Andrew, whilst I appreciate your pointing out alternative solutions which is extremely helpful, I believe your “inefficient” and “wrong tool” remarks are both highly circumstantial and misleading.

    • Paul

      Andrew,

      As far as I know none of those options you presented work on shared hosting, but Martin’s will. I’d recommend one of those you mentioned for dedicated or VPS, though.

  • http://homeandgarden13.info Lesley Hirt

    Really enjoyed this blog post.Really looking forward to read more. Really Cool.

  • Dan

    This is great – just what I need for my purposes. I’m site admin for a yet-to-be profitable car club on a very tight budget. Our choice of webhost is naturally on the cheap side, which has resulted in two problems:
    1. We have been frequently hacked over the past couple of months
    2. I only have basic cpanel and ftp access
    I’ve been thinking of writing a php+mysql solution, and thought I’d check Google to make sure I wasn’t reinventing the wheel. Thanks to Martin, I have a huge head start.

  • Tom

    I found that your comparison fails if you delete and file and add a file — then the file counts are equal. I’m no programmer, but this modification of your comparison algorithm seems to work for me. I changed your $tmp array name to $control.

    if(!empty( $result )) {
    $diffs = array();
    $control = array();
    foreach( $result as $value ) {
    // recreate control array
    $control[$value["file_path"]] = $value["file_hash"];
    //
    // record altered files
    if( array_key_exists( $value["file_path"], $files )
    && $files[$value["file_path"]] != $value["file_hash"] ) {
    $diffs["altered"][$value["file_path"]] = $files[$value["file_path"]];
    //$tmp[$value["file_path"]] = $files[$value["file_path"]];
    }
    }
    $diffs["added"] = array_diff_key( $files, $control );
    $diffs["deleted"] = array_diff_key( $control, $files );
    unset( $control );
    }

  • John

    why dealing with all that setup when are some solution already made http://www.guardio.net