Extract Objects from an Access Database with PHP, Part 1

This entry is part 2 of 2 in the series Extract Objects from an Access Database with PHP

Extract Objects from an Access Database with PHP

A simple search of the Web reveals that there are many programmers who need to work with legacy databases, such as Microsoft Access, but are stumped when confronted with the task of extracting files from Access’s OLE Object field. Part of the problem is the dearth of readily available information about how objects are stored in the OLE field.

Worse still for the PHP programmer is that every instance of queries posted to forums and support websites are from non-PHP programmers – .Net seems to be the most common, and there are also questions from Java and Delphi programmers.

One thing all of these developers have in common though is the need to extract information from these troublesome OLE fields in order to migrate from a legacy database. But not one article specifically targets for PHP programmers. Nor do any of them address “packages”, only specific data types such as JPEG, PNG, GIF, etc.

In this article we’ll see how PHP can be used to extract objects from two OLE types: packages and Acrobat PDF documents. This is the first part of a two-part series in which we’ll look at OLE packages, which, as we’ll see, can be identified as either “Package” or “Packager Shell Object”. Values expressed in this article are based on an Access 2000 database.

OLE Container

The storage of a blob (binary large object) in a database is never a simple matter, and Microsoft’s Access database is no exception. Let’s summarize some key aspects of the OLE container used when an object is inserted to an OLE field in Access:

  • No external file would have been inserted to an Access OLE field without first being wrapped in a container. For our purposes, this container can be considered to be a header in front of the object we are interested in and a trailer after it.
  • The header length is not fixed. Even for the OLE package type the header length can be different depending on the object that is embedded in the container.
  • The data is not in a human-readable form, making it harder to visually find crucial items such as the original name of the embedded object, or its name, or the start of the object component.
  • The trailer can be ignored. Our focus here will be on the header and object components.

One other point: it’s not just the object component that’s encoded, but the header is too. Before we can make any sense of any of the data, the header needs to be decoded first.

Encoded Header and Data

The test database is defined as simply as possible, with the embedded files contained in the image field of Table1:

ole01-1

Throughout this worked example, we’ll be using an Apache logo GIF that I stored in the OLE field. Extracting the field contents without decoding, then inspecting the output in a hexadecimal viewer reveals, well, gibberish:

ole01-2

After converting the extracted header from hexadecimal to decimal encoding using PHP’s hex2dec() function we have something more useful:

ole01-3

It looks uninviting, but it’s clear that we can now make some sense of the data. To help simplify matters the first 20 bytes can be ignored.

The next block of characters tells us that the OLE object is of type Packager Shell Object. An OLE package is a general purpose container for those file types not recognized by the database when the files were inserted.

In the words of Professor Farnsworth, “Good news, everyone!” If we look closer at the ASCII column of the hex dump we can see the name of the original file that is embedded in the container – apache_02.gif. This will be useful later when we try to locate the end of the header and the start of the data that we need to extract. Even better, the filename always starts at byte 84 of a Packager Shell Object header and at byte 70 of a Package header.

Before moving on to how we extract the GIF from the package, here’s a little teaser: In both of these two hex dumps, there’s the sequence 0A0600 underlined in yellow. For now, accept that 0A0600 is important. We’ll see why in a moment.

Extracting the Object

Before we can extract the embedded object, we need to know where it starts and ends in the data. Careful inspection of the hexadecimal representation reveals that the character sequence for the filename occurs three times – once at byte 84 (or 70), then again twice as part of the full path.

ole01-4

At least, that how it seems in this example. Just to complicate matters, it’s possible that the full path may appear only once. Also, Access may have changed the case of the file path, or changed each level of the file path to old-fashioned 8:3 format (meaning that names longer than 8 character will be truncated, and characters 7 and 8 replaced by ~1).Fortunately there will always be an instance of the full path somewhere around the location of the second full path in this example. “Somewhere around the location?” Just as the header length is not fixed, so too are the locations of the full path. We need to work around these niggles.

“More good news, everyone!” The final instance of the filename is the second last important item of interest in the header. It is null-terminated (00), after which is the hexadecimal sequence 0A0600 we saw earlier. This the original size in bytes of the embedded object. It’s stored in little-endian format, so this needs to be reversed to big-endian before we can use it: 0A0600 becomes 00060A, which means 1,546 bytes. This group of three bytes gives a maximum allowable embedded file size of 16 MB.

After this sequence is the object we need. It starts with the character sequence 474946383961. Converting this to ASCII yields “GIF89a”. It doesn’t show in the ASCII column because Access does not store data in the OLE header in consistent two-character blocks. That’s why the first instance of the full path displays nicely up to “DocumentsWri” then goes awry. In this example, the character after “Wri” ought to be “t” as in “Writing” but Access inserted a null character into the sequence. Why? Yeah, Microsoft, why! Note that “ting” (74646E67) immediately follows the null. Another niggle to code around.

So let’s recap. What do we have now? We have the original name of the embedded file, the size in bytes of the original file, and we have the start location of the data we need to extract. We thus have all we need to extract the embedded file from a legacy Access database, and then save it to disc using its original name and extension.

Putting Theory into Practice

What follows is the PHP I used to extract from my test database the object details we discussed above. It’s no-frills, procedural PHP which should be easy for anyone to understand the steps involved in extracting an object from an OLE package.

You’ll notice that some of the logic has multiplied certain numbers by a factor of two. For example, the offset for the embedded file name is defined as 168 rather than the 84 bytes mentioned above. That’s because some parts of the code work on the raw data from the Access database’s OLE field, which is encoded in hexadecimal format. That is, each single byte is stored as two hexadecimal characters.

<?php
if (!function_exists("hex2bin")) {
   function hex2bin($hexStr) {
      $hexStrLen = strlen($hexStr);
      $binStr = "";
      $i = 0;
      while ($i < $hexStrLen) {
         $a = substr($hexStr, $i, 2);
         $c = pack("H*", $a);
         $binStr .= $c;
         $i += 2;
      }
      return $binStr;
   }
}

$dbName = "db1.mdb";
$db = new PDO("odbc:DRIVER={Microsoft Access Driver (*.mdb)}; DBQ=$dbName; Uid=; Pwd=;");

$sql = "SELECT * FROM Table1 WHERE id = 2";
foreach ($db->query($sql) as $row) {
   $objName = "";

   switch (getOLEType($row["image"])) {
      case "Packager Shell Object":
          list($objName, $objData) = extractPackagerShellObject($row["image"]);
          break;
      case "Package":
          list($objName, $objData) = extractPackage($row["image"]);
          break;
      default:
          throw new Exception("Unknown OLE type");
   }
   if ($objName != "") {
      file_put_contents($objName, $objData);
   }
}

function flipEndian($data, $size) {
   $str = "";
   for ($i = $size - 2; $i >= 0; $i -= 2) {
      $str .= substr($data, $i, 2);
   }
   return $str;
}

function findNullPos($str) {
   // must start on a two-character boundary
   return floor((strpos($str, "00") + 1) / 2) * 2;
}

function getOLEType($data) {
   // fixed position of OLE type
   $offset = 40;

   $tmp = substr($data, $offset, 255);
   $nullPos = findNullPos($tmp);
   $tmp = substr($tmp, 0, $nullPos);
   $type = hex2bin($tmp);

   return $type;
}

function extractPackagerShellObject($data) {
   $headerBlock = 500; // usable header size
   $offset = 168; // location of name

   // find name
   $tmp = substr($data, $offset, 255);
   $nullPos = findNullPos($tmp);
   $name = substr($tmp, 0, $nullPos);
   $pos = $offset + strlen($name);

   // find data
   $path1 = strpos($data, $name, $pos); // 1st full path
   $pos = $path1 + strlen($name);
   $path2 = strpos($data, $name, $pos); // 2nd full path
   // check if only one full path
   if ($path2 > $pos) {   
      $pos = $path2 + strlen($name);
   }
   $oleSizePos = $pos + 2;
   $oleObjSize = flipEndian(substr($data, $oleSizePos, 8), 8);
   $oleHeaderEnd = $oleSizePos + 8;
   $objName = hex2bin(substr($tmp, 0, $nullPos));

   // extract object
   $data = substr($data, $oleHeaderEnd, hexdec($oleObjSize) * 2);
   $objData = hex2bin($data);

   return array($objName, $objData);
}

function extractPackage($data) {
   $headerBlock = 500; // usable header size
   $offset = 140; // location of name

   // find name
   $tmp = substr($data, $offset, 255);
   $nullPos = findNullPos($tmp);
   $name = substr($tmp, 0, $nullPos);
   $pos = $offset + strlen($name);

   // find data
   $path1 = strpos($data, $name, $pos); // 1st full path
   $pos = $path1 + strlen($name);
   $path2 = strpos($data, $name, $pos); // 2nd full path
   // check if only one full path
   if ($path2 > $pos) {   
      $pos = $path2 + strlen($name);
   }
   $oleSizePos = $pos + 2;
   $oleObjSize = flipEndian(substr($data, $oleSizePos, 8), 8);
   $oleHeaderEnd = $oleSizePos + 8;
   $objName = hex2bin(substr($tmp, 0, $nullPos));

   // extract object
   $data = substr($data, $oleHeaderEnd, hexdec($oleObjSize) * 2);
   $objData = hex2bin($data);

   return array($objName, $objData);
}

Note that the bespoke hex2bin() function is only required if your PHP installation does not have its own (it was added in PHP 5.4). Also, the switch statement makes the logic extensible, as we’ll discover in the next part of this article.

The two package functions look very similar. These will be revisited in the next part of this article, when the code will be refactored because of the similar nature of the logic required for these and other object types.

I have tested this PHP with the following file types: BMP, GIF, JPEG, PNG, DOC, XLS, and PPT, all of which were stored as packages. The logic presented above should allow any file stored in an OLE package to be rescued from a legacy database, not just the GIF used in this example.

Summary

In this article we have covered the essential elements of extracting package objects from OLE fields in a Microsoft Access database using PHP. Having learned the basic structure of an OLE object, and how to extract the file contained therein, the method presented above can be applied to other OLE types to ensure that as much data as possible can be retrieved from a legacy Access database.

In the second part of this series we’ll look at the more complex issue of extracting known object types, focusing in particular on extracting Acrobat PDF documents from a legacy Access database. In the meantime, feel free to clone the GitHub repository with this article’s code to play and explore.

Image via Fotolia

Extract Objects from an Access Database with PHP

<< Extract Objects from an Access Database with PHP, Part 2

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

No Reader comments

Comments on this post are closed.