Extract Objects from an Access Database with PHP, Part 1

Share this article

Key Takeaways

  • PHP can be used to extract objects from legacy databases, such as Microsoft Access, specifically from OLE fields, which are often troublesome for developers to migrate from.
  • The extraction process involves understanding the structure of the OLE container, decoding the header and data, and identifying the start and end of the embedded object in the data.
  • The article provides a detailed PHP code example for extracting objects from OLE packages, which includes finding the original name of the embedded file, the size in bytes of the original file, and the start location of the data.
  • The method presented can be applied to other OLE types to ensure that as much data as possible can be retrieved from a legacy Access database. The second part of the series will focus on extracting known object types, particularly Acrobat PDF documents.
A simple search of the Web reveals that there are many programmers who need to work with legacy databases, such as Microsoft Access, but are stumped when confronted with the task of extracting files from Access’s OLE Object field. Part of the problem is the dearth of readily available information about how objects are stored in the OLE field. Worse still for the PHP programmer is that every instance of queries posted to forums and support websites are from non-PHP programmers – .Net seems to be the most common, and there are also questions from Java and Delphi programmers. One thing all of these developers have in common though is the need to extract information from these troublesome OLE fields in order to migrate from a legacy database. But not one article specifically targets for PHP programmers. Nor do any of them address “packages”, only specific data types such as JPEG, PNG, GIF, etc. In this article we’ll see how PHP can be used to extract objects from two OLE types: packages and Acrobat PDF documents. This is the first part of a two-part series in which we’ll look at OLE packages, which, as we’ll see, can be identified as either “Package” or “Packager Shell Object”. Values expressed in this article are based on an Access 2000 database.

OLE Container

The storage of a blob (binary large object) in a database is never a simple matter, and Microsoft’s Access database is no exception. Let’s summarize some key aspects of the OLE container used when an object is inserted to an OLE field in Access:
  • No external file would have been inserted to an Access OLE field without first being wrapped in a container. For our purposes, this container can be considered to be a header in front of the object we are interested in and a trailer after it.
  • The header length is not fixed. Even for the OLE package type the header length can be different depending on the object that is embedded in the container.
  • The data is not in a human-readable form, making it harder to visually find crucial items such as the original name of the embedded object, or its name, or the start of the object component.
  • The trailer can be ignored. Our focus here will be on the header and object components.
One other point: it’s not just the object component that’s encoded, but the header is too. Before we can make any sense of any of the data, the header needs to be decoded first.

Encoded Header and Data

The test database is defined as simply as possible, with the embedded files contained in the image field of Table1:

ole01-1

Throughout this worked example, we’ll be using an Apache logo GIF that I stored in the OLE field. Extracting the field contents without decoding, then inspecting the output in a hexadecimal viewer reveals, well, gibberish:

ole01-2

After converting the extracted header from hexadecimal to decimal encoding using PHP’s hex2dec() function we have something more useful:

ole01-3

It looks uninviting, but it’s clear that we can now make some sense of the data. To help simplify matters the first 20 bytes can be ignored. The next block of characters tells us that the OLE object is of type Packager Shell Object. An OLE package is a general purpose container for those file types not recognized by the database when the files were inserted. In the words of Professor Farnsworth, “Good news, everyone!” If we look closer at the ASCII column of the hex dump we can see the name of the original file that is embedded in the container – apache_02.gif. This will be useful later when we try to locate the end of the header and the start of the data that we need to extract. Even better, the filename always starts at byte 84 of a Packager Shell Object header and at byte 70 of a Package header. Before moving on to how we extract the GIF from the package, here’s a little teaser: In both of these two hex dumps, there’s the sequence 0A0600 underlined in yellow. For now, accept that 0A0600 is important. We’ll see why in a moment.

Extracting the Object

Before we can extract the embedded object, we need to know where it starts and ends in the data. Careful inspection of the hexadecimal representation reveals that the character sequence for the filename occurs three times – once at byte 84 (or 70), then again twice as part of the full path.

ole01-4

At least, that how it seems in this example. Just to complicate matters, it’s possible that the full path may appear only once. Also, Access may have changed the case of the file path, or changed each level of the file path to old-fashioned 8:3 format (meaning that names longer than 8 character will be truncated, and characters 7 and 8 replaced by ~1).Fortunately there will always be an instance of the full path somewhere around the location of the second full path in this example. “Somewhere around the location?” Just as the header length is not fixed, so too are the locations of the full path. We need to work around these niggles. “More good news, everyone!” The final instance of the filename is the second last important item of interest in the header. It is null-terminated (00), after which is the hexadecimal sequence 0A0600 we saw earlier. This the original size in bytes of the embedded object. It’s stored in little-endian format, so this needs to be reversed to big-endian before we can use it: 0A0600 becomes 00060A, which means 1,546 bytes. This group of three bytes gives a maximum allowable embedded file size of 16 MB. After this sequence is the object we need. It starts with the character sequence 474946383961. Converting this to ASCII yields “GIF89a”. It doesn’t show in the ASCII column because Access does not store data in the OLE header in consistent two-character blocks. That’s why the first instance of the full path displays nicely up to “DocumentsWri” then goes awry. In this example, the character after “Wri” ought to be “t” as in “Writing” but Access inserted a null character into the sequence. Why? Yeah, Microsoft, why! Note that “ting” (74646E67) immediately follows the null. Another niggle to code around. So let’s recap. What do we have now? We have the original name of the embedded file, the size in bytes of the original file, and we have the start location of the data we need to extract. We thus have all we need to extract the embedded file from a legacy Access database, and then save it to disc using its original name and extension.

Putting Theory into Practice

What follows is the PHP I used to extract from my test database the object details we discussed above. It’s no-frills, procedural PHP which should be easy for anyone to understand the steps involved in extracting an object from an OLE package. You’ll notice that some of the logic has multiplied certain numbers by a factor of two. For example, the offset for the embedded file name is defined as 168 rather than the 84 bytes mentioned above. That’s because some parts of the code work on the raw data from the Access database’s OLE field, which is encoded in hexadecimal format. That is, each single byte is stored as two hexadecimal characters.
<?php
if (!function_exists("hex2bin")) {
   function hex2bin($hexStr) {
      $hexStrLen = strlen($hexStr);
      $binStr = "";
      $i = 0;
      while ($i < $hexStrLen) {
         $a = substr($hexStr, $i, 2);
         $c = pack("H*", $a);
         $binStr .= $c;
         $i += 2;
      }
      return $binStr;
   }
}

$dbName = "db1.mdb";
$db = new PDO("odbc:DRIVER={Microsoft Access Driver (*.mdb)}; DBQ=$dbName; Uid=; Pwd=;");

$sql = "SELECT * FROM Table1 WHERE id = 2";
foreach ($db->query($sql) as $row) {
   $objName = "";

   switch (getOLEType($row["image"])) {
      case "Packager Shell Object":
          list($objName, $objData) = extractPackagerShellObject($row["image"]);
          break;
      case "Package":
          list($objName, $objData) = extractPackage($row["image"]);
          break;
      default:
          throw new Exception("Unknown OLE type");
   }
   if ($objName != "") {
      file_put_contents($objName, $objData);
   }
}

function flipEndian($data, $size) {
   $str = "";
   for ($i = $size - 2; $i >= 0; $i -= 2) {
      $str .= substr($data, $i, 2);
   }
   return $str;
}

function findNullPos($str) {
   // must start on a two-character boundary
   return floor((strpos($str, "00") + 1) / 2) * 2;
}

function getOLEType($data) {
   // fixed position of OLE type
   $offset = 40;

   $tmp = substr($data, $offset, 255);
   $nullPos = findNullPos($tmp);
   $tmp = substr($tmp, 0, $nullPos);
   $type = hex2bin($tmp);

   return $type;
}

function extractPackagerShellObject($data) {
   $headerBlock = 500; // usable header size
   $offset = 168; // location of name

   // find name
   $tmp = substr($data, $offset, 255);
   $nullPos = findNullPos($tmp);
   $name = substr($tmp, 0, $nullPos);
   $pos = $offset + strlen($name);

   // find data
   $path1 = strpos($data, $name, $pos); // 1st full path
   $pos = $path1 + strlen($name);
   $path2 = strpos($data, $name, $pos); // 2nd full path
   // check if only one full path
   if ($path2 > $pos) {   
      $pos = $path2 + strlen($name);
   }
   $oleSizePos = $pos + 2;
   $oleObjSize = flipEndian(substr($data, $oleSizePos, 8), 8);
   $oleHeaderEnd = $oleSizePos + 8;
   $objName = hex2bin(substr($tmp, 0, $nullPos));

   // extract object
   $data = substr($data, $oleHeaderEnd, hexdec($oleObjSize) * 2);
   $objData = hex2bin($data);

   return array($objName, $objData);
}

function extractPackage($data) {
   $headerBlock = 500; // usable header size
   $offset = 140; // location of name

   // find name
   $tmp = substr($data, $offset, 255);
   $nullPos = findNullPos($tmp);
   $name = substr($tmp, 0, $nullPos);
   $pos = $offset + strlen($name);

   // find data
   $path1 = strpos($data, $name, $pos); // 1st full path
   $pos = $path1 + strlen($name);
   $path2 = strpos($data, $name, $pos); // 2nd full path
   // check if only one full path
   if ($path2 > $pos) {   
      $pos = $path2 + strlen($name);
   }
   $oleSizePos = $pos + 2;
   $oleObjSize = flipEndian(substr($data, $oleSizePos, 8), 8);
   $oleHeaderEnd = $oleSizePos + 8;
   $objName = hex2bin(substr($tmp, 0, $nullPos));

   // extract object
   $data = substr($data, $oleHeaderEnd, hexdec($oleObjSize) * 2);
   $objData = hex2bin($data);

   return array($objName, $objData);
}
Note that the bespoke hex2bin()
function is only required if your PHP installation does not have its own (it was added in PHP 5.4). Also, the switch statement makes the logic extensible, as we’ll discover in the next part of this article. The two package functions look very similar. These will be revisited in the next part of this article, when the code will be refactored because of the similar nature of the logic required for these and other object types. I have tested this PHP with the following file types: BMP, GIF, JPEG, PNG, DOC, XLS, and PPT, all of which were stored as packages. The logic presented above should allow any file stored in an OLE package to be rescued from a legacy database, not just the GIF used in this example.

Summary

In this article we have covered the essential elements of extracting package objects from OLE fields in a Microsoft Access database using PHP. Having learned the basic structure of an OLE object, and how to extract the file contained therein, the method presented above can be applied to other OLE types to ensure that as much data as possible can be retrieved from a legacy Access database. In the second part of this series we’ll look at the more complex issue of extracting known object types, focusing in particular on extracting Acrobat PDF documents from a legacy Access database. In the meantime, feel free to clone the GitHub repository with this article’s code to play and explore. Image via Fotolia

Frequently Asked Questions (FAQs) about Extracting OLE Objects from an Access Database Using PHP

What is an OLE object in MS Access?

An OLE (Object Linking and Embedding) object in MS Access is a way to store files from other applications like Word, Excel, PDFs, images, etc. It allows you to create objects in one application and then link or embed them in another application. This means you can insert an Excel spreadsheet into an Access database, and it will retain all its original properties and functionalities.

How does PHP interact with an Access database?

PHP can interact with an Access database using a database driver. This driver allows PHP to communicate with the database, send queries, and retrieve results. The most common driver used for this purpose is the Microsoft Access Driver (*.mdb, *.accdb).

Why would I want to extract OLE objects from an Access database using PHP?

There could be several reasons for this. For instance, you might want to display the content of the OLE object on a web page, or you might want to process the data in some way. PHP provides a convenient and flexible way to accomplish these tasks.

What are the prerequisites for extracting OLE objects from an Access database using PHP?

To extract OLE objects from an Access database using PHP, you need to have PHP installed on your server, along with the appropriate database driver. You also need to have access to the database from which you want to extract the OLE objects.

Can I extract any type of OLE object from an Access database using PHP?

Yes, you can extract any type of OLE object from an Access database using PHP, as long as the database driver supports that type of object. This includes Word documents, Excel spreadsheets, PDFs, images, and more.

What is the difference between linking and embedding an OLE object in Access?

When you link an OLE object in Access, the object remains in its original file and location, and a link is created to it from the Access database. When you embed an OLE object, a copy of the object is inserted into the database. The main difference is that changes to a linked object are reflected in the Access database, while changes to an embedded object are not.

How can I handle errors when extracting OLE objects from an Access database using PHP?

PHP provides several ways to handle errors. One common method is to use try-catch blocks. If an error occurs within the try block, the catch block is executed, allowing you to handle the error gracefully.

Can I extract multiple OLE objects at once from an Access database using PHP?

Yes, you can extract multiple OLE objects at once from an Access database using PHP. You would typically do this by executing a query that selects multiple records from the database, and then looping through the results to extract each OLE object.

Can I modify an OLE object in an Access database using PHP?

Yes, you can modify an OLE object in an Access database using PHP. However, this requires a good understanding of both the structure of the OLE object and the way that Access stores OLE objects in its database.

Can I use PHP to insert OLE objects into an Access database?

Yes, you can use PHP to insert OLE objects into an Access database. This involves creating a new record in the database and then storing the OLE object in the appropriate field.

David FrancisDavid Francis
View Author

David is a web developer based in England. He is an experienced programmer having developed software for various platforms including 8-bit CPUs, corporate mainframes, and most recently the Web. His preference is for simplicity and efficiency, avoiding where possible software that's complex, bloated, or closed.

Intermediate
Share this article
Read Next
Get the freshest news and resources for developers, designers and digital creators in your inbox each week