File_Archive

I arrived back from Toronto yesterday, although I was too tired to do anything despite getting home at about 10am! But I’m now ready to continue posting regularly.

I am getting ready to release a major new version (1.2) of one of my products. One of the features of this allows you to automatically generate zip, tar, gzip and bzip files of your product so they can be distributed to your customers. You upload the files, and the application generates the archive.

When I was writing this section of code, I spent quite a long time looking for classes to help with the archive generation. I am a strong believer in reusing code that is made available in libraries rather than writing it myself. It saves a lot of time! I looked in a lot of places and tried several classes – they worked, but not fully. For example, there were issues with corrupt zip files, or directories not being maintained. I eventually settled on a PEAR class called File_Archive, which is extremely good!

File_Archive: File_Archive will let you manipulate easily the tar, gz, tgz, bz2, tbz, zip, ar (or deb) files

Not only does it do everything you might need to do with archives, but it is actively developed. I have contributed several bug reports and they are always responded to and fixed quickly – often within hours of reporting. When I started using it, the docs available weren’t that great, but they have now been greatly improved and updated.

Before actually doing any code writing, it is important to decide what kind of archive you are going to generate. The documentation provides a very good description of what to look out for:

The choice of the file format is important if you want an efficient generation. Let’s see what are the possibilities:

  • Tar
    Pros: generation very efficient, constant memory usage, no need to cache
    Cons: no compression (but anyway images or video can hardly be compressed), not as widely used as Zip
  • Tgz, Tbz
    Pros: very high compression ratio, constant memory usage
    Cons: can’t be cached, needs a lot of CPU at each generation
  • Zip
    Pros: intermediate result can be cached, compressed, you can choose the compression level, widely used
    Cons: compression ratio lower than for Tgz/Tbz

From http://poocl.la-grotte.org/example.php

File_Archive includes several features which will be useful for handling files and their archives (hence the name!). These examples have been nabbed from the File_Archive docs to illustrate the library.

Readers

The readers allow you to read files, directories and also archives of files/directories. The generated list is held within an object so it can be used later when archiving or serving to users. For example, if you wanted to read an archive, and hence uncompress it:

$source = File_Archive::read('path/to/dir/archive.tar');

Writers

Writers will do the actual writing of archives to disk or to memory. There is also a function which can send a created archive as an attachment in an e-mail. You have to call a reader before the writer so you can get the contents to create the archive with. For example, this example will send all the files in the current directory by e-mail:

File_Archive::extract(
File_Archive::read('path/to/dir', '', 0, 0),
File_Archive::toMail('to@example.com', array('Subject' => 'path/to/dir directory', 'From' => 'example@example.com'), 'body'));

Caching

File_Archive 1.4 introduced the possibility to use a cache to store intermediate result of a zip compression. It uses the Cache_Lite PEAR package to do so.

A zip file is made of compressed files, one after the others. So if you generate an archive that contains files A, B and C and then another archive that contains A and C, you will compress twice the files A and C. The use of the cache will allow to save the compressed version of files A, B and C on the first compression, to use them again in the second compression.

On my machine (a thinkpad T42P with default factory equipment), generating a 200MB zip archive takes around 30s of CPU without the cache, 32s of CPU with an empty cache and 2s of CPU if all the files to compress are already in cach

http://poocl.la-grotte.org/tutorial/cache.php

Useful Function

Here is a function which I used in my product to quickly generate any kind of supported archive from an existing directory (and all of its contents):

This function takes 2 arguments: the type of archive you want to create – [font=Courier New]string $type[/font], and the name of the directory you want to archive – [font=Courier New]string $directory[/font]. The archive is then created with the same name as the directory. The options to pass to [font=Courier New]$type[/font] can be one of tgz, tbz, tar, zip, gz, gzip, bz2, bzip2 or any composition of them (for example tar.gz or tar.bz2). The case of this parameter is not important.

function archive($type, $directory)
{
if (file_exists($directory))
{
$archive = File_Archive::read($directory, $directory);
$archive->extract(File_Archive::toArchive($directory.$type, File_Archive::toFiles()));
}
else
{
trigger_error('Directory '.$location.' does not exist', E_USER_ERROR);
}
}
?>

So hopefully you’ll find this useful in your own programs.

Links

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • Dr Livingston

    > I am a strong believer in reusing code that is made available in libraries rather than writing it myself. It saves a lot of time!

    Well.. Well… Well…

    > I looked in a lot of places and tried several classes – they worked, but not fully. For example, there were issues with corrupt zip files, or directories not being maintained.

    By the time you spent looking for a solution, you could have started to script your on solution, and proberly got a fair bit of it done.

    I’m all for re-use myself, but only in the sense of script which I’ve developed myself. With the ammount of script out there that suffers from lack of unit testing, lack of documentation, and a lack of reliable feedback, as a developer, you just cannot depend on it.

    It’s not safe, not from a developers point of view, and certainly not from a business point of view. After reading this blog, for one I wouldn’t use your software.

    I couldn’t depend on it. Sorry

  • http://aplosmedia.com/ Eric.Coleman

    I don’t like the API design that File_Archive has… i think it’s terrible and confusing.

    Ohh weell.. to each his own.

    – Ric, E

  • fsteinel

    How about support for file level packing, e.g. foo will be foo.cmp and gzcompressed. Usefull for small files without the zip/tar-header overhead.

  • http://www.realityedge.com.au mrsmiley

    Dr Livingston, are you saying you wouldn’t purchase any software without knowing that every library it used came from a source that lived up to your development standards?

  • chrisb

    Dr Livingston, Writing compression code well is a fairly difficult thing to do.. Granted I don’t know exactly how good the author is, but I doubt there are many people that could put together a quality compression library anywhere close to the time it takes to research available options…

  • WebDevGuy

    I know this is Off-Topic but it would be GREAT if someone would write a brief tutorial about Pear’s FILE_PDF. I find VERY little about – not even enough to get started. Yet I have seen presentations about how great it is.

  • http://boyohazard.net Octal

    By the time you spent looking for a solution, you could have started to script your on solution, and proberly got a fair bit of it done.

    Not anymore. Thanks to this blog I now know exactly where to go to find what I need.

    With the ammount of script out there that suffers from lack of unit testing, lack of documentation, and a lack of reliable feedback, as a developer, you just cannot depend on it.

    I find the PEAR archive to be reliable enough for my needs. I also think the following is a good source of feedback:

    Not only does it do everything you might need to do with archives, but it is actively developed. I have contributed several bug reports and they are always responded to and fixed quickly – often within hours of reporting. When I started using it, the docs available weren’t that great, but they have now been greatly improved and updated.

    After reading this blog, for one I wouldn’t use your software.

    Seeing as we’re judging people on a post rather than testimonials, customer referrals etc; I wouldn’t hire you as a software developer. The time you spend developing your own solution probably equates to more time to develop which means more cost for me.

  • OfficeOfTheLaw

    Dr.Livingston,
    Are you suggesting that people should not use existing libraries and instead waste their client’s time reinventing the wheel?

    Whether or not File_Archive is designed “properly” is not the issue… it works, and does the job for people who may otherwise spend a good 8 to 12 hours writing their own file archive wrapper with unit tests.

    Sure, it’s nice to write your own, but unfortunately clients tend to be pushy one schedules and time constraints, and all they care about is that the product works. Not to excuse sloppy code, but File_archive is also far from sloppy.

  • Robert Douglass

    The documentation for the File_Archive package is horrible and incomplete. Thanks for the tutorial; I doubt that any but the most dedicated developers will have success using any of the package’s features beyond the toy code in the tutorial and here.

  • Carter

    I doubt there are many people that could put together a quality compression library anywhere close to the time.By the time you spent looking for a solution, you could have started to script your on solution, and proberly got a fair bit of it done.

    Nice blog.I read some of you’re articles and they are really nice.