Binaries Belong in the Database Too

Tweet

One often repeated piece of advice one sees across the internet is “Don’t store binary files in the database. Save a pointer to the file and keep it on the disk.” The ostensible reasons for this is that storing binary files is that file systems are much more efficient at storing and retrieving files than databases. Which is a valid argument, especially if one is dealing with very large files. But I think there are a number of advantages to storing files in the database which can easily outweigh this issue, especially for dynamic, data-driven sites.

The biggest advantage is that databases generally support transactions and are ACID compliant. Which means if there is a problem uploading the file, one can easily cancel the entire update and roll back to the previous state. Unlike dealing with a file system, where one must manually clean up any mess one makes. Furthermore, when using a database as the backing store for your files, you need not worry about file name collisions.

Backup is another huge advantage. If your application is already storing the content in a database, storing binary attachments in there as well makes a lot of sense. Your backup routine becomes much simpler, as you need just backup a database rather than worrying about some arbitrary folders on the disk. And your restore routine also becomes greatly simplified as you need just restore a database rather than a database as well as some bit of the file system. This comes in especially handy when you, as a developer, need to pull down a “current” copy of the site in order to do some maintenance work.

From a security standpoint there are huge advantages as well. If your application need not write files, you can easily ban it from writing to disk at all using file system permissions. The inability to write malicious files to disk makes many exploits impossible. Furthermore, since you are using a server-side language to handle files, keeping unauthorized users from raiding your files becomes much easier as they can easily be patched through your normal security. In addition, because no one needs to upload files, one can oftentimes disable FTP access entirely. And FTP is about as insecure a service as exists today.

Finally, depending on the database platform, one can do some very neat trick with database-based file storage. In Sql Server, one can do full-text searches on Html or Word documents out of the box.  Installing the PDF IFilter allows for doing the same sort of searches on PDFs. And there are other IFilters available for many sorts of files. Users love full-text search.

As I stated above, there is clearly some overhead associated with using a database to store your binary files. The most significant one is that database connections are expensive, and pulling hundreds of KB out of a database does require it be open a lot longer. But this is easily overcome using caching, either to a temporary folder on the disk or-better yet-into memory.

So, next time someone tells you are crazy for storing files in a RDBMS, tell them they are crazy to store the files on a disk.

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • Antony

    Great article, i’ve been concerned with the fact that i prefer storing image binaries in database.

    The reason i prefer this is that is’t easy to create associations and relationsships with other database entries.

    Is there any good article regarding optimizing performance for db-file storage such as using caching and so on? Anyone?

  • dbevfat

    Well, one of the problems is performance – serving static content is much faster than fetching it from DB first and serving it. The proposed solution to use a temporary folder as a cache is somewhat silly. Why use a database at all, then? Caching in memory? Well, at least for bigger files this is not an option.

    The second problem is more rare – manipulating files. You can’t easily fwrite a file that resides in a database.

    All-in-all, this is a good technique if you’re aware of the drawbacks. Hardware is getting cheaper and cheaper, so I guess server resources aren’t that important.

  • http://timvw.madoka.be timvw

    Furthermore, when using a database as the backing store for your files, you need not worry about file name collisions.

    In both systems you have to worry about the ‘identification’ of entities. In a filesystem this is done via the filename and in a sql dbms this is done via the primary key.

    Backup is another huge advantage. If your application is already storing the content in a database, storing binary attachments in there as well makes a lot of sense. Your backup routine becomes much simpler, as you need just backup a database rather than worrying about some arbitrary folders on the disk.

    It makes me wonder why most sql dbms vendors are proud to announce that the process of making a backup is as simple as copying some files… (To me it seems as if they’re saying that backing up files is easier than backup a dbms :P)

    And your restore routine also becomes greatly simplified as you need just restore a database rather than a database as well as some bit of the file system. This comes in especially handy when you, as a developer, need to pull down a “current” copy of the site in order to do some maintenance work.

    I find that restoring files is much easier than restoring a database schema. Just do a websearch and see how many people have problems restoring a simply mysql dumpfile…

    If your application need not write files, you can easily ban it from writing to disk at all using file system permissions.

    That’s true for both filesystem and dbms. So i really don’t see why a dbms would be more secure than a filesystem.

    In addition, because no one needs to upload files, one can oftentimes disable FTP access entirely.

    But you would have to open up whatever your dbms accepts for input..

    And FTP is about as insecure a service as exists today.

    I agree that FTP is broken. Probably that’s the reason we have scp (and others) these days…

    Finally, depending on the database platform, one can do some very neat trick with database-based file storage. In Sql Server, one can do full-text searches on Html or Word documents out of the box. Installing the PDF IFilter allows for doing the same sort of searches on PDFs. And there are other IFilters available for many sorts of files. Users love full-text search.

    And you can do the same tricks with regular files.

    As I stated above, there is clearly some overhead associated with using a database to store your binary files. The most significant one is that database connections are expensive, and pulling hundreds of KB out of a database does require it be open a lot longer. But this is easily overcome using caching, either to a temporary folder on the disk or-better yet-into memory.

    So you would give up the ‘security’ advantages (not having to write to files) of the dbms for performance??

    So, next time someone tells you are crazy for storing files in a RDBMS, tell them they are crazy to store the files on a disk.

    Conclusion: You only mentionned one advantage namely support for transactions.

    There are situations where i find it acceptable to store binary data in a dbms too. I expect that we’re evolving to a situation where for common situations we’ll have a filesystem that can be queries/backed up/… as if it were a sql dbms.

  • Mark

    Am I surprised to find this sort of article in a M$ based forum ? No !

    It’s crazy to store binary files in a database. I’ve seen a number of commercial M$ applications do this where there is no check whatsoever on file sizes, hence DB becomes full and sssslllloooowwww.

    Keep your files externally in filestore, with a database pointer.
    There are plenty of Open Source tools available which can index & search M$ Word & PDF files. Oh and why not run it on a secure system like Linux ? Writing to the file system is much safer :-)

  • chrisb

    Mark: as with everything in programming, there will always be situations where performance can be hurt, I think the article(?) shoudl be taken more as a ‘dont be afraid to do xxxx just because its generally not done’

    Personally though, I tend to avoid storing files in the database – transactions can be done in a number of ways without needing to go to the database just ‘cos its easier..

    One more thing that was omitted is the monetary costs associated with storing large amounts of binary stuff in the database – storing a lot of extra non-relational data in there immediately hits your applications scalability [caching etc could be used but then outweighs the benefits for ease of implementation of other stuff] and forces your customer to buy more expensive hardware and sql licenses for no real benefit to them.

    Again, there will be situations where it makes total sense to store stuff in there; but for every good situation there will be half dozen bad ones so make sure its used appropriately…

  • wwb_99

    @timvw

    In both systems you have to worry about the ‘identification’ of entities. In a filesystem this is done via the filename and in a sql dbms this is done via the primary key.

    True, but in most cases the database or application logic will end up handling this issue for the developer. 99% of time the file is either going into a table with its own key, or going into an attachments table referencing keys from somewhere else. Wherease one will often have to gin up their own naming scheme for the filesystem.

    It makes me wonder why most sql dbms vendors are proud to announce that the process of making a backup is as simple as copying some files… (To me it seems as if they’re saying that backing up files is easier than backup a dbms :P)

    But that backup is still one file rather than multiple files in multiple folders. Nevermind properly transferring the permissions on said files and folders over. There is alot more that can go wrong restoring multiple entities.

    I find that restoring files is much easier than restoring a database schema. Just do a websearch and see how many people have problems restoring a simply mysql dumpfile…

    Alot of people have alot of troubles moving alot of applications. Mostly because making apps transportable is tricky if you do not keep it in mind from day one of design. These same folks would probably also have issues copying folders of randomly named files, fixing configurations and allowing write permissions on said folders.

    But you would have to open up whatever your dbms accepts for input..

    Not really. One usually uploads files using the application itself, which is already accessible from the internet (or intranet) and which already must have access to appropriate parts of the database.

    So you would give up the ’security’ advantages (not having to write to files) of the dbms for performance??

    I, myself, generally use memory rather than disk. But disk is a viable option, managing caching can be done in such a manner where the app can still never write to itself, maintaining the security advantages.

    There are situations where i find it acceptable to store binary data in a dbms too. I expect that we’re evolving to a situation where for common situations we’ll have a filesystem that can be queries/backed up/… as if it were a sql dbms.

    This is definitely coming, see next generation filesystems from Apple & MS.

    @Mark

    It’s crazy to store binary files in a database. I’ve seen a number of commercial M$ applications do this where there is no check whatsoever on file sizes, hence DB becomes full and sssslllloooowwww.

    Actually, the size of a database should not really hurt query speed unless said database is improperly designed. SELECTing based in indicies or better yet primary keys should be a sub-second operation no matter how many rows are in the database. Now, a huge database will definitely slow down full-text queries, but they still scale very well.

    There are plenty of Open Source tools available which can index & search M$ Word & PDF files. Oh and why not run it on a secure system like Linux ? Writing to the file system is much safer :-)

    Last time I checked, Windows 2003/IIS6 had far fewer security vulnerabilities reported than *nix/Apache. On the file-system level, NTFS has never been “cracked.” Not that Linux is a file system, but that is another debate for another time.

    Here is a more drawn out explanation of why writing to the file system is not safer:

    1) malicious user finds image upload form, and forces it to accept a php script by faking some headers.
    2) malicious user then executes php script, which happens to be running in the same context of the application, revealing all kinds of goodies (like that unencrypted database connection information most PHP apps keep as a global variable).
    3) you are pwned.

    If those uploads were going into a database, then the worst the user could do would be serve other users a corrupt jpg as the stuff in the database is not executed by the host applicaton.

    @timvw & mark:

    What tools, exactly, can you all use to index PDFs and Word documents? How does one interface it into the website and the rest of the data it contains? What kind of setup or plumbing do these tools require?

    @chrisb

    Good point on shifting some file-size and performance requirements to the database server and the potential cost increases one can see there.

    @all

    One thing just hit me: we all say storing files in a file system is more efficent, but are there any good tests of this to see just how big the performance penalty is?

  • chrisb

    Depends what you feel a good test of performance is.. the easiest that springs to mind is dynamic images served from a database vs direct in IIS..

    Another would be to enable kernel caching for other static files – you’ll likely see a fairly sizable performance gain even against a cached ASP.NET-served version..

  • LinhGB

    The biggest advantage is that databases generally support transactions and are ACID compliant. Which means if there is a problem uploading the file, one can easily cancel the entire update and roll back to the previous state. Unlike dealing with a file system, where one must manually clean up any mess one makes.

    Dealing with a file system is the same. I think it’s only because you’re thinking from a MS Windows point of view. This is purely server administration stuff and applications aren’t supposed to know or care about it.

    Furthermore, when using a database as the backing store for your files, you need not worry about file name collisions.

    Filenames stored on a filesystem by a web app are usually hashed. Hashed filenames and original filenames are both stored in the database. Problem solved. You can also store such file metadata in the filesystem itself (some support this).

    Mark has addressed the other points. I’ll question this:

    As I stated above, there is clearly some overhead associated with using a database to store your binary files. The most significant one is that database connections are expensive, and pulling hundreds of KB out of a database does require it be open a lot longer. But this is easily overcome using caching, either to a temporary folder on the disk or-better yet-into memory.

    Only hundreds of KBs? Think bigger please. Even for web apps, it is not unreasonable to expect that people will upload or will demand the ability to upload files in multi-MBs. This is a big problem that not even with the MS Longhorn attitude of “let’s throw GBs of RAM and multi-core CPUs at the performance problem” you can address on web servers. Your server will get hammered to death if only a couple hundreds of people access the file.

    As for caching, you might as well use the filesystem to store the files then. Caches can expire, and your application will have to “clean up the mess”, as you mentioned earlier as one of the supposedly cons of using a filesystem. File caching at web server level or kernel level is far far better than doing it with the overheads associated with a database.

    For me, it’s about using the right tool for the right job. Filesystems are better designed to deal with files. It’s as simple as that.

    I expect that we’re evolving to a situation where for common situations we’ll have a filesystem that can be queries/backed up/… as if it were a sql dbms.

    Linux has already had this in a KDE wrapper. You can “mount” a sql db on your system. I forgot the exact details as I don’t care much about it but it’s in a recent Linux Journal issue. Google will find more info.

  • pete

    There are advantages to storing binaries in the database, and in some situations it can work well. Specifically, it is valuable in creating a dedicated asset management system with some kind of version control – where your database for the binaries is highly tuned, or even separated from the database managing the metadata. In this way, we can start to see binaries as an attribute of an Asset object rather than metadata+binary. But we are all used to achieving this effect by now with a combination of metadata in the database and binary on the filesystem.

    Some of the reasons and scenarios where I avoid databasing binaries are:

    1. I often want other applications to have access to the same image store without using an API into the database
    2. I’m often building on top an existing collection of images from another system (like asset management)
    3. using the filesystem you can export a static/published version of your site using scp/rsync easily in the scenarios where you don’t want to use the database dynamically. Specifically one of the sites I run has millions of hits per day and would be significantly more expensive if I fetched images from the database or had to have enough front-end servers with enough memory to make the caching effective.
    3. On an n-tier setup having to transfer large binaries between servers across a database connection is inefficient. As developers, we usually try to reduce the weight of data transferred from the database (think of all those times you don’t ‘select *’ but select only the fields you require).
    4. When I want to distribute my images in front of my application, thinking here of edge-caching etc, where I may distribute my static content closer to the user (reverse proxy on my own network or geographically distributed edge-caches). Doing this saves me money as the bandwidth is bought very cheaply in bulk as it only need to serve static content. Of course this could still be supported by exporting from the database to the static locations.

    In summary, I think if you are in a non-specialized setup and you are happy that the connection between your front-end/app servers and your database containing the binaries will scale to support the increased load, then designing your application in such a way as to take advantage of your database in the way the article advocated is worth serious consideration.

  • wwb_99

    @linhGB

    Valid points, but I think you (and many folks) overstate the performance issues of pulling files out of a database. I have had a server get slashdotted which pulls files out of a database and it handled it like a champ. I have another application that gets slammed hard for three days a year. It also stores alot of images in the database, and it takes it like a champ. They definitely don’t get “hammered to death” with a couple of hundred people accessing the sites.

    Dealing with a file system is the same. I think it’s only because you’re thinking from a MS Windows point of view. This is purely server administration stuff and applications aren’t supposed to know or care about it.

    If your app requires a file system that supports transactions, then I would bet it does care about this. I have seen a couple of people wave the linux flag and claim that transactions & full-text search can be handled with file-system functions, but I have yet to see anyone post an example of a reasonably elegant way to handle such things.

  • Tim

    I had a situation once where the website I built had to be replicated out to several servers. Tossing everything into a database, then using MSDE on the outside servers, allowed me to set up replication which copied new content out to the servers automatically – including images, PDFs, etc., all within a few seconds of updates. No need for file copying or worrying about filesystem permissions. Admittedly it wasn’t a big site (MSDE will do up to about 2GB of database), but it worked great for me and kept me as a single developer working on task instead of doing daily chores.

  • Ian Qvist

    I would like to see some test on the performance issues with database binaries.
    I’ve run into the file identifier problem a couple of times. I created a forum where people can upload their profile pictures, and i had to group the pictures and rename files to an id counter. This could be done a lot easier with a database storing all the profile pictures.

  • http://www.ajohnstone.com Andrew-J2000

    Hi,

    After reading this blog entry I thought I would shed some light with regard to my experience of storing files in databases. I’m sure many of you have known this to be a taboo practice, and I would certainly agree depending on the database. A project I worked on for MTV Networks Europe/International required a completely shared nothing architecture. This meant that MTV’s hosting & operations imposed that I stored files in the database, and expressed my hesitation. To read more, about some of the issues that has not been discussed see below…

    Large Binary Data and Blob’s

  • wwb_99

    @Andrew-J2000

    I think a big part of your issues with things revolved around MySql. Sql server handles blobs alot better.

    In any case, the memcaching is one method I use alot to help take load off of the DB.

  • thrashtad

    @LihnGB:

    File caching at web server level or kernel level is far far better than doing it with the overheads associated with a database.

    Who says you can’t do both ? You really think that because it’s stored in a DBMS that you don’t access it like any other served file? Your webapp would ouput something like:

    http://blah.com/imgs/img.jpg?ID=123

    where img.jpg is really a PHP/ASP/Whateva file that pulls the image from the DB, fakes some headers, and outputs the binary data.

    Apache, IIS, etc..can cache this output, as well it should. You have a single performance hit that occurs when the cache is nonexistant or has expired, and the rest that follow don’t touch the DB at all.

    You were talking about performance issues why?

    @pete:

    1. I often want other applications to have access to the same image store without using an API into the database

    See above. Brain-dead simple there.

    IMHO, the benefits to security (having random web users actually altering my filesystem is NOT COOL!!) that comes with not having binary data splattered onto your platters directly, the ease of searching, serving, updating, (and writing secure code to do such tasks) far outweigh any percieved performance drawbacks.

    I would also like to point to this study:
    http://www.cis.ksu.edu/~rpalani/MSReport/ComparisonDbVsServer.htm

    As you can see, both methods (done correctly) scale perfectly well. Yes, uploading huge 100M files takes about 50 seconds longer, but that is a non-issue as the bottleneck there will almost always be the network and not the DB or the FS. Uploading is also not going to be your major operation, as it’s not being done very often. Notice that downloading has the lead going to the DB until the FS catches up at 100M.

    100M is an awfully large file for the types of data we’re talking about here. (Images, documents, etc..) At 10M and below, the performance hits are negligable.

    I personally am a fan of meta-data table + blob table for situations that warrant that functionality. Granted, if you’re serving billions of files to millions of users a day then storing them in a DBMS is stupidity beyond belief. But even in that case, you’re going to be rolling your own custom filesystem architecture, akin to Google, which is way beyond the scope of this discussion.

    In 80% of other use cases, both methods work equally well.