Image Storage - File System Versus Database

cpradio · January 3, 2015, 4:12pm

Continuing the discussion from MySQLI Security:

Why do you want to store images in your database versus on the filesystem or on a CDN?

RyanReese · January 3, 2015, 4:16pm

Not sure what the standard protocol is. I assumed storing in the database was the norm based off all the threads I’ve read on here.

So you recommend that when I submit my form, I upload it to a directory (perhaps “uploads”)?

Is there a way to rename images that are set for upload? If so, that’s perfect.

cpradio · January 3, 2015, 4:21pm

Yes. move_uploaded_file

RyanReese · January 3, 2015, 4:22pm

Perfect. Thanks.

felgall · January 3, 2015, 9:26pm

It depends on a number of factors including how many images there are and how many people are on the team working on the system.

The book “SQL AntiPatterns” includes not including the images in the database as an antipattern.

If you have lots of images or are using transaction processing in the database then the images need to be loaded into the database. You also need to keep them in the database if there is a team working on the system instead of just you.

If you only have a few hundred images and are not using transaction processing and you are the only one developing the system then you can get away with keeping them outside of the database at the cost of maintaining those relationships independently of the database.

You also need to make sure the images get backed up as well as the database. The situation mentioned at the start of that chapter of the book involved a company that lost millions of image files because they were stored outside the database and only the database was being backed up.

RyanReese · January 3, 2015, 11:36pm

I will be the only one working on the system. The images might be 200 MAX. Probably won’t get above 100 though. They will be 150x150 JPGs.

Michael_Morris · January 4, 2015, 12:41am

Did not know that - however I smell ivory tower nonsense.

The main reason to have the images outside the database is it’s a static file fetch for the webserver, which will always be considerably faster than any database transaction.

That said, I can agree that the authoritative copy of the image should lie in the database, then a copy be placed in the file system by the code that updates images. For a first layer of failsafe protection use a little mod-rewrite magic to have a small PHP script that can make an attempt fetch any images missing in the filesystem from the database, then write a copy into the filesystem. A second failsafe layer would be to run a cron PHP script every night and checks the file_mtime of all the images against the record insertion time in the database, and replaces any images that, for whatever reason, are older than their db copy.

felgall · January 4, 2015, 12:57am

To make that statement it is obvious that you have only ever worked on tiny to small applications and have never worked on a medium to large one (ones involving hundreds of thousands to hundreds of millions of images or more).

If you have a team of several dozen people including database administrators as a separate group and the images are a part of transactions where the image changes need to be rolled back with the rest of the transaction then placing the images in the database is the only sensible option - they can be easily extracted on the fly as required using the appropriate SQL command.

The only reasons that so many people do not put images in the database are:

they don’t know the correct SQL commands to do it
the appropriate database options are usually not enabled on smaller server setups (where the entire database is running on a single server)
they are not using transactions in their database processing or
the number of images is tiny enough to maintain independently of the database (just a few hundred).

In Ryan’s case (4) applies and so it is not necessary to store the images in the database

TheRedDevil · January 4, 2015, 2:07am

I would disagree and say storing the actual image in the database is the “anti-pattern”. But again it depends on how you address the question and what the use case is. I.e. are the images internal, are they accessed perhaps a few times a day, or hundreds of thousands of times per day.

Storing content using BLOB and accessing it again is quite expensive, especially with MySQL, it is cheaper if you use Oracle, but still expensive. And that is just talking about the server recourses required to access it.

In the end it become a price game, if you can save tens of thousands of dollars by reducing your database clusters and instead save the images in static clusters while receiving faster access times. Doing this makes it also easier creating hot cache for most accessed images, using Memcache and custom Nginx module.

The security is still obtained by keeping the full set of the images in the database, but only by references not actually the image. This way you also keep the ability of transactions and can rollback etc. Though it is more messy, since you need cleanup scripts to deleted images that should be removed.

Note. You can also get around this by storing the image in the database, and then creating hot cache copies with the most accessed images, though in this case you carry the increased cost on the database clusters.

There is never a clear cut answer on which solution is the best, both has advantages and disadvantages, though I would go as far as saying, unless you know exactly why you store the images in the database, then it is not the best approach.

Michael_Morris · January 4, 2015, 2:15am

At the last company I worked with we had to deal with 8.5 million images. They where not kept in the database. That said, that system was hideously poorly designed and maintained so it isn’t a rubric of sane design, but I have dealt with this scale. To imply otherwise is condescending.

But clarifying on my original post - what I mean by ivory tower nonsense is the notion of serving images out of a database. I’m going to guess the book doesn’t address real world performance issues. Extracting on the fly is a MAJOR performance hit compared to using the filesystem.

If you’d read the rest of my post you’d note that I’m of a mind to keep the authoritative copy of the image in the database along with the rest of the data. That way it can be ACID compliant along with the rest of the data. This has the problem of doubling the storage space required as the image is in two places at once, but beyond that it’s just a caching approach.

felgall · January 4, 2015, 3:28am

So what book did you publish in order to contradict what it says in the book “SQL Antipatterns” where it lists dozens of things that can go wrong when the image is NOT stored in the database. (The book does also mention a small number of instances where it does make sense to keep the image outside of the database - but they generally only apply to very low volumes of images - say a few tens of thousands of images at most).

I agree it is a price game - the hundreds of thousands it would cost to replicate the database functionality for the images so that rolling back a transaction automatically restores all of the images back the way they were before the transaction started would definitely prevent a company from considering the idea of not storing them in the database with the rest of the data that belongs to the same transaction.

Anyway, you wouldn’t use mySQL or Oracle for a large database as they are too inefficient with larger databases…

In that case I think we agree. Obviously once the images need to be accessed frequently enough to cause performance issues then caching copies of those frequently accessed images as files outside of the database is the solution. Doing so would not impact on the integrity of the data in the database and so transaction processing that includes images would still work and as the external images are cached copies rather than originals you don’t have any issues regarding outdated or orphaned images (beyond that implied by caching of any data in the first place).

Anyway, caching of any data has no impact on how the authoritative copy of that data is stored in the first place and we appear to be in agreement that for a medium sized system such as the one you worked on with 8.5 million images that having the original of the image in the database resolves many of the problems that can occur when trying to administer that volume of data.

===========
Just to clarify why storing the images outside the database is an antipattern - the book “SQL Antipatterns” lists the following issues:

Assuming that images can’t be stored in the database because the person is simply unaware of the commands that make it easy to do so
Files don’t obey DELETE - your database transaction deleting data from various tables will not also delete any images in files
Files don’t obey transaction isolation - the files will update immediately without waiting for a COMMIT
Files don’t obey ROLLBACK - rolling back a delete transaction will not undelete the corresponding image files
Files don’t obey database backup tools - running a backup of the database will not back up the files
Files don’t obey access privileges - if the particular data is only accessible to some users that doesn’t prevent other users accessing image files.
Files are not an SQL data type - renaming the file will break the connection to the rest of the data

Legitimate uses for the antipattern:

the database is much leaner
backing up the database is much faster
ad-hoc image editing is easier

Some databases allow you to get the best of both - For example Oracle BFILE data type and SQL Server FILESTREAM data type both allow external files to be treated as if they were a part of the database (at least in part).

SpacePhoenix · January 4, 2015, 6:12am

That’s a blog post from a few years ago by someone who does (or at least did), work with millions of images

TheRedDevil · January 4, 2015, 6:26pm

Honestly? So to be able to have my own opinion on a subject, I actually need to published a book on the subject? Or I guess, its also enough to have read about it in some book as well like you keep referring to?.. I at a loss of words, so I guess Ill leave it at that.

Back to the subject.

You keep referring to the book, but at the same time it seems you have forgotten the authors example (There is other of us that own the book as well).

The chapter starts with a “database catastrophe”, ending up with restoring it from a backup. They then find out they lost the images as they was stored in the file system and no one thought about backing them up. The example also state that the total number of images we talk about is 11,000.

If we break this down, the first reason this happened was poor management of backups. Secondly, having 11,000 images in a database is peanuts, at this amount you would have no problems storing and serving them from the database (but that is beyond the point).

Lets update the example, and say they had 100 million images instead in the database. The main problem with the example in this case, is the time it would actually take to restore the database in this scenario, the site would be down a lot longer, than if the database and files was backed up separately.

If we look at the other points that is mentioned (those you have listed up), can just as easily be blamed on bad application architecture.

In difference from the way you approach this, the author is straight forward and does not claim storing the images in the database is the ultimate solution for the problem.

Storing the images in a NoSQL solution can be a good approach, but while a lot of them claim to have atomic/transaction capabilities, very few are ACID compatible. Which would void the reason for the antipattern in the first place.
Those that are ACID compatible like FoundationDB have other limitations like the max size of the value, which makes it unsuitable for storing images.

In the end, if you have a good architecture, the antipattern is void, due to you would not be affected by the issues it mention. As long as the architecture is based on this from the start, the extra cost for the client during development, will be pure profit after a few months, due to the monthly savings on hardware.

felgall · January 4, 2015, 11:04pm

I didn’t make that claim either - if you go back to my original post on this you will see that I started off saying It depends. In post 14 I said that there is no reason for Ryan (who started this thread) to store the images in the database as none of the reasons for doing so apply.

That chapter in the book unfortunately starts with one of the worst examples that the author used since that problem in this case wasn’t actually caused by the antipattern the chapter is dealing with but is as you suggest a management problem. As you say, 11000 images is peanuts - note that all of the quantities I have mentioned where I suggest storing the images in the database involve much larger numbers of images.

The reasons for storing images in the database which can’t be easily worked around are transaction processing and security neither of which can be as easily handled when the images are not included in the database itself.

As Micheal pointed out (and which I agreed with) - storing the originals of the images in the database and then providing cached copies resolves performance issues while still retaining most of the benefits that letting the database handle this data provides.

The other thing you haven’t mentioned is how many teams that your company has working on projects that involve images and databases and how many are on the database administration team. This particular antipattern only becomes a significant issue in situations where there are millions of images and dozens if not hundreds of programming staff involved - it does have benefits even with low volumes of images and a single programmer but the benefits of not placing the images in the database are normally great enough to not make it worth storing the images in the database for small systems with only a few thousand images.

Most of the people visiting this forum will not come across a situation where this particular antipattern is going to have any noticeable effect but to start out as you did claiming that the opposite is the antipattern is just plain wrong.

TheRedDevil · January 5, 2015, 1:47am

It starts to become ridiculous when you require that someone who disagree with your view, start mention projects and clients where the topic at hand has come into play. While your own word does not require any justification.

It is the kind of tactic normally applied, when you cannot address the points itself and instead try to attack the person behind them.

It makes me question if you have in fact ever worked in the scenarios you describe yourself, or if you have you most probably did not have a say on the architectural/hardware side of the question.

How many developers that are working on the project, should have nothing to say in regard to which solution is chosen. Only one of the teams will be working on the actual module that implement the handling of the image/documents. Everyone else will be using this module when a image/document need to be updated, and does not really need to know how it work, or even why it work.

This part is exactly the same no matter if it is decided that the project should store the images/documents in a database or on a file system. There should be one module handling this, which every other part of the project uses.

The benefit of storing the images/documents in a database (assuming it is actually ACID compliant if we talk about noSQL) is that it makes the application logic easier to implement, but at a much higher hardware cost. This cost does not scale well, but instead start going straight up after a while as storage requirements grow.

This is the single reason why I believe static content does not need to be stored in the database (cost), since the issues addressed can all be solved by proper architecture. It will not be ACID compliant, but you can easily get very close (including having the ability to easily rollback a deleted image days after it was deleted, without doing a full restore), and at a much lower hardware cost.

In the end, as we agreed earlier this is a “cost game”. When you store and serve massive amounts of data, just the smallest changes to your hardware setup, can net tens of thousands of dollars a month or even more. This means sometimes “as secure” is secure enough and worth the risk considering the savings.

Note. This is my last reply in this thread, as you obviously do not want to discuss this at a technical level.

animedreamz73 · January 5, 2015, 8:55am

Neither of your posts are valid and this thread is off topic. This is the php section not the sql section. So a large debate and argument on filesystem vs database was a waste of both of your time and didn’t provide any useful info to the topic, Both solutions have there validity and use. When I owned a website with millions of images and with thousands of active members a day the server and database load was less stressed by the images being in the filesystem vs the default of being stored in the database.

Amit_Yaron · January 5, 2015, 9:26am

It may be very convenient if your PHP server reads the image using

file_get_contents('php://input')

and if you don’t want to create file names for each image.
If you store the image in a DB, you can create an output image using the GD function 'imagecreatefromstring() '.

Michael_Morris · January 6, 2015, 8:09pm

But that’s the performance hit. You do realize that any use of the GD library requires placing an uncompressed copy of the image in memory yes? A 68K JPEG file can be over half a megabyte uncompressed. It adds up fast. Just the startup of the PHP engine is a performance hit compared to simply loading the image.

As far as database backup problems with images are concerned - on the occasions I’ve stored images in a database they’ve always been off on their own table even if that table had a 1 to 1 mapping with another table. The reason for this is that, in the event a recovery becomes necessary the other tables can be restored first, then the image table. If the system is setup in a fault tolerant manner the site can be partially up while the image table rebuilds.

Michael_Morris · January 6, 2015, 8:18pm

The vast majority of Fellgall’s objections to images in filesystems only are due to problems with transactional integrity, or ACID compliance. And to be blunt, the vast majority of database driven websites out there have no such compliance. Heck, I’m sure a large fraction still run MyISAM tables which have no transactional ability. Any site using NoSQL databases will have no ACID compliance (I have yet to see a transaction complete NoSQL solution). And the thing is, most websites don’t need such compliance.

Not to be mean but if 90% of the web’s databases crash out and are lost forever, who cares? (most of it was porn anyway). It’s not a big loss. ACID compliance comes into play with things like hospital records, company budgets, and other big ticket if it fails a few million dollars or more go up in smoke. On a database like that, if the images are part of the mission critical data (say, X-Ray photographs in patient records) then it is imperative they be subjected to the same standards as the rest of the data.

Most of the time this isn’t the case.

animedreamz73 · January 7, 2015, 12:38am

The thing is he’s not asking about all of that. It’s easier to store them in the filesystem and store a link to them then to store the image in the database. It’s good practice to back up both your database and filesystem to try and prevent losses of any data or functionality. I had members adding new images daily and I backed up my file system and database daily, with the database being down twice a day.