File System VS DB

Hello all,

It’s been a long time since I have posted here, but couldn’t think of a better venue to post this.
Compared to file systems Databases are known to be significantly more effective in managing data, but I am trying to find examples where this is not the case.

I came across a post that addressed by conducting a series of very interesting tests.
Although as the writer states, there wasn’t a lot of science behind the environmental conditions of these tests, I still think they are valuable and are certainly worth considering.

http://revjim.net/2003/04/23/speed-concerns-database-vs-filesystem/

Challenge: Let’s say you would have a page cached through a content delivery Network. All of the static contents of the page are being cached, images, JS scripts, CSS assets, and so on.

On the same page you have an area that is populated by a Json call that requires a call to an external content provider. This service pulls data from the provider, stores it in a local database, and serves it to the frontend once Json calls it.

The obvious advantage of the Content Delivery Network is that it’s serving static localized content, distributed across a global network, providing faster local access. The whole service aspect of this is creating significant latency, and for various reasons, this piece of data cannot be cached by the content delivery network.

Idea: Since apparently getting one single piece of data is lose to 10 times faster thought the file system rather than through the database, would it not be feasible to store the set of values in independent files in the file system (ex. shared between multiple SSD drives), that would hold the ID of the parent value in the name of the file, map/call them through the Query string, Session or Cookie? This would mean that instead of making a call to the service, the page would only include a value from a static file, that could even be cached.

There are a couple of ways we could generated these files. 1. They could be generated “per user request”. Each time the page is requested by one user, it would generate the file and store it in the SD drives. A second user requesting the same page would not experience the same level of latency, as the file would have already been generated. In addition to this, and to take this a step further, the Content Delivery Network could also cache this value. 2. Have a Job generate files for all dynamic pages and store them in SSD drives.

Question: Is it correct to assume that this would be faster than using LOCAL Memcache?

Your inputs are more than welcome!

[FONT=verdana]Hi Pedro,

I won’t try to give you a detaile answer. To be honest, I’m not an expert in all the technical aspects of this. There are other people here who can give a better insight than I can.

That said, I do have an application that uses the file system to cache data, in preference to running a database query. This appears to work well, although I haven’t done any scientific tests to determine if it is the best solution.

In summary, the application uses a large database table that contains fairly static data. It’s typically updated once or twice per day at most, and the updates are rarely time-critical (that is, it doesn’t matter if a visitor sees data that is slightly out of date). Every request for a particular web page on a given day always runs the same query. It searches the table for records with a date field which is greater than or equal to the current date. The query is not particularly fast, plus I need to do some extensive formatting on the query results before showing them to the visitor.

So, rather than running the query on every page load, I run it once every six hours. I then apply the formatting to the query results, and write the formatted data out to a text file. Then, when a new page request arrives, I simply retrieve that text file and merge it with the rest of the requested page.

The result is that I only need to run the time-consuming database query four times per day, rather than on every page request.

In the exceptional cases where an update to the table is time-critical, the administrator can manually delete the text file. This forces the application to create a new one, with the most up-to-date data.

I don’t know if this is of any help to you, but I hope it will at least give you some ideas.

Mike[/FONT]

Really appreciate your input Mikl. Your approach makes total sense to me, even more so because the data you are handling is not time-critical. I guess my question is around taking the same approach to time sensitive data, prune to constant change. But on top of that question, is there is a fundamental assumption that it is more efficient, within the same local environment, to load the content of one single file as opposed to make a call to a DB and fetch the same content in a significant database.

[FONT=verdana]I don’t know.

My own scenario was based on the assumption that it is more efficient to load the contents of a single file rather than making many calls to the database (once call per visitor). But I don’t know if even that assumption valid.

I’ll look forward to other people’s input.

Mike[/FONT]

For caching purposes I would refer to something like memcached or APC if you have access to it, simply because it stores everything in RAM, which is always faster than hard disk access – even if you have SSDs.

^^^^ ding ding ding, don’t reinvent the wheel. Memcache is pretty everloving awesome when you need it.

Also, remember that there are two hard problems in computer science – cache invalidation and off by one errors.

Thanks ScallioXTX and www_b99. I do know that for most operations Memcache is significantly more efficient. I am interested however in cases where its not, and if anyone has tested this.

Below is an interesting example from the article I linked too;

[LEFT]Getting the filenames of the 10 most recent entries from the entire system.[/LEFT][INDENT]FILESYSTEM

[INDENT]
TIME: 1.7814919948578TIME: 1.7425200939178TIME: 1.8071219921112TIME: 1.6778069734573TIME: 1.6711789369583TIME: 1.7414019107819TIME: 1.6959699392319TIME: 1.6531630754471TIME: 1.7546479701996TIME: 1.6758890151978TOT TIME: 17.201191902161AVG TIME: 1.5637447183782[/INDENT]DATABASE

[INDENT]
TIME: 0.0039100646972656TIME: 0.001039981842041TIME: 0.00095093250274658TIME: 0.00096702575683594TIME: 0.00095295906066895TIME: 0.00098395347595215TIME: 0.0009620189666748TIME: 0.0009760856628418TIME: 0.00094294548034668TIME: 0.00095808506011963TOT TIME: 0.012644052505493AVG TIME: 0.0011494593186812[/INDENT][/INDENT][LEFT]Getting the filenames of the 10 most recent files in a single directory.[/LEFT][INDENT]FILESYSTEM

[INDENT]
TIME: 0.055459976196289TIME: 0.053847074508667TIME: 0.044721961021423TIME: 0.043873071670532TIME: 0.043742060661316TIME: 0.043787956237793TIME: 0.043717980384827TIME: 0.04374098777771TIME: 0.043833017349243TIME: 0.04370105266571TOT TIME: 0.46042513847351AVG TIME: 0.041856830770319[/INDENT]DATABASE

[INDENT]
TIME: 0.0095839500427246TIME: 0.0055500268936157TIME: 0.005547046661377TIME: 0.0055389404296875TIME: 0.0056079626083374TIME: 0.00553297996521TIME: 0.005499005317688TIME: 0.0055099725723267TIME: 0.0053470134735107TIME: 0.0053049325942993TOT TIME: 0.059021830558777AVG TIME: 0.0053656209598888[/INDENT][/INDENT][LEFT]Getting one item.[/LEFT][INDENT]FILESYSTEM

[INDENT]
TIME: 0.00032293796539307TIME: 0.00021898746490479TIME: 0.00017297267913818TIME: 0.00016999244689941TIME: 0.00027298927307129TIME: 0.00017201900482178TIME: 0.00016689300537109TIME: 0.00016403198242188TIME: 0.0001760721206665TIME: 0.00017201900482178TOT TIME: 0.0020089149475098AVG TIME: 0.0001826286315918[/INDENT]DATABASE

[INDENT]
TIME: 0.0042519569396973TIME: 0.0011199712753296TIME: 0.0010420083999634TIME: 0.0010360479354858TIME: 0.0010439157485962TIME: 0.0010349750518799TIME: 0.001041054725647TIME: 0.0010310411453247TIME: 0.0010330677032471TIME: 0.0064520835876465TOT TIME: 0.019086122512817AVG TIME: 0.0017351020466198[/INDENT][/INDENT][LEFT]The database was 1360 times faster than the filesystem when looking for the 10 most recent items in the entire system. The database was 7.8 times faster when looking for the 10 most recent items in a single directory. However, the filesystem was 9.5 times faster at getting a single file.[/LEFT]

Stats are great. Now show me a time when being faster at getting the 10 most recent items from the file system is going to matter in a real-world, http-based application.