File system cache best practices:

Hi everybody, I’m writing a file cache class for my custom app. I’ve seen some examples online where the filenames for the cached data are hashed with md5. This is of course a security measure in case someone gets hold of your cached data they don’t have an easy time making sense of it… My question is, if my cached data is outside of the public HTML directory, is there any benefit in me doing this?

Also I’ve seen how people debate between using serialize or json encode to store arrays and it seems that the preferred option is to json encode… does anyone have any arguments against this?

Many thanks for your input

filenames for the cached data are hashed with md5. This is of course a security measure

Of course it is not. MD5 is used just to map the cache contents with filenames, making it sure that same content will produce always the same hash.

Also I’ve seen how people debate between using serialize or json

that’s quite a common thing that people tend to waste their precious lives in such a pointless debate over a topic of no importance.

I’m writing a file cache class for my custom app.

I am 100% sure you don’t really need it and it will do you more harm than good.

You mean it is not a good idea to cache database results in the file system?

I mean it’s a very bad idea.

A database has a query cache of its own, to start with.
After firing the same query for the second time you’ll get results immediately, from the query cache. So you are adding just a useless layer to the application.

1 Like

Alright how about this… I’m building a permissions system where to gather the permissions for a specific user i need to do 4 database operations. This is because access groups can inherit any number of access groups and users can have specific peemissions. Do you still think I would not benefit from filesystem or even database cache given how expensive these operations will be?

Yes of course.
You have to understand that a database is not a distant warehouse each roundtrip to which takes you several hours of logistics. It’s a very intelligent storage and most likely you will get your 4 results straight from the memory without even touching the hard disk.

There is a strange phenomenon when laymen tend to super-optimize such parts of their applications that will never ever become bottlenecks. Frankly, you are wasting your time to solve an imaginary problem. Let me advise you to concentrate on the real problems.

1 Like

Maybe try this method:

https://www.anetizer.com/eureka-for-making-a-web-page-load-faster

1 Like

Thank you for the answers. What I’m currently building is a large CMS application that is going to juggle lots of data. Therefore the idea of cache comes to mind in all forms. However @colshrapnel idea that there is no need to cache data from the database is very interesting. I wonder if that will hold true when there is too much going on. I have worked with CMS in the past that seemed to overkill cache operations even bloating the database with serialized data…
@John_Betong the link you posted looks to be full page cache, which with very large and dynamic applications can be tedious and problematic to determine when to clear, but definitely a good one.
At this point I was thinking of storing in the file system arrays of data that result from expensive database or programatic operations so that the app can always find shortcuts to its processes. I actually think this would be critical in a very large CMS app that is prone to grow out of control

1 Like

As you are in the process of developing, I think it would be wise of you to spend some time thinking about your database design. Databases have some “shortcuts” of their own.

For example, if I have a table “users” with the fields “id”, “name”, “gender” and “age” with “id” as the only index, a query like
SELECT * FROM users WHERE name LIKE 'realDonaldTrump'
it would cost more than if there was an index on the name field.

Not that it is a good idea to make every field an index (it is a poor idea) but for a field that is going to be used in queries a lot, sure.

Don’t hesitate to use EXPLAIN queries to help you better design your database and write better queries.

2 Likes

Surely, a cache is a good thing.
But you are doing it upside down.

I remember an interview with a YouTube guy who said something like that “the development was quite easy - spot the performance problem, find the bottleneck, fix it, spot another performance problem… and so on”.

So you can tell that you have neither a problem spot nor a bottleneck but already trying to fix it. Sounds a bit illogical, eh?

1 Like

I would be tempted to complete the custom app first and then test for any bottlenecks. There are many free applications which test web-page load time.

Maybe read this article…

http://wiki.c2.com/?PrematureOptimization

3 Likes

So best practice is no practice, I’m hearing you guys, thanks for the advice and nice article.

1 Like

While I completely agree that you should never optimize for something before it is actually a problem. I feel that claiming that a database will never become the bottleneck is a bit far fetched.

I have worked on several systems over the years where this has been the case, and in these situations you have two options, either you add more hardware, or you cache.

@Andres_Vaquero instead of focusing on a solution to cache query results right now, if you believe your software will receive a lot of traffic you would benefit more from updating your database layer to be able to handle a master / slave cluster, i.e. that it is able to decide where to send the query.

In regards for a query cache, it is sometimes used, but in that case it is for data that it cost a lot to compile together. It is normally never used for simple queries that just pull information from a table or two. Normally, before a query cache is considered, a cache solution for the entire content or part of the content is implemented.

1 Like

a database will never become the bottleneck

Actually nowhere did I say anything like that.
I’ve been given a practical example of a permissions system. Which scarcely could be a problem, given it could be properly designed to get you the data required by means of a primary key-based lookup. So I kept with this example.
With a little tweaking, all the data required could be stored in the index, and thus served straight from the memory, working effectively as though it is cached, but without the problem of cache invalidation.

But originally I never mentioned database. What if you need to create a big memory footprint to grab some information that doesn’t change regularly? Any ways I’m sold on the idea of not prematurely optimising… Cheers

There is some small speed difference, which differs depending on data volume and other factors, you would need to run your own tests. Apart from that, json appears to be more human readable than serialize() - if it matters to you at all. There is also a third option: var_export() to a plain php file - this may have the benefit of php using its opcode cache when reading the contents.

It depends on what database we are talking about since each have their own implementations and some don’t have a query cache at all. MySQL’s query cache is certainly used for simple queries and even more often than for complex ones - if many tables are referenced in a query then there is more likelihood that data from one of them may become invalidated, which clears the cache for the whole query. Therefore, simplest queries selecting from one table have the highest chance of using the cache.

2 Likes

Found a nice article on optimizing PHP, which has a slight different view to the one that @John_Betong posted, kind of saying also don’t completely dismiss premature optimization. Thought I’d share it with you: http://phplens.com/lens/php-book/optimizing-debugging-php.php

When to Start Optimizing?

Some people say that it is better to defer tuning until after the coding is complete. This advice only makes sense if your programming team’s coding is of a high quality to begin with, and you already have a good feel of the performance parameters of your application. Otherwise you are exposing yourselves to the risk of having to rewrite substantial portions of your code after testing.

My advice is that before you design a software application, you should do some basic benchmarks on the hardware and software to get a feel for the maximum performance you might be able to achieve. Then as you design and code the application, keep the desired performance parameters in mind, because at every step of the way there will be tradeoffs between performance, availability, security and flexibility.

Also choose good test data. If your database is expected to hold 100,000 records, avoid testing with only a 100 record database – you will regret it. This once happened to one of the programmers in my company; we did not detect the slow code until much later, causing a lot of wasted time as we had to rewrite a lot of code that worked but did not scale.

Also I think I should add as an example the purely programmatic case where I thought file caching could apply in my application at this stage. Each page is represented as a node within a tree that registers one or more URLs and subnodes. When you hit a URL the node registered for that URL gets instantiated. That node may be a heavy object containing lots of data and other objects. I’ve actually included a modifier in the constructor that determines weather you get a ‘light’ instance or a ‘heavy’ instance. That node determines which sub urls can be accessed. In order to make a sitemap or tree navigation I create a instance of every single node in the application and gather the accessible urls. Once the application grows and has hundreds of nodes it will take a lot of memory to create an instance of every single node in the application. So it makes sense that storing a simple tree of urls in the file cache is going to be a great relief overall, specially as the application starts to grow…

My approach is to avoid premature optimization but to think about optimization from the start. What I mean by this is that while coding I first think about good and readable code structure but I also try to imagine how the data is expected to grow and to adapt the code to handle the volume. Also, when I can apply some optimizations without much effort and disrupting the code structure I will do that immediately. Usually, I will not dive deep into micro-optimizations but think about such things:

  • if I am pulling data from a db table and the table is expected to grow in the future I will optimize the code immediately, for example by implementing pagination or selecting only a limited number of rows. But at this stage I will not test the code against millions of records, where standard pagination techniques using LIMIT … OFFSET become slow.

  • if possible I’ll try to limit the number of db queries, for example avoid 1+n number of selects for lists.

  • if I see that my code does a lot of heavy stuff in frequent ‘spots’, for example on pages that are supposed to be visited frequently then I’ll try to optimize right away but only when I know the site or system is going to receive a lot of traffic.

This way a potential optimization need in the future is usually a simple task of adding db keys or tweaking some queries and it won’t involve rewriting large parts of the code.

I just don’t think that your case of checking permissions by using 4 db queries might be a bottleneck worth implementing a cache system. First, I would do my best to speed up the operations without any cache and only create caching when really necessary. For example, you might consolidate those queries into one view or stored procedure to avoid too many round trips, or send all 4 queries in one go - test which one performs best (e.g. uses the db query cache most efficiently) and use that.

If you think this will become computing intensive then it might be reasonable to think about a cache. But first I’d try to optimize the application design if possible - for example a tree structure might be represented in a single tree/category table in the db and then fetching the whole tree is just one fast select query, in which case there is no more need for a cache. Your application might not allow for this but this is just an example.

1 Like

This is a great answer thanks a lot. It gives me ideas like using the database to consolidate programatic nodes with dynamically generated nodes. thank you!