Data Processing & Storage


I’m currently working on a project that during its use generates a significant amount of raw data that I wish to process and store results. Since the results will be used quite frequently thoughtout the application simply generating the results from the raw data when the user requests a page that make use of said results is unacceptable (Especially when the project is designed to run on shared hosting). Additionally some of the the results would need to be calculated on a daily basis and stored for historical reference (and to use in nice looking graphs :D).

I was wondering what would be the best method of generating results from raw data (Cron, random on user page request?) and the best method of caching/storing the data?

Please remember this is designed to run on shared hosting by clients so it can’t involve some non-PHP solution or require complex server-side setup.




The shared hosting requirements make this a tough one. Without that requirement a simple message queue like Beanstalk would make this task easy.

With shared hosting, there might be some options…

  1. Which DB are you using? Mysql 5.1 and postgres have a schedular. As long as all the data crunching is inside the DB, this is probably the simplest solution. You’ll have to learn some stored proc syntax, but that’s pretty trivial.

Create a table called “queue”. Write a stored proc, lets call it “consumer()”, that takes the jobs off of the queue. Schedule that to run as often you dare. You can place additional scheduling logic, e.g. detecting when the server is busy, in consumer() so just keep the actual schedule simple. Consumer can just use a curser/repeat loop to call another proc to do the actual work.

Your web pages just place a new item in the queue table.

You’ll get a good clean separation with this and the procs and stuff can just be loaded when you load the schema.

  1. If you have cron or not all of the data is DB related, then you can just do the same stuff above in PHP. Of course the user will probably have to schedule the cron job themselves, making deployment harder. The queue table parts are the same. Same architecture, different polling mechanism.

Without these two scheduling devices, things become harder…

  1. You could use the PHP garbage collector trick. Have a small percentage chance that processing will be done after each request. This may appear to leave the web page unloaded and may confuse the user. You have to be able to handle the case where they navigate away.

This is trickier than it looks. Although there are PHP settings to keep the script running after the connection is closed, none of the resources of the page will be returned until the work is done. You could also end up with two copies of your consumer being run at once.

Not only are there mechanical problems, but you may go to get the data and it might not be there. I guess you just have to pay the processing hit at that point. You can raise the percentage chance to process as the job gets older.

  1. Same as 3, but have your pages call file_get_contents($url) to fetch a special job runner page. The job runner script just closes straight away (or returns an error string). It carries on processing though. Again, you are relying on page hits to trigger the processing on a regular basis. This is cleaner than 5, but you need to know the URL of the page to hit and the sysadmin of the host may grow to hate you. This is better than leaving the page hanging though.

  2. When clients install your software, have it register automatically with your external ping service. Your external server can then fetch the job runner page for them.

  3. Can you do the crunching on a different server and just expose an API?

  4. You might be able to hijack a mail queue package in a way that by failing to send an e-mail it triggers an error that you handle. Your error handler calls the consumer() function.

Apart from that I’m out of ideas :(.

yours, Marcus

Comment regarding 3 above.

If your running PHP FPM, then you can tell the web server to send the response and continue processing.

For graphing maybe take a look at Protovis, if can make the browser assumptions.

Many thanks for all the advice. I have to admit that since all the data is stored in a MySQL database then I think the best option to look into would be the stored procedures and Event Scheduler. Doubt it will be too complex despite never having used stored procedures before since all data crunching that needs doing can be expressed in a mathematical way with some simple logic thrown in.

My guess is that two types of data storage will emerge in your solution.

One, will be the items that are infrequently used OR not of critical importance (ie its OK for them to wait for a database hit) OR both would be most practical to store in a database.

Depending on your needs, you may be able to store just the results or you may need the raw data. In either case, testing your system with the amount of actual data you plan to have in it may be critical to ensuring that your solution will scale as the months / years go by.

If there is critical data that require high availability, then you would look at a caching scenario. If there are certain items you always know you will need, it may be most sensible to simple populate them when the application begins (eager loading), but if not (or for any items that you don’t know) a lazy loading cache generally makes more sense (ie load into the cache on first use and use it from the cache thereafter).