Cronjobs & Scalability

Hi guys,

I was thinking implementation of the service I’m trying to build (same process as mint.com but it is not in any way like the service mint.com provides and is not related to personal finances).

  1. User logs in
  2. his account connects to external resources, grabs data and updates locally
  3. his stats etc are calculated.

I figured I update the stats (Step 2 & 3 above) when user logs in and as a cronjob nightly.

This is fine when I have a low number of users, however, as the number of users scales upwards, the cronjob might fail.

I’m thinking about implementing it the following way, however, I would appreciate any insight/advice.

Each user has a “last_updated” field based on which the stats are updated via cron job. This still does not quite solve the scalability issue.

Alternately, I could just update when the user logs in and that’s it, which would solve the issue.

However, I just wanted to see how you guys here would go about tackling this issue.

Any insight/implementation advice would be very much appreciated.

Thanks
NG

It is difficult to say something specific, as the advice will depend on your specific case and what your doing with the data.

But here is a few tips.

  1. Have the cron job run all over the day, running X people each time.
  2. Have the cron job attach a server and instance id when its about to process X users, as this will allow you to run the same cron job across multiple servers and even multiple instances at the same servers without them redoing the work, as they all “pick the work, and claim it” before they start.
  3. As you identify the parts which require the most processing, see if you cant get those sections programmed into C, i.e. a higher speed language.

The two first points there should get you a long way and depending on your member numbers and of course the processing being done it might all you need. The benefit with this is as your site expand you need load balancer, etc. Which allow you to run the crons from all of the web servers running under the load balancer as well if its implemented like this without causing a race condition (NOTE. This does mean you need to verify no other place of your code can cause a race condition).

Then when you are so successful that this is not cutting it any more, you can afford getting one or more good C programmers to write part of your system to C applications/modules.

If the information is only visible to a logged in user then only update on demand.

However…

if the marketing stratergy of site is dependant upon some or all of the
information being fresh, public & indexable by search engines for example
a job search website it is totally impractical to run individual Cron jobs for
each location.

a. Kick start the content retrieval script each night with cron and and get
the script to loop-back on itself passing a counter variable as it does so or have it
querying the database for the oldest row. When all the dates are less than 24 hours old it stops itself.
NB: some webhosts are a little unhappy about continiously running scripts
like this on shared servers, having it sleep for random periods helps lower
the %server usage stats.

b. Bolt on the Google App Engine
Python is not all that different from PHP and the free useage quota should
more than cope with your needs for some time. It can handle the raw data
retrieval, process & extract the pieces you want then send them to your
websites databse.

Not to highjack the thread, but I am in basically the same boat. I currently have 200+ cron jobs and have realized if I don’t change this soon, then I will be running into trouble.

My site is more in the realm of what Sogo7 mentioned, it needs to be updated constantly; it needs to be fresh all of the time. Currently, PHP and Cron allow me to run multiple processes at the same time, but working with Cron becomes a manual task and not one that I want to rely on as my site grows.

Realizing that I need to move away from PHP and relying on Cron to get maximum performance, and seeing that C and Python have been suggested, which option would be best optimized to to run simultaneous tasks. In other words, I basically need to build my own cron for the processes that need to run, and these processes will overlap at some point and need to be executed at the same time.

Any help on deciding which language to proceed with would be greatly appreciated. And, if there are any other languages out there that might be even better, please suggest.