Running multiple cURL processes

Hi,

I’m currently running a PHP application on a linux VPS (MODS - I’ve posted here as I don’t think it’s specific to PHP, but please feel free to move it accordingly)

I’m currently retrieving domain information through an API, and enriching this with data from various web sources, some of which I have to access via cURL due to the lack of an API.

This didn’t pose a problem with 10,000 of records but as the number increases, I can see a definite bottleneck about to happen.

On the most basic level, the process is

  • retrieve domain information then for each domain
  1. run internal processing (count number of characters etc)
  2. cURL information on PageRank
  3. cURL WHOIS data
  4. Repeat 1 - 3 for next domain

but naturally, with each cURL taking upto ten seconds this is a very slow process with lots of domains to check.

Would it be better design practice to run the cURL jobs for both PR and WHOIS as a separate script to take advantage of ‘multi-threading’ (so for example one cron job to retrieve the domain, one for the PR and one for the WHOIS) or would this make little overall difference?

Thanks

My main suggestion, as a general advice, would be to disconnect the links wherever they’re not necessary.

Does retrieving the whois information need to come after getting the pagerank info? Does the internal processing? Not at all. That way, you can set yourself up to do multiple things at the same time or even multiple of the same thing at the same time; for example, if collecting whois information was particularly slow then you could fire off multiple worker scripts to do that job where only one might be needed for quicker tasks.

That would be moving more in the direction of message queues, background job management, etc. which might also be an area to look towards especially if you are currently just firing off a script at intervals with cron.

Thanks Salathe,

This is area that’s very new to me so I appreciate the help.

I’ve done a google on background job management and message queues, but would you recommend anywhere as a starting to point to learn more about this specific to a Linux / Apache server?