Downloading Feed for Processing is a Bottleneck
Okay, so I've been profiling my recent project a bit, as the Tests I wrote take a while to run (54 tests in 15-20 seconds). One of the tests takes 13-18 of those seconds.
I personally do not like tests taking longer than a second to run (most cases that is a do able), but this one I just can't seem to get around it, and the biggest problem, is it will only get worse.
The test in question is given a URL, it downloads the contents of that URL and stores it locally (right now it uses file_get_contents). At this moment, I'm not certain if the bottleneck is the network request or the actual processing (but I plan to run a few tests later to help narrow that down).
Here are the requirements:
- Must be run through a cronjob
- Feed is external to the website, so it isn't on the same network (must be this way as it is a third party system providing the data)
- Must download it and store it locally (takes 13+ seconds)
- Must process the feed into a set of MySQL tables and file system caches for use by the website (this literally takes 67 ms to process)
Here is what I'm using for the required steps
- Feed is downloaded using file_get_contents
- I want to experiment using wget and curl too. Placing it in wget removes PHP from the equation for downloading
- Feed is being processed using simplexml, seems to be fine
- Download could cause a max execution time to be reached, or exceed memory (if the file is large enough)
- Currently I'm only receiving 5 records in the feed, and I expect once fully live it will be 50+ records.
Other notes of Interest
- The whole process is anonymous, it can take any feed and relay it into MySQL/file cache without any coding changes.
- Though I don't recommend using it this way, as you lose key benefits, so there are ways to define a feed and process it with minimal coding
- All other functions of the system are under 200 ms, so this download is the odd man out.
So, I'd love to hear your feedback and experience with curl in a process similar to this, or if you think ditching curl and running wget prior to the cronjob run is the best candidate (that is my feeling too). Or if there is another approach I should be looking at, I'm open to that too.