Scraping

Hi,

Could you please tell me how i get the scraped results of a few websites on one page. For example when i search in 20 car classifieds websites and i want to have the results on one page but sorted in price or date? So the ads must be mixed but sorted in price. I allready have the permission to scrape!
Could someone give me some advice please?

Rudy

First, start with cURL to actually grab the page data into a variable. Then, with functions like preg_replace, preg_match, and so forth, mine the data you want from the variable. Then, I would suggest formatting that data as cleanly as you can and saving it into a database. All that happens from a script that’s run by Cronjobs every so often.

The, from your regular web site, connect to MySQL, grab the data and display it on the page. You can sort it however you like with MySQL.

What you’re asking is not a cut and dry question. There are lots of steps involved and are highly dependent on what data you’re grabbing and from where.

I know there are several other threads on SitePoint that deal with each specific issues I mentioned above - you might try searching them out.

Have you looked for any programs that will do this for you? Or would at least get you started in the right direction. I remember hearing about one a few months ago but just can’t think of the name right now. If I do remember I will post it here.

Thank you for your reply. Would be cool if you could sent that info

Thank you for the info. I will start as soon as possible. Thank you again. I will let you know about my progression.

Rudy

you could maybe edit this script to work for you:
http://r00tsecurity.org/forums/index.php/topic/9884-little-link-scraper/

First of all, check to see if those websites have some kind of rss feed for the searches or even better, an API
If they do, then it’s much easier to work with data in predictable xml or json format.
If they don’t, then you can still scrape the html page. The way I would do it is be treating html as DOMDocument object using php’s DOMDocument class. The problem with this approach is that you must be sure that html page is valid and the encoding is also valid, does not have to be utf8, but you must know the page encoding and it must be reliable. Some pages will have encoding declared as utf-8 but in reality it will be something else. This is the cause of many problems when parsing the document with DOMDocument class.

But still, it’s better to use DOM than pre_match, in my opinion.

So the flow of steps would be: load html in a string
use mbstring extension to guess the actual encoding, recode into latin-1 or into utf-8 if necessary.
3) run through tidy to fix the string so that it will become valid html
4) load into DOMDocument and then the last step is parsing the DOM will be easy. You can use standard DOM methods or xpath for that.

Lastly, treat the parsed html as you would normally treat user input - don’t trust it.

<snip/>

You might want to look around on those sites for authorization to use the material…as has been pointed out in threads before, screen scraping without permission (especially of sensitive/secured data) is a… grey area at best, and a copyright violation at worst…