Feedback on this PHP XPath code?

Hey, everyone. I hammered the following code into shape line by line, but I’d appreciate it if someone with more knowledge on the matter gave me some feedback.

This is a web scraping function that assumes the web page to be scraped is part of a chain of pages containing search results, assuming also that each page consists of a link to the next page (unless it’s the end page) and a series of rows containing the information to be scraped. Here’s the code.

public function scrape($url, $driver)
{
	$curl_options = array(
		CURLOPT_USERAGENT => 'Mozilla/5.0 (Windows NT 6.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.73 Safari/537.36',
		CURLOPT_AUTOREFERER => true,
		CURLOPT_RETURNTRANSFER => true,
		CURLOPT_FOLLOWLOCATION => true,
		CURLOPT_MAXREDIRS => 4
	);
	$output = '';
	while ($url)
	{
		libxml_use_internal_errors(true);
		$page = curl_init($url);
		curl_setopt_array($page, $curl_options);
		$dom = new DOMDocument;
		$dom->loadHTML(curl_exec($page), LIBXML_NOBLANKS | LIBXML_NOWARNING | LIBXML_NOERROR);
		curl_close($page);
		$xpath = new DOMXPath($dom);
		$url = $this->$driver->get_next_page_url($xpath);
		foreach ($this->$driver->get_rows($xpath) as $row)
		{
			$dom = new DOMDocument;
			$dom->appendChild($dom->importNode($row, true));
			$xpath = new DOMXPath($dom);
			$fields = $this->$driver->get_fields($xpath);
			$output .= '"' . implode('","', $fields) . '"' . "\r\n";
		}
	}
	return $output;
}

In explanation, given a starting URL and a $driver (the name of the class file that contains the code to scrape a specific website), the code creates a cURL object from which it then creates a DOM object, from which it then creates an XPath object (for easier searching).

The code searches for the next page’s URL, if any, and then gets down to the business of looping through the rows expected to be found on a page containing search results, parsing each one.

Curl options – those are the ones I needed to make the three sites I’m scraping work. Do you see other curl options I’d need?

Creating the DOM instance – are the LIBXML constants the appropriate ones?

Looping through the rows – I’m creating new DOM and XPath objects for each row. That makes them much more easy to process, but is there a more efficient way?

Thanks for any comments!

1 Like

does the ToS of the pages allow you to scrape them?

1 Like

Writing low level scrapping code by hand is for the birds.

I would use Guzzle and the Symfony Dom Crawler.

Chirp, chirp, chirp! :smile:

It’s for personal use and very low volume, maybe five or six pages of results per day. The main benefit to me is getting the results into CSV format.

I’ve got something for that to.

The problem with your code is it has so many failure points and you aren’t accounting for any of them. The libraries I mentioned have much better handling or failures and put you on a path to write things correctly.

One example is DOMDocument is not installed by default and neither is CURL. You’re only thinking about your environment and that is selfish. You should write software in a way that is portable between multiple environments taking the necessary precautions when dependencies don’t exist.

Fair enough point about writing portable software, though wouldn’t those libraries and frameworks you suggest also require DOMDocument and CURL? I think so. Symfony’s DomCrawler specifically mentions using DOMXPath internally. I’m not against using libraries and frameworks. The code I posted is actually running under CodeIgniter. But like having to learn long division by hand before using a calculator, I like to learn how the inside work gets done.

Could you elaborate on the points of failure? I know you can use error handling to allow the application to recover gracefully instead of crashing. Is that what you mean?

That doesn’t matter. if the ToS do not allow scraping, then you’re breaking law.

Which can be fine for academic reasons but for professional projects that will likely touch many developers hands it is not. Be less selfish and use proven tools with thorough documentation.[quote=“RobertSF, post:7, topic:215609”]
Fair enough point about writing portable software, though wouldn’t those libraries and frameworks you suggest also require DOMDocument and CURL? I think so. Symfony’s DomCrawler specifically mentions using DOMXPath internally.
[/quote]

Correct but those solutions are well documented and typically have thorough error/exception handling in place.

Ok, I’ve installed composer and I’m downloading Symfony’s DomCrawler. Thanks for the suggestions.

1 Like

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.