PHP DOM parsing

tmrd · June 7, 2015, 2:10am

Hello.

I’m trying to get the values of the following table. I tried both curl/regex (I know it’s not recommended) and DOM separately, but wasn’t able to get the values properly. I need an exact match.

There are multiple rows in the page.

<tr>
    <td width="75" style="NS">
        <img src="NS" width="64" alt="INEEDTHISVALUE">
    </td>
    <td style="NS">
        <a href="NS">NS</a>
    </td>
    <td style="NS">INEEDTHISVALUETOO</td>
</tr>

NS = Non-static values. They change for each td and a since it’s a colored (inline css) table.

I’m using simple_html_dom class which can be found here : http://htmlparsing.com/php.html

I’m using the code below to get all td’s, but I need more specific output.

$html = file_get_html("URL");
foreach($html->find('td') as $td) {
    echo $td."<br>";
}

REGEX & CURL

$site = "URL";
$ch = curl_init();
$hc = "YahooSeeker-Testing/v3.9 (compatible; Mozilla 4.0; MSIE 5.5; Yahoo! Search - Web Search)";
curl_setopt($ch, CURLOPT_REFERER, 'http://www.google.com');
curl_setopt($ch, CURLOPT_URL, $site);
curl_setopt($ch, CURLOPT_USERAGENT, $hc);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$site = curl_exec($ch);
curl_close($ch);
preg_match_all('@<tr><td width="75" style="(.*?)"><img src="(.*?)" width="64" alt="(.*?)"></td><td style="(.*?)"><a href="(.*?)">(.*?)</a></td><td style="(.*?)">(.*?)</td></tr>@', $site, $arr);
var_dump($arr); // returns empty array, why?

s_molinari · June 7, 2015, 4:56am

Have you looked at the methods available in this extension?

http://php.net/manual/en/book.dom.php

Scott

megazoid · June 7, 2015, 5:39am

You can also try pQuery for that

tmrd · June 7, 2015, 8:05am

Okay, I got this far.

$site = "EXTERNAL URL";

$ch = curl_init();
$hc = "YahooSeeker-Testing/v3.9 (compatible; Mozilla 4.0; MSIE 5.5; Yahoo! Search - Web Search)";
curl_setopt($ch, CURLOPT_REFERER, 'http://www.google.com');
curl_setopt($ch, CURLOPT_URL, $site);
curl_setopt($ch, CURLOPT_USERAGENT, $hc);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$site = curl_exec($ch);
curl_close($ch);

libxml_use_internal_errors(true);

$results = array();
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->loadHTML($site);
$xpath = new DOMXPath($doc);

foreach ($xpath->query('.//tr') as $tr) {
    $results[] = array(
      'img_alt' => $xpath->query('td[1]/img', $tr)->item(0)->getAttribute('alt'),
      'td_text' => $xpath->query('td[last()]', $tr)->item(0)->nodeValue
    );
}

echo $results[1]['img_alt'];
echo $results[1]['td_text'];

Here’s the error I get, I got stuck again.

Fatal error: Call to a member function getAttribute() on null in /Applications/MAMP/htdocs/fetch/test.php on line 161

Line 161

'img_alt' => $xpath->query('td[1]/img', $tr)->item(0)->getAttribute('alt'),

I think it is being broken in the middle of the file while it looks for a match.

An excerpt from my file

<!DOCTYPE html>
<html prefix="og: http://ogp.me/ns#" lang="tr">
<head>

Here’s another excerpt from ol.html file.

<link rel="alternate" type="application/rss+xml" title="title" href="/rss.xml" />
<link rel="search" type="application/opensearchdescription+xml" href="/search.xml" title="title2" />
<link rel="index" title="sitename is here" href="http://www.site.com" />
<link rel="canonical" href="http://www.site.com" />
<link rel="apple-touch-icon" href="/images/logo.png" />
<link rel="publisher" href="https://plus.google.com/+" />
<link rel="image_src" href="http://www.site.com" />

Jeff_Mott · June 7, 2015, 2:25pm

You don’t have to use curl directly, instead you can use the library Guzzle. And you don’t have to use DOM directly, instead you can use the library Goutte which wraps Guzzle and adds the ability to select using CSS selectors.

Quick example:

$crawler = $client->request('GET', 'http://www.symfony.com/blog/');

$crawler->filter('h2 > a')->each(function ($node) {
    print $node->text()."\n";
});

tmrd · June 7, 2015, 11:55pm

Oh, thank you.

I managed it to solve it by using simplehtmldom.

Here’s the solution, maybe it can help someone who’s looking for a solution in the future.

for($i=0; $i<50; $i++) { // there are 50 rows, so I use a for loop.
    $tr = $html->find('tr', $i);
    if(isset($tr->children(0)->children(0)->src)) { // I don't want to get the rows that don't have a child img element.
        echo "identified";
        echo $tr->children(0)->children(0)->src."<br>"; // image src
        echo $tr->children(0)->children(0)->alt."<br>"; // image title
        echo $tr->children(0)->children(2)."<br>"; // 2. td value
        echo "<hr>";
        echo $tr->children(0)->children(0)->alt;
    } else {
        echo "unidentified";
    }
}

system · September 7, 2015, 7:04am

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.