SitePoint Sponsor

User Tag List

Results 1 to 4 of 4
  1. #1
    SitePoint Enthusiast
    Join Date
    Aug 2008
    Posts
    62
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Parsing html with Xpath without simpleXML

    I have been wrecking my brain trying to figure out how to use xpath, alone, to parse html code.
    A conscious effort was made to learn xpath to the exclusion of simplexml, because; simplexml is quite limited, IMHO based on my research.

    Given the following sample html code how…..:
    Code:
    <?php
    $html = <<<HTML
    <html>
    <head>
    	<title>Car info</title>
    	<meta name="robots" content="index,follow" />
    	<meta name="description" content="Info about makes and models of cars" />
    	<meta name="keywords" content="car makes, car models" />
    </head>
    
    <body>
    <div id="recordlist">
    	<div class="records">
    		<h2>Cars</h2>
    		<ul>
    			<li><a href="http://ford.com">Ford</a>
    				<ol start="1">
    					<li>Escort</li>
    					<li>Taurus</li>
    					<li>Mustang</li>
    				</ol>
    			</li>
    
    			<li><a href="http://Chevrolet.com">Chevrolet</a>
    				<ol start="1">
    					<li>Corvette</li>
    					<li>Cavalier</li>
    					<li>Suburban</li>
    				</ol>
    			</li>
    
    			<li><a href="http://Volkswagen.com">Volkswagen</a>
    				<ol start="1">
    					<li>New Beetle</li>
    					<li>Jetta</li>
    					<li>Toureg</li>
    				</ol>
    			</li>	
    		</ul>
    	</div>
    </div>
    </body>
    </html>
    HTML;
    ?>
    I want to create a string variable called $content that has the following format for each auto maker:
    Code:
    <?php
    $content = <<<CONTENT
    <h2>(Auto make i.e. Ford, Chevrolet or Volkswagen)</h2>
    	<h3>(make 1)</h3>
    	<h3>(make 2)</h3>
    	<h3>(ect…)</h3>
    <p><a href="(href attribute parsed from html)">Link to (Auto maker name) website</a></p>
    <hr />
    (next auto maker record)
    
    CONTENT;
    ?>
    Thus far the xpath code looks like this:
    Code:
    <?php
    //start new dom instance
    $dom = new DOMDocument();
    
    //put html into the dom
    @$dom->loadHTML($html);
    
    //initial xpath query
    $xpath_query = "//div[@id='recordlist']/div[@class='records']/ul/li";
    
    //start instance of xpath
    $xpath = new DOMXPath($dom);
    
    //get results of xpath query
    $xpath_query_results = $xpath->query($xpath_query);
    
    //loop and parse relevant parts
    foreach($xpath_query_results as $results)
    	{
    		/*
    		I am clueless on what to do from this point on to parse the Auto make,  auto models and website links.		
    
    		*/
    	}
    
    
    ?>
    What is a example of how to parse the values of nodes and sub nodes, using xpath alone?

  2. #2
    PHP Guru lampcms.com's Avatar
    Join Date
    Jan 2009
    Posts
    921
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    You know, if you parsing this exact document, you can make your xpath simpler:

    $results = $xpath->query('//ul/li');
    // that's it
    Now $result is an object of type DOMNodeList, it's not an array but it's iteratable
    Now you can do your foreach
    foreach($result as $e){
    $maker = $e->getElementsByTagName('a')->item(0)->nodeValue;
    $cars = $xpath->query('//li', $e);
    foreach($cars as $car){
    $s .= '<h3>'.$car->item(0)->nodeValue.'</h3>';
    }
    // now here you can use the value of $maker and $s which is a string made in innder loop
    }

    I am not sure, but you may not even need to get the ->item(0) in the inner foreach loop,
    and should probably use just $car->nodeValue
    My project: Open source Q&A
    (similar to StackOverflow)
    powered by php+MongoDB
    Source on github, collaborators welcome!

  3. #3
    PHP Guru lampcms.com's Avatar
    Join Date
    Jan 2009
    Posts
    921
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I have to mention that I have not tested my example, it's just a basic outline on how you should proceed. Just remember that you can iterate over xpath results and each item is an element DOMNode, so you can use methods and properties of DOMNode on items.
    My project: Open source Q&A
    (similar to StackOverflow)
    powered by php+MongoDB
    Source on github, collaborators welcome!

  4. #4
    SitePoint Member
    Join Date
    Aug 2008
    Posts
    3
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by losirus View Post
    A conscious effort was made to learn xpath to the exclusion of simplexml, because; simplexml is quite limited, IMHO based on my research.
    Out of curiosity, how did you find SimpleXML limiting? When it comes to XPath, SimpleXML's only real limitation is that it will only return elements (like DOMElement), not text nodes (DOMText) and other nodes.

    Here's how I'd do what you're talking about in your first post with SimpleXML:
    PHP Code:
    $page simplexml_load_string($html);
    $content '';

    foreach (
    $page->xpath('//div[@id="recordlist"]/div[@class="records"]/ul/li') as $li)
    {
        
    $maker = (string) $li->a;
        
    $url   $li->a['href'];

        
    $content .= '<h2>' $maker "</h2>\n";

        foreach (
    $li->ol->li as $make)
        {
            
    $content .= "\t<h3>" $make "</h3>\n";
        }

        
    $content .= '<p><a href="' $url '">Link to ' $maker " website</a></p>\n<hr />\n";



Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •