Processing HTML with Hpricot

In this world of Web2.0 mashups and easy API access, it is quite refreshing how easy it is to pull data for third party sites and re-mash it into something new. Unfortunately, not everyone has been bitten by this bug, so we as developers sometimes have to do a little more leg work to get the information we need. A common technique is called a screen scrape where your application acts like a browser and parses the HTML returned from the third party server.

Although this should be simple enough, anyone who has ever tried to do this knows the pain of dancing with regular expressions in an attempt to find the the tags that you need. Luckily, us rubyists have the Hpricot library which takes the hard work out of parsing HTML. Hpricot allows developers to access html elements via CSS-selectors and X-Path, so you can target specific tags really easily. And because it is written in C, it is pretty fast too.

Installation

Hpricot is a gem, so installation is as easy as:

gem install hpricot

The just require the library at the top of the ruby file:


require 'hpricot'

Usage

Lets take this HTML snippet:


<html>
  <head>
    <title>Snippet</title>
  </head>
  <body>
    <div id="container">
      <div id="navigation">
        <ul>
          <li><a href="/">Home</a></li>
          <li><a href="/contact></a></li>
        </ul>
       </div>
       <div id="sub-content">
          <p>This would be some sort of sidebar</p>
       </div>
       <div id="content">
         <p>This is paragraph 1</p>
         <p>This is paragraph 2</p>
       </div>
     </div>
   </body>
</html>

We can easily pull out the content of the paragraphs by doing this (Let’s assume the HTML is already stored in the variable @html)


doc = Hpricot(@html)

pars = Array.new
doc.search("div[@id=content]/p").each do |p|
  pars << p.inner_html
end

Yep – that’s it. You now have an array with two elements that are the same as the copy in the two p tags. Notice that the p tag in the sub-content div isn’t pulled in?

It doesn’t end there though, you can also manipulate the HTML – which can come in handy if you wanted to, say, create a quick and dirty mobile version. Let’s say we wanted to remove the sub-content div from the mobile version, we could do this:


doc = Hpricot(@html)

doc.search("div[@id=sub-content]").remove

puts doc

The resultant HTML no longer has a div called sub-content!

To add a new class to the navigation ul is as simple as:


doc = Hpricot(@html)

doc.search("div[@id=navigation]/ul").set("class", "nav")

This is just the tip of the iceberg – the library is really powerful and simple to use. Go and check out the official page for more (less trivial) examples.

Disclaimer: You should make sure you have permission for the website owner before screen-scraping their site.

Win an Annual Membership to Learnable,

SitePoint's Learning Platform

  • Jason Stirk

    I’ve also found hpricot insanely useful for processing RSS and Atom feeds. The number of malformed feeds out there is scary, and using a standard XML parser (like many RSS libraries do) means that a lot of these feeds can’t be read.

    hpricot doesn’t have that problem, and you can still use XPath and search for elements you want easily.

  • madpilot

    Good call Jason – that’s an awesome application for Hpricot

  • Ryan

    I don’t personally use Ruby, but XPath is a hell of a lot more enjoyable than trying to pull out data from XML (or almost XML) than trying to use DOM or similar.

    A relevant handy extension in Firefox is XPath Checker, which is very handy for testing out your XPath expressions.

  • satendra

  • Amit Kulkarni

    Another great tool using php language is htmlsql …
    here you can extract all the links in your page by using the tag

    select href from a

    and you have all links in an array …

    People with sql/php background find this very powerful

    A ruby porting is also underway