Processing HTML with Hpricot

In this world of Web2.0 mashups and easy API access, it is quite refreshing how easy it is to pull data for third party sites and re-mash it into something new. Unfortunately, not everyone has been bitten by this bug, so we as developers sometimes have to do a little more leg work to get the information we need. A common technique is called a screen scrape where your application acts like a browser and parses the HTML returned from the third party server.

Although this should be simple enough, anyone who has ever tried to do this knows the pain of dancing with regular expressions in an attempt to find the the tags that you need. Luckily, us rubyists have the Hpricot library which takes the hard work out of parsing HTML. Hpricot allows developers to access html elements via CSS-selectors and X-Path, so you can target specific tags really easily. And because it is written in C, it is pretty fast too.

Installation

Hpricot is a gem, so installation is as easy as:

gem install hpricot

The just require the library at the top of the ruby file:


require 'hpricot'

Usage

Lets take this HTML snippet:


<html>
  <head>
    <title>Snippet</title>
  </head>
  <body>
    <div id="container">
      <div id="navigation">
        <ul>
          <li><a href="/">Home</a></li>
          <li><a href="/contact></a></li>
        </ul>
       </div>
       <div id="sub-content">
          <p>This would be some sort of sidebar</p>
       </div>
       <div id="content">
         <p>This is paragraph 1</p>
         <p>This is paragraph 2</p>
       </div>
     </div>
   </body>
</html>

We can easily pull out the content of the paragraphs by doing this (Let’s assume the HTML is already stored in the variable @html)


doc = Hpricot(@html)

pars = Array.new
doc.search("div[@id=content]/p").each do |p|
  pars << p.inner_html
end

Yep – that’s it. You now have an array with two elements that are the same as the copy in the two p tags. Notice that the p tag in the sub-content div isn’t pulled in?

It doesn’t end there though, you can also manipulate the HTML – which can come in handy if you wanted to, say, create a quick and dirty mobile version. Let’s say we wanted to remove the sub-content div from the mobile version, we could do this:


doc = Hpricot(@html)

doc.search("div[@id=sub-content]").remove

puts doc

The resultant HTML no longer has a div called sub-content!

To add a new class to the navigation ul is as simple as:


doc = Hpricot(@html)

doc.search("div[@id=navigation]/ul").set("class", "nav")

This is just the tip of the iceberg – the library is really powerful and simple to use. Go and check out the official page for more (less trivial) examples.

Disclaimer: You should make sure you have permission for the website owner before screen-scraping their site.