Blog Post RSS ?

Blogs » Ruby on Rails » Processing HTML with Hpricot
 

Processing HTML with Hpricot


  • Save to
    Del.icio.us

by Myles Eftos

In this world of Web2.0 mashups and easy API access, it is quite refreshing how easy it is to pull data for third party sites and re-mash it into something new. Unfortunately, not everyone has been bitten by this bug, so we as developers sometimes have to do a little more leg work to get the information we need. A common technique is called a screen scrape where your application acts like a browser and parses the HTML returned from the third party server.

Although this should be simple enough, anyone who has ever tried to do this knows the pain of dancing with regular expressions in an attempt to find the the tags that you need. Luckily, us rubyists have the Hpricot library which takes the hard work out of parsing HTML. Hpricot allows developers to access html elements via CSS-selectors and X-Path, so you can target specific tags really easily. And because it is written in C, it is pretty fast too.

Installation

Hpricot is a gem, so installation is as easy as:

gem install hpricot

The just require the library at the top of the ruby file:


require 'hpricot'

Usage

Lets take this HTML snippet:


<html>
  <head>
    <title>Snippet</title>
  </head>
  <body>
    <div id="container">
      <div id="navigation">
        <ul>
          <li><a href="/">Home</a></li>
          <li><a href="/contact></a></li>
        </ul>
       </div>
       <div id="sub-content">
          <p>This would be some sort of sidebar</p>
       </div>
       <div id="content">
         <p>This is paragraph 1</p>
         <p>This is paragraph 2</p>
       </div>
     </div>
   </body>
</html>

We can easily pull out the content of the paragraphs by doing this (Let’s assume the HTML is already stored in the variable @html)


doc = Hpricot(@html)

pars = Array.new
doc.search("div[@id=content]/p").each do |p|
  pars << p.inner_html
end

Yep - that’s it. You now have an array with two elements that are the same as the copy in the two p tags. Notice that the p tag in the sub-content div isn’t pulled in?

It doesn’t end there though, you can also manipulate the HTML - which can come in handy if you wanted to, say, create a quick and dirty mobile version. Let’s say we wanted to remove the sub-content div from the mobile version, we could do this:


doc = Hpricot(@html)

doc.search("div[@id=sub-content]").remove

puts doc

The resultant HTML no longer has a div called sub-content!

To add a new class to the navigation ul is as simple as:


doc = Hpricot(@html)

doc.search("div[@id=navigation]/ul").set("class", "nav")

This is just the tip of the iceberg - the library is really powerful and simple to use. Go and check out the official page for more (less trivial) examples.

Disclaimer: You should make sure you have permission for the website owner before screen-scraping their site.

Tags:

This post has 5 responses so far

  1. I’ve also found hpricot insanely useful for processing RSS and Atom feeds. The number of malformed feeds out there is scary, and using a standard XML parser (like many RSS libraries do) means that a lot of these feeds can’t be read.

    hpricot doesn’t have that problem, and you can still use XPath and search for elements you want easily.

     
  2. Good call Jason - that’s an awesome application for Hpricot

     
  3. I don’t personally use Ruby, but XPath is a hell of a lot more enjoyable than trying to pull out data from XML (or almost XML) than trying to use DOM or similar.

    A relevant handy extension in Firefox is XPath Checker, which is very handy for testing out your XPath expressions.

     
  4.  
  5. Another great tool using php language is htmlsql …
    here you can extract all the links in your page by using the tag

    select href from a

    and you have all links in an array …

    People with sql/php background find this very powerful

    A ruby porting is also underway

     

Sponsored Links

Leave a response

You are not logged in, log in with your SitePoint Forum username and password.

-OR- Post Anonymously

* Make sure any code samples are escaped (i.e. ‘<b>’ becomes ‘&lt;b&gt;’).

If not logged in, your comments will be placed in a moderation queue. This means your comment may not appear until one of our moderators approves it.

SitePoint Marketplace

Buy and sell Websites, templates, domain names, hosting, graphics and more.

Logo Design, Web page Design and more!

99designs

  • Custom logo designs created ‘just for you’.
  • Pick the design you like best.
  • Only pay if you’re satisfied with the result.

Want More Traffic?

Get up to five quotes from qualified SEO specialists, with no obligation!

Get A Free SEO Quote Now!