In this world of Web2.0 mashups and easy API access, it is quite refreshing how easy it is to pull data for third party sites and re-mash it into something new. Unfortunately, not everyone has been bitten by this bug, so we as developers sometimes have to do a little more leg work to get the information we need. A common technique is called a screen scrape where your application acts like a browser and parses the HTML returned from the third party server.
Although this should be simple enough, anyone who has ever tried to do this knows the pain of dancing with regular expressions in an attempt to find the the tags that you need. Luckily, us rubyists have the Hpricot library which takes the hard work out of parsing HTML. Hpricot allows developers to access html elements via CSS-selectors and X-Path, so you can target specific tags really easily. And because it is written in C, it is pretty fast too.
Installation
Hpricot is a gem, so installation is as easy as:
gem install hpricot
The just require the library at the top of the ruby file:
require 'hpricot'
Usage
Lets take this HTML snippet:
<html>
<head>
<title>Snippet</title>
</head>
<body>
<div id="container">
<div id="navigation">
<ul>
<li><a href="/">Home</a></li>
<li><a href="/contact></a></li>
</ul>
</div>
<div id="sub-content">
<p>This would be some sort of sidebar</p>
</div>
<div id="content">
<p>This is paragraph 1</p>
<p>This is paragraph 2</p>
</div>
</div>
</body>
</html>
We can easily pull out the content of the paragraphs by doing this (Let’s assume the HTML is already stored in the variable @html)
doc = Hpricot(@html)
pars = Array.new
doc.search("div[@id=content]/p").each do |p|
pars << p.inner_html
end
Yep – that’s it. You now have an array with two elements that are the same as the copy in the two p tags. Notice that the p tag in the sub-content div isn’t pulled in?
It doesn’t end there though, you can also manipulate the HTML – which can come in handy if you wanted to, say, create a quick and dirty mobile version. Let’s say we wanted to remove the sub-content div from the mobile version, we could do this:
doc = Hpricot(@html)
doc.search("div[@id=sub-content]").remove
puts doc
The resultant HTML no longer has a div called sub-content!
To add a new class to the navigation ul is as simple as:
doc = Hpricot(@html)
doc.search("div[@id=navigation]/ul").set("class", "nav")
This is just the tip of the iceberg – the library is really powerful and simple to use. Go and check out the official page for more (less trivial) examples.
Disclaimer: You should make sure you have permission for the website owner before screen-scraping their site.
Related posts:
- 5 Top Tips to Beautify Your HTML and Enrich Your Content It may be surprising at first, but some of the...
- App Engine to Add Offline Processing, XMPP Google App Engine -- an important part of Google's Web...
- The 5 Most Under-Used HTML Tags It is easy to forget some of the lesser-known HTML...
- Styling the html and body Elements One of the most common ways to begin a...
- New Release: Build Your Own Web Site The Right Way Using HTML & CSS Build Your Own Web Site The Right Way Using HTML...







I’ve also found hpricot insanely useful for processing RSS and Atom feeds. The number of malformed feeds out there is scary, and using a standard XML parser (like many RSS libraries do) means that a lot of these feeds can’t be read.
hpricot doesn’t have that problem, and you can still use XPath and search for elements you want easily.
November 21st, 2007 at 10:42 am
Good call Jason – that’s an awesome application for Hpricot
November 21st, 2007 at 4:53 pm
I don’t personally use Ruby, but XPath is a hell of a lot more enjoyable than trying to pull out data from XML (or almost XML) than trying to use DOM or similar.
A relevant handy extension in Firefox is XPath Checker, which is very handy for testing out your XPath expressions.
November 21st, 2007 at 8:56 pm
November 22nd, 2007 at 6:13 pm
Another great tool using php language is htmlsql …
here you can extract all the links in your page by using the tag
select href from a
and you have all links in an array …
People with sql/php background find this very powerful
A ruby porting is also underway
November 25th, 2007 at 5:55 pm