One of the (few) defining points of Web 2.0 is consuming remote data and services. Which is great if your service provider is Amazon, Yahoo or Google but not so great if it’s your regional elected representatives, who may only have just arrived at Web 1.0. Being able to mine such sites for data is becoming more and more a part of everyday web development.
Anyway, while pondering what forummatrix or wikimatrix is lacking, figured this was a good excuse to take BeautifulSoup for a spin; “a Python HTML/XML parser designed for quick turnaround projects like screen-scraping”, one of the better (if not the best, according to opinion) tools of this kind (note there’s also RubyfulSoup by the same author).
Beautiful Soup is capable of handling pretty much the worst HTML you can throw at it, and still give you a usable data structure. For example given some HTML like;
<i><b>Aargh!</i></b>
…and running through Beautiful Soup like;
from BeautifulSoup import BeautifulSoup
print BeautifulSoup('<i><b>Aargh!</i></b>').prettify()
…I get;
<i> <b> Aargh! </b> </i>
…notice how it’s changed the order of the tags. This clean up allows me to access the inner text like;
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup('<i><b>Aargh!</i></b>')
print soup.i.b.string
This isn’t intended as a full tutorial – the documentation is extensive and excellent. Another link you should be aware of though is urllib2 – The Missing Manual, which describes pythons urllib2 library (among other things, provides an HTTP client).
Anyway, the mission was to mine MARC for the secunia advisories mailing list, to speed evaluating security records.
MARC provides a search interface which displays results in pages of up to 30 at a time. Aside from the fact it’s all easily fetch able via HTTP GET requests, MARC doesn’t seem to undergo regular HTML changes (still looks the same as I remember and those <font/> tags are a give away), which hopefully means anything mining it’s HTML won’t be “broken” in the near future.
The result in advisories.py;
#!/usr/bin/python
"""
Pulls out secunia security advisories from
http://marc.theaimsgroup.com/?l=secunia-sec-adv
DO NOT overuse!
Make sure you read the following:
http://marc.theaimsgroup.com/?q=about#Robots
Also be aware that secunia _may_ feel you may be making inappropriate
use of their advisories. For example they have strict rules regarding
content _on_ their site (http://secunia.com/terms_and_conditions/) but
this may not applying to the mailing list announcements
License on the script is GPL: http://www.gnu.org/copyleft/gpl.html
"""
import urllib2, re, time
from urllib import urlencode
from BeautifulSoup import BeautifulSoup
def fetchPage(application, page = 1):
"""
Fetches a page of advisories, using the marc search interface
"""
url = 'http://marc.theaimsgroup.com/?l=secunia-sec-adv&%s&%s' \
% (urlencode({'s':application}), urlencode({'r':page}))
return urllib2.urlopen(url)
def fetchMessage(mid):
"""
Fetches a single advisory, given it's marc message id
"""
url = 'http://marc.theaimsgroup.com/?l=secunia-sec-adv&%s&q=raw'\
% (urlencode({'m':mid}))
return urllib2.urlopen(url).read()
class LastPage(Exception):
"""
Used to flag that there are no pages of advisories to process
"""
pass
class FlyInMySoup(Exception):
"""
Used to indicate the HTML being passed varies wildly from what
was expected.
"""
pass
class NotModified(Exception):
"""
Used to indicate there are no new advisories
"""
pass
class Advisories:
"""
Controls BeautifulSoup, pulling out relevant information from a page of advisories
and 'crawling' for additional pages as needed
"""
maxpages = 10 # If there are more than this num pages, give up
requestdelay = 1 # Delay between successive requests - be kind to marc!
__nohits = re.compile('^No hits found.*')
__addate = re.compile('.*[0-9]+\. ([0-9]{4}-[0-9]{2}-[0-9]{2}).*', re.DOTALL)
__messageid = re.compile('.*m=([0-9]+).*')
def __init__(self, application, lastMsgId = None):
self.__application = application
self.__lastMsgId = lastMsgId
self.__advisories = []
self.__pages = []
self.__loaded = 0
def __loadPage(self, page = 0):
"""
Load a page and store it in mem as BeautifulSoup instance
"""
self.__pages.append(BeautifulSoup(fetchPage(self.__application, page+1)))
time.sleep(self.requestdelay)
def __hasAdvisories(self, page = 0):
"""
Test whether page has advisors. To be regarded as not having advisories,
it must contain a font tag with the words "No hits found". Other input
raises FlyInMySoup and will typically mean something is badly broken
"""
font = self.__pages[page].body.find(name='font', size='+1')
if not font:
if self.__pages[page].body.pre is None:
raise FlyInMySoup, "body > pre tag ? advisories?\n%s"\
% self.__pages[page].prettify
return True
if self.__nohits.match(font.string) == None:
raise FlyInMySoup, "Nosir - dont like that font tag?\n%s"\
% font.prettify
return False
def __hasAnotherPage(self, page = 0):
"""
Hunts for a img src = 'images/arrright.gif' (Next) in
the advisories page and if found returns a page number
to make another request with. Other raises a LastPage
exception
"""
if page >= self.maxpages: raise LastPage;
pre = self.__pages[page].body.pre
imgs = pre.findAll(name='img', src='images/arrright.gif', limit=5)
if len(imgs) > 0:
return page + 1
raise LastPage
def __fetchAdvisories(self, page = 0):
"""
Fetches a page of advisories, recursing if more pages of advisories
were found
"""
self.__loadPage(page)
if self.__hasAdvisories(page):
advisory = {}
in_advisory = 0
pre = self.__pages[page].body.pre
for child in pre:
if not in_advisory:
m = self.__addate.match(str(child))
if m is not None:
in_advisory = 1
advisory['date'] = m.group(1)
else:
try:
advisory['mid'] = self.__messageid.match(child['href']).group(1)
advisory['desc'] = child.string.strip()
self.__advisories.append(advisory)
advisory = {}
in_advisory = 0
except:
pass
# Some sanity checks...
if len(self.__advisories) == 0:
raise FlyInMySoup, "No advisories in body > pre!\n%s" % pre
if in_advisory:
raise FlyInMySoup, "Still looking for the last advisory"
# More protection for marc
if self.__lastMsgId and self.__advisories[0]['mid'] == str(self.__lastMsgId):
raise NotModified, "Not modified - last message id: %s"\
% self.__lastMsgId
try:
nextpage = self.__hasAnotherPage(page)
except:
return
self.__fetchAdvisories(nextpage)
def __lazyFetch(self):
"""
Fetch advisories but only when needed
"""
if not self.__loaded:
self.__fetchAdvisories()
self.__loaded = 1
def __iter__(self):
self.__lazyFetch()
return self.__advisories.__iter__()
def __len__(self):
self.__lazyFetch()
return len(self.__advisories)
if __name__ == '__main__':
import getopt, sys, csv
from os import getcwd
from os.path import isdir, isfile, realpath, join
def usage():
"""
advisories.py [-p=proxy_url] [-f] [-d=target_dir] <application>
Pulls a list of security advisories for a given <application>
Puts a summary list in <application>.csv and raw text in
<application>_<msgid>.txt
options:
-d, --directory= (directory to write csv and raw msgs to)
-f, --fetchmsgs (fetch raw messages announcements as well)
-h, --help (display this message)
-p, --proxy=http://user:pass@proxy.isp.com:8080
"""
print usage.__doc__
def lastMsgId(csvfile):
"""
Pull out the last message id from the csvfile. Used to test for
changes if the advisories page
"""
if not isfile(csvfile): return None
try:
fh = open(csvfile, 'rb')
csvreader = csv.reader(fh, dialect='excel')
csvreader.next()
id = csvreader.next()[1]
fh.close()
return id
except:
return None
app = None
proxy = None
fetchMsgs = 0
dir = getcwd()
try:
opts, args = getopt.getopt(sys.argv[1:], \
"fhp:d:", ["help", "fetchmsgs", "proxy=", "directory="])
for o, v in opts:
if o in ("-h", "--help"):
usage()
sys.exit(0)
if o in ("-f", "--fetchmsgs"):
fetchMsgs = 1
elif o in ("-p", "--proxy"):
proxy = v
elif o in ("-d", "--directory"):
if isdir(realpath(v)):
dir = realpath(v)
else:
raise "Invalid dir %s" % v
if len(args) == 1:
app = args[0]
else:
raise getopt.error("Supply an app name to fetch advisories for!")
except getopt.error, msg:
print msg
print "for help use --help"
sys.exit(2)
if proxy:
# Use the explicit proxy passed as a CLI option
proxy_support = urllib2.ProxyHandler({"http" : proxy})
else:
# Prevent urllib2 from attempting to auto detect a proxy
proxy_support = urllib2.ProxyHandler({})
opener = urllib2.build_opener(proxy_support, urllib2.HTTPHandler)
urllib2.install_opener(opener)
csvfile = join(dir,app+'.csv')
advs = Advisories(app, lastMsgId(csvfile))
if len(advs) > 0:
fh=open(csvfile, 'wb')
csvwriter=csv.writer(fh, dialect='excel')
csvwriter.writerow(('date','mid','desc'))
for a in advs:
csvwriter.writerow((a['date'], a['mid'], a['desc']))
if fetchMsgs:
mfh=open(join(dir, "%s_%s.txt" % (app, a['mid'])), 'wb')
mfh.write(fetchMessage(a['mid']))
mfh.close()
fh.close()
print "%s advisories found for %s" % (len(advs), app)
else:
print "No advisories found for %s" % app
Assuming you have a recent version of python and Beautiful Soup 3.x+ installed (download the tarball, extract somewhere and run $ setup.py install to install into your Python library), you can run this script from the command line (it’s intended for cron) like;
$ advisories.py phpbb
… and it will create a file phpbb.csv containing all advisories it found. There’s a few other features, like proxy support and the ability to download the raw advisories which you can read about by running $ advisories.py --help. Make sure you read the warnings at the start of the script though!
So mission basically complete. The interesting part is figuring out where to put checks in the code. While Beautiful Soup allows you to read pretty much anything SGML-like, a change in the HTML tag structure of MARC would break this script (it’s not an official API after all), so hopefully it’s primed to raise exceptions in the right places should manual intervention be required.
Otherwise another project to investigate, if you’re getting into HTML mining, is webstemmer (Python again), which in some cases (e.g. a news site) may be smart enough to get you what you want with very little effort.
Related posts:
- How to Use PHP Namespaces, Part 3: Keywords and Autoloading In the final part of his series explaining PHP namespaces,...
- How to Use PHP Namespaces, Part 2: Importing, Aliases, and Name Resolution In the second part of Craig's PHP namespaces series, he...
- How to Use PHP Namespaces, Part 1: The Basics In the first part of a series of articles, Craig...
- RockMelt: Another Day, Another New Browser RockMelt, a new and revolutionary browser, has been announced and...







Even more beautiful would be if it converted
<i><b>Aargh!</i></b>to
<em> <strong> Aargh! </strong> </em>June 30th, 2006 at 9:42 am
Seems that could be done: http://www.crummy.com/software/BeautifulSoup/documentation.html#Replacing%20one%20Element%20with%20Another
June 30th, 2006 at 5:01 pm
Do you happen to know if anything like this is available for PHP? :)
June 30th, 2006 at 11:22 pm
html_tidy
July 1st, 2006 at 1:41 am
No exactly but html_tidy + “something” e.g. PEAR’s XML_Serializer or SimpleXML (for DOM-like data structures) could be used to a similar effect, but you’d be spending more time to searching data structures.
One thing that’s nice about Beautiful soup is the tag search capabilities, for example consider this from the above script;
imgs = pre.findAll(name='img', src='images/arrright.gif', limit=5) if len(imgs) > 0: return page + 1That allowed me to hunt for img tags with attribute src=’images/arrright.gif’ – these only exist on MARC if there are more than 30 messages list (”Next page” link basically) [side note - that limit=5 is redundant - should have changed that], so I can use it to check whether I need to fetch any more pages.
Its the search API most of all that makes BeautifulSoup attactive over something html_tidy based.
July 1st, 2006 at 9:31 am
Are my eyes playing tricks on me? Python code in a Sitepoint article??? *marks calendar*
Really though, thanks for the info, this may come in handy! I am just jaded due to the lack of Python coverage and specific forum here at SP ;) Your code snippet thing doesn’t appear to actually support Python that well either :(
July 2nd, 2006 at 6:24 am
Another Python article! *yay*
Can’t wait till Python 2.5!
It’s such a shame this beautiful language doesn’t get the attention it deserves.
July 4th, 2006 at 5:22 am
Thankyou ;) Actually – improvements appreciated.
July 6th, 2006 at 1:14 am
Just to let you know: ForumMatrix now has Secunia advisories. WikiMatrix will follow when it’s upgraded to the new software.
July 12th, 2006 at 10:48 pm
<b>
March 8th, 2008 at 12:50 am