In that case I'd check out their API. The "parse" action will parse a page and return the HTML for that page. Example URLs:
http://en.wikipedia.org/w/api.php?ac...int&format=php - Data for "SitePoint" page in PHP serialised format - to use with unserialize() in PHP.
http://en.wikipedia.org/w/api.php?ac...int&format=xml - Data for "SitePoint" page in XML format
No scraping needed, the data is in an easy-to-use format for you.
I'd strongly recommend donating to Wikipedia if you use its data extensively. High usage of their servers means that your scraping costs them quite a bit of money (bandwidth, server processing time, etc.)
Here's an example for you (PHP):
Code:
<?php
$page = 'SitePoint';
$api_url = 'http://en.wikipedia.org/w/api.php?action=parse&page=%s&format=php';
// MediaWiki API needs a user-agent to be specified
$context = stream_context_create(array('http' => array(
'user_agent' => 'SitePoint example for topic 748667',
)));
$data = unserialize(file_get_contents(sprintf($api_url, $page), null, $context));
echo $data['parse']['text']['*'];
?>
Bookmarks