SitePoint Sponsor

User Tag List

Results 1 to 9 of 9
  1. #1
    SitePoint Enthusiast
    Join Date
    Jun 2007
    Posts
    27
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    How to code a crawl script that copies the first paragraph from Wikipedia?

    Hey guys,


    Long time no see; I'm trying to find a way I can implement a PHP script for my Wordpress blog in which whenever I access my website.com/tag/tag-name section I get a description worth 250 words or something from Wikipedia relating to the term.

    For example if my tag is websitename.com/tag/theory-of-relativity, then the script should crawl Wikipedia after "Theory of Relativity" and past the first 250 words from Theory of relativity - Wikipedia, the free encyclopedia, so I can get a description for the term before I list the wordpress post tagged with the term.

    This may sound like black-hat SEO spam to some of you, but I actually believe this is a good practice since it provides relevancy, and is by no means non-ethical or something. I want my readers to have an idea what a term is all about (i put a lot of tags on my blog that are very encyclopedic in nature) and this might help.

    I have minimal programming experience and I would really appreciate it if someone could help me out with some insights on how I can write a piece of code like this. maybe someone finds the idea of such a script being implemented really useful, why not write it and share it altogether? Thank in advance for your efforts.
    ZMEmusic.com - music news and reviews
    ZMEtravel.com - travel and leisure blog

  2. #2
    Programming Team silver trophybronze trophy
    Mittineague's Avatar
    Join Date
    Jul 2005
    Location
    West Springfield, Massachusetts
    Posts
    17,053
    Mentioned
    187 Post(s)
    Tagged
    2 Thread(s)
    IMHO no code needed. When you write a post with a "tag" in it, include a link to the wikipedia page. It will take a little time for you to find the link, but only minimal compared to writing the post.

  3. #3
    SitePoint Wizard frank1's Avatar
    Join Date
    Oct 2005
    Posts
    1,392
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

  4. #4
    SitePoint Enthusiast
    Join Date
    Jan 2010
    Location
    Melbourne, Australia
    Posts
    28
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Another option are the Wikipedia database dumps - The entire Wikipedia database is available for download. Then you wouldn't need to scrape it, just look it up in a local copy of the database.

  5. #5
    SitePoint Enthusiast
    Join Date
    Jun 2007
    Posts
    27
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thank you for your replies.

    @Mittineague: how can you customize your tag pages so you can add individual content for every tag page ? From my experience, you can't...

    @frank1: scrapping is exactly what this is about I believe. thanks for the suggestion
    @daniel15: yes, but that would mean i need to download entire gigabites and then interpret their database structure. Plus, I need it to be up date, and the best way to achieve this is by scraping off wikipedia directly.

    I tried googling for something similar, but .... I couldn't find something really relevant.
    ZMEmusic.com - music news and reviews
    ZMEtravel.com - travel and leisure blog

  6. #6
    SitePoint Enthusiast
    Join Date
    Jan 2010
    Location
    Melbourne, Australia
    Posts
    28
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    In that case I'd check out their API. The "parse" action will parse a page and return the HTML for that page. Example URLs:

    http://en.wikipedia.org/w/api.php?ac...int&format=php - Data for "SitePoint" page in PHP serialised format - to use with unserialize() in PHP.
    http://en.wikipedia.org/w/api.php?ac...int&format=xml - Data for "SitePoint" page in XML format

    No scraping needed, the data is in an easy-to-use format for you.

    I'd strongly recommend donating to Wikipedia if you use its data extensively. High usage of their servers means that your scraping costs them quite a bit of money (bandwidth, server processing time, etc.)

    Here's an example for you (PHP):
    Code:
    <?php
    $page = 'SitePoint';
    $api_url = 'http://en.wikipedia.org/w/api.php?action=parse&page=%s&format=php';
    
    // MediaWiki API needs a user-agent to be specified
    $context = stream_context_create(array('http' => array(
    	'user_agent' => 'SitePoint example for topic 748667',
    )));
    
    $data = unserialize(file_get_contents(sprintf($api_url, $page), null, $context));
    
    echo $data['parse']['text']['*'];
    ?>

  7. #7
    SitePoint Enthusiast
    Join Date
    Jun 2007
    Posts
    27
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Daniel,

    Your code is wonderful and hits the spot, but it's not there yet.

    1. If you type in a more than one word parsing phrase, it won't return anything.

    example "new york" won't show anything.

    then "new_york" will show you that you need to redirect and use proper capitalization.

    ultimately, "New_York" will show you the proper display.


    2. I'd only like to show the first <p> from the a wikipedia listing. this will put less strain on my server as well as wikipedia's. and, yes, i've downloaded to wikipedia multiple times now, whenever a call out was made.

    A perfect example of what I'm trying to replicate can be seen at PhysOrg.com - Science News, Technology, Physics, Nanotechnology, Space Science, Earth Science, Medicine.

    Just click on any of their post, check out the right sidebar and click on tag or more.

    CLEAR EXAMPLE: PhysOrg.com - magnetic field

    thank you everyone for your help.

    Quote Originally Posted by Daniel15 View Post
    In that case I'd check out their API. The "parse" action will parse a page and return the HTML for that page. Example URLs:

    http://en.wikipedia.org/w/api.php?ac...int&format=php - Data for "SitePoint" page in PHP serialised format - to use with unserialize() in PHP.
    http://en.wikipedia.org/w/api.php?ac...int&format=xml - Data for "SitePoint" page in XML format

    No scraping needed, the data is in an easy-to-use format for you.

    I'd strongly recommend donating to Wikipedia if you use its data extensively. High usage of their servers means that your scraping costs them quite a bit of money (bandwidth, server processing time, etc.)

    Here's an example for you (PHP):
    Code:
    <?php
    $page = 'SitePoint';
    $api_url = 'http://en.wikipedia.org/w/api.php?action=parse&page=%s&format=php';
    
    // MediaWiki API needs a user-agent to be specified
    $context = stream_context_create(array('http' => array(
    	'user_agent' => 'SitePoint example for topic 748667',
    )));
    
    $data = unserialize(file_get_contents(sprintf($api_url, $page), null, $context));
    
    echo $data['parse']['text']['*'];
    ?>
    ZMEmusic.com - music news and reviews
    ZMEtravel.com - travel and leisure blog

  8. #8
    SitePoint Enthusiast
    Join Date
    Jan 2010
    Location
    Melbourne, Australia
    Posts
    28
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Yes, my code was just an example, not a production-ready script. You will definitely have to modify it.

    1. If you type in a more than one word parsing phrase, it won't return anything.

    example "new york" won't show anything.

    then "new_york" will show you that you need to redirect and use proper capitalization.

    ultimately, "New_York" will show you the proper display.
    Modify the script to handle that
    Read Mediawiki's API documentation and see how to handle redirects.

    2. I'd only like to show the first <p> from the a wikipedia listing. this will put less strain on my server as well as wikipedia's. and, yes, i've downloaded to wikipedia multiple times now, whenever a call out was made.
    I think the API only returns the whole page. From that, you'd grab the first paragraph. I think returning the whole page would be less stressful on their servers as they'd cache the whole page (they wouldn't just cache the first paragraph) so it should be relatively quick to retrieve anyways.

    A perfect example of what I'm trying to replicate can be seen at
    They might be using the API, or might just use the Wikipedia dumps I talked about earlier. Perhaps ask them what approach they used?

  9. #9
    Non-Member
    Join Date
    Sep 2011
    Location
    spb
    Posts
    5
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    to: Daniel15 thanks for sharing the idea about wikipedia api - strange, but i never heard about it before and just scraped it
    to: Sharkyx
    i understand that you're not talking about black-hat seo,but there are a few services that can show you all keywords that wikipedia are in top of google serp, so you can at least exclude these keywords from your "posts", because its hard to outperform wikipedia.
    Last edited by G.Suvorov; Sep 22, 2011 at 08:15. Reason: shhhh


Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •