SitePoint Sponsor

User Tag List

Results 1 to 8 of 8
  1. #1
    SitePoint Member
    Join Date
    Aug 2011
    Posts
    3
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Wikipedia User Agent

    Hey,

    First time poster, I could really do with some help as I can't find a solution on the internet.

    I've made a script using cURL to take information from Wikipedia and put it in a MYSQL DB so far I have over 3000 records but need to take more, I'm now getting the error:

    "Scripts should use an informative User-Agent string with contact information, or they may be IP-blocked without notice."

    When I run the script.

    Now I know what I need to do, I found instructions on wiki (User-Agent policy - Meta) apparently I need a user agent but I just don't know how to do this.

    I've tried adding this code:
    PHP Code:
    ini_set"user_agent""MediaArchiver (+http://www.mywebsite.com/)"); 
    but that doesn't work.

    Any idea how I'm going to go about this? Thanks to anyone who will help because I can't find anything else on the internet to guide me

  2. #2
    SitePoint Member
    Join Date
    Oct 2006
    Posts
    7
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I'm still learning curl myself, but I think your going to need to use curl_setopt.

  3. #3
    SitePoint Wizard bronze trophy Immerse's Avatar
    Join Date
    Mar 2006
    Location
    Netherlands
    Posts
    1,661
    Mentioned
    7 Post(s)
    Tagged
    1 Thread(s)
    Yep, curl_setopt is the way to go:

    PHP Code:
    $ch curl_init();
    curl_setopt($chCURLOPT_USERAGENT"MediaArchiver (+http://www.mywebsite.com/)"); 

  4. #4
    SitePoint Member
    Join Date
    Aug 2011
    Posts
    3
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Guys,

    Thanks for the quick replies and I really appreciate the help, the above code didn't stop the error message though.

    Has anyone had an experience in using cURL on wikipedia and know how to get round this?

  5. #5
    . shoooo... silver trophy logic_earth's Avatar
    Join Date
    Oct 2005
    Location
    CA
    Posts
    9,013
    Mentioned
    8 Post(s)
    Tagged
    0 Thread(s)
    You have to slow down. Pull too much from Wikipedia in too short of a time they will consider it a bad bot.
    Logic without the fatal effects.
    All code snippets are licensed under WTFPL.


  6. #6
    SitePoint Wizard silver trophybronze trophy Cups's Avatar
    Join Date
    Oct 2006
    Location
    France, deep rural.
    Posts
    6,869
    Mentioned
    17 Post(s)
    Tagged
    1 Thread(s)
    Consider hitting dbpedia instead.

    It'll mean learning some SPARQL, but its not far removed from sql as long as you grok what a namespace is.

  7. #7
    SitePoint Member
    Join Date
    Aug 2011
    Posts
    3
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Looking at Cups reply, perhaps I'm not doing this the best way...

    What I'm trying to do is compile a database of English Music Artists, Movies, TV Shows, Video Games and Books.

    The reason for this is because I want users on my site to be able to add these items to their "profiles" I don't want them to be able to add false or made up ones which is why I'm going for the DB. I did consider writing a curl script which went and got all the information about the form of media when the user adds a new one to their profile but that would require too much processing time.

    Thus I set about creating a cURL script which goes to wikipedia, takes the Name, Image and Description of the media item and sticks it in my DB (along with the original page so I can link to wikipedia as per their terms).

    DBPedia looks like a better way at a quick glance, or perhaps there's some other suggestion? I did try exporting the MusicBrainz db but that only covers Music.

    Once again I want to thank everyone for their replies, I'm new to PHP but have been coding in Java for a while

  8. #8
    SitePoint Wizard silver trophybronze trophy Cups's Avatar
    Join Date
    Oct 2006
    Location
    France, deep rural.
    Posts
    6,869
    Mentioned
    17 Post(s)
    Tagged
    1 Thread(s)
    Yeah, you can hit dbpedia and cache locally what you want, it sounds like you just want the contents of the infobox.

    You just then need to routinely update your cached data.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •