SitePoint Sponsor

User Tag List

Results 1 to 12 of 12
  1. #1
    SitePoint Addict
    Join Date
    Jun 2002
    Posts
    240
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    automatic content generation

    I need your help on this one guys! I want to create a PHP script that automatically populates a content page with information regarding a specific subject from Wikipedia. For example, if my subject was "web design", the code would automatically pull in the page from Wikipedia that relates to web design. According to Wikipedia's TOS, I can't pull information directly from the site, but I'll create a caching script after I get the main script working. So...how do I do this? It's been a while since I wrote any real code so I'm feeling pretty rusty at this point.

    Thanks Sitepointers!

    Ian


  2. #2
    SitePoint Addict Procode's Avatar
    Join Date
    Dec 2006
    Location
    New York
    Posts
    371
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Wait, you lost me, didn't you say it was illegal?

  3. #3
    SitePoint Addict
    Join Date
    Jun 2002
    Posts
    240
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    No no no, it's NOT ILLEGAL. Wikipedia says that you can use their content (as long as credit is given on the page), but they don't want you to pull the content directly from their site into your site each time the page is loaded as this could overburden their servers if your site has a lot of traffic. I am trying to build the script to pull the info into my site, and then I will create a caching script for it so that it complies with their rules.


  4. #4
    SitePoint Wizard bronze trophy Kailash Badu's Avatar
    Join Date
    Nov 2005
    Posts
    2,560
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    you can copy the content and reuse it in your website. However, I guess you cannot fetch it directly from wikipedia's site to be displayed in your website.

  5. #5
    SitePoint Wizard bronze trophy Kailash Badu's Avatar
    Join Date
    Nov 2005
    Posts
    2,560
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    well, ian0502 summed it up.

  6. #6
    SitePoint Zealot
    Join Date
    Dec 2005
    Posts
    101
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    1. Pull content from site
    2. Write content to file on site
    3. Use if statement to see if file exists, if so, display the file
    4. If not, pull the content from the site (and repeat)

    PHP Code:
    if(file_exists("cache/webdesign.html"))
    {
    include 
    "cache/webdesign.html";
    } else {
    // pull info from wikipedia/web_design
    // then it put in a file such as cache/webdesign.html


  7. #7
    SitePoint Addict
    Join Date
    Jun 2002
    Posts
    240
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by chronic View Post
    1. Pull content from site
    2. Write content to file on site
    3. Use if statement to see if file exists, if so, display the file
    4. If not, pull the content from the site (and repeat)

    PHP Code:
    if(file_exists("cache/webdesign.html"))
    {
    include 
    "cache/webdesign.html";
    } else {
    // pull info from wikipedia/web_design
    // then it put in a file such as cache/webdesign.html

    Yes, but how do I pull the info from the site while stripping out the Wikipedia template? I've seen other sites do this successfully, but I can't seem to remember them now.


  8. #8
    SitePoint Addict Procode's Avatar
    Join Date
    Dec 2006
    Location
    New York
    Posts
    371
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by ian0502 View Post
    No no no, it's NOT ILLEGAL. Wikipedia says that you can use their content (as long as credit is given on the page), but they don't want you to pull the content directly from their site into your site each time the page is loaded as this could overburden their servers if your site has a lot of traffic. I am trying to build the script to pull the info into my site, and then I will create a caching script for it so that it complies with their rules.
    Oh, must of misread your post. pardon me

  9. #9
    SitePoint Addict
    Join Date
    Jun 2002
    Posts
    240
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I just realized that I may never be able to cache all the relevant pages of content since each page contains keywords that link to other internal pages. Theroretically, I might need to download a million pages just to cover all my bases with one subject. I heard recently that you can download the entire Wikipedia database for use on your own sites, but I have yet to find it on their site. Any thoughts?

    Ian


  10. #10
    Always learning viveknarula's Avatar
    Join Date
    Mar 2006
    Location
    INDIA
    Posts
    418
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    You can use regular expressions to fetch a specific part of on page. e.g. open the page with functions such as file_get_contents(). Then use regular expressions to fetch the data from that page and display or cache it on ur site.

    NOTE -- To use regular expressions the fetched pages should have same design format.

  11. #11
    Fully Sweet Car noddy's Avatar
    Join Date
    Aug 2002
    Location
    Perth, Western Australia
    Posts
    759
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Example

    This will get you the entire pages content.

    http://www.lionslair.net.au/~nathanr...t_contents.php

    I have passed it the url of this thread.

    PHP Code:
    <?
    $content
    =file_get_contents("http://www.sitepoint.com/forums/showthread.php?t=448499");
    echo 
    "<textarea cols=400 rows=100>".$content."</textarea>";
    ?>
    This is for http://en.wikipedia.org/wiki/Web_design

    http://www.lionslair.net.au/~nathanr...t_contents.php

    Then I would look for
    Code:
    <h1 class="firstHeading">
    until you found the end of the document which is around this mark.

    Code:
    <p><a name="See_also" id="See_also"></a></p>
    After that you may do a strip_tags to format it how you want it.

    However there are most likely a lot of different combinations the site will have in the way they format content so you need to define the main sections that have the content you are looking for.

  12. #12
    SitePoint Addict
    Join Date
    Jun 2002
    Posts
    240
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    You know, I think I used something like that before. If I remember correctly, I dumped the content into an array, then used explode() to remove all the crap that was located before and after a certain tag that I declared in the code.



Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •