SitePoint Sponsor

User Tag List

Results 1 to 7 of 7
  1. #1
    SitePoint Enthusiast
    Join Date
    Jun 2008
    Posts
    29
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Need help with scraper script

    I am trying to make a guitar tab search engine.

    Here is my scraper script for one of the sites:

    Code:
    <?php
    
    set_time_limit(0);
    
    
    
    mysql_connect($dbhost, $dbuser, $dbpass);
    mysql_select_db($dbname);
    
    
    
    function get ($a,$b,$c)
    {
      $y = explode($b,$a);
      $x = explode($c,$y[1]);
      
      return $x[0];
    }
    
    function slug($str)
    {
    	$str = strtolower(trim($str));
    	$str = preg_replace("/[^a-z0-9-]/", "-", $str);
    	$str = preg_replace("/-+/", "-", $str);
    	$str = rtrim($str, "-");
    	
    	return $str;
    }
    
    
    
    for ($i = 1; $i <= 750000; $i++) 
    {
      $content = file_get_contents("site/print.php?what=tab&id=$i");
    
      if ($content != "tab not found")
      {
        $title = get($content, "<title>", "</title>");
    
        $matches = array();
        
        if (preg_match("/Bass Tab/", $title))
        {
          preg_match('$([a-z ]+) Bass Tab([a-z ]+)by ([a-z ]+)$i', $title, $matches);
    
          $type = "Bass";
          $song = $matches[1];
          $band = $matches[3];
          
        }
        else
        {
          preg_match('$([a-z ]+) Tab([a-z ]+)by ([a-z ]+)$i', $title, $matches);
    
          $type = "Guitar";
          $song = $matches[1];
          $band = $matches[3];
        }
        
        $slug_song = slug($song);
        $slug_band = slug($band);
    
        $tab = get($content, "<pre>", "</pre>");
    
        foreach ($remove as $value) 
        {
          $tab = str_replace($value, "", $tab);
        }
    
        $tab = trim($tab);
    
        $sql = mysql_query("SELECT * FROM tabs WHERE band='$band' AND song='$song' ORDER BY version DESC LIMIT 1");
        $sql_count = mysql_num_rows($sql);
        
        if ($sql_count == 0)
        {
          $version = 1;
        }
        else
        {
          while ($row = mysql_fetch_array($sql))
          {
            $version = $row["version"];
          }
        
          $version = $version + 1;
        }
        
        $date_posted = time();
        
        mysql_query("INSERT INTO tabs (xid, poster_id, poster_name, type, band, slug_band, song, slug_song, tab, version, date_posted, is_approved) VALUES ('$i', '0', 'Guest', '$type', '$band', '$slug_band', '$song', '$slug_song', '$tab', '$version', '$date_posted', '1')");
        
        echo " yes ";
      }
      else
      {
        echo " no ";
      }
    }
    
    ?>
    Now this is running into problems. It just stops inserting into the database after 1000, and sometimes it isn't even scraping the page content at all.

    Can I get some help?

  2. #2
    Twitter: @AnthonySterling silver trophy AnthonySterling's Avatar
    Join Date
    Apr 2008
    Location
    North-East, UK.
    Posts
    6,111
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    What is 'xid' in your SQL table? I noticed you're populating this using a loop, so the next time the loop run's you'll get same id again, is this planned?

    Maybe an auto increment field would be better suited for the ID?

    SilverB.

  3. #3
    SitePoint Enthusiast
    Join Date
    Jun 2008
    Posts
    29
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by SilverBulletUK View Post
    What is 'xid' in your SQL table? I noticed you're populating this using a loop, so the next time the loop run's you'll get same id again, is this planned?

    Maybe an auto increment field would be better suited for the ID?

    SilverB.
    xid is the id of the tab on the site from which it is scraping

    it has an auto_increment id field

  4. #4
    dooby dooby doo silver trophybronze trophy
    spikeZ's Avatar
    Join Date
    Aug 2004
    Location
    Manchester UK
    Posts
    13,807
    Mentioned
    158 Post(s)
    Tagged
    3 Thread(s)
    can you show us the create table syntax please DionDev?
    Mike Swiffin - Community Team Advisor
    Only a woman can read between the lines of a one word answer.....

  5. #5
    SitePoint Enthusiast
    Join Date
    Jun 2008
    Posts
    29
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by spikeZ View Post
    can you show us the create table syntax please DionDev?
    Here it is:

    Code:
    CREATE TABLE tabs (
      id int unsigned NOT NULL auto_increment,
      xid int unsigned NOT NULL,
      poster_id int unsigned NOT NULL,
      poster_name text NOT NULL,
      poster_ip text NOT NULL,
      type text NOT NULL,
      band text NOT NULL,
      slug_band text NOT NULL,
      song text NOT NULL,
      slug_song text NOT NULL,
      tab text NOT NULL,
      version int NOT NULL,
      date_posted int unsigned NOT NULL,
      rating int NOT NULL default 0,
      views int unsigned NOT NULL default 0,
      is_approved int unsigned NOT NULL default 0, 
      PRIMARY KEY (id)
    );
    Could this have to do with the fact that I am trying to insert 750,000 rows, and it has to connect to the web site every single time, but some kind of overload is being caused?

  6. #6
    dooby dooby doo silver trophybronze trophy
    spikeZ's Avatar
    Join Date
    Aug 2004
    Location
    Manchester UK
    Posts
    13,807
    Mentioned
    158 Post(s)
    Tagged
    3 Thread(s)
    Possibly, I am setting it up on my test Server so I can have a good look at it.
    Which site do you scrape from? (PM me if you want!)
    Mike Swiffin - Community Team Advisor
    Only a woman can read between the lines of a one word answer.....

  7. #7
    SitePoint Member
    Join Date
    Jun 2007
    Location
    11years old domain for sale! PM me for more info!
    Posts
    2
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    why dont u try to echo the data after 1000 queries instead of adding to the database? (or may be store to a flatfile database). Then we can know if its a database entering problem or a scraping problem.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •