SitePoint Sponsor

User Tag List

Page 1 of 2 12 LastLast
Results 1 to 25 of 41
  1. #1
    SitePoint Zealot abstraktmedia's Avatar
    Join Date
    Feb 2004
    Location
    Ljubljana
    Posts
    191
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    grabing a part of page

    Hi there...

    can anyone help me with extraction of one part of the page...

    the part I want to grab is between <div id="content"> and </div>...
    I have a feeling I'll have to use Regexp here

    Or if there's a class out there for doing this even beter....

    Thanx....
    exit(0);

  2. #2
    SitePoint Member
    Join Date
    Mar 2004
    Location
    New Delhi
    Posts
    8
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    A simple solution

    Hi,

    Regex would certainly help in solving your problem.

    I wrote this sometime back to extract all mp3s from some website

    Its a little modified version of that script to suit you. Mind you, its probably the worst way it could be done in, but it does the job. It will have problems with nested div's.

    PHP Code:
    <?php

        
    //     load file into a string for manipulation
        
    function load_file($address)
        {
            
    $contents file_get_contents($address);
            return 
    $contents;        
        }

        
    // main function, does all the handling    
        // function to fetch next element of generic <TYPE> (strips data from tags)
        
    function fetch_next($opentag$closetag$source, &$fetch$offset=0)
        {        
            
    $start strpos($source$opentag$offset);
            if(
    $start)
            {
                
    $end strpos($source$closetag$offset);    
                if(!
    $end)
                    
    $error true;
                else
                    
    $end += strlen($closetag);
            }
                    
            
    $fetch substr($source$start$end-$start);
            if(!
    $start || !$end)
                return 
    false;
            return 
    $end;
        }
        
        
    // arguments : url to extract, opening tag, closing tag
        // contents between opening and closing tags are extracted from the given url
        
    function get($url$opentag$closetag)
        {        
            
    $page load_file($url);
            if(!
    $page)
                die(
    "can't load specified page");
                
            
    $matches = array();
            
    $div "";
            while(
    $offset fetch_next($opentag$closetag$page$found$offset))
            {
                
    $matches[] = $found;
            }
            return 
    $matches;
        }


        
    // usage example    
        
    print_r(get("http://test/getdiv.htm""<div id=\"content\">""</div>"));    
    ?>

  3. #3
    PHP manual bot bronze trophy Gaheris's Avatar
    Join Date
    Oct 2003
    Location
    Germany
    Posts
    2,195
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    PHP Code:
    preg_match('~<div id="content">(.*?)</div>~i'$str$match);
    $content $match[1]; 

  4. #4
    SitePoint Zealot abstraktmedia's Avatar
    Join Date
    Feb 2004
    Location
    Ljubljana
    Posts
    191
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thanx guys...
    exit(0);

  5. #5
    SitePoint Evangelist
    Join Date
    Feb 2004
    Location
    Sofia, Bulgaria
    Posts
    421
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    check this function i wrote a month ago.. i checked it against regex and it's faster.. you can check my post on this topic if you wish: http://www.sitepoint.com/forums/showthread.php?t=154958

    PHP Code:
    function adv_copy_str($start_str$end_str) {
      
    $start_pos strpos($this->content$start_str$this->offset);
      if ((
    $start_pos !== false) AND ($start_pos $this->offset)) {
        
    $start_pos += strlen($start_str);
        
    $end_pos strpos($this->content$end_str$start_pos);
        if (
    $end_pos !== false) {
          if (
    $end_pos $this->max_offset) {
            
    $this->offset $end_pos;
          }
          else {
            
    $this->offset $this->max_offset;
          }
          
    $temp substr($this->content$start_pos, ($end_pos-$start_pos));
          return 
    $temp;
        }
      }
      return 
    0;

    guys if someone could suggest improvements for my function.. 10x in advance..

  6. #6
    SitePoint Member
    Join Date
    Mar 2004
    Location
    New Delhi
    Posts
    8
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    hi dacool,

    I looked at your function and it looks great. If you have a look at the function I gave above, it's very much the same, despite variable name differences
    great minds think alike

    PHP Code:
        // main function, does all the handling     
        // function to fetch next element of generic <TYPE> (strips data from tags) 
        
    function fetch_next($opentag$closetag$source, &$fetch$offset=0
        {         
            
    $start strpos($source$opentag$offset); 
            if(
    $start
            { 
                
    $end strpos($source$closetag$offset);     
                if(!
    $end
                    
    $error true
                else 
                    
    $end += strlen($closetag); 
            } 
                     
            
    $fetch substr($source$start$end-$start); 
            if(!
    $start || !$end
                return 
    false
            return 
    $end
        } 
    I was just wondering as to why I didn't use regex for this task. Actually it gave me problems where we has more than one repeating pattern.

    For example, if we had to extract all div tags from a web page like
    Code:
    <html>
    <body>
    
    <div content='right'>
     my content here
    </div>
    
    This is some text that shouldn't be included if I want to scrape the data between the div's right?
    
    <div id='content'>
    this is scraped fine..
    </div>
    
    </body>
    </html>
    When I ran the regex like the one given above
    PHP Code:
    preg_match('~<div id="content">(.*?)</div>~i'$str$match); 
    $content $match[1]; 
    it got me everything starting from the first div opening tag to the last closing div tag. Thus the unwanted text i.e. This is some text that shouldn't be included if I want to scrape the data between the div's right? was included.

    Please guys, if you have an elegant way of doing this task with regex, I would love to hear from you.

  7. #7
    PHP manual bot bronze trophy Gaheris's Avatar
    Join Date
    Oct 2003
    Location
    Germany
    Posts
    2,195
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Not that sure, the query works perfectly correct for me. In case you modified it, can I see the regex you used?

  8. #8
    Now with customized title Jump's Avatar
    Join Date
    Sep 2002
    Location
    The Restaurant at The End of The Universe
    Posts
    1,423
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    You example above has

    HTML Code:
    <div id='content'>
    But your regex has

    HTML Code:
    <div id="content">
    You are looking for double quotes when the page has single quotes?

  9. #9
    PHP manual bot bronze trophy Gaheris's Avatar
    Join Date
    Oct 2003
    Location
    Germany
    Posts
    2,195
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    If you are referring to me Jump: my regex is based on the first example given in his first post.

  10. #10
    Now with customized title Jump's Avatar
    Join Date
    Sep 2002
    Location
    The Restaurant at The End of The Universe
    Posts
    1,423
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    nah, I was refering to the post above yours. But now I see it was posted by anurag not the originator of the question.

  11. #11
    SitePoint Evangelist
    Join Date
    Feb 2004
    Location
    Sofia, Bulgaria
    Posts
    421
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by anurag
    I looked at your function and it looks great. If you have a look at the function I gave above, it's very much the same, despite variable name differences
    great minds think alike
    hi anurag,
    i use this function in one class for content grabbing.. but i'm currently working on some projects and have no time to complete it..
    Quote Originally Posted by anurag
    I was just wondering as to why I didn't use regex for this task.
    i made some tests of my function and regex on a page about 50K and the results were that my function was up to 18 times faster than preg_match().. i'll recommend you to use regex for more complicated matches, but not when you have just simple begin-end patterns..

  12. #12
    SitePoint Member
    Join Date
    Mar 2004
    Location
    London
    Posts
    18
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    This was exactly what I was looking for, thanks to you anurag for your first post of coding.

    I have now obtained the data I need for my script, now my next step, is it now possible to take the returned data and insert it into a MySQL Database instead of sending it to the page to display?

  13. #13
    SitePoint Member
    Join Date
    Mar 2004
    Location
    New Delhi
    Posts
    8
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hi zelda,

    Great to know that I could be of some help to you. The second part is really easy.

    PHP Code:
    // connect to mysql using this code
    $link mysql_connect('server''user''password') or die('error msg here..');

    mysql_select_db('ur_database'$link);

    $query "insert into ur_table(col1, col2, col3) values('val1', 'val2', 'val3');

    mysql_query(
    $query) or die('show errors, if any'); 
    That's pretty much it. Replace ur_database and ur_table with your actual database and table name.
    The values for columns of string type would appear inside single quotes. Other column types like numbers etc. dont have any quotes around their values.

    Cheers!

  14. #14
    SitePoint Member
    Join Date
    Mar 2004
    Location
    London
    Posts
    18
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    thanks anurag, I am slowly getting closer to a finished result I feel. I do have one question now.

    Currently I am scraping a string of data which includes text and numbers and it was outputting on the page with your code as so
    Code:
    Array ( [0] => 
    text, text,text, 1078947446, 47, 96, 19
    )
    How do I assign each of these to seperate fields? I'm assuming there is some sort of code to convert the return from one long line of code into seperate variables.

  15. #15
    SitePoint Evangelist
    Join Date
    Feb 2004
    Location
    Sofia, Bulgaria
    Posts
    421
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    dirty way.. if $values is the name of the array you posted.. instead of using $values[0] you can use conter within FOREACH or WHILE or some other loop.. and if you are put single quotes around the string values in the array..
    PHP Code:
    $q "insert into ur_table(col1, col2, col3) values(".$values[0].")";
    mysql_query($q) or die('there is an error!'); 
    better way.. $values is again your array with values.. $fields is the array with the names of the fields in your table.. and be sure that both arrays have same keys.. again you can use a counter insted of $values[0]..
    PHP Code:
    $fields = array('field1''field2''field3''field4''field5''field6''field7');
    // put here the loop condition if any
    $arr_values explode(','$values[0]);
    $q "INSERT INTO table SET ";
    foreach (
    $arr_values as $k => $val) {
      
    $q .= ($fields[$k] . '=' . ((is_string($val)) ? ("'$val'") : $val) . ',');
    }
    $q substr($q0, -1); // this will remove the last ','
    mysql_query($q) or die('there is an error!');
    // end of loop if any 

  16. #16
    SitePoint Member
    Join Date
    Mar 2004
    Location
    London
    Posts
    18
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    thankyou dacool, but I get a "there is an error msg" when i use this. Instead of INSERTing a new record, I just need existing records to be updated. I'm just thinking that I have the code the wrong way about or something. Below is the coding I have currently if one of you sweet guys could help me debug it

    What I wanted was the $sql to UPDATE the database record by matching with the stored "server_name" (as that is only data that will not change, and there are no sites from the scrape not already in the database). An output of the database update queries is all I really need to see as a result.
    PHP Code:
    <?php 

    // Load file into a string for manipulation 
    function load_file($address

        
    $contents file_get_contents($address); 
            return 
    $contents;         


    // Main function, does all the handling     
    // Function to fetch next element of generic <TYPE> (strips data from tags) 
    function fetch_next($opentag$closetag$source, &$fetch$offset=0
    {         
        
    $start strpos($source$opentag$offset); 
            if(
    $start
        {
                
    $end strpos($source$closetag$offset);     
                if(!
    $end
                    
    $error true
                else 
                    
    $end += strlen($closetag); 
        } 
                     
            
    $fetch substr($source$start$end-$start); 
          if(!
    $start || !$end
              return 
    false
                return 
    $end

         
    // Arguments : url to extract, opening tag, closing tag 
    // Contents between opening and closing tags are extracted from the given url 
    function get($url$opentag$closetag
    {         
        
    $page load_file($url); 
            if(!
    $page
            die(
    "Statistics File is missing!"); 
                 
          
    $matches = array(); 
          
    $div ""
          while(
    $offset fetch_next($opentag$closetag$page$found$offset)) 
          { 
              
    $matches[] = $found
          } 
          return 
    $matches



    // Urls to crawl for statistics files
    print_r(get("http://yourdomain.com/_stats.php""<div>""</div>"));    
    print_r(get("http://yourdomain2.com/_stats.php""<div>""</div>"));  

    // Connect to the database 
    $link mysql_connect('localhost''****''****') or die('Could not connect to database!'); 

    mysql_select_db('****'$link); 

    // Update sites with new statistics
    $fields = array('sitename''site_desc''server_name''board_startdate''users''posts''unique_hits'); 
    // put here the loop condition if any 
    $arr_values explode(','$values[0]); 
    $sql "UPDATE site_toplist SET "
    foreach (
    $arr_values as $k => $val) { 
      
    $sql .= ($fields[$k] . '=' . ((is_string($val)) ? ("'$val'") : $val) . ','); 

    $sql substr($sql0, -1); // this will remove the last ',' 
    mysql_query($sql) or die('Could not update the database!'); 
    // end of loop if any 

    ?>
    Last edited by zelda; Mar 30, 2004 at 21:33.

  17. #17
    SitePoint Evangelist
    Join Date
    Feb 2004
    Location
    Sofia, Bulgaria
    Posts
    421
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    you should add WHERE clause to tell to SQL which row exactly you want to update.. set $sql_unique_field with the name of your row intendification field.. try this way:
    PHP Code:
    // Update sites with new statistics 
    $fields = array('sitename''site_desc''server_name''board_startdate''users''posts''unique_hits'); 
    $sql_unique_field 'server_name';
    // put here the loop condition if any 
    $arr_values explode(','$values[0]); 
    $sql "UPDATE site_toplist SET "
    foreach (
    $arr_values as $k => $val) { 
      if (
    $fields[$k] != $sql_unique_field) {
        
    $sql .= ($fields[$k] . '=' . ((is_string($val)) ? ("'$val'") : $val) . ',');
      } else {
        
    $sql_where = (" WHERE $sql_unique_field=" . ((is_string($val)) ? ("'$val'") : $val));
      }

    $sql substr($sql0, -1); // this will remove the last ',' 
    if (!empty($sql_where)) $sql .= $sql_where;
    mysql_query($sql) or die('Could not update the database!'); 
    // end of loop if any 

  18. #18
    SitePoint Addict moonchild's Avatar
    Join Date
    Nov 2003
    Location
    U$A
    Posts
    258
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Gaheris
    PHP Code:
    preg_match('~<div id="content">(.*?)</div>~i'$str$match);
    $content $match[1]; 
    you're the man when it comes to regular expressions... i still need to learn them. lol

  19. #19
    SitePoint Member
    Join Date
    Mar 2004
    Location
    London
    Posts
    18
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I knew I had to fit a WHERE in there somewhere, just wasent sure how to do it. I am working with two examples at the moment and upon running of the script, both sitenames are being set as blank.

    I echoed the $sql to recieve
    Code:
    UPDATE site_toplist SET sitename=''
    and thats all I got, does not seem to be picking up the data being scraped at all. Any more ideas, this is way over my head now

  20. #20
    SitePoint Evangelist
    Join Date
    Feb 2004
    Location
    Sofia, Bulgaria
    Posts
    421
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    check if the values are really in the $values array, and if their index is 0..

  21. #21
    SitePoint Member
    Join Date
    Mar 2004
    Location
    New Delhi
    Posts
    8
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Yupps, your array is definitely empty.

    I think this is the mistake
    PHP Code:
    // Urls to crawl for statistics files 
    print_r(get("http://yourdomain.com/_stats.php""<div>""</div>"));     
    print_r(get("http://yourdomain2.com/_stats.php""<div>""</div>")); 
    I just printed the array for demonstation purpose, you should assign $values to the return value of get() function.
    Something like
    PHP Code:
    // Urls to crawl for statistics files 
    $values get("http://yourdomain.com/_stats.php""<div>""</div>"); 
    Apart from that dacool has given some great ideas.

  22. #22
    SitePoint Member
    Join Date
    Mar 2004
    Location
    London
    Posts
    18
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    spot on anurang, assigning a $values to the get() function worked perfect, I am now getting data for each value and it is inserting into the database.
    Code:
    UPDATE site_toplist SET sitename = '
    Test.com',site_desc = ' Just a test site',board_startdate = ' 1022524652',users = ' 710',posts = ' 8222',unique_hits = ' 507
    ' WHERE server_name = ' www.yourdomain.com'


    The only problem I can see now is it has only read and updated the last site out of the two I have in the file, in other words not looping through the site list, only reading the last site in the list.

    I assume there needs to be a loop setup on the get() function?? so it reads through each site and updates the database accordingly.

    Again thankyou for you help guys, I would not even have come close to the result I have so far without your great help

    **** Just noticed: if sitename or site_desc has a comma or apstrophe, the code is broken and cannot update the database, is there a way of stripping these as to avoid the errors?

  23. #23
    SitePoint Member
    Join Date
    Mar 2004
    Location
    New Delhi
    Posts
    8
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Actually the get function would return everything in an array, so you would have to loop through that array which is returned by get().

    I cant seem to copy n paste anything apparently, so I am just gonna have to approximate stuff

    In the code posted by dacool to generate sql queries, make the following change.

    At the beginning of that portion of code, add this
    PHP Code:
    foreach($values as $arr_values)

    And simply close this opening brace at the end of that portion of code

    PHP Code:

    Hope that works

    Cheers!

  24. #24
    SitePoint Member
    Join Date
    Mar 2004
    Location
    London
    Posts
    18
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    that did not seem to change anything, is still only returning the one site scrape.
    PHP Code:
    mysql_select_db('****'$link); 

    // Urls to crawl for statistics files
    $values get("http://www.yourdomain.com/_stats.php""<div>""</div>");    
    $values get("http://www.yourdomain1.com/_stats.php""<div>""</div>");    

    // Update sites with new statistics
    foreach($values as $arr_values

        
    $fields = array('sitename''site_desc''server_name''board_startdate''users''posts''unique_hits'); 
        
    $sql_unique_field 'server_name'

        
    $arr_values explode(','$values[0]); 
        
    $sql "UPDATE site_toplist SET "
        foreach (
    $arr_values as $k => $val
        { 
            if (
    $fields[$k] != $sql_unique_field
            { 
                
    $sql .= ($fields[$k] . ' = ' . ((is_string($val)) ? ("'$val'") : $val) . ','); 
              } 
              else 
              { 
                
    $sql_where = (" WHERE $sql_unique_field = " . ((is_string($val)) ? ("'$val'") : $val)); 
              } 
        } 
        
    $sql substr($sql0, -1); // this will remove the last ',' 
        
    if (!empty($sql_where)) $sql .= $sql_where
        
    mysql_query($sql) or die('Could not update the database!'); 


  25. #25
    SitePoint Evangelist
    Join Date
    Feb 2004
    Location
    Sofia, Bulgaria
    Posts
    421
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    this line of code returns only the first row with values, so you have to remove it:
    PHP Code:
    $arr_values explode(','$values[0]); 
    cheers to anurag.. he gave great ideas too now change the code as shown below.. the function get() returns all matches in array (am i right) so you don't need to call it two times.. and put the lines where you set the 'constant' variables outside the loop..
    PHP Code:
    mysql_select_db('****'$link); 

    // Urls to crawl for statistics files 
    $values get("http://www.yourdomain.com/_stats.php""<div>""</div>");        
    $dest_table 'site_toplist';
    $fields = array('sitename''site_desc''server_name''board_startdate''users''posts''unique_hits');
    $sql_unique_field 'server_name';
    // Update sites with new statistics 
    foreach($values as $arr_values
    {
        
    $sql "UPDATE $dest_table SET "
        foreach (
    $arr_values as $k => $val
        { 
            if (
    $fields[$k] != $sql_unique_field
            { 
              
    $sql .= ($fields[$k] . ' = ' . ((is_string($val)) ? ("'$val'") : $val) . ','); 
            } 
            else 
            { 
              
    $sql_where = (" WHERE $sql_unique_field = " . ((is_string($val)) ? ("'$val'") : $val)); 
            } 
        } 
        
    $sql substr($sql0, -1); // this will remove the last ',' 
        
    if (!empty($sql_where)) $sql .= $sql_where
        
    mysql_query($sql) or die('Could not update the database!'); 



Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •