SitePoint Sponsor

User Tag List

Results 1 to 11 of 11
  1. #1
    SitePoint Mentor silver trophy
    Rubble's Avatar
    Join Date
    Dec 2005
    Location
    Cambridge, England
    Posts
    2,367
    Mentioned
    80 Post(s)
    Tagged
    3 Thread(s)

    Curl and preg_match

    I do not think I am doing this correctly; what I want to do is get the htm page title from a page into a variable. The next step is to put the url into a variable as well.
    I am trying to create a sitemap but my pages are dynamic and I want something like: http://www.rubblewebs.co.uk/imagemagick/xml/sitemap.php

    This puts Help into a variable so this part is working ?
    PHP Code:
    <?php
    $result 
    "<title>Help</title>";
    preg_match('^<title>([a-zA-Z]+)</title>$^',$result$matches);
    print_r($matches);
    ?>
    But when I put it together with CURL it does not work.
    PHP Code:
    <?php
    $ch 
    curl_init();
    curl_setopt($chCURLOPT_URL,"http://www.rubblewebs.co.uk/imagemagick/index.php");
    curl_setopt($chCURLOPT_RETURNTRANSFER,1);
    $result=curl_exec ($ch);
    curl_close ($ch);

    preg_match('^<title>([a-zA-Z]+)</title>$^',$result$matches);

    print_r($matches);
    ?>
    The CURL part works on its own as if I echo $result; I get the webpage but together they do not work.

  2. #2
    Programming Team silver trophybronze trophy
    Mittineague's Avatar
    Join Date
    Jul 2005
    Location
    West Springfield, Massachusetts
    Posts
    17,036
    Mentioned
    187 Post(s)
    Tagged
    2 Thread(s)

    curlopt

    Are the non-alphanumeric characters being encoded? Maybe try
    PHP Code:
    preg_match('^<title>([a-zA-Z]+)</title>$^'rawurldecode($result), $matches); 

  3. #3
    ✯✯✯ silver trophybronze trophy php_daemon's Avatar
    Join Date
    Mar 2006
    Posts
    5,284
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    You might wanna take a look at get_meta_tags.
    Saul

  4. #4
    SitePoint Mentor silver trophy
    Rubble's Avatar
    Join Date
    Dec 2005
    Location
    Cambridge, England
    Posts
    2,367
    Mentioned
    80 Post(s)
    Tagged
    3 Thread(s)
    Thanks for the replys:
    Mittineague your suggestion had no effect and I can see how I could use the meta tags php_daemon.

    It looks like the problem is in the CURL code; again if I hard code in the URL the meta tags info is displayed but if I use the $result from CURL the output is empty.

    I will have to look for some more detailed information on CURL

  5. #5
    SitePoint Mentor silver trophy
    Rubble's Avatar
    Join Date
    Dec 2005
    Location
    Cambridge, England
    Posts
    2,367
    Mentioned
    80 Post(s)
    Tagged
    3 Thread(s)
    Some progress; the CURL part is working its the getting the data from the string where I am going wrong.
    I will have another look later as this seems to work:
    PHP Code:
    <?php
    // http://davidwalsh.name/download-urls-content-php-curl/

    /* gets the data from a URL */  
    function get_data($url)   
    {   
        
    $ch curl_init();   
        
    $timeout 5;   
        
    curl_setopt($ch,CURLOPT_URL,$url);   
        
    curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);   
        
    curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);   
        
    $data curl_exec($ch);   
        
    curl_close($ch);   
        return 
    $data;   
    }  

    $returned_content get_data('http://www.rubblewebs.co.uk/imagemagick/index.php');  
    $string "<title>";
    $container $returned_content;

    if(
    strstr($container,$string)) {
    echo 
    "found it.";
    } else {
    echo 
    "not found.";
    }

    ?>
    This returns "found it." So now I need another way of getting the text between the two <title> tags.

  6. #6
    ✯✯✯ silver trophybronze trophy php_daemon's Avatar
    Join Date
    Mar 2006
    Posts
    5,284
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Oh yeah, sorry get_meta_tags won't get you the title.

    Looking back at the regexp, this should work:
    Code php:
    preg_match('#<title>(.+?)</title>#i',$result, $matches);
    Saul

  7. #7
    SitePoint Mentor silver trophy
    Rubble's Avatar
    Join Date
    Dec 2005
    Location
    Cambridge, England
    Posts
    2,367
    Mentioned
    80 Post(s)
    Tagged
    3 Thread(s)
    That worked php_daemon thank you.

    I had not thought this out fully and have now hit a stumbling block at the next stage

  8. #8
    Programming Team silver trophybronze trophy
    Mittineague's Avatar
    Join Date
    Jul 2005
    Location
    West Springfield, Massachusetts
    Posts
    17,036
    Mentioned
    187 Post(s)
    Tagged
    2 Thread(s)

    bad regex

    Your CURL stuff looks basically the same, but the page you are testing for has this
    HTML Code:
    <title>ImageMagick - Index page</title>
    and the regex doesn't have a space or a dash so it won't match
    Code:
    ^<title>([a-zA-Z]+)</title>$^
    as long as you don't have another title tag in the string you should be "safe" using the accursed everything atom in this use of it. but you may only need to add a space and dash to the regex.

  9. #9
    SitePoint Mentor silver trophy
    Rubble's Avatar
    Join Date
    Dec 2005
    Location
    Cambridge, England
    Posts
    2,367
    Mentioned
    80 Post(s)
    Tagged
    3 Thread(s)
    Good point Mittineague

    This brings me to my next stumbling block as I need to get my links and some links are relative e.g. <a href="server/server.php"> some have a class e.g. <a class="index" href="server/server.php> and some are up a directory e.g. <a href="../forum"> although I do not think I have any absolute links

  10. #10
    SitePoint Wizard silver trophybronze trophy Cups's Avatar
    Join Date
    Oct 2006
    Location
    France, deep rural.
    Posts
    6,869
    Mentioned
    17 Post(s)
    Tagged
    1 Thread(s)
    You'll need to extract and rewrite the urls.
    PHP Code:
    $html ="";
    // this particular regex was to extract only the links query string after links with "redirect.asp?id="
        
    preg_match_all('#(href="redirect.asp\?id=)(.*)(">).*#',
                        
    $str$titles);

    foreach( 
    $titles[2] as $k=>$work ){

    // echo '<hr /> '. $work . '<hr /> '; //dbg
    // print_r($titles[2]); //dbg

    $html .=  ' <a href="http://www.example.com/redirect.asp?id=' $work '</a><br />';
    }

    echo 
    $html
    Should work, its from something a bit complicated extracting items from the query string, but reduced should still give you the gist.

    Also, found very nice cURL wrapper on phpclasses: curl_http_client or the owners site here

  11. #11
    SitePoint Mentor silver trophy
    Rubble's Avatar
    Join Date
    Dec 2005
    Location
    Cambridge, England
    Posts
    2,367
    Mentioned
    80 Post(s)
    Tagged
    3 Thread(s)
    Taken a while to get back to this as I kept going around in circles !
    I found some interesting code on http://www.merchantos.com/makebeta/p.../#put_together although I do not know anything about DOM and curl for that matter it does what I want and when I get time I will look into it more.
    The code has a couple of "Bodges" to get the output I wanted; I could modify the site links to make it better. Again I may do that later.
    Anyway my code is below and it needs quite a bit of cleaning up but what it is doing simply is reading the links on the webpages then using this information to get the page titles. All being saved into a database; I am then going to use this to create a site map. The next stage is to try and get a cron job running probably a couple of times a month to update it.
    PHP Code:
    <?php
    // ********** FUNCTIONS **********
    // http://www.merchantos.com/makebeta/php/scraping-links-with-php/#put_together
    function storeLink$url,$gathered_from ) {
        
    $query "INSERT INTO links ( url, gathered_from ) VALUES ( '$url', '$gathered_from' )";
        
    mysql_query($query) or die( 'Error: Main INSERT query failed' );
                                            }
    // ********** Initial settings **********
    // Database variables
    $database '';
    $username '';
    $host 'localhost';
    $password '';
    // Set the user agent as some servers will error without one
    $userAgent 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
    // Array of pages to parse
    $page = array( '''art.php''other.php''resize.php''watermark.php''mosaic.php''text.php''server/server.php''codes/codes.php''notes/notes.php' );
    // Start url
    $start_page "http://www.rubblewebs.co.uk/imagemagick/";

    // ********** Start the code **********
    // Connect to the database using the details entered into the variable above
    $conn mysql_connect"$host""$username""$password);
    // If the connection can not be made print Could not connect MySQL
    if ( !$conn ) die ( "Could not connect to MySQL server" );
    // If the database could not be opened or found print Could not open database
    mysql_select_db$database,$conn ) or die ( "Could not open database" );

    // Start off by emptying the database 
    $query "TRUNCATE TABLE links";
    mysql_query$query ) or die( 'Error: TRUNCATE query failed' );

    // Read the pages from the array finding the links
    foreach( $page as $value ){
    $target_url $start_page.$value;
    // make the cURL request to $target_url
    $ch curl_init();
    curl_setopt$chCURLOPT_USERAGENT$userAgent );
    curl_setopt$chCURLOPT_URL,$target_url );
    curl_setopt$chCURLOPT_FAILONERRORtrue );
    curl_setopt$chCURLOPT_FOLLOWLOCATIONtrue );
    curl_setopt$chCURLOPT_AUTOREFERERtrue );
    curl_setopt$chCURLOPT_RETURNTRANSFER,true );
    curl_setopt$chCURLOPT_TIMEOUT10 );
    $htmlcurl_exec$ch );
    if ( !
    $html ) {
        echo 
    "<br />cURL error number:" .curl_errno$ch );
        echo 
    "<br />cURL error:" curl_error$ch );
        exit;
    }
    // parse the html into a DOMDocument
    $dom = new DOMDocument();
    @
    $dom->loadHTML$html );
    // grab all the links on the page
    $xpath = new DOMXPath$dom );
    $hrefs $xpath->evaluate"/html/body//a" );
    // Save the links etc. into the database
    for ( $i 0$i $hrefs->length$i++ ) {
        
    $href $hrefs->item$i );
        
    $url $href->getAttribute'href' );
        
    storeLink$url,$target_url );
    }
    }

    // http://www.justin-cook.com/wp/2006/12/12/remove-duplicate-entries-rows-a-mysql-database-table/
    // Remove duplicate data based on the url column
    // Create a new table with the data from the current table without the duplicates
    mysql_query"CREATE TABLE temp_table AS SELECT * FROM links WHERE 1 GROUP BY url" )
    or die( 
    'Error: CREATE TABLE failed'.mysql_error() );
    // Delete the first table
    mysql_query"DROP TABLE links" )
    or die( 
    'DROP TABLE failed'.mysql_error() );
    // Rename the new table to the original name
    mysql_query"RENAME TABLE temp_table TO links" )
    or die( 
    'RENAME TABLE failed'.mysql_error() );

    // Get the title from the pages put into the database by the original curl code
    $query "SELECT url FROM links";
    $returned mysql_query$query ) or die( 'Error: SELECT url query failed' );
    // Loop through all the urls getting the title tag from each page then saving it with the relavent url in the database
    while ( $row mysql_fetch_array $returned ))
    {
    $file $row['url'];
    // Get the page titles
    $ch curl_init();
    curl_setopt$chCURLOPT_URL$start_page.$file );
    curl_setopt$chCURLOPT_USERAGENT$userAgent );
    curl_setopt$chCURLOPT_RETURNTRANSFER,);
    $result curl_exec $ch );
    curl_close $ch );
    preg_match'#<title>(.+?)</title>#i'$result$matches );
    // Update the rows with the page title
    $query "UPDATE links SET title = '$matches[1]' WHERE url = '$file'";
    mysql_query$query ) or die( 'Error: Update query failed'.mysql_error() );
    }

    // Code to sort out the pages that returned a 404 error. This was caused by the files with links inside a folder e.g server/server.php
    $query "SELECT * FROM links ORDER BY title";
    $returned mysql_query$query ) or die( 'Error, SELECT query for 404 error failed' );

    while ( 
    $row mysql_fetch_array $returned ) )
    {
    // Only select the results that have a 404 error as the title
    if ( $row['title'] == 'Error 404 page' ){
    $location $row['gathered_from'];
    $exploded explode"/"$row['gathered_from'] );
    // Count the number of parts there are when $row['gathered_from'] is exploded
    $count count$exploded );
    $last_item $exploded[$count-1];
    $target str_replace$last_item''$location );
    $path $target.$row['url'];
    $file $row['url'];
    // Get the page titles
    $ch curl_init();
    curl_setopt$chCURLOPT_URL$path );
    curl_setopt$chCURLOPT_USERAGENT$userAgent );
    curl_setopt$chCURLOPT_RETURNTRANSFER,);
    $result curl_exec $ch );
    curl_close $ch );
    preg_match'#<title>(.+?)</title>#i'$result$matches );
    // Update the rows with the page title
    $query "UPDATE links SET title = '$matches[1]' WHERE url = '$file'";
    mysql_query$query ) or die( 'Error: Update the 404 error query failed' );
    }
    }
    ?>


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •