SitePoint Sponsor

User Tag List

Results 1 to 8 of 8
  1. #1
    SitePoint Zealot thetzfreak's Avatar
    Join Date
    Aug 2004
    Location
    United States
    Posts
    154
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Is there a better way to do this?

    Hello,

    I am currently designing a script that makes a google search and grabs files from the pages within the search results. This is how I do it: I use the cURL code shown below to open the google result page. I then use preg_match_all() to find all 10 URLs in the google result page. When I made the search, I searched only within "index of" directories to make finding files easier. I then use cURL again to open each of these 10 URLs, then preg_match_all, again, to find the URLs of the files (JPEGs, PNGs, etc) that contain the search string in google (like "cool car"). Here's the cURL code I use to open files:

    PHP Code:
    $curl_handle curl_init();
    curl_setopt($curl_handle,CURLOPT_URL$value);
    curl_setopt($curl_handle,CURLOPT_CONNECTTIMEOUT,2);
    curl_setopt($curl_handle,CURLOPT_RETURNTRANSFER,1);
    $buffer1 curl_exec($curl_handle);
    curl_close($curl_handle);
    if (!empty(
    $buffer1)) { //do stuff here } 
    Now, as you can guess, this is a very unwieldy method of getting URLs. To use cURL like this on EVERY url in the google search, and then preg_match_all in every page is quite ridiculous. However, due to my inexperience in PHP, it's the only way I can think of doing it. I don't like this method because each google SERP can take up to 2 minutes to get the file URLs.

    Is there any easier way to do all of this?

  2. #2
    Programming Team silver trophybronze trophy
    Mittineague's Avatar
    Join Date
    Jul 2005
    Location
    West Springfield, Massachusetts
    Posts
    17,036
    Mentioned
    187 Post(s)
    Tagged
    2 Thread(s)

    curling google

    You may want to check Google's Terms before you develop a script that hits them too often too fast. AFAIK they say something about not using scripts o access their search results.

  3. #3
    SitePoint Zealot thetzfreak's Avatar
    Join Date
    Aug 2004
    Location
    United States
    Posts
    154
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hmm, perhaps you are right. However, I have seen sites that grab results from search engines (3 at a time) within 1 or 2 seconds, which means the scripts were not "accessing" the SERPs, because it would have taken much longer to open every page. So, there must be another method of doing this that doesn't directly access their pages.

    What do you think? Is there another way?

  4. #4
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by thetzfreak View Post
    PHP Code:
    $curl_handle curl_init();
    curl_setopt($curl_handle,CURLOPT_URL$value);
    curl_setopt($curl_handle,CURLOPT_CONNECTTIMEOUT,2);
    curl_setopt($curl_handle,CURLOPT_RETURNTRANSFER,1);
    $buffer1 curl_exec($curl_handle);
    curl_close($curl_handle);
    if (!empty(
    $buffer1)) { //do stuff here } 
    (...)
    Is there any easier way to do all of this?
    PHP has curl file wrappers compiled in per default. You can use:
    PHP Code:
    $buffer1 file_get_contents($value);
    if (!empty(
    $buffer1)) { //do stuff here } 
    Or just:
    PHP Code:
    if ($buffer1 file_get_contents($value)) { //do stuff here } 

  5. #5
    SitePoint Zealot thetzfreak's Avatar
    Join Date
    Aug 2004
    Location
    United States
    Posts
    154
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    kyberfabrikken: Yes, that is true, but what if the result in google is a bad link? It'll hang for a few minutes trying to open it. If I use cURL, I set the timeout to 2 seconds (or at least I think it works) in this line: curl_setopt($curl_handle,CURLOPT_CONNECTTIMEOUT,2);

    That's the only reason I see to use cURL here. If it were not for that, I'd use your method.

    Does my cURL method take more CPU than file_get_contents, or vice versa?

  6. #6
    SitePoint Zealot thetzfreak's Avatar
    Join Date
    Aug 2004
    Location
    United States
    Posts
    154
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    kyberfabrikken: Yes, that is true, but what if the result in google is a bad link? It'll hang for a few minutes trying to open it. If I use cURL, I set the timeout to 2 seconds (or at least I think it works) in this line: curl_setopt($curl_handle,CURLOPT_CONNECTTIMEOUT,2);

    That's the only reason I see to use cURL here. If it were not for that, I'd use your method.

    Does my cURL method take more CPU than file_get_contents, or vice versa?

  7. #7
    . shoooo... silver trophy logic_earth's Avatar
    Join Date
    Oct 2005
    Location
    CA
    Posts
    9,013
    Mentioned
    8 Post(s)
    Tagged
    0 Thread(s)
    default_socket_timeout
    http://www.php.net/manual/en/ref.fil...socket-timeout

    PHP Code:
    ini_set('default_socket_timeout'2); 
    Logic without the fatal effects.
    All code snippets are licensed under WTFPL.


  8. #8
    SitePoint Zealot thetzfreak's Avatar
    Join Date
    Aug 2004
    Location
    United States
    Posts
    154
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by logic_earth View Post
    default_socket_timeout
    http://www.php.net/manual/en/ref.fil...socket-timeout

    PHP Code:
    ini_set('default_socket_timeout'2); 
    Thanks Where do I put this code? Before each file_get_contents function, or in the beginning of my script?

    I also just realized my last post got posted twice. Sorry about that.

    EDIT: Ah, I see now, it goes in the beginning of the script. Thanks =)


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •