SitePoint Sponsor

User Tag List

Results 1 to 7 of 7
  1. #1
    SitePoint Member
    Join Date
    Jul 2005
    Posts
    19
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Scraping images from a website

    I am trying to scrape product images from a website to save myself a large and manual task.

    Each product has its own page: http://www.example.com/product.php?p=1234

    Within this page there is one product image with an ambiguous name such as random_product.png the only thing distinguising it from the other images on the page is that it has a location of catalog/random_product.png.

    What I would like to do is have script scan all the product pages 1 - 6000 and save the image as the ID e.g., if random_product.png had an id of 1234 the script would save the file as 1234.png

    Are there any scripts available that would handle this?

    Many thanks in advance.

  2. #2
    SitePoint Evangelist asprookie's Avatar
    Join Date
    May 2005
    Posts
    539
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    A site sucker or a bot.

  3. #3
    SitePoint Member
    Join Date
    Jul 2005
    Posts
    19
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I've looked into website grabbers but they do not save the image with the id in the name, are there any you can recommend that would handle this.

  4. #4
    SitePoint Wizard Hammer65's Avatar
    Join Date
    Nov 2004
    Location
    Lincoln Nebraska
    Posts
    1,161
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Who owns these images? Do they know you are using them? Have they given you permission? Images and video are protected works, you can't just grab them and use them for your own convenience.

  5. #5
    SitePoint Wizard cranial-bore's Avatar
    Join Date
    Jan 2002
    Location
    Australia
    Posts
    2,634
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    This should give you the gist of it:
    PHP Code:
    <?php
    function save_image($pageID) {
        
        
    $base 'http://example.com/';
        
        
    //use cURL functions to "open" page
        //load $page as source code for target page
        
        //Find catalog/ images on this page
        
    preg_match_all('~catalog/([a-z0-9\.\_\-]+(\.gif|\.png|\.jpe?g))~i'$page$matches);
        
        
    /*
        $matches[0] => array of image paths (as in source code)
        $matches[1] => array of file names
        $matches[2] => array of extensions
        */
        
        
    for($i=0$i count($matches[0]); $i++) {
            
    $source $base $matches[0][$i];
            
    $tgt $pageID $matches[2][$i];    //NEW file name. ID + extension
            
            
    if(copy($source$tgt)) $success true;
            else 
    $success false;
        }
        
        return 
    $success//Rough validation. Only reports last image from source
    }


    //Download image from each page
    for($i=1$i<=6000$i++) {
        if(!
    save_image($i)) echo "Error with page $i<br>";
    }
    ?>
    You'll have to add your own cURL code to load the HTML source of each page into the $page variable.

    It'd probably be nice on the hosting web server not to do all 6000 pages in one go, and even for smaller runs you may need to increase your max execution time.

    And remember that any copyright restrictions will still apply regardless of how you get the images.

  6. #6
    SitePoint Member
    Join Date
    Jul 2005
    Posts
    19
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Hammer65 View Post
    Who owns these images? Do they know you are using them? Have they given you permission? Images and video are protected works, you can't just grab them and use them for your own convenience.
    Thanks for that lowdown on the law, but you can sleep easy, the images are my client's for his new website which I need to scrape from his old one.

  7. #7
    SitePoint Member
    Join Date
    Jul 2005
    Posts
    19
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thanks for this, i'll give it a shot and let you know how I get on, much appreciated. The web server is our own physical dedicated so no problems with using resources, only a few websites on it at present.

    Quote Originally Posted by cranial-bore View Post
    This should give you the gist of it:
    PHP Code:
    <?php
    function save_image($pageID) {
        
        
    $base 'http://example.com/';
        
        
    //use cURL functions to "open" page
        //load $page as source code for target page
        
        //Find catalog/ images on this page
        
    preg_match_all('~catalog/([a-z0-9\.\_\-]+(\.gif|\.png|\.jpe?g))~i'$page$matches);
        
        
    /*
        $matches[0] => array of image paths (as in source code)
        $matches[1] => array of file names
        $matches[2] => array of extensions
        */
        
        
    for($i=0$i count($matches[0]); $i++) {
            
    $source $base $matches[0][$i];
            
    $tgt $pageID $matches[2][$i];    //NEW file name. ID + extension
            
            
    if(copy($source$tgt)) $success true;
            else 
    $success false;
        }
        
        return 
    $success//Rough validation. Only reports last image from source
    }


    //Download image from each page
    for($i=1$i<=6000$i++) {
        if(!
    save_image($i)) echo "Error with page $i<br>";
    }
    ?>
    You'll have to add your own cURL code to load the HTML source of each page into the $page variable.

    It'd probably be nice on the hosting web server not to do all 6000 pages in one go, and even for smaller runs you may need to increase your max execution time.

    And remember that any copyright restrictions will still apply regardless of how you get the images.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •