Image Scraping with Symfony’s DomCrawler

A photographer friend of mine implored me to find and download images of picture frames from the internet. I eventually landed on a web page that had a number of them available for free but there was a problem: a link to download all the images together wasn’t present.

I didn’t want to go through the stress of downloading the images individually, so I wrote this PHP class to find, download and zip all images found on the website.

How the Class works

It searches a URL for images, downloads and saves the images into a folder, creates a ZIP archive of the folder and finally deletes the folder.

The class uses Symfony’s DomCrawler component to search for all image links found on the webpage and a custom zip function that creates the zip file. Credit to David Walsh for the zip function.

Coding the Class

The class consists of five private properties and eight public methods including the __construct magic method.

Below is the list of the class properties and their roles.
1. $folder: stores the name of the folder that contains the scraped images.
2. $url: stores the webpage URL.
3. $html: stores the HTML document code of the webpage to be scraped.
4. $fileName: stores the name of the ZIP file.
5. $status: saves the status of the operation. I.e if it was a success or failure.

Let’s get started building the class.

Create the class ZipImages containing the above five properties.

<?php
class ZipImages {
    private $folder;
    private $url;
    private $html;
    private $fileName;
    private $status;

Create a __construct magic method that accepts a URL as an argument.
The method is quite self-explanatory.

public function __construct($url) {
    $this->url = $url; 
    $this->html = file_get_contents($this->url);
    $this->setFolder();
}

The created ZIP archive has a folder that contains the scraped images. The setFolder method below configures this.

By default, the folder name is set to images but the method provides an option to change the name of the folder by simply passing the folder name as its argument.

public function setFolder($folder="image") {
    // if folder doesn't exist, attempt to create one and store the folder name in property $folder
    if(!file_exists($folder)) {
        mkdir($folder);
    }
    $this->folder = $folder;
}

setFileName provides an option to change the name of the ZIP file with a default name set to zipImages:

public function setFileName($name = "zipImages") {
    $this->fileName = $name;
}

At this point, we instantiate the Symfony crawler component to search for images, then download and save all the images into the folder.

public function domCrawler() {
    //instantiate the symfony DomCrawler Component
    $crawler = new Crawler($this->html);
    // create an array of all scrapped image links
    $result = $crawler
        ->filterXpath('//img')
        ->extract(array('src'));

// download and save the image to the folder 
    foreach ($result as $image) {
        $path = $this->folder."/".basename($image);
        $file = file_get_contents($image);
        $insert = file_put_contents($path, $file);
        if (!$insert) {
            throw new \Exception('Failed to write image');
        }
    }
}

After the download is complete, we compress the image folder to a ZIP Archive using our custom create_zip function.

public function createZip() {
    $folderFiles = scandir($this->folder);
    if (!$folderFiles) {
        throw new \Exception('Failed to scan folder');
    }
    $fileArray = array();
    foreach($folderFiles as $file){
        if (($file != ".") && ($file != "..")) {
            $fileArray[] = $this->folder."/".$file;
        }
    }

    if (create_zip($fileArray, $this->fileName.'.zip')) {
        $this->status = <<<HTML
File successfully archived. <a href="$this->fileName.zip">Download it now</a>
HTML;
    } else {
        $this->status = "An error occurred";
    }
}

Lastly, we delete the created folder after the ZIP file has been created.

public function deleteCreatedFolder() {
    $dp = opendir($this->folder) or die ('ERROR: Cannot open directory');
    while ($file = readdir($dp)) {
        if ($file != '.' && $file != '..') {
            if (is_file("$this->folder/$file")) {
                unlink("$this->folder/$file");
            }
        }
    }
    rmdir($this->folder) or die ('could not delete folder');
}

Get the status of the operation. I.e if it was successful or an error occurred.

public function getStatus() {
    echo $this->status;
}

Process all the methods above.

public function process() {
    $this->domCrawler();
    $this->createZip();
    $this->deleteCreatedFolder();
    $this->getStatus();
}

You can download the full class from Github.

Class Dependency

For the class to work, the Domcrawler component and create_zip function need to be included. You can download the code for this function here.

Download and install the DomCrawler component via Composer simply by adding the following require statement to your composer.json file:

"symfony/dom-crawler": "2.3.*@dev"

Run $ php composer.phar install to download the library and generate the vendor/autoload.php autoloader file.

Using the Class

  • Make sure all required files are included, via autoload or explicitly.
  • Call the setFolder , and setFileName method and pass in their respective arguments. Only call the setFolder method when you need to change the folder name.
  • Call the process method to put the class to work.
<?php
    require_once 'zipfunction.php';
    require_once 'vendor/autoload.php';
    use Symfony\Component\DomCrawler\Crawler;
    require_once 'vendor/autoload.php';

    //instantiate the ZipImages class
    $object = new ArchiveFile('http://sitepoint.com');
    // set the zip file name
    $object->setFolder('pictureFrames');
    // set the zip file name
    $object->setFileName('myframes');
    // initialize the class process
    $object->process();

Summary

In this article, we learned how to create a simple PHP image scraper that automatically compresses downloaded images into a Zip archive. If you have alternative solutions or suggestions for improvement, please leave them in the comments below, all feedback is welcome!

Win an Annual Membership to Learnable,

SitePoint's Learning Platform

  • http://www.bitfalls.com/ Bruno Skvorc

    You can’t prevent it from happening by reducing the exposure of known techniques – quite the opposite. Now that you know this approach, you can easily block it directly. Stay tuned for an upcoming article on prevention of hotlinking and static resource theft.

  • chronicler_Isiah

    That’s fine Bruno but I think if site point sees itself as a responsible resource for web developers then yes it needs an article on how this scraping code can be countered.

    It would go someway to mitigating the decision to publish it in the first place.

    Best regards
    Isiah

    • http://www.bitfalls.com/ Bruno Skvorc

      We’ve published articles on SQL injection, too, and how to perform this attack. The end result is that at least a few websites are now immune to it. We’ll continue publishing shortcuts, exploits and hacks alongside regular tutorials and articles, simply because they both demonstrate popular vulnerabilities, and because they’re a good show of PHP’s power in simplicity. We don’t regret publishing this, nor will we., so we don’t feel it needs mitigating. But yes, an anti-hotlink article is coming soon.

    • Taylor Ren

      Resources and copyright shall be preserved. But technology itself is neutral.

      In general, to know the unknown, is the best to defend.

  • sebastiaan hilbers

    You know that doing stuff in the constructor is considered a bad practice? Just saying…

  • Ionut Iulian

    What about using “copy” instead of file_get_contents and file_put_contents?

    • http://www.bitfalls.com/ Bruno Skvorc

      Feel free to use copy if you like it more, sure

  • R CHAINI

    If you try to save the page from IE with extension as *.html , then all the images will get downloaded to a folder.

    • http://www.bitfalls.com/ Bruno Skvorc

      Indeed, but this approach lets you loop through paginated entries and harvest an entire domain as long as you know its URL pattern.

  • http://www.php.net/ Kalle Sommer Nielsen

    I honestly don’t see the big point in this, considering you can write this with PHP’s builtin ZipArchieve class and the DOM extension without adding the extra layer, not only is it faster in, they already have a great API without having to clutter a framework into the mix.