Fast reliable method to check an external image URL

I’m working on an app that exports data to a KML file. It pulls the data from a database, then uses SimpleXML to write the file.
Some entities in the dataset have an associated image URL. As KML allows HTML in the descriptions, I’m adding the images to the description in <img> tags.
But before I do, I would like to check the image URL, and only add it if the URL works.

My first attempt was with file_exists(), but I found that unreliable, as it always seems to return false, so I get no pictures at all.

I did a bit of searching for other solutions.
This was a nice short one I found on SO:-

function url_exists($url) {
    return curl_init($url) !== false;
}

It was very fast, like file_exists(), but it seems also unreliable, as it was including broken URLs, which defeats the object of having a test.

I found some using @get_headers(), Eg:-

// Initialize an URL to the variable 
$url = "https://www.geeksforgeeks.org"; 
  
// Use get_headers() function 
$headers = @get_headers($url); 
  
// Use condition to check the existence of URL 
if($headers && strpos( $headers[0], '200')) { 
    $status = "URL Exist"; 
} 
else { 
    $status = "URL Doesn't Exist"; 
} 
  
// Display result 
echo($status); 

And using curl:-

$url = "http://www.domain.com/demo.jpg";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_NOBODY, true);
$result = curl_exec($curl);
if ($result !== false)
{
  $statusCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);  
  if ($statusCode == 404)
  {
    echo "URL Not Exists"
  }
  else
  {
     echo "URL Exists";
  }
}
else
{
  echo "URL not Exists";
}

Both of these appear to work, but I find they make things run quite slow.
I did slim them down a bit and wrap them in functions, but they are essentially what I tried. These are posted as copied.

Is it just a case of having to swallow the fact that I am at the mercy external server responses?
Or is there a reliable, fast way to get just enough info to say the URL is good, without downloading any image data at this time?

FWIW I ran some performance tests on those two.
The curl version is the faster at 7.92 sec.
The get headers one was 13.14 sec.
With no test, the time is 0.07 sec.

Clearly curl is faster test, but still not what you call fast.

Here they are wrapped in functions as I’m testing:-

	function url_exists($url) {
		$headers = @get_headers($url); 
		if($headers && strpos( $headers[0], '200')) { 
			return true ;
		}
		return false ;
	}
	
	function exists_url($url){
		$curl = curl_init($url);
		curl_setopt($curl, CURLOPT_NOBODY, true);
		if($result = curl_exec($curl)){
		  $statusCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);
		  if ($statusCode != 404){
			return true ;
		  }
		}
		return false ;
	}

file_get_contents()? I remember, there is a some setting to allow URL as argument of this function. Or am I wrong?..

I can give it a try.

Did you try using curl_multi_exec(…)?

These Topics may be of interest:

https://www.sitepoint.com/community/t/how-to-get-a-sites-last-modifed-date/326391/19”

https://www.sitepoint.com/community/t/quickly-test-your-domains-names-web-pages-and-web-files/326091”

Well, yes. You cannot know whether something exists without asking the server if it exists. The nicest way to do that is to issue a HEAD HTTP request and see what the response code is, which seems to be exactly what you’re doing (even though treating everything != 404 as existing is questionable, but ok).

The main problem you’re having is that everything is running serially, meaning you have to actively wait for a response before you can act on it. That’s time wasted where you could have done other things instead.

You could have a look at aync requests in Guzzle for a way to send off multiple requests at once and then wait for the answers of all of them in parallel, rather than serial.

So schematically, over time instead of this:

|--- request 1 ----|
                   | ----- request 2 ------|
                                           |------------- request 3 -------------|

You’d end up doing this

|--- request 1 ----|
|----- request 2 ------|
|------------- request 3 -------------|

Which saves [wall clock] time.

This is basically what @John_Betong was hinting at, but with a nicer interface IMO.

See https://docs.guzzlephp.org/en/stable/quickstart.html#async-requests

Don’t go overboard with this, i.e., don’t throw off a 1000 requests at once, that will just eat up your CPU and get you nowhere. Also try to be nice to external servers, i.e. don’t try to throw more than ~8 requests at any server at the same time, at the risk of getting rate limited and/or banned.
How many you can do at once depends on CPU speed, amount of memory, OS settings, network speed etc etc. Best to just experiment with it and see what works best for you.

2 Likes

Can you supply the list of urls that you tested so that I can try on my site which uses curl_multi_exec(…)

This is a bit of a cop-out, but given the lag I was getting, I think it’s going to be more efficient to do a bit of house-keeping and weed out the broken links and omit the test, in fact it is already done.

Funnily, all the examples I found would either accept only 200 or anything except 404, like that’s the only good response and that’s the only bad one.

Well, with 200 you’re sure it’s there. That’s the only one where you’re sure.
Any other 2xx response is weird, shouldn’t happen and I’m not sure what it mean for the existence of the image.
Any 3xx response should be followed to find the actual location of the image and then check that instead.
Any 4xx means you’re doing something wrong with the request, which says nothing at all about the image. For a trivial HEAD request this shouldn’t happen, but you never know.
Any 5xx request means the server has problems and you can’t know whether or not the image exists, i.e., please check back later.
Lastly, any 1xx response also shouldn’t happen - I would retry on that too.

I have an online test site that now uses the fastest method for finding the http_ response codes. I tried numerous methods and found curl_multi_exec(…); is by far the best method.

Try inserting a list of the image URLs into the following site. The test/demo retrieves 44 results in about one and a half seconds.

The tests are being run between online servers and only the results rendered.

https://supiet2.tk/test

If you are interested and I am back on the desktop I will supply the relative function to obtain the results.

Your test is giving me a 404. :smile:

Testing in parallel certainly makes sense. It would need a bit of re-jigging to make it happen, as the DB results are processed serially in a foreach loop (actually a loop within a loop).

But I think I have done the right thing now by purging the old URLs.
It’s an old project I’m refactoring. From a code point of view, I started from scratch, as I wrote it many years ago, and the code was pretty awful, and utilised some now obsolete libraries. But the part that did not start anew was the dataset. I kept the old data, just revamped the structure a bit, so there were quite a few obsolete URLs in there from way back. With those removed, I don’t need to lag the script with all those requests, which was maybe just a lazy way of dealing with out-dated data, but taxing for the server(s).
But still, it’s good to know these things for times I may need to use it.

1 Like

Try the link without the quotes.

1 Like

That’s it, the quote is getting added to the end.