cURL Failing To Download Page

Hello There!

I just signed-up so how-about you Seniors welcoming here ? :wink:

I do not understand why my following cURL php fails to fetch a webpage.

<?php 

//ERROR REPORTING CODES. 
declare(strict_types=1); 
ini_set('display_errors', '1'); 
ini_set('display_startup_errors', '1'); 
error_reporting(E_ALL); 
mysqli_report(MYSQLI_REPORT_ERROR | MYSQLI_REPORT_STRICT); 

/*
Download a Webpage via the HTTP GET Protocol using libcurl
*/
function _http ( $target, $referer ) {
	//Initialize Handle
	$handle = curl_init();
	//Define Settings
	curl_setopt ( $handle, CURLOPT_HTTPGET, true );
	curl_setopt ( $handle, CURLOPT_HEADER, true );
	curl_setopt ( $handle, CURLOPT_COOKIEJAR, "cookie_jar.txt" );
	curl_setopt ( $handle, CURLOPT_COOKIEFILE, "cookies.txt" );
	curl_setopt ( $handle, CURLOPT_USERAGENT, "web-crawler-tutorial-test" );
	curl_setopt ( $handle, CURLOPT_URL, $target );
	curl_setopt ( $handle, CURLOPT_REFERER, $referer );
	curl_setopt ( $handle, CURLOPT_FOLLOWLOCATION, true );
	curl_setopt ( $handle, CURLOPT_MAXREDIRS, 4 );
	curl_setopt ( $handle, CURLOPT_RETURNTRANSFER, true );
	//Execute Request
	$output = curl_exec ( $handle );
	//Close cURL handle
	curl_close ( $handle );
	//Separate Header and Body
	$separator = "\r\n\r\n";
	$header = substr( $output, 0, strpos( $output, $separator ) );
	$body_start = strlen( $header ) + strlen( $separator );
	$body = substr( $output, $body_start, strlen( $output ) - $body_start );
	//Parse Headers
	$header_array = Array();
	foreach ( explode ( "\r\n", $header ) as $i => $line ) {
		if($i === 0) {
			$header_array['http_code'] = $line;
			$status_info = explode( " ", $line );
			$header_array['status_info'] = $status_info;
		} else {
			list ( $key, $value ) = explode ( ': ', $line );
			$header_array[$key] = $value;
		}
	}
	//Form Return Structure
	$ret = Array("headers" => $header_array, "body" => $body );
	return $ret;
}
$page = _http( "https://potentpages.com", "" );
$headers = $page['headers'];
$http_status_code = $headers['http_code'];
$body = $page['body'];

?>

What is wrong ? It seems Fire Fox is not interpreting the code.

In what way?

Your code doesnt actually output anything. You set some values, but never echo them out.

Ok, I echoed the $header and the $body. But no luck.
This is what I see on my browser:

$line ) { if($i === 0) { $header_array['http_code'] = $line; $status_info = explode( " ", $line ); $header_array['status_info'] = $status_info; } else { list ( $key, $value ) = explode ( ': ', $line ); $header_array[$key] = $value; } } //Form Return Structure $ret = Array("headers" => $header_array, "body" => $body ); return $ret; } $page = _http( "https://potentpages.com", "" ); $headers = $page['headers']; $http_status_code = $headers['http_code']; $body = $page['body']; echo "HEADERS: $headers";
echo "BODY: $body"; ?>

What is wrong ?
Why do not you load my code on your browser and see for yourself ?

If you SEE that code in your browser, your server isnt running the PHP engine correctly.

But why is it not running properly ? I use xampp. And been using it for 2.5yrs now and the engine always interprets the code.
Do you see anything wrong in the coding ? If not, then it should work! Right ?
Xampp is running.

What URL are you putting into your browser?

file:///C:/xampp/htdocs/work/download_pages_test.php
I get to it like this:
NotePad++: Menu->Run->Fire Fox;

Yeah, I thought so.

accessing a file:\ address isn’t going through XAMPP. It’s loading it directly through the browser, which means it’s not going through the PHP interpreter.

Try putting this into your browser:
http://localhost/work/download_pages_test.php

2 Likes

You are right. But why was NotePad++ opening the wrong path ? In the past, I opened files like this in my browsers and they worked. Why problem this time ?
Anyway, I getting this code error now:

Parse error : syntax error, unexpected ‘<’, expecting end of file in C:\xampp\htdocs\work\download_pages_test.php on line 57

And so, I replaced this:

echo "HEADERS: $headers"; <br>

with this:

echo "HEADERS: $headers"; ?> <br> <?php 

Now I see cURL fetched the page. But at the top I get this error:

Notice** : Array to string conversion in **C:\xampp\htdocs\work\download_pages_test.php** on line **57
HEADERS: Array
BODY:

Line 57 is this:
echo “HEADERS: $headers”; ?>
<?php

$headers is an array, so it’s trying to do an implicit conversion between Array and String, and it fails somewhat (Array.toString() returns “Array”, no matter what the contents of the array are.

For the headers array, you’ll need to use print_r($headers).

1 Like

Thanks. I forgot how to echo array values and so googled and did it like this.

$page = _http( "https://potentpages.com", "" );
$headers = $page['headers'];
$http_status_code = $headers['http_code'];
$body = $page['body'];

//echo "HEADERS: $headers"; ?> <br> <?php 
//echo "BODY: $body"; 

foreach($header as $h => $h_value) {
    echo "Key=" . $h . ", Value=" . $h_value;
    echo "<br>";
}

foreach($body as $b => $b_value) {
    echo "Key=" . $b . ", Value=" . $b_value;
    echo "<br>";
}

And now get this error:

Key=http_code, Value=HTTP/2 200

**Notice** : Array to string conversion in **C:\xampp\htdocs\work\download_pages_test.php** on line **61**
Key=status_info, Value=Array
Key=server, Value=nginx/1.15.6
Key=date, Value=Tue, 10 Sep 2019 15:22:06 GMT
Key=content-type, Value=text/html; charset=UTF-8
Key=vary, Value=Accept-Encoding
Key=x-powered-by, Value=PHP/7.2.21
Key=link, Value=; rel=shortlink
Key=expires, Value=Tue, 10 Sep 2019 15:22:05 GMT
Key=cache-control, Value=no-cache
Key=set-cookie, Value=browser=235511156a103a09c176e6cceee3306f;Path=/;Max-Age=31536000
Key=x-backend, Value=Apache

**Warning** : Invalid argument supplied for foreach() in **C:\xampp\htdocs\work\download_pages_test.php** on line **65**

Line 61 looks like this:

    echo "Key=" . $h . ", Value=" . $h_value;

Line 65 looks like this:

foreach($body as $b => $b_value) {

I will test with print_r later when I get back. In the meanwhile you are welcome to help me get rid of these 2 errors.

Thanks for your help all this far.

$body is NOT an array, so trying to treat it as one will fail.

Within the loop test for the variable type and act accordingly because the array items are not all strings:

foreach($body as $b => $b_value)
{
  if( is_string($b_values) ):  
    echo "Key=" . $b . ", Value=" . $b_value;
  else:
    // added pre to add line feeds
    echo '<pre>'; print_r($b_value); echo '</pre>';
  endif;
  echo "<br>";
}

Is this what is required?

http://johns-jokes.com/downloads/sp-d/johnyboy-curl-test/?url=https://potentpages.com

1 Like

Yes, as expected your print_r() worked to this:

Array ( [http_code] => HTTP/2 200 [status_info] => Array ( [0] => HTTP/2 [1] => 200 [2] => ) [server] => nginx/1.15.6 [date] => Tue, 10 Sep 2019 16:41:19 GMT [content-type] => text/html; charset=UTF-8 [vary] => Accept-Encoding [x-powered-by] => PHP/7.2.21 [link] => ; rel=shortlink [expires] => Tue, 10 Sep 2019 16:41:18 GMT [cache-control] => no-cache [set-cookie] => browser=286194e16df63a75b50cd7225dc3632b;Path=/;Max-Age=31536000 [x-backend] => Apache ) 

Now, what I actually want to do is get the cURL to dump the whole page’s HEADER onto a variable.
And, dump the whole page’s html code onto a variable.
And so, which lines to modify to what ?

My code now looks like this …

<?php 

//ERROR REPORTING CODES. 
declare(strict_types=1); 
ini_set('display_errors', '1'); 
ini_set('display_startup_errors', '1'); 
error_reporting(E_ALL); 
mysqli_report(MYSQLI_REPORT_ERROR | MYSQLI_REPORT_STRICT); 

/*
Download a Webpage via the HTTP GET Protocol using libcurl
*/
function _http ( $target, $referer ) {
	//Initialize Handle
	$handle = curl_init();
	//Define Settings
	curl_setopt ( $handle, CURLOPT_HTTPGET, true );
	curl_setopt ( $handle, CURLOPT_HEADER, true );
	curl_setopt ( $handle, CURLOPT_COOKIEJAR, "cookie_jar.txt" );
	curl_setopt ( $handle, CURLOPT_COOKIEFILE, "cookies.txt" );
	curl_setopt ( $handle, CURLOPT_USERAGENT, "web-crawler-tutorial-test" );
	curl_setopt ( $handle, CURLOPT_URL, $target );
	curl_setopt ( $handle, CURLOPT_REFERER, $referer );
	curl_setopt ( $handle, CURLOPT_FOLLOWLOCATION, true );
	curl_setopt ( $handle, CURLOPT_MAXREDIRS, 4 );
	curl_setopt ( $handle, CURLOPT_RETURNTRANSFER, true );
	//Execute Request
	$output = curl_exec ( $handle );
	//Close cURL handle
	curl_close ( $handle );
	//Separate Header and Body
	$separator = "\r\n\r\n";
	$header = substr( $output, 0, strpos( $output, $separator ) );
	$body_start = strlen( $header ) + strlen( $separator );
	$body = substr( $output, $body_start, strlen( $output ) - $body_start );
	//Parse Headers
	$header_array = Array();
	foreach ( explode ( "\r\n", $header ) as $i => $line ) {
		if($i === 0) {
			$header_array['http_code'] = $line;
			$status_info = explode( " ", $line );
			$header_array['status_info'] = $status_info;
		} else {
			list ( $key, $value ) = explode ( ': ', $line );
			$header_array[$key] = $value;
		}
	}
	//Form Return Structure
	$ret = Array("headers" => $header_array, "body" => $body );
	return $ret;
}
$page = _http( "https://potentpages.com", "" );
$headers = $page['headers'];
$http_status_code = $headers['http_code'];
$body = $page['body'];

//echo "HEADERS: $headers"; ?> <br> <?php 
//echo "BODY: $body"; 

foreach($headers as $h => $h_value) {
    echo "Key=" . $h . ", Value=" . $h_value;
    echo "<br>";
}

foreach($body as $b => $b_value) {
    echo "Key=" . $b . ", Value=" . $b_value;
    echo "<br>";
}

/*
print_r($headers) ?> <br> <?php 
print_r($body)
*/
?>

And it shows this error:

Key=http_code, Value=HTTP/2 200

**Notice** : Array to string conversion in **C:\xampp\htdoc\work\download_pages_test.php** on line **61**
Key=status_info, Value=Array
Key=server, Value=nginx/1.15.6
Key=date, Value=Tue, 10 Sep 2019 16:46:48 GMT
Key=content-type, Value=text/html; charset=UTF-8
Key=vary, Value=Accept-Encoding
Key=x-powered-by, Value=PHP/7.2.21
Key=link, Value=; rel=shortlink
Key=expires, Value=Tue, 10 Sep 2019 16:46:47 GMT
Key=cache-control, Value=no-cache
Key=set-cookie, Value=browser=a7af8b7f00036600fbc1e794f0f367f8;Path=/;Max-Age=31536000
Key=x-backend, Value=Apache

**Warning** : Invalid argument supplied for foreach() in **C:\xampp\htdocs\work\download_pages_test.php** on line **65**

Folks,

I am actually having trouble here:

foreach($headers as $h) {
    echo "$h";
    echo "<br>";
}

foreach($body as $b) {
    echo "$b";
    echo "<br>";
}

I get error:

HTTP/2 200

**Notice** : Array to string conversion in **C:\xampp\htdocs\work\download_pages_test.php** on line **61**
Array
nginx/1.15.6
Tue, 10 Sep 2019 16:52:43 GMT
text/html; charset=UTF-8
Accept-Encoding
PHP/7.2.21
; rel=shortlink
Tue, 10 Sep 2019 16:52:42 GMT
no-cache
browser=e05c0d69b7d2c6f31d40f2e72e937a7a;Path=/;Max-Age=31536000
Apache

**Warning** : Invalid argument supplied for foreach() in **C:\xampp\htdocs\work\download_pages_test.php** on line **65**

Did you read my post #13 and follow the suggestions and view the link which not only shows the correct curl output but also the source code to display the output?

Sorry, I missed your post 13.
I checkedout your link. NEAT! I like the spreadsheet values!
I also want a variable containing all the page’s html code just like a variable contains all the page’s header. Can you show me how to do that ?

Also, care to explain what these sections and their codes are on your page ?

File:

Source: htmlspecialchars($tmp)

Source: strip_tags($tmp)

Source: htmlentities($tmp)

1 Like

I tested your code:

foreach($body as $b => $b_value)
{
  if( is_string($b_values) ):  
    echo "Key=" . $b . ", Value=" . $b_value;
  else:
    // added pre to add line feeds
    echo '<pre>'; print_r($b_value); echo '</pre>';
  endif;
  echo "<br>";
}

Look:

<?php 

//ERROR REPORTING CODES. 
declare(strict_types=1); 
ini_set('display_errors', '1'); 
ini_set('display_startup_errors', '1'); 
error_reporting(E_ALL); 
mysqli_report(MYSQLI_REPORT_ERROR | MYSQLI_REPORT_STRICT); 

/*
Download a Webpage via the HTTP GET Protocol using libcurl
*/
function _http ( $target, $referer ) {
	//Initialize Handle
	$handle = curl_init();
	//Define Settings
	curl_setopt ( $handle, CURLOPT_HTTPGET, true );
	curl_setopt ( $handle, CURLOPT_HEADER, true );
	curl_setopt ( $handle, CURLOPT_COOKIEJAR, "cookie_jar.txt" );
	curl_setopt ( $handle, CURLOPT_COOKIEFILE, "cookies.txt" );
	curl_setopt ( $handle, CURLOPT_USERAGENT, "web-crawler-tutorial-test" );
	curl_setopt ( $handle, CURLOPT_URL, $target );
	curl_setopt ( $handle, CURLOPT_REFERER, $referer );
	curl_setopt ( $handle, CURLOPT_FOLLOWLOCATION, true );
	curl_setopt ( $handle, CURLOPT_MAXREDIRS, 4 );
	curl_setopt ( $handle, CURLOPT_RETURNTRANSFER, true );
	//Execute Request
	$output = curl_exec ( $handle );
	//Close cURL handle
	curl_close ( $handle );
	//Separate Header and Body
	$separator = "\r\n\r\n";
	$header = substr( $output, 0, strpos( $output, $separator ) );
	$body_start = strlen( $header ) + strlen( $separator );
	$body = substr( $output, $body_start, strlen( $output ) - $body_start );
	//Parse Headers
	$header_array = Array();
	foreach ( explode ( "\r\n", $header ) as $i => $line ) {
		if($i === 0) {
			$header_array['http_code'] = $line;
			$status_info = explode( " ", $line );
			$header_array['status_info'] = $status_info;
		} else {
			list ( $key, $value ) = explode ( ': ', $line );
			$header_array[$key] = $value;
		}
	}
	//Form Return Structure
	$ret = Array("headers" => $header_array, "body" => $body );
	return $ret;
}
$page = _http( "https://potentpages.com", "" );
$headers = $page['headers'];
$http_status_code = $headers['http_code'];
$body = $page['body'];

foreach($body as $b => $b_value)
{
  if( is_string($b_values) ):  
    echo "Key=" . $b . ", Value=" . $b_value;
  else:
    // added pre to add line feeds
    echo '<pre>'; print_r($b_value); echo '</pre>';
  endif;
  echo "<br>";
}

/*
print_r($headers) ?> <br> <?php 
print_r($body)
*/
?>

And get this error:

**Warning** : Invalid argument supplied for foreach() in **C:\xampp\htdocs\work\download_pages_test.php** on line **57**

Line 57 is:

foreach($body as $b => $b_value)

Do you mind replying to posts 17 & 18 ?
I need a variable to hold all the page’s html as it’s value.
The $header is holding the page’s header. So far so good.
Now, the $body should’ve held all the page’s html. And so, show me how to do that.

And show me how to get a variable (eg $page_content) to hold all the page’s content parsing all the html, css, xml, etc. code. $page content should just hold plain text. Just like you would show plain text to a website visitor. And not like what you would show to a web crawler.
That way, I can save it’s variable value onto a file. Walla! I have the fetched page’s content like an article saved on my hdd. You understand ?
At the end, I will get my script to dump the header, body and plain text content of the page ont my mysql db like so:

COLS
url|header|body|plain_text_content

That way, I build my searchengine’s INDEX. Understand ?

Thanks!

Anyone else welcome to get creative here and show me how you did it.

Thanks

It looks like $body is not an array or an object and the foreach loop is complaining that the variable type is incorrect.

Try these statements before the loop:

  1. echo gettype($body);
  2. echo '<pre>'; var_dump($body); echo '</pre>';
  3. echo '<pre>'; print_r($body); echo '</pre>';

I also want a variable containing all the page’s html code just like a variable contains all the page’s header. Can you show me how to do that ?

Curl returns all the page’s html code and the following PHP native functions are used so that the output is in a human readable format.

Check the free, excellent and online PHP Manual for a detailed explanation of the following functions:

htmlspecialchars($tmp)

strip_tags($tmp)

htmlentities($tmp)

$tmp is the raw curl output result.

It’s very late here and I am off to read my book :slight_smile:

1 Like