How can i pull content from another website?

DanZs · February 28, 2011, 10:52am

For example i have my own website, and i want to pull content from another website
but not the whole content, only the a href links, how can i do that?

Thanks

Amit_Yaron · February 28, 2011, 1:21pm

You can do it by sending an HTTP request to the other website and process the reply.
In Javascript you can use Ajax to send your request and POSt/GET data. In PHP, you can use [URL=“http://www.php.net/manual/en/book.curl.php”]cURL or the PECL extension [URL=“http://www.php.net/manual/en/book.http.php”]HTTP to send requests and receive responses.

Mandes · February 28, 2011, 4:22pm

I use cURL then parse the returned code to extract the data your looking for into an array, from there your script can do with it as it wishes.

DanZs · February 28, 2011, 4:44pm

can you show me example script please?

Mandes · February 28, 2011, 4:49pm


<?  
$ch = curl_init(); 
curl_setopt($ch, CURLOPT_URL, 'http://www.example.com'); 
curl_setopt($ch, CURLOPT_HEADER, 1); 
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
$data = curl_exec(); 
curl_close($ch); 
 ?>

the page html is now contained in $data

DanZs · February 28, 2011, 4:52pm

Mandes:


<?  
$ch = curl_init(); 
curl_setopt($ch, CURLOPT_URL, 'http://www.example.com'); 
curl_setopt($ch, CURLOPT_HEADER, 1); 
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
$data = curl_exec(); 
curl_close($ch); 
 ?>

the page html is now contained in $data

Thanks my friend, but i always receive the same error

"Fatal error: Call to undefined function curl_init() "

What is wrong?

Mandes · February 28, 2011, 5:23pm

then it would seem that cURL isnt installed on your server.

phpinfo();

and look at the results to see if it lists cURL as an active extension

DanZs · February 28, 2011, 5:26pm

Never mind i fixed it :] but i have another question

I’m trying to pull the results from google with this code, search for PHP


<?php
  $ch = curl_init();
  curl_setopt($ch, CURLOPT_URL, 'http://www.google.com/#q=PHP');
  curl_setopt($ch, CURLOPT_HEADER, 1);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  $data = curl_exec($ch);
  file_put_contents("text.txt", $data);
  curl_close($ch);
?>

But inside the text.txt file is see NOT FOUND, here it is, whats wrong here?

HTTP/1.1 404 Not Found
Content-Type: text/html; charset=UTF-8
X-Content-Type-Options: nosniff
Date: Mon, 28 Feb 2011 17:24:45 GMT
Server: sffe
Content-Length: 1354
X-XSS-Protection: 1; mode=block

Mandes · February 28, 2011, 5:44pm

try adding some timeout headers and looking at a properpage of a website.


<?php 
  $ch = curl_init(); 
  curl_setopt($ch, CURLOPT_URL, 'http://www.sitepoint.com'); 
  curl_setopt($ch, CURLOPT_HEADER, 0); 
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
  curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 300);
curl_setopt($ch, CURLOPT_TIMEOUT, 300);
  $data = curl_exec($ch); 
  file_put_contents("text.txt", $data);
  curl_close($ch); 
?>

google seems to have its main page at google.nl , if you scrape google.com you just get a redirect script, so thats probably why you were getting an not found on your test. The script above works OK here

aamonkey · February 28, 2011, 6:07pm

there is also file_get_contents()

Michael_Morris1 · February 28, 2011, 7:08pm

No. Ignore this advice, it is dangerous.

If PHP is configured to allow file_get_contents() to pull files from another server, include and require will be able to do the same. This is a SERIOUS security flaw. Outside of that file_get_contents isn’t designed to handle external requests and is vulnerable to DDOS attack and buffer overflowing.

It is advised to at all times configure PHP to not allow external file reads except through the cURL library. If the cURL library isn’t needed (which is often the case) it should be turned off as well.

aamonkey · February 28, 2011, 8:04pm

I’m not sure where you are getting your info, but it is blatantly wrong. There is nothing insecure about setting allow_url_fopen to ON. First of all, this setting DOES NOT allow you to include() external urls - there is a separated directive for that called “allow_url_include”, and even that is not insecure unless you write horrible code, or the server is compromised in which case it’s a moot point. Yes, if you allow a user set their own includes or to type in an unvalidated url and eval the code or something you could be in trouble - but that would be silly, wouldn’t it? And you could do the same thing with cURL.

Anytime you are playing with 3rd party information you have to take precautions, but your statement above is a massive exaggeration. Want a ultra-secure server? Don’t connect it to the internet.

Michael_Morris1 · February 28, 2011, 10:58pm

aamonkey:

I’m not sure where you are getting your info, but it is blatantly wrong. There is nothing insecure about setting allow_url_fopen to ON. First of all, this setting DOES NOT allow you to include() external urls - there is a separated directive for that called “allow_url_include”, and even that is not insecure unless you write horrible code, or the server is compromised in which case it’s a moot point. Yes, if you allow a user set their own includes or to type in an unvalidated url and eval the code or something you could be in trouble - but that would be silly, wouldn’t it? And you could do the same thing with cURL.

Anytime you are playing with 3rd party information you have to take precautions, but your statement above is a massive exaggeration. Want a ultra-secure server? Don’t connect it to the internet.

Want a secure one? Use functions as intended. file_get_contents has remote opening capability for historical backwards compatibilty reasons, not because it was ever a good idea to latch that functionality to that function to begin with.

TomB · February 28, 2011, 11:13pm

While I believe a buffer overflow may be technically possible, wouldn’t setting the maxlen parameter to something sensible prevent it?

I fail to see how file_get_contents() directly leads to a DDOS attack though.

Also the attacker has to have control over the content you are fetching, if you’re fetching data from a specific source this shouldn’t be a problem anyway.

The cURL library is technically just as likely to be caught by buffer overflows as the native fopen() function (which file_get_contents calls).

As far as I’m concerned, the reason to use cURL is the added features, not security.

hbsooner · February 28, 2011, 11:49pm

DanZs:

Never mind i fixed it :] but i have another question

I’m trying to pull the results from google with this code, search for PHP
<?php 
  $ch = curl_init(); 
  curl_setopt($ch, CURLOPT_URL, 'http://www.google.com/#q=PHP'); 
  curl_setopt($ch, CURLOPT_HEADER, 1); 
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
  $data = curl_exec($ch); 
  file_put_contents("text.txt", $data);
  curl_close($ch); 
?>
But inside the text.txt file is see NOT FOUND, here it is, whats wrong here?

HTTP/1.1 404 Not Found
Content-Type: text/html; charset=UTF-8
X-Content-Type-Options: nosniff
Date: Mon, 28 Feb 2011 17:24:45 GMT
Server: sffe
Content-Length: 1354
X-XSS-Protection: 1; mode=block

Are you trying to scrape data from auto suggest? If so, try this URL:

http://clients1.google.com/complete/search?hl=en&q='.$keyword

If not, try this one:

http://clients1.google.com/search?hl=en&q='.$keyword

aamonkey · March 1, 2011, 2:26am

The fact that the very first example for the function in the manual shows how to get the contents of an external URL would seem to disprove your point.

hbsooner · March 1, 2011, 3:31am

And people use cURL for a reason. file_get_contents is slow and limited. Maybe you’re not used to Scraping, but file_get_contents is not the answer for more than basic stuff.