How can i pull content from another website?

For example i have my own website, and i want to pull content from another website
but not the whole content, only the a href links, how can i do that?

Thanks

You can do it by sending an HTTP request to the other website and process the reply.
In Javascript you can use Ajax to send your request and POSt/GET data. In PHP, you can use [URL=“http://www.php.net/manual/en/book.curl.php”]cURL or the PECL extension [URL=“http://www.php.net/manual/en/book.http.php”]HTTP to send requests and receive responses.

I use cURL then parse the returned code to extract the data your looking for into an array, from there your script can do with it as it wishes.

can you show me example script please?


<?  
$ch = curl_init(); 
curl_setopt($ch, CURLOPT_URL, 'http://www.example.com'); 
curl_setopt($ch, CURLOPT_HEADER, 1); 
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
$data = curl_exec(); 
curl_close($ch); 
 ?> 

the page html is now contained in $data

Thanks my friend, but i always receive the same error

"Fatal error: Call to undefined function curl_init() "

What is wrong?

then it would seem that cURL isnt installed on your server.

phpinfo();

and look at the results to see if it lists cURL as an active extension

Never mind i fixed it :] but i have another question

I’m trying to pull the results from google with this code, search for PHP


<?php
  $ch = curl_init();
  curl_setopt($ch, CURLOPT_URL, 'http://www.google.com/#q=PHP');
  curl_setopt($ch, CURLOPT_HEADER, 1);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  $data = curl_exec($ch);
  file_put_contents("text.txt", $data);
  curl_close($ch);
?>

But inside the text.txt file is see NOT FOUND, here it is, whats wrong here?

HTTP/1.1 404 Not Found
Content-Type: text/html; charset=UTF-8
X-Content-Type-Options: nosniff
Date: Mon, 28 Feb 2011 17:24:45 GMT
Server: sffe
Content-Length: 1354
X-XSS-Protection: 1; mode=block

try adding some timeout headers and looking at a properpage of a website.


<?php 
  $ch = curl_init(); 
  curl_setopt($ch, CURLOPT_URL, 'http://www.sitepoint.com'); 
  curl_setopt($ch, CURLOPT_HEADER, 0); 
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
  curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 300);
curl_setopt($ch, CURLOPT_TIMEOUT, 300);
  $data = curl_exec($ch); 
  file_put_contents("text.txt", $data);
  curl_close($ch); 
?>

google seems to have its main page at google.nl , if you scrape google.com you just get a redirect script, so thats probably why you were getting an not found on your test. The script above works OK here

there is also file_get_contents()

:nono: No. Ignore this advice, it is dangerous.

If PHP is configured to allow file_get_contents() to pull files from another server, include and require will be able to do the same. This is a SERIOUS security flaw. Outside of that file_get_contents isn’t designed to handle external requests and is vulnerable to DDOS attack and buffer overflowing.

It is advised to at all times configure PHP to not allow external file reads except through the cURL library. If the cURL library isn’t needed (which is often the case) it should be turned off as well.

I’m not sure where you are getting your info, but it is blatantly wrong. There is nothing insecure about setting allow_url_fopen to ON. First of all, this setting DOES NOT allow you to include() external urls - there is a separated directive for that called “allow_url_include”, and even that is not insecure unless you write horrible code, or the server is compromised in which case it’s a moot point. Yes, if you allow a user set their own includes or to type in an unvalidated url and eval the code or something you could be in trouble - but that would be silly, wouldn’t it? And you could do the same thing with cURL.

Anytime you are playing with 3rd party information you have to take precautions, but your statement above is a massive exaggeration. Want a ultra-secure server? Don’t connect it to the internet.

Want a secure one? Use functions as intended. file_get_contents has remote opening capability for historical backwards compatibilty reasons, not because it was ever a good idea to latch that functionality to that function to begin with.

While I believe a buffer overflow may be technically possible, wouldn’t setting the maxlen parameter to something sensible prevent it?

I fail to see how file_get_contents() directly leads to a DDOS attack though.

Also the attacker has to have control over the content you are fetching, if you’re fetching data from a specific source this shouldn’t be a problem anyway.

The cURL library is technically just as likely to be caught by buffer overflows as the native fopen() function (which file_get_contents calls).

As far as I’m concerned, the reason to use cURL is the added features, not security.

Are you trying to scrape data from auto suggest? If so, try this URL:

http://clients1.google.com/complete/search?hl=en&q='.$keyword

If not, try this one:

http://clients1.google.com/search?hl=en&q='.$keyword

The fact that the very first example for the function in the manual shows how to get the contents of an external URL would seem to disprove your point.

And people use cURL for a reason. file_get_contents is slow and limited. Maybe you’re not used to Scraping, but file_get_contents is not the answer for more than basic stuff.