Interpreting CURL response when accessing a website?

carlosbcg · January 8, 2012, 12:14am

I am rather confused regarding the response that CURL is returning me when I try and access a website.

What I am after is the response code. Which is all well and good. Only thing is that CURL seems to return a variety of response codes and I am unclear as to how they tie together.

Take for example the following response…

HTTP/1.1 301 Moved Permanently
Location: http://www.google.com/
Content-Type: text/html; charset=UTF-8
Date: Sat, 07 Jan 2012 23:49:39 GMT
Expires: Mon, 06 Feb 2012 23:49:39 GMT
Cache-Control: public, max-age=2592000
Server: gws
Content-Length: 219
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN

HTTP/1.1 200 OK
Date: Sat, 07 Jan 2012 23:49:39 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
Set-Cookie: PREF=ID=3bdab1cd4225c488:FF=0:TM=1325980179:LM=1325980179:S=9V1GOM2Gf8DlN_-k; expires=Mon, 06-Jan-2014 23:49:39 GMT; path=/; domain=.google.com
Set-Cookie: NID=54=dZFexKNdSVB943cwresQuwA4wJVZiuar4BLjbEJ-EuUZblmkOaNDMiUBvACmxSzOMF_ZedjapSR_zkP4oPku7kBUhLx6l6rxnDr_CYtawAOPlFLWy7xLE0oIAKOP0DTM; expires=Sun, 08-Jul-2012 23:49:39 GMT; path=/; domain=.google.com; HttpOnly
P3P: CP=“This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info.”
Server: gws
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
Transfer-Encoding: chunked

Array
(
[0] => HTTP/1.1 301
[1] => 301
)

Here is the PHP code using CURL that produced it.



<?php
// code mostly from: http://w-shadow.com/blog/2007/08/02/how-to-check-if-page-exists-with-curl/
$url = "http://google.com";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);

/* set the user agent - might help, doesn't hurt */
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)');
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);

/* try to follow redirects */
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

/* timeout after the specified number of seconds. assuming that this script runs
 on a server, 20 seconds should be plenty of time to verify a valid URL.  */
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 15);
curl_setopt($ch, CURLOPT_TIMEOUT, 20);

/* don't download the page, just the header (much faster in this case) */
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_HEADER, true);

/* handle HTTPS links */
if(strpos($url, 'https')) {
 curl_setopt($ch, CURLOPT_SSL_VERIFYHOST,  1);
 curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
}

$response = curl_exec($ch);
//curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);

print_r($response);

/*  get the status code from HTTP headers */
if(preg_match('/HTTP\\/1\\.\\d+\\s+(\\d+)/', $response, $matches)) {
   print_r($matches);
}
?>

My question is this…

There is one response which says “HTTP/1.1 301 Moved Permanently” and another which says “HTTP/1.1 200 OK”. Are they both correct? How so?

I mean the url I am going to is “http://google.com”. If I understand the response received correctly is it saying that this url is first redirected to “http://www.google.com” and that the response from an attempt to access that page is 200 OK? Is that how this works?

Are responses always a string of them such that one follows what happens down the response tree like this?

Any input anyone cares to share with me would be appreciated.

Thanks.

Carlos

JamesKenny · January 12, 2012, 4:08pm

http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html

carlosbcg · January 12, 2012, 8:01pm

Thanks for the link James but that doesn’t help much at all. I mean the particular codes are easily gotten over the internet. I had looked at them and know where I might be able to access them.

My question revolved around mapping the response given to me by CURL (which is not numeric) to the particular codes and otherwise making heads or tails of what it tells me.

In the case of the example I gave CURL returns TWO HTTP status codes for example. So it’s not a simple matter of saying to myself “Aha! The status code is —”. Which of the two is it? Or is it both? How do these two (or more) status codes relate to each other and to the one request for a page that I might make through CURL?

See what I mean?

Giving me a page of numeric descriptions telling me what this or that code means doesn’t help much in that said page does not help me understand the relationships inherent in the CURL response and map it’s word responses to particular codes.

Carlos

JamesKenny · January 12, 2012, 8:34pm

I am sorry, I am guilty of the very same thing I already told somebody else they were wrong for. Not reading the entire thread (what an ass huh?)

Yes, your assumption is correct, it is first connecting to google and then redirecting to the tos type page, most likely becasue google is one of the widest scraped sites on the planet and it violates googles tos accessing their site the way you are trying to connect, they will ban your ip (they want you to buy into their api services and use those domains)
when i seen:
Array
(
[0] => HTTP/1.1 301
[1] => 301
)

i thought you wanted to know how to handle them

carlosbcg · January 12, 2012, 9:21pm

No problem James.

Regarding it being against the Google TOS to access Google SERP’s the way I am can you do me a favor and point me to the specific paragraph in the Google TOS that says…

That saving a SERP page to my local computer is against the TOS?
That accessing that SERP page any way I please on my own computer is against the TOS?

What Google is concerned with is throwing automated scripts to scrape Google results at Google. Taking up bandwidth and disrupting it’s services. What I do with saved pages on my own computer is my business and none of Google’s per se any more than how I read a book that I save in ebook format to my own computer is anyone’s business. I seriously doubt that Google will sue me (and win) regarding my reading the links out of it’s SERP pages stored on my computer. There is no harm done to Google at all and as far as Google knows about is no different than if I went through it’s pages saved to my local computer one by one and copied each link to a local spreadsheet manually (as in my doing the work myself one link at a time). The results would be the same with no difference at all regarding any impact (or not) on Google.

Incidentally if what I am doing is indeed against the Google TOS does that make writing Greasemonkey Javascripts to present Google SERP’s in a more useful way to myself locally also a violation?

Where does fair use of Google SERP results kick in under copyright laws (which laws override the Google TOS by the way)?

If Google is able to scrape (and blatantly scrape) whatever it pleases online including portions of books copyrighted by others under Fair Use should not we be allowed at bare minimum, to access and read their SERP results saved locally to our computers, any way we please?

Incidentally they have never banned my IP for saving SERP results locally to my computer one page at a time :). Which is the only thing they could possible ban me for regarding what I do over the internet.

Carlos

JamesKenny · January 12, 2012, 9:53pm

I dont see one of the top of a search, my expeirence was with creating a keyword tool, which i am 99% positive the tos is worded in a way that only allows direct access via a browser or device intended for "viewing’ the content as well as quite a few other stipulationce, hence the reason i combined the boss and adwords api to complete the tool

lol, please someone answer this one, between the government and google god only knows

honestly i could care less what “you” do as long as it doesnt affect “me” but me running into similar errors with cUrl and google is why i responded and gave my two cents

carlosbcg · January 12, 2012, 10:09pm

What I do locally on my computer definitely does not affect you :).

Nor do I ever access Google directly using CURL or anything other than my browser. I view Google results through a browser just like anybody else over the Internet. The only difference is that I save what I view (manually through standard browser functionality) to locally stored HTML pages which I later, apart from accessing Google over the internet, view in whatever way I please…offline as far as Google is concerned.

Carlos

JamesKenny · January 12, 2012, 10:24pm

you keep saying localhost but Location: http://www.google.com/ says differntly

back on topic though what exactly do you want to do with the error codes?
and whats the “entire” array format you are having trouble with?

is you problem implemneting the above array?
Array
(
[0] => HTTP/1.1 301
[1] => 301
)

JamesKenny · January 12, 2012, 11:07pm

http://en.wikipedia.org/wiki/List_of_HTTP_header_fields <<<can explain headers better than me
mix that (note the P3P on wikipedia) with
P3P: CP=“This is not a P3P policy! See http://www.google.com/support/accoun...&answer=151657 for more info.” <<<in your response select the link

carlosbcg · January 13, 2012, 12:16am

Oops. You are correct James. My bad. You are absolutely correct in saying that such access is a violation of their TOS. I stand corrected. It didn’t even dawn on me that I was doing that.

I used google because I had been looking for a website that would be convenient to use in this thread without giving away the domain my client is having me work on. A website that was returning a two tiered status code as google does. I tried IBM, Microsoft and other popular sites and non was returning the type of response that I was looking for (in line with what I was getting from my client site).

back on topic though what exactly do you want to do with the error codes?
and whats the “entire” array format you are having trouble with?

You see the two status codes?

The first being HTTP/1.1 301 Moved Permanently and second being HTTP/1.1 200 OK?

I was confused as to which one was the HTTP status is what my original post was mainly about. I was after the HTTP status.

Turns out, if I am understanding this correctly, that they both are. The one is returned first (the 301) following by the next one for the page to which the one access is redirected (which itself returns an HTTP status of 200 OK).

Some pages over the internet appear to return a veritable cascade of redirections. I guess my best bet when reporting the HTTP status is to report the last one since that is the one that counts I suppose.

is you problem implemneting the above array?

No. It had nothing to do with understanding or implementing the array. I was just confused as to how to tell which one was THE HTTP status since there were multiple one’s.

I believe I understand things now though.

Thanks for your input.

Carlos

Topic		Replies	Views
cURL Failing To Download Page PHP scripts	29	1899	September 11, 2019
[SOLVED] Why is curl() returning different results from get_headers()? PHP	4	4024	August 26, 2018
cURL Experiments PHP	244	21307	July 29, 2017
Please help me with CURL PHP	6	9868	May 15, 2010
What's wrong with my Curl call? PHP	26	8426	February 26, 2023

Interpreting CURL response when accessing a website?

Related topics