Folks,
Why do I keep on getting this error when everytime I input a url in the url field in the following web proxy:
“The specified URL could not be returned due to a status code of 400.”
<?php
error_reporting(0);
session_start();
//Settings Instructions: https://darkpolitics.wordpress.com/2009/12/29/create-your-own-web-proxy-server/
// turn debug messages on when debugging your proxy
//$DEBUG = true;
$DEBUG = false;
// set this to the location of the webproxy page if you know where its going to be otherwise this function will work it out.
// for performance you should hardcode this to your webproxy location
//$PROXYURL = "http://www.mysite.com/myproxy.php";
$PROXYURL = get_current_location(); // works out current scripts location
// urls from orig search will be $_POST but then future links we proxify will be $_GET
$url = $_REQUEST["url"];
$useragent = $_POST["useragent"]; // will only be a POST from search form
ShowDebug("useragent posted from search form = $useragent");
// set the user-agent we will surf with. We only set on initial search and then use a session to pass this var to any
// other content passed through the proxy. Make sure you have session cookies enabled for your proxy page!
if(!empty($useragent)){
if($useragent=="us"){
$surf_useragent = $_SERVER["HTTP_USER_AGENT"]; // use current agent
}else if($useragent=="ie"){
$surf_useragent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)"; // use IE 7
}else{ // must be ff as we only have 2 choices!! Add as required
$surf_useragent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 (.NET CLR 3.5.30729)"; // use FF3
}
// set a session for future calls through the proxy
$_SESSION["surf_useragent"] = $surf_useragent;
}else{
$surf_useragent = $_SESSION["surf_useragent"];
}
ShowDebug("surf with agent = $surf_useragent");
$err = false;
$msg = "";
$content = "";
$subpathurl ="";
$pathurl = "";
$siteurl = "";
// this list contains domains that this proxy will allow obviously in your own proxy you can remove this!!
$whitelist = "technicallypolitical.com,strictly-software.com,infowars.com,prisonplanet.com,hashemian.com";
$cansearch = false;
ShowDebug("url = $url");
ShowDebug("useragent = $useragent");
ShowDebug("PROXYURL = $PROXYURL");
if(!empty($url)){
ShowDebug("url = $url");
// make sure its valid with a protocol at the start
if($url == "http://"){
$err = true;
$msg = "Please specify a full URL to access e.g http://www.darkpolitricks.com";
}else if(!preg_match("/https?:\/\//",$url)){
$err = true;
$msg = "Please specify the protocol within the URL e.g http://";
ShowDebug("error = $msg");
}else{
ShowDebug("get content from remote url $url");
if(!empty($whitelist)){
// check whether url is allowed
$allowed = explode(",",$whitelist);
$count = count($allowed);
$lowurl = strtolower($url);
ShowDebug("check whether $lowurl is in whitelist of $whitelist");
foreach($allowed as $val){
ShowDebug("check whether ".$val." is in $url");
if( strripos($lowurl, $val) !== false){
ShowDebug("This url $url is on whitelist matching $val");
$cansearch = true;
break;
}
}
}else{
$cansearch = true;
}
if(!$cansearch){
$err = true;
$msg = "The url is not allowed to be accessed from this web proxy server.";
}else{
// crawl item e.g URL, script, CSS, image
$html = mycrawler_single($url,$surf_useragent);
$content = $html["html"];
$status = $html["status"];
$headers = $html["header"];
$content_type = $html["content_type"];
$connect_error = $html["message"];
ShowDebug("connect error = $connect_error");
ShowDebug("status = $status");
// a status code 200 means we got a successful request back if we didn't then we have an issue
if($status!="200"){
// 404 = Page not found
if($status=="404"){
$err = true;
$msg = "The specified URL could not be located.";
}else if(!empty($connect_error)){
$err = true;
$msg = $connect_error;
ShowDebug("CONNECT ERROR = $connect_error; msg = $msg");
}else{
$err = true;
$msg = "The specified URL could not be returned due to a status code of $status.";
}
}else{
// need to replace all links in our returned content with links to the proxy so that future clicks are proxified
$urlinfo = parse_url($url);
// get root url to exend any relative links e.g http://www.mysite.com
$siteurl = $urlinfo["scheme"]."://".$urlinfo["host"];
if(!empty($urlinfo["path"])){
$pathurl = $siteurl.$urlinfo["path"];
// make sure file is removed in case we need current sub directory
$pospath = strripos($pathurl, "/");
if($pospath!==false){
ShowDebug( "take up to / as pos $pospath in $pathurl<br />");
$subpathurl = substr($pathurl,0,$pospath);
}else{
$subpathurl = $pathurl."/";
}
}else{
$pathurl = $siteurl;
$subpathurl = $pathurl."/";
}
ShowDebug("SiteURL = $siteurl path = $pathurl");
// for text related content we scan for links so that we can change them all to go through our proxy
// for images and other non textual content we have no need to change the links
if(preg_match("/(text|html|xml|xhtml|css|javascript)/i", $content_type )){
//if(preg_match("/(text|html|xml|xhtml)/", $content_type )){
ShowDebug("parse links");
// make sure all links are rerouted through proxy
$content = reformat_links($content,$siteurl,$subpathurl);
}
// As all links/src values from the page we visit need to pass through the proxy as well we need to ensure
// to output the correct header for file. For example a PNG image needs to have the correct header e.g image/png
ShowDebug("output content-type: $content_type");
header( $content_type );
ShowDebug("output content = $content");
// output content to screen
echo $content;
}
}
}
}else{
// default url to http://
$url = "http://";
}
// Will return the current location of the script running. If the proxy page is moved around a lot then this
// will work out where it is but for performance set the value at the top in $PROXYURL
function get_current_location(){
$url = "";
if( $_SERVER["SERVER_PORT"]== 443){
$protocol = "https://";
}else{
$protocol = "http://";
}
$url = $protocol . $_SERVER["SERVER_NAME"] . $_SERVER["SCRIPT_NAME"];
return $url;
}
// retrieve link destinations and modify them so that when they are clicked the content is passed through the proxy
// as well. I look for src/href tags. Currently this does not handle URLs defined like so href="../"
function reformat_links($content,$siteurl,$subpathurl){
// need to make all URLs go through our proxy! use ISAPI rewriting to make it nicer this is just a guide
global $PROXYURL;
$relurl = $PROXYURL . "?url=" .$siteurl; // for urls like url="/sub/page.htm"
$cururl = $PROXYURL . "?url=" .$subpathurl; // for urls like url="page.htm"
$absurl = $PROXYURL . "?url="; // for urls like url="http://www.mysite.com/page.htm"
ShowDebug("reformat rel urls = $relurl");
ShowDebug("reformat cur urls = $cururl");
ShowDebug("reformat abs urls = $absurl");
$newcontent = $content;
// get all links and reformat
// as we don't want to do the same links multiple times which happens I use placeholders first and then
// once every possible location has been marked I insert the link to the proxy
// look for absolute urls e.g url="http://www.mysite.com/blah.asp"
$newcontent = preg_replace("/((?:href|src)=['\"])(http.*?)(['\"])/i","$1##ABSURL##$2$3",$newcontent);
// get links starting with / e.g url="/sub/page.htm"
$newcontent = preg_replace("/((?:href|src)=['\"])(\/.*?)(['\"])/i","$1##RELURL##$2$3",$newcontent);
// get links starting like url="page.htm"
$newcontent = preg_replace("/((?:href|src)=['\"])([^#h\/][^#t][^t][^p].*?)(['\"])/i","$1##CURURL##$2$3",$newcontent);
// now replace placeholders
$newcontent = str_replace("##RELURL##",$relurl,$newcontent);
$newcontent = str_replace("##CURURL##",$cururl,$newcontent);
$newcontent = str_replace("##ABSURL##",$absurl,$newcontent);
ShowDebug("return content");
return $newcontent;
}
// code to load remote content such as HTML files, CSS, Images etc
// To follow more than 3 redirects (e.g ISAPI rewrites then change $maxredirs=XX)
function mycrawler_single($url, $useragent="",$timeout=10, $maxredirs=3)
{
ShowDebug( "IN mycrawler_single Get URL content from $url $useragent maxredirs = $maxredirs");
$urlinfo = parse_url($url);
if (empty($urlinfo["scheme"])) {$urlinfo = parse_url("http://".$url);}
if (empty($urlinfo["path"])) {$urlinfo["path"]="/";}
if (empty($urlinfo["port"]))
{
switch($urlinfo["scheme"])
{
case "http":
$urlinfo["port"] = 80;
break;
case "https":
$urlinfo["port"] = 443;
break;
}
}
// if no agent is supplied use default agent
if (empty($useragent)) $useragent = $_SERVER["HTTP_USER_AGENT"];
ShowDebug("useragent to use = $useragent");
if (isset($urlinfo["query"]))
{
$request = "GET ".$urlinfo["path"]."?".$urlinfo["query"]." ";
} else {
$request = "GET ".$urlinfo["path"]." ";
}
// form request
$request .= "HTTP/1.0\r\n";
$request .= "Host: ".$urlinfo["host"]."\r\n";
$request .= "User-Agent: ".$useragent."\r\n";
$request .= "Connection: close\r\n\r\n";
ShowDebug( "request = ".$request);
ShowDebug( "open ".$urlinfo["host"].":".$urlinfo["port"]);
$fp = @fsockopen($urlinfo["host"], $urlinfo["port"], $errno, $errstr, $timeout);
if (!$fp)
{
ShowDebug( "ERROR! (".$errno.")".$errstr);
$urlinfo["header"] = "";
$urlinfo["html"] = "Error: $errno $errstr";
$urlinfo["status"] = 400; // bad request
$urlinfo["content_type"] = "";
$urlinfo["message"] = "The request could not be made. $errno $errstr";
return $urlinfo;
}
else
{
ShowDebug($request);
fwrite($fp, $request);
while (!feof($fp))
{
if(isset($data)){
$data .= fgets($fp, 4096);
}else{
$data = fgets($fp, 4096);
ShowDebug( "take status code from 9,4 in data = ".$data);
// status code should be here! if not its a bad request
$code = trim(substr($data,9,4));
ShowDebug( "Status Code = ".$code);
}
}
ShowDebug( "Status Code = ".$code);
// if no status code default to 400 = Bad Request
if(empty($code) || !is_numeric($code)){
$code = 400;
ShowDebug("default to bad request 400");
}
ShowDebug("status code = $code - response = $data");
fclose($fp);
$tmp = explode("\r\n\r\n", $data, 2);
// We will return an array with these parts header, html, status code and content-type
$urlinfo["header"] = $tmp[0];
$urlinfo["html"] = $tmp[1];
$urlinfo["status"] = $code;
$urlinfo["content_type"] = get_content_type($tmp[0]);
$urlinfo["message"] = "";
ShowDebug( "The Status Code = ".$urlinfo["status"]." from header: ".$urlinfo["header"]);
// handle redirects
ShowDebug( "do we need to redirect? pos of location in header = ". stripos($urlinfo["header"], "location:"). " maxredirs = $maxredirs");
if ((stripos($urlinfo["header"], "location:")) && ($maxredirs > 0))
{
ShowDebug( "found location in header and we CAN REDIRECT");
preg_match("/\r\nlocation:(.*)/i", $urlinfo["header"], $match);
if ($match)
{
$redirect = trim($match[1]);
ShowDebug( "Redirecting to ".$redirect);
ShowDebug( "$maxredirs is currently $maxredirs");
$maxredirs--;
ShowDebug( "$maxredirs after count down is now $maxredirs");
ShowDebug( "DO A REDIRECT TO $redirect");
return mycrawler_single($redirect, $useragent, $timeout, $maxredirs);
}
}
ShowDebug( "RETURN FROM mycrawler_single");
// return array of header/html
return $urlinfo;
}
}
// will check headers for the content-type. We need this so that images are displayed correctly
function get_content_type($headers){
$content_type = "";
if(!empty($headers)){
$headerarray = explode("\r\n", $headers);
foreach($headerarray as $head){
ShowDebug( "header item = ".$head);
if(preg_match("/Content-Type: .+$/i",$head)){
$content_type = $head;
break;
}
}
}
ShowDebug("return $content_type");
return $content_type;
}
// Debug function if you want to show debug e.g for testing your proxy then turn $DEBUG = True at top of page
// for performance all ShowDebug statements should be removed on production to reduce unneccessary function calls
function ShowDebug($msg){
global $DEBUG;
if(!$DEBUG) return;
if(!empty($msg)){
echo htmlentities($msg)."<br />";
}
}
if(empty($url) || $url=="http://" || $err){
?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en-US">
<head>
<title>Dark Politricks Web Proxy Example</title>
<meta content='text/html; charset=UTF-8' http-equiv='Content-Type'/>
<meta name="keywords" content="DarkPolitricks, WebProxy, Proxy, Proxies, Proxi, Proxied, Forwarded-For" />
<meta name="description" content="An example of a web proxy, how you can make your own web proxy to bypass basic filtering" />
<!-- Put all these in an external stylesheet -->
<style>
body{background:lightblue;}
p{font-weight:bold;}
.error{color:red;}
.msg{color:green;}
#main{margin:auto;width:600px;}
#search{margin:auto;width:600px;}
label{font-weight:bold;font-face:Tahoma,Arial;}
#url{width:300px;}
#searchflds{border:1px solid black;}
dt{float:left;}
dd{float:left;}
#domainlist{font-style:italic;color:navy;}
#searchbutton{text-align:right;}
#agent{clear:both;}
.agent{margin-top:10px;}
#ie{margin-left:-12px;}
</style>
</head>
<body>
<div id="main">
<h1>Example of a WebProxy</h1>
<?php
if(!empty($msg)){
if($err){
echo "<p class='error'>$msg</p>";
}else{
echo "<p class='msg'>$msg</p>";
}
}
?>
<p>This is an example page and can only be used to access the following domains:</p>
<p id="domainlist">technicallypolitical.com, strictly-software.com, infowars.com, prisonplanet.com</p>
<p>Please read the related article at <a href="http://www.darkpolitricks.com/2009/12/create-your-own-web-proxy-server" title="Create your own web proxy">www.darkpolitricks.com</a> to get more information as well as a link to download the code so that you can create your own web proxy.</p>
<div id="search">
<form id="searchanon" name="searchanon" method="POST">
<fieldset id="searchflds">
<dl>
<dt><label for="where">Where To</label></dt>
<dd><input type="text" id="url" name="url" value="<?php echo $url ?>" maxlength="100" />
</dl>
<dl id="agent">
<dt class="agent"><label for="useragent">User-Agent</label></dt>
<dd class="agent"><input type="radio" name="useragent" id="ie" value="ie" <?php if($useragent=="ie"){ echo 'checked="true"'; } ?> /><label for="ie" title="Use IE 7 user-agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)">IE 7</label>
<input type="radio" name="useragent" id="ff" value="ff" <?php if($useragent=="ff"){ echo 'checked="true"'; } ?> /><label for="ff" title="Use FireFox 3 user-agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 (.NET CLR 3.5.30729)">FireFox 3</label>
<input type="radio" name="useragent" id="us" value="us" <?php if($useragent=="us"){ echo 'checked="true"'; } ?> /><label for="ff" title="Keep existing agent: <?php echo $_SERVER["HTTP_USER_AGENT"] ?>">Keep Existing User-Agent</label>
</dd>
</dl>
</fieldset>
<p id="searchbutton"><input type="submit" value="Go There" id="submitsearch" name="submitsearch" />
</form>
</div>
</div>
</body>
</html>
<?php
}
?>
And, how do I remove the restrictions so any website can be viewed apart from:
$whitelist = "technicallypolitical.com,strictly-software.com,infowars.com,prisonplanet.com,hashemian.com";
I removed the above mentioned urls from the $whitelist and it worked as I was able to view google but then the 400 error started appearing.
How would you change the code and where ?