SitePoint Sponsor

User Tag List

Results 1 to 7 of 7
  1. #1
    SitePoint Member
    Join Date
    May 2008
    Posts
    11
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    script to run search query

    hi,

    I am trying to write a script which can do the following. I do not have any idea of how to write it ..but have the idea of how it should/will work.

    I have a list of words in an excel file(almost 5000words). I need to take each word and then put in the search query of any search engine(for example ask.com) and store the result that appears under "related words"/"suggested words"/"narrow your search words" in a file.

    Example:
    If you search for the word "forum" in ask.com, we can see the url as
    ask.com/web?q=forum&search=search&qsrc=0&o=0&l=dir

    and the search result page contains "Narrow Your Search"

    Roman Forum
    Free Forum
    ..
    ...
    etc.

    So I need to write a script that can take word by word from my excel list and put it in the url (bold text) and then store only the "result under "Narrow your search result" in a file.
    ask.com/web?q=nextword&search=search&qsrc=0&o=0&l=dir

    Please let me know if its something possible and if I can have look at any examples..I would need to complete this as soon as possible.

    I can explain it in a more better way if the above explanation is not clear.

    Regards
    Last edited by pragan; Jun 18, 2008 at 14:18.

  2. #2
    SitePoint Evangelist superuser2's Avatar
    Join Date
    Aug 2006
    Posts
    598
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    You'll want to look into CURL for fetching the pages.

    I'm not familiar with HTML parsing but that's what it is, so you could try googling for it.

  3. #3
    SitePoint Member
    Join Date
    May 2008
    Posts
    11
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    thanks for the reply.I did not understand whether curl does what I want to achieve?I mean change the search term in the url using my excel list and then save the result page?

    Please guide me more.

    Thanks.

  4. #4
    SitePoint Member
    Join Date
    May 2008
    Posts
    11
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    any help??

  5. #5
    SitePoint Addict silentcollision's Avatar
    Join Date
    Jun 2006
    Location
    New Zealand
    Posts
    388
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I'm by no means an expert, but I'll run through what I would do.

    Firstly I'd make an MySQL table, and insert all the words inside this. There should be a field for an id, the word, another field for the results, and one more to tell if the word has been indexed or not. You can import .csv files in PhpMyAdmin, and you can export your Excel document as a .csv file (although I've never actually done this).

    Then, in your PHP script, you'll need a function to retrieve the page contents. As posted above, you could use CURL, but you could implement file_get_contents() as well. You could use something like this for your CURL.

    Once you're retrieved the page contents, I see the narrow-your-search items are in a div tag with the id 'narrow'.

    You'll have to use regex to find this area, and then look for the words within it. I'm useless with regex, but you might be able to utilize the following:

    Code:
    ;">Free Web <b>Forum</b></a>
    If you find the text between the <a> and the </a>, strip the tags (strip_tags()), you'll have your word. Put these in an array (Maybe look at preg_match_all(), implode() it, and then insert this into the database with your relevant word.

    You could put all of this inside a loop scheduled as a Cron Job, doing several words per minute (or more depending on your server/host), and you'll be done in no time.

    Hope that helps. I'm not sure about the legality of using ask.com in such a way, and I don't know the appropriate regex. I'm sure someone else can help you more.

    Edit: I was bored, and so I wrote this. My regex isn't very good but it might get you started.

    Code php:
    <?php
    $word = 'forum';
    $url = 'http://www.ask.com/web?q=' . $word . '&search=search';
     
    # Function I showed earlier
    function content($url) {
     
    	if(!@file_get_contents($url)) {
     
    		$curl = curl_init();
     
    		// Setup headers - I used the same headers from Firefox version 2.0.0.6
    		// below was split up because php.net said the line was too long. :/
    		$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
    		$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
    		$header[] = "Cache-Control: max-age=0";
    		$header[] = "Connection: keep-alive";
    		$header[] = "Keep-Alive: 300";
    		$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
    		$header[] = "Accept-Language: en-us,en;q=0.5";
    		$header[] = "Pragma: "; // browsers keep this blank.
     
    		curl_setopt($curl, CURLOPT_URL, $url);
    		curl_setopt($curl, CURLOPT_USERAGENT, 'Googlebot/2.1 (+[url]http://www.google.com/bot.html)');[/url]
    		curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
    		curl_setopt($curl, CURLOPT_REFERER, 'http://www.google.com');
    		curl_setopt($curl, CURLOPT_ENCODING, 'gzip,deflate');
    		curl_setopt($curl, CURLOPT_AUTOREFERER, true);
    		curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
    		curl_setopt($curl, CURLOPT_TIMEOUT, 10);
     
    		$html = curl_exec($curl); // execute the curl command
    		curl_close($curl); // close the connection
     
    	} else {
    		$html = @file_get_contents($url);
    	}
     
    	return $html; // and finally, return $html
    }
     
    preg_match_all('|<div class="zm">.*</div>|', content($url), $matches);
     
    echo '<pre>';
    print_r($matches);
    echo '</pre>';
    ?>
    Last edited by silentcollision; Jun 19, 2008 at 18:40.

  6. #6
    SitePoint Addict silentcollision's Avatar
    Join Date
    Jun 2006
    Location
    New Zealand
    Posts
    388
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Ok, so I was very bored. And I'm wanting to get some practice writing classes. Its not as good as others could write (the regex is terrible), but it does work.

    Download

    The files, in case someone doesn't trust me.

    index.php
    Code php:
    <?php
    /*
    	* index.php
    	* Chris Myers
    	* A class for generating keywords similar to other keywords
    		* Written by Chris Myers @ SquaredProductions.com
    		* All original unless where stated
    	* June, 2008
     
    	* Instructions
    		* Open up classes/class.database.php
    			* Insert your host, username, password, and database
     
    		* Open settings.php
    			* You can edit the $script array from here
    			* You can leave them as they are, but it is reccomended you have a table prefix (k_) by default
    			* In settings.php you can edit the number of words to index per tick
    				* I've set this to one, but even on a shared host you could probably run up to 5/10
    				* If you run this once every OTHER minute, at 5 words, thats 150 words per hour
    				* That means that you'll be finished in 33 1/3 hours. 
    				* You could make this a LOT faster if you want
     
    		* Upload all the files
    			* No chmoding or anything is necessary
     
    		* Open your browser, and locate the directory which contains the files
    			* To install, you MUST run index.php?mode=install
    			* This sets up the database tables - the script won't run without them
     
    		* From then, you can set a cron to run index.php (without any $_GET requests) as normal
     
    		* Other notes
    			* If you want to uninstall, just point your browser to index.php?mode=uninstall
    			* If you just want to see the results for a single word (maybe it didn't work?),
    			  point your browser to index.php?mode=word&word=xxxx where xxxx is your word
    			  The script should print an array of the values received for that word
     
    */
     
    # Include the database and settings
    include('settings.php');
    include('classes/class.database.php');
     
    # Initiate the database
    $link = new DB;
    $link->DB;
     
    # Require the new class
    include('classes/class.keywords.php');
     
    # Initiate the keywords class
    $keywords = new keywords;
     
    # Check whether we're supposed to run installation or uninstallation
    if($_GET['mode'] == 'install') {
     
    	# Run the installation
    	$keywords->install($script['tables_install']);
    	echo 'Successfully installed';
     
    	# Check the PHP version
    	if (version_compare(phpversion(), '5.0.0') === -1) {
       		echo '<br><br>This script has not been tested on your PHP version! (' . phpversion() . '). It may not function as expected.<br>';
    	}
     
    	# Insert the introductory values if its defined as such
    	if($script['initial_insert'] == true) {
    		$link->query("INSERT INTO " . $script['table_prefix'] . "words (id , word , keywords , indexed ) VALUES ( '1', 'forum', '', '0' ), ( '2', 'cartoon', '', '0' )");
    	}
     
    } elseif($_GET['mode'] == 'uninstall') {
     
    	$keywords->uninstall($script['tables_uninstall']);
    	echo 'Successfully uninstalled';
     
    # Check for a specific word
    } elseif($_GET['mode'] == 'word') {
     
    	echo '<pre>';
    	print_r($keywords->words($_GET['word']));
    	echo '<pre>';
     
    } else {
     
    	# We'll run the script as normal
    	$query = mysql_query("SELECT * FROM " . $script['table_prefix'] . "words WHERE indexed=0 " . $script['sql_order'] . "LIMIT " . $script['number_words']);
    	if(mysql_num_rows($query)) {
    		while($row = mysql_fetch_array($query)) {
    			$link->query("UPDATE " . $script['table_prefix'] . "words SET keywords='" . mysql_real_escape_string(implode($script['sql_implode'], $keywords->words($row['word']))) . "', indexed=1 WHERE id='" . $row['id'] . "' LIMIT 1");
    		}
    	}
     
    }
     
    ?>

    settings.php
    Code php:
    <?php
    /*
    	* settings.php
    	* Chris Myers
    	* A class for generating keywords similar to other keywords
    		* Define some settings for general use
    	* May, 2008
    */
     
    # A table prefix if you want one
    $script['table_prefix'] = 'k_';
     
    # A list of tables and their requirements to install
    $script['tables_install'] = array($script['table_prefix'] . "words" => "id int(11) NOT NULL auto_increment, word varchar(48) NOT NULL, keywords text NOT NULL, indexed int(1) NOT NULL default 0, PRIMARY KEY(id)");
     
    # A list of tables to uninstall if necessary
    $script['tables_uninstall'] = array($script['table_prefix'] . 'words');
     
    # The number of words to index each time the script runs
    $script['number_words'] = 2;
     
    # SQL to order the queries (optional)
    $script['sql_order'] = "ORDER BY RAND() ";
     
    # SQL insert implosion
    $script['sql_implode'] = ', ';
     
    # Insert some opening words into the database?
    $script['initial_insert'] = true;
    ?>

    classes/class.keywords.php
    Code php:
    <?php
    /*
    	* class.keywords.php
    	* Chris Myers
    	* A class for generating keywords similar to other keywords
    		* 
    	* May, 2008
    */
     
    class keywords {
     
    	/*======================================================================*\
    	Function:	words($word)
    	Purpose:	Provide all words which respond to narrowed searches for $word
    	Input:		$word - the word to search for
    	Output:		An array of values
    	\*======================================================================*/
    	function words($word) {
     
    		# Open an array for the final results
    		$this->keywords = array();
     
    		# Try and retreieve the contents from a provided url
    		$this->url = 'http://www.ask.com/web?q=' . $word . '&search=search';
    		$this->html = $this->content($this->url);
     
    		# Find the matches
    		preg_match_all('|<div class="zm">.*</div>|', $this->html, $this->matches);
     
    		# Run through all the matches
    		for($this->i = 0; $this->i <= (count($this->matches[0]) - 2); $this->i++) {
     
    			# Get the matches as an array by exploding the regex we used to find them
    			$this->matches[0][$this->i] = explode('<div class="zm">', $this->matches[0][$this->i]);
     
    			# Run through each of the matches sets, check its not an empty value, and add it to our final array
    			foreach($this->matches[0][$this->i] as $this->insert) {
    				if(!empty($this->insert)) {
    					$this->keywords[] = strip_tags($this->insert);
    				}
    			}
     
    		}
     
    		return $this->keywords;
    	}
     
    	/*======================================================================*\
    	Function:	content($url)
    	Purpose:	Retreive the HTML of a website
    	Input:		$url - the http location of a website
    	Output:		The HTML of the website
    	Credit:		[url]http://php.benscom.com/manual/fr/ref.curl.php#78046[/url]
    	\*======================================================================*/
    	function content($url) {
     
    		if(!@file_get_contents($url)) {
     
    			$this->curl = curl_init();
     
    			// Setup headers - I used the same headers from Firefox version 2.0.0.6
    			// below was split up because php.net said the line was too long. :/
    			$this->head[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
    			$this->head[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
    			$this->head[] = "Cache-Control: max-age=0";
    			$this->head[] = "Connection: keep-alive";
    			$this->head[] = "Keep-Alive: 300";
    			$this->head[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
    			$this->head[] = "Accept-Language: en-us,en;q=0.5";
    			$this->head[] = "Pragma: "; // browsers keep this blank.
     
    			curl_setopt($this->curl, CURLOPT_URL, $url);
    			curl_setopt($this->curl, CURLOPT_USERAGENT, 'Googlebot/2.1 (+[url]http://www.google.com/bot.html)');[/url]
    			curl_setopt($this->curl, CURLOPT_HTTPHEADER, $this->head);
    			curl_setopt($this->curl, CURLOPT_REFERER, 'http://www.google.com');
    			curl_setopt($this->curl, CURLOPT_ENCODING, 'gzip,deflate');
    			curl_setopt($this->curl, CURLOPT_AUTOREFERER, true);
    			curl_setopt($this->curl, CURLOPT_RETURNTRANSFER, 1);
    			curl_setopt($this->curl, CURLOPT_TIMEOUT, 10);
     
    			$this->html = curl_exec($this->curl); // execute the curl command
    			curl_close($this->curl); // close the connection
     
    		} else {
    			$this->html = @file_get_contents($url);
    		}
     
    		return $this->html; // and finally, return $this->html
    	}
     
    	/*======================================================================*\
    	Function:	installation functions
    	Purpose:	Install the class system and remove the tables when required
    	Input:		$tables
    	Output:		None
    	\*======================================================================*/
    	function install($tables) {
    		global $link;
    		foreach($tables as $this->table => $this->fields)
    		$link->query("CREATE TABLE " . $this->table . " (" . $this->fields . ")");
    	}
     
    	function uninstall($tables) {
    		global $link;
    		$link->query('DROP TABLE ' . implode(', ', $tables));
    	}
     
     
    }
    ?>

    classes/class.database.php
    Code php:
    <?php
    /*
    	* class.database.php
    	* Chris Myers
    	* Database file
    		* I didn't write this, and I can't remember where I got it from
    	* May, 2008
    */
     
       class DB {
       function DB() {
           $this->host = "localhost";
           $this->db = "";
           $this->user = "";
           $this->pass = "";
           $this->link = mysql_connect($this->host, $this->user, $this->pass) or die("Database connection failed. Please check your details.");
           mysql_select_db($this->db);
       }
       function query($query) {
           $result = mysql_query($query, $this->link) or print($query . ', ' . mysql_error());
           return $result;
       }
       function fetcharray($query) {
           $result = $this->query($query);
           $result = mysql_fetch_assoc($result);
           return $result;
       }
       function numrows($query) {
           $result = $this->query($query);
           $result = mysql_num_rows($result);
           return $result;
       }
       function close() {
           mysql_close($this->link);
       }
    }
    ?>
    Last edited by silentcollision; Jun 20, 2008 at 17:58.

  7. #7
    SitePoint Member
    Join Date
    May 2008
    Posts
    11
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hey

    Thanks a lot . your reply should really help me in achieving what I want to !!
    nice explanation.I will try this out and reply once I am done.

    thanks again


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •