SitePoint Sponsor |
|
User Tag List
Results 1 to 9 of 9
-
Dec 10, 2008, 19:08 #1
- Join Date
- Sep 2008
- Location
- New York, NY
- Posts
- 1
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
Script to search text and produce HTML page
I am not new to programming, but am not an expert either. For an upcoming project, we are looking to do the following:
1. A back-end web app/script that can read text from a blog post (in particular, read the keywords of the headline or title of the blog post, or if there are tags in a tag cloud for the blog post, read the tags) and then subsequently automatically search Google using these keywords or tags. We want to collect the top three (3) Google Search results and then post it in a separate HTML page.
2. For example, here is a possible blog post:
---
iPhone vs the Google Phone: Who Wins?
iPhone is the best phone. google phone is not the best phone. on and one, ad infinitum, etc etc, this is the blog post here. etc etc.
---
3. The web app/script needs to read the headline and pick out the keywords (i.e. iPhone, Google Phone)- notice that it needs to know to pick out the phrase "Google Phone" and not just 'Google' or 'Phone' separately.
4. Then the app/script will search Google for these keywords and then collect the links for the top three (3) search results and then make a simple HTML page that displays the links for the Google searches.
I know there's a lot of information here, but I just want to know what languages (perl, php?) should be used to do something like this. What is the flow of information and how would such an app/script(s) be built?
Thanks for all your help. This forum is great!
-
Dec 10, 2008, 23:46 #2
- Join Date
- Dec 2005
- Location
- Florida
- Posts
- 150
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
But you would need a database of keywords then.
Can't just KNOW that google phone is a keyword, most humans wouldn't even know.Inferno Programming Tutorials has articles by experts.
-
Dec 11, 2008, 11:46 #3
- Join Date
- Nov 2004
- Location
- Moon Base Alpha
- Posts
- 1,053
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
perl or php should both be able to handle your requirements. It would have to be a CGI script obviously. So you start with an HTML form that the user types his/her search query into. That data is sent to the server where the searching program is located. A CGI program gathers the data and parses it into memory to be used by your search program. It then opens the blog and starts parsing through the text on the page looking for matching phrases. When it finds a match it stores the appropriate data which is eventually returned to the client that called the script in the first place.
There are too many ways that search programs can be written to give you any specific suggestions. You can do simple brute force word/pattern matches and hope they return relevant matches, you can apply ever more complicated algorithims that make sure the search results are as targeted as possible.
You can downlaod search programs and check the source code to see how they are written.
-
Jan 30, 2009, 15:19 #4
The best language to do parsing is Rebol: you don't need to cope with regular expressions like in Perl or PHP (in fact PHP was originally some perl macros), because PARSE is one of the most powerful features in REBOL. It has many capabilities from simple string splitting to parse expression matching. PARSE forms the basis of pattern matching, that is implemented as regular expression matching in other languages. This is much more natural.
Search on google "rebol parse html" and you should find an example from a guy who wants to do something like you I think: How-to-properly-parse-HTML-and-XHTML-Meta-Tags-td19448593
-
Jan 30, 2009, 16:49 #5
-
Jan 31, 2009, 03:47 #6
Was it a sandwich
OK, just a little comparison with Perl on a real world problem from a sitepoint member here http://www.sitepoint.com/forums/showthread.php?t=594722 who got no answer to his question from any of you:
"Using perl regular expression i need to read the text between a xml tag. like <title>this is test</title>"
the problem is what happens if there are several new lines etc. between the two tags?
In rebol you don't have to bother as you would just do :
parse html [through "<title>" copy title to "</title>"]
the expression in bracket is called a rule which is almost in plain english. The string even if it contains new lines will be stored in the variable title or any other variable you want of course. By the way rebol can store string with newlines without bothering with escape sequence as in many other languages. You can have such a string:
title: {This is the title of my Blog
This is the sub-title of my Blog}
To come back to Rebol powerfull PARSE feature, it is so powerfull than you can create a full language with it, which is what is called today Domain Specific Language promoted by Microsoft and soon others as a good alternative to UML for building Domain Application Architecture. That is in fact the main reason I was attracted to Rebol, because professionally I'm Functional Software Architect and Project Manager for a Fortune 500 Company. I do not recommend Rebol as a language for building scalable software as it is still lacking some features like namespaces (I mentionned that in the C++ OOps thread) but as a tool to create code generation for other languages like Java, ASP.NET, PHP, XML, UML etc.
-
Jan 31, 2009, 13:27 #7
- Join Date
- Nov 2004
- Location
- Moon Base Alpha
- Posts
- 1,053
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
He got his answer on another forum, so why would I want to answer it here too?
-
Jan 31, 2009, 14:09 #8
- Join Date
- Nov 2004
- Location
- Moon Base Alpha
- Posts
- 1,053
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
btw, all he needed to do was add the "s" option to his regexp to make it work:
Code:$file_content =~ m/<title.*>(.*?)<\/title>/sg; $pageTitle = $1;
-
Feb 2, 2009, 06:56 #9
- Join Date
- Jul 2008
- Posts
- 17
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
we can provide you this.PM me for this
Bookmarks