SitePoint Sponsor

User Tag List

Results 1 to 9 of 9
  1. #1
    SitePoint Member
    Join Date
    Sep 2008
    Location
    New York, NY
    Posts
    1
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Script to search text and produce HTML page

    I am not new to programming, but am not an expert either. For an upcoming project, we are looking to do the following:

    1. A back-end web app/script that can read text from a blog post (in particular, read the keywords of the headline or title of the blog post, or if there are tags in a tag cloud for the blog post, read the tags) and then subsequently automatically search Google using these keywords or tags. We want to collect the top three (3) Google Search results and then post it in a separate HTML page.

    2. For example, here is a possible blog post:

    ---
    iPhone vs the Google Phone: Who Wins?

    iPhone is the best phone. google phone is not the best phone. on and one, ad infinitum, etc etc, this is the blog post here. etc etc.
    ---

    3. The web app/script needs to read the headline and pick out the keywords (i.e. iPhone, Google Phone)- notice that it needs to know to pick out the phrase "Google Phone" and not just 'Google' or 'Phone' separately.

    4. Then the app/script will search Google for these keywords and then collect the links for the top three (3) search results and then make a simple HTML page that displays the links for the Google searches.

    I know there's a lot of information here, but I just want to know what languages (perl, php?) should be used to do something like this. What is the flow of information and how would such an app/script(s) be built?

    Thanks for all your help. This forum is great!

  2. #2
    SitePoint Zealot execute's Avatar
    Join Date
    Dec 2005
    Location
    Florida
    Posts
    150
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    But you would need a database of keywords then.
    Can't just KNOW that google phone is a keyword, most humans wouldn't even know.
    Inferno Programming Tutorials has articles by experts.

  3. #3
    SitePoint Wizard bronze trophy KevinR's Avatar
    Join Date
    Nov 2004
    Location
    Moon Base Alpha
    Posts
    1,053
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    perl or php should both be able to handle your requirements. It would have to be a CGI script obviously. So you start with an HTML form that the user types his/her search query into. That data is sent to the server where the searching program is located. A CGI program gathers the data and parses it into memory to be used by your search program. It then opens the blog and starts parsing through the text on the page looking for matching phrases. When it finds a match it stores the appropriate data which is eventually returned to the client that called the script in the first place.

    There are too many ways that search programs can be written to give you any specific suggestions. You can do simple brute force word/pattern matches and hope they return relevant matches, you can apply ever more complicated algorithims that make sure the search results are as targeted as possible.

    You can downlaod search programs and check the source code to see how they are written.

  4. #4
    SitePoint Addict reboltutorial's Avatar
    Join Date
    Jan 2009
    Posts
    309
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Lightbulb

    Quote Originally Posted by jpaulllc View Post
    I am not new to programming, but am not an expert either. For an upcoming project, we are looking to do the following:

    I know there's a lot of information here, but I just want to know what languages (perl, php?) should be used to do something like this. What is the flow of information and how would such an app/script(s) be built?

    Thanks for all your help. This forum is great!
    The best language to do parsing is Rebol: you don't need to cope with regular expressions like in Perl or PHP (in fact PHP was originally some perl macros), because PARSE is one of the most powerful features in REBOL. It has many capabilities from simple string splitting to parse expression matching. PARSE forms the basis of pattern matching, that is implemented as regular expression matching in other languages. This is much more natural.

    Search on google "rebol parse html" and you should find an example from a guy who wants to do something like you I think: How-to-properly-parse-HTML-and-XHTML-Meta-Tags-td19448593

  5. #5
    SitePoint Wizard bronze trophy KevinR's Avatar
    Join Date
    Nov 2004
    Location
    Moon Base Alpha
    Posts
    1,053
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by reboltutorial View Post
    The best language to do parsing is Rebol:

    Almost lost my lunch.

  6. #6
    SitePoint Addict reboltutorial's Avatar
    Join Date
    Jan 2009
    Posts
    309
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by KevinR View Post
    Almost lost my lunch.
    Was it a sandwich

    OK, just a little comparison with Perl on a real world problem from a sitepoint member here http://www.sitepoint.com/forums/showthread.php?t=594722 who got no answer to his question from any of you:
    "Using perl regular expression i need to read the text between a xml tag. like <title>this is test</title>"

    the problem is what happens if there are several new lines etc. between the two tags?

    In rebol you don't have to bother as you would just do :

    parse html [through "<title>" copy title to "</title>"]

    the expression in bracket is called a rule which is almost in plain english. The string even if it contains new lines will be stored in the variable title or any other variable you want of course. By the way rebol can store string with newlines without bothering with escape sequence as in many other languages. You can have such a string:

    title: {This is the title of my Blog
    This is the sub-title of my Blog}

    To come back to Rebol powerfull PARSE feature, it is so powerfull than you can create a full language with it, which is what is called today Domain Specific Language promoted by Microsoft and soon others as a good alternative to UML for building Domain Application Architecture. That is in fact the main reason I was attracted to Rebol, because professionally I'm Functional Software Architect and Project Manager for a Fortune 500 Company. I do not recommend Rebol as a language for building scalable software as it is still lacking some features like namespaces (I mentionned that in the C++ OOps thread) but as a tool to create code generation for other languages like Java, ASP.NET, PHP, XML, UML etc.

  7. #7
    SitePoint Wizard bronze trophy KevinR's Avatar
    Join Date
    Nov 2004
    Location
    Moon Base Alpha
    Posts
    1,053
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    He got his answer on another forum, so why would I want to answer it here too?

  8. #8
    SitePoint Wizard bronze trophy KevinR's Avatar
    Join Date
    Nov 2004
    Location
    Moon Base Alpha
    Posts
    1,053
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    btw, all he needed to do was add the "s" option to his regexp to make it work:

    Code:
    $file_content =~ m/<title.*>(.*?)<\/title>/sg;
    $pageTitle = $1;
    Of course this is not actual parsing of the file, its pattern matching. But there are a number of parsing modules that could be used. If rebol has this type of functionality builtin thats great. Maybe someday people will actually use it too.

  9. #9
    SitePoint Member
    Join Date
    Jul 2008
    Posts
    17
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    we can provide you this.PM me for this


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •