SitePoint Sponsor

User Tag List

Results 1 to 9 of 9
  1. #1
    SitePoint Addict djh's Avatar
    Join Date
    Apr 2000
    Location
    Long Beach, CA
    Posts
    333
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Generate keyword summaries

    I'm going to put all of my companies articles into a database. I want a search interface to search through these articles, and I'm wondering if anyone has good suggestions.

    here's what I'm considering:

    1. every time an article is added, automatically generate 30 or so relevant keywords that adequately describes the article. Conduct the search only on this keyword field.

    2. somehow index the full text articles (we're using MS SQL server, anyone used indexing services/tools for SQL server yet?) and run the search against the index.

    Any ideas folks? Especially re: generating the 30 or so most important key words? I need the search to be pretty accurate as some of these articles will be pay-access only. I can do this by hand, but that'll take so long!!!

    thanks!

  2. #2
    grasshoppa Snowbird122's Avatar
    Join Date
    Apr 2001
    Location
    Austin
    Posts
    353
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hmm. Interesting post. If you wanted to do #1, maybe you could write a program that would go through the articles and delete all the common words (the, and, if, when, to, from, a, because...) and stores all the rest of the words in the keywords field. You would be left with some nice keywords, but depending on the length of the article, you would probably have a LOT more than 30 left. If would be very, very accurate though.

    Now you just need to find a list of a few thousand common english words. I assume Google uses something like this to remove common words from their search queries.
    Last edited by Snowbird122; Jun 7, 2001 at 22:25.
    http://www.echo-consulting.net - Sound Solutions for Online Inspriations.

  3. #3
    Say WHA?! goober's Avatar
    Join Date
    Sep 2000
    Location
    United States
    Posts
    1,921
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    well, theoretically, it seems that you'd have to do this:

    1. Grab the entire text of the article
    2. Subtract all 'dead words' (i.e "the", "at", "end", "of", "stuff", "hi", "hello", etc.)
    3. Seperate the remaining words with a comma (where there is a space, change it to a comma and a space.

    The main obstacle here, obviously, is subtracting the list of words, which seems like it could be quite a task. First, think of all the dead words out there that don't describe anything. That's a ton of words. But theoretically, I'm seeing it as your best option.

    Anyone disagree? Am I 'theoretically' correct? Let me know please, as this topic has grabbed my interest as well. If i'm write, maybe we can start to bang out some code.

    'Till next time..
    Sean Killeen [LinkedIn] [Twitter] [Web]

    Warning: Reality.sys corrupted. Universe halted. Reboot? (Y/N)

  4. #4
    SitePoint Addict djh's Avatar
    Join Date
    Apr 2000
    Location
    Long Beach, CA
    Posts
    333
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thanks for the suggestions. I agree that removing the common english words and leaving the rest is probably a good, if not the best, option. But here's a couple of thoughts:

    1. Conceivably, the remaining words can be a quite significant number. The original thought was to narrow down the number of words that a search term is compared against, in hopes of making the search process quicker. But with such a significant number of words, wouldn't it be too slow? We have well over 2000 articles!!

    2. I've heard of indexing programs... has anyone used one for SQL, or doesn't SQL have a built in one??

  5. #5
    SitePoint Wizard westmich's Avatar
    Join Date
    Mar 2000
    Location
    Muskegon, MI
    Posts
    2,328
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I moved the thread because it didn't seem aimed anyone technology, I think you're looking for any suggestion.

    However, to point to specific technology, MS SQL Server has a built-in English Query engine. It knows hot to intelligently index articles including knowing that words swim, swam, swimming are all related and should be pulled up in the same search based on how many times theyre referenced. I believe Oracle offers similar functionallity and there are third-party vendors also.

    These may not be cheap options, but for a company they maybe viable. A developer could easily spend 100 hours trying to create a similar search feature.
    Westmich
    Smart Web Solutions for Smart Clients
    http://www.mindscapecreative.com

  6. #6
    SitePoint Addict djh's Avatar
    Join Date
    Apr 2000
    Location
    Long Beach, CA
    Posts
    333
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    hey westmich -

    ok... but I'm only interested in using ASP. sorry if I didn't make that explicit.

    so if the SQL Query engine is built in, how do I access it? how often does it update? do you know where I can get more info about it?

    thanks!

  7. #7
    SitePoint Wizard westmich's Avatar
    Join Date
    Mar 2000
    Location
    Muskegon, MI
    Posts
    2,328
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Moved back

    I would start here for checking into it - http://www.microsoft.com/sql/default.asp

    One thing I didn't think about, though, is that if this is for an internal system you can develop a more specific search page since you know what fields your database contains. Date range, categories, etc... instead of just a generic text box like yahoo.
    Westmich
    Smart Web Solutions for Smart Clients
    http://www.mindscapecreative.com

  8. #8
    Serial Publisher silver trophy aspen's Avatar
    Join Date
    Aug 1999
    Location
    East Lansing, MI USA
    Posts
    12,937
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    check out this article

    http://www.phpbuilder.com/columns/clay19990421.php3

    it describes in detail what you want to do... in PHP...

    However I'm sure you can adapt it to ASP, the idea is the same, just the coding would be different.
    Chris Beasley - I publish content and ecommerce sites.
    Featured Article: Free Comprehensive SEO Guide
    My Guide to Building a Successful Website
    My Blog|My Webmaster Forums

  9. #9
    SitePoint Addict djh's Avatar
    Join Date
    Apr 2000
    Location
    Long Beach, CA
    Posts
    333
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    hmm very interesting. thank you aspen.

    anyone else read it? can someone explain how his relevancy thing works?

    so he takes out the noise words, and with the remaining words he breaks each word to its own row? why? and how does that constitute relevancy? i'm definitely missing something here.

    thanks


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •