SitePoint Sponsor

User Tag List

Results 1 to 9 of 9

Thread: Search Theory

  1. #1
    SitePoint Zealot csi95's Avatar
    Join Date
    Jan 2005
    Location
    Albany, NY
    Posts
    151
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Search Theory

    This may be a little off topic, but I'm not really sure where else to start this thread...

    I'm wondering how search engines like Yahoo and Google work on a pseudo code level. What's really going on behind the scenes?

    For the sake of simplicity, let's throw out things like TITLE and META tags, and just focus on page content.

    My guess is that they're creating a set of tables to track the occurance of words on a given page. At a simplistic level, that might involve three tables:

    tableSites

    id url
    1 http://www.one.com
    2 http://www.two.com
    3 http://www.three.com

    tableWords

    id word
    1 dog
    2 cat
    3 fish
    4 horse
    5 cow

    tableOccurrences

    url word occurrences
    1 1 3
    1 2 5
    2 4 14
    3 1 1
    3 2 3
    3 3 12
    3 4 4
    3 5 2

    Using an approach like this, it would just be a matter of running a query that joins these tables. For example:

    SELECT ts.url
    FROM tableSites ts LEFT JOIN tableOccurrences to
    ON ts.id = to.url LEFT JOIN tableWords tw
    ON tw.id = to.word
    WHERE tw.word = 'cat'
    ORDER BY to.occurrences desc


    Can it really be that simple?
    Join the EasyImage Affiliate Program!
    30% commission on all sales
    Conversion rates as high as 20%
    Dedicated Affiliate Manager to help you succeed!

  2. #2
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,247
    Mentioned
    59 Post(s)
    Tagged
    3 Thread(s)
    i'd be willing to bet they don't use sql

    probably bitmap indexes or something

    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  3. #3
    SitePoint Zealot csi95's Avatar
    Join Date
    Jan 2005
    Location
    Albany, NY
    Posts
    151
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Probably. I'm looking for a more conceptual answer, however. (and something I can implement on SQL Server)...
    Join the EasyImage Affiliate Program!
    30% commission on all sales
    Conversion rates as high as 20%
    Dedicated Affiliate Manager to help you succeed!

  4. #4
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,247
    Mentioned
    59 Post(s)
    Tagged
    3 Thread(s)
    gee, i guess you really fooled me, i got the distinct impression you were asking how google and yahoo did it

    "I'm wondering how search engines like Yahoo and Google work on a pseudo code level. What's really going on behind the scenes?"

    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  5. #5
    SitePoint Zealot csi95's Avatar
    Join Date
    Jan 2005
    Location
    Albany, NY
    Posts
    151
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I am.

    "probably bitmap indexes or something" isn't very helpful though. It's like answering "probably an engine or something" when I ask "How do cars work?"

    I'm looking for the conceptual approach on how they turn a page of HTML into a searchable index. I don't need the specific platform, architecture or database construct they use to implement their strategy, just the basics on how the system works.

    Sure, you'll probably use a different implementation path if your trying to index the entire web versus a hundred thousand products in an online store, but the concept of the indexing and matching should be the same. That's what I'm looking for.
    Join the EasyImage Affiliate Program!
    30% commission on all sales
    Conversion rates as high as 20%
    Dedicated Affiliate Manager to help you succeed!

  6. #6
    SitePoint Guru asterix's Avatar
    Join Date
    Jun 2003
    Posts
    847
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I'm looking for the conceptual approach on how they turn a page of HTML into a searchable index.
    Low level concept:

    1) Tokenize
    2) Generate morphemes
    3) Lecalize word sense
    4) Build index entry, related entry pointers etc.
    5) Defragment index if necessary
    6) re-balance the index tree
    7) Update statistics
    8) Repeat

  7. #7
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,247
    Mentioned
    59 Post(s)
    Tagged
    3 Thread(s)
    Lecalize?
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  8. #8
    SitePoint Guru asterix's Avatar
    Join Date
    Jun 2003
    Posts
    847
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Those **** keys just keep moving. I swear, from one day to the next they have swapped positions.

    I wanted to write "lexicalize word sense" which I realize now also means absolutely nothing. It is in any case only necessary for natural language processing, and then the correct terminology is "disambiguate word sense".

  9. #9
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,247
    Mentioned
    59 Post(s)
    Tagged
    3 Thread(s)
    ah, disambiguate, now that i can understand!

    however, i doubt that csi95 will need to disambiguate products in an online store

    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •