SitePoint Sponsor

User Tag List

Results 1 to 4 of 4
  1. #1
    SitePoint Member
    Join Date
    Dec 2007
    Posts
    19
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Inverted index in a search engine

    Hello there,

    I'm trying to write some code to make a small application for searching text from files.

    Files should be crawled, and I need to put an inverted index to boost searches.

    My problem is that I kind of have ideas about how the parser would be, I'm willing to implement the AND, NOT, OR in the query.

    Whereas, I couldn't figure out how my index should be... I have never created an inverted index so if any body could suggest a feasable way to do it I would be very grateful... I do know in theory how it works but my problem is I absolutely have no idea to make happen in MySql I need to give keywords being indexed a weight too...

    Thank you so much.

  2. #2
    SitePoint Zealot
    Join Date
    May 2008
    Location
    Montreal
    Posts
    155
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    First, have you considered pre-existing libraries that already solve your problem? Is PHP really the language you want to be doing this project in?

  3. #3
    SitePoint Member
    Join Date
    Dec 2007
    Posts
    19
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thank you Peter for your answer, for a learning purpose I don't want to look at existing libraries at the moment and I do prefer to make with php.

    I just want few simple suggestions to start with and later on, I will be improving but now I'm really stuck

  4. #4
    ********* Victim lastcraft's Avatar
    Join Date
    Apr 2003
    Location
    London
    Posts
    2,423
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Hi...

    Which version of MySQL are you using?

    The reason I ask is that there are two SQL choices you should consider before writing any PHP code or burdening yourself with extra components to be installed:
    1. If the table is MyIsam, or can be shadowed to MyIsam, then you can use the "fulltext" index of mysql.
    2. If you can use triggers, then make each insert into the master table cause entries to be added to a "words" table.


    The latter solution is pretty straightforward...
    Code:
    create table words (
        word varchar(255),
        id integer,
        position integer);
    You want a unique index on the whole lot and an index on word.

    "id" points to rows on the master table, but I wouldn't put a foreign key constraint on it. If it points nowhere the later select statements will skip it anyway, and it just slows things down. The exception to this is if you are deleting data frequently and you want to use a cascading delete to keep the index small.

    "position" is the character position in the row. You only need this if the data in each row is large and you want to detect multiple hits.

    Add the trigger to the master table such that it loops through the text position with your favourite MySQL loop construct. Writing some helper functions like next_word() and skip_whitespace() will be useful here.

    Once you have the words table, just assemble the select query for the master data given the row ids given by words table.

    The whole task is a couple of hundred lines of code. Should take 2-4 days.

    Using the DB to do all the work should be fast enough (up to a hundred MBytes). If it's not, then look at Lucene or Sphinx.

    yours, Marcus
    Marcus Baker
    Testing: SimpleTest, Cgreen, Fakemail
    Other: Phemto dependency injector
    Books: PHP in Action, 97 things


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •