SitePoint Sponsor

User Tag List

Results 1 to 9 of 9
  1. #1
    Community Advisor ULTiMATE's Avatar
    Join Date
    Aug 2003
    Location
    Bristol, United Kingdom
    Posts
    2,160
    Mentioned
    46 Post(s)
    Tagged
    0 Thread(s)

    Storing an Index in SQL Server

    I'm not too sure if this is the correct forum to post this in, as if a database isn't the best solution then it becomes a programming question, nevertheless...

    I have developed an Inverted Index script using C# to create a structure of Dictionary<string, List<Word>> where the struct Word contains the row ID of the table and its whereabouts within the entry. Within my database I have a table consisting roughly of the following. To make the question easier to follow I'll claim to have the following within my table taken from this Wikipedia entry:

    Code:
    ID        URL            CONTENT
    ------------------------------------------------------------
    1         a.com          it is what it is
    2         b.com          what is it
    3         c.com          it is a banana

    So, my inverted index will contain the following:

    Code:
    "a":      {(2, 2)}
    "banana": {(2, 3)}
    "is":     {(0, 1), (0, 4), (1, 1), (2, 1)}
    "it":     {(0, 0), (0, 3), (1, 2), (2, 0)} 
    "what":   {(0, 2), (1, 0)}
    All I want to know is how I can store the above within my SQL Server database and whether it is wise to do so. I am writing an information-retrieval system that is queried to return a list of the most relevant rows within the table and whilst I would like to utilise SQL Server for this I have opted not to choose SQL Server Full Text Search because it is not flexible enough to accommodate the extensions that will be required once this is finished.

    Can a .NET/SQL Server guru help me out?

  2. #2
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,341
    Mentioned
    63 Post(s)
    Tagged
    3 Thread(s)
    yes, it is wise to store this into a database table
    Code:
    CREATE TABLE wordindex
    ( word VARCHAR(37) NOT NULL 
    , urlid INTEGER NOT NULL
    , pos SMALLINT NOT NULL
    , PRIMARY KEY ( word,urlid,pos )
    );
    INSERT INTO wordindex VALUES
     ('a',3,3)
    ,('banana',3,4)
    ,('is',1,2),('is',1,5),('is',2,2),('is',3,2)
    ,('it',1,1),('it',1,4),('it',2,3),('it',3,1)
    ,('what',1,3),('what',2,1)
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  3. #3
    Community Advisor ULTiMATE's Avatar
    Join Date
    Aug 2003
    Location
    Bristol, United Kingdom
    Posts
    2,160
    Mentioned
    46 Post(s)
    Tagged
    0 Thread(s)
    I originally thought of trying this, but when I realised the scope of what I want this system to do (handle around 100-1000 separate pages of pure text at around 300-500 words per page) the size of said index would be huge. I've never handled a table larger than a few hundred entries and I can easily see a table like this reaching into the thousands.

    Additionally, to carry on with the examples on the Wikipedia page, if I were to search for "what is it" what query would be required to get a list of the relevant examples ordered by relevancy?

  4. #4
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,341
    Mentioned
    63 Post(s)
    Tagged
    3 Thread(s)
    how do you define relevancy?
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  5. #5
    Community Advisor ULTiMATE's Avatar
    Join Date
    Aug 2003
    Location
    Bristol, United Kingdom
    Posts
    2,160
    Mentioned
    46 Post(s)
    Tagged
    0 Thread(s)
    The document(s) that best match the search term. As stated in the Wikipedia entry if the user were to "search" for the term "what is it" the set:

    Code:
    {0,1} ∩ {0,1,2} ∩ {0,1,2} = {0,1}
    Is returned, meaning that the terms exist in all current rows. As the term occurs consecutively in row 1 that would be the closest-matching row and would appear first. If I were to search for "what is it" I would like to see the following returned:

    Code:
    ID        URL            CONTENT
    ------------------------------------------------------------
    2         b.com          what is it
    1         a.com          it is what it is
    3         c.com          it is a banana

  6. #6
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,341
    Mentioned
    63 Post(s)
    Tagged
    3 Thread(s)
    "as the term occurs consecutively" is gonna be really hard to do with sql

    here's a query which searches for three words and returns urlids sequenced by how many of those words are found in the content column --
    Code:
    SELECT urlid
         , COUNT(*) AS hits
      FROM wordindex
     WHERE word IN ('what','is','it')
    GROUP
        BY urlid
    ORDER 
        BY COUNT(*) DESC
    maybe you can do the relevancy from there in your application logic
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  7. #7
    Community Advisor ULTiMATE's Avatar
    Join Date
    Aug 2003
    Location
    Bristol, United Kingdom
    Posts
    2,160
    Mentioned
    46 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by r937 View Post
    "as the term occurs consecutively" is gonna be really hard to do with sql
    I guess when the SQL expert on SitePoint says that something is really hard it's gonna be really hard...

    Quote Originally Posted by r937 View Post
    "here's a query which searches for three words and returns urlids sequenced by how many of those words are found in the content column --
    Code:
    SELECT urlid
         , COUNT(*) AS hits
      FROM wordindex
     WHERE word IN ('what','is','it')
    GROUP
        BY urlid
    ORDER 
        BY COUNT(*) DESC
    maybe you can do the relevancy from there in your application logic
    I was considering using code like this to do what you've said, and then moving the results into a DataTable and manipulating them from there, although as I am not an expert with C#/.NET it'll be a struggle.

    If I read this correctly, this will rank the rows based on how many of the words are found within the text, right? Wit not be possible to write some kind of stored procedure to do multiple checks, first for "what", "is" and "it" (as individual hits), then for "what is" or "is it" (as double hits) and finally for "what is it" as the main hit, then to rank the overall results by all three? To be honest this is the first time I've ever had to do such a thing with SQL so it's all a bit confusing.

  8. #8
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,341
    Mentioned
    63 Post(s)
    Tagged
    3 Thread(s)
    yes, you could do it the way you suggest, first testing for individual words, then for two, then for three, and assign relevancy that way

    however, you wouldn't need your inverted index table for that, you'd use LIKE on the original content column, and the query would be really slow
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  9. #9
    Community Advisor ULTiMATE's Avatar
    Join Date
    Aug 2003
    Location
    Bristol, United Kingdom
    Posts
    2,160
    Mentioned
    46 Post(s)
    Tagged
    0 Thread(s)
    That's true, I guess I won't be getting a true database solution for this outside of SQL Server's Full Text Search, although storing the inverted index within the database and calling out the referenced rows will probably speed things up. Thanks for the help!


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •