SitePoint Sponsor

User Tag List

Results 1 to 7 of 7
  1. #1
    SitePoint Member
    Join Date
    Apr 2010
    Posts
    4
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Complex Keyword + Content Searching Help

    Hi,
    so ive been trying to figure this one out for quite some time here, and I've looked all over the internet... Seems simple enough, but I've been stuck for days.
    Basically my problem is trying to figure out how to efficiently search based on both keywords and content when the keywords are normalized, and ideally count how many of my search keywords were found.

    So I have a database of resumes. Each resume includes information about where the user is willing to work in the form of associations to entire states as well as associations to rectangular regions defined by 4 latitude and longitude numbers. The resume may have some associated text, like a description of the person, a desired job title, as well as the extracted text of the resume. Each resume can also be described by keywords, which are stored in a normalized keywords table.

    The tables:
    ====================
    Resumes
    -resume_id
    -desired_job_title
    -personal_statement
    -extracted_text

    States
    -state_id
    -state_name
    -state_abbr

    Resumes_states
    -resume_id
    -state_id

    Resumes_regions
    -resume_id
    -north (these are all decimal numbers which represent a lat-lon box)
    -south
    -east
    -west

    Keywords
    -keyword_id
    -keyword
    -rank (increases as people use it over time, used for suggestions and clouds)

    Resumes_keywords
    -resume_id
    -keyword_id
    ====================

    Ideally, I want to be able to search based on a location, a set of keywords, and a minimum number of matching keywords. Searching based on a set of locations and a set of keywords is fairly easy, if I'm ok with showing all results which contain at least one keyword, and not ranking the results in any way. But I'd like to be able to order the results by the number of keywords that matched and then apply a minimum threshold, for example, "show me all resumes that are willing to work in DC and contain at least 2 of the keywords from {'director', 'manager', 'computers', 'technology'}.

    Less ideally, some sort of efficient query that will still order things appropriately would be acceptable, but the only solution I can think of for that would be using fulltext search on the text and on a concatenated list of the keywords.

    Using 4 joins, I get a big table where each resume has LxK rows where L is the number of locations and K is the number of keywords. I've thought about searching this, but problems arise because of finding the search term in both the keywords AND in the text; Every row matches when the term is in the text (ie description or extracted text) and only a single row matches when its a keyword. I thought about using a UNION to only include the text data in one row, but then the row only counts once even if multiple keywords are present. I suspect the answer lies in using some addition to add these things up (which I cant figure out) but even then, problems also arise because of multiple locations (say I search for resumes in DC and in VA and the user has listed both!)

    So I thought about first applying all my location and other meta-data query and then using a sub-query, but I know it's bad practice and gets slow real fast. Maybe that doesnt matter because the query optimizer is smart and will actually only do text comparisons once? I hope?!

    Any ideas out there? Is this even possible? Seems like lots of people need to search based on text content and on normalized keywords but I can't find anything about it! Any help is MUCH appreciated!!!! (Also probably important to mention, is that my tables are quite small, but I'd really like to figure this out and do it the right way so that things can grow. Also, using ajaxy searches, many queries are happening so its gotta be fast).

    Thanks!

  2. #2
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,247
    Mentioned
    59 Post(s)
    Tagged
    3 Thread(s)
    "show me all resumes that are willing to work in DC and contain at least 2 of the keywords from {'director', 'manager', 'computers', 'technology'}.
    Code:
    SELECT t.columns
         , k.how_many_keywords
      FROM Resumes AS t
    INNER
      JOIN ( SELECT resume_id
                  , COUNT(*) AS how_many_keywords
               FROM Resumes_keywords
              WHERE keyword_id IN 
                    ( SELECT keyword_id
                        FROM Keywords
                       WHERE keyword IN
                             ( 'director'
                             , 'manager'
                             , 'computers'
                             , 'technology' ) )
             GROUP
                 BY resume_id
             HAVING COUNT(*) >= 2 ) AS k
        ON k.resume_id = t.resume_id
     WHERE "willing to work in DC" -- you do this part
    and of course if you want to know which keywords, you'll have to join back to the Resumes_keywords and Keywords tables to get them
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  3. #3
    SitePoint Member
    Join Date
    Apr 2010
    Posts
    4
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hi Rudy,
    Thanks so much for the quick reply. I was actually posting here hoping that you would respond (saw some of your other posts and you really know what you're talking about). A few issues with your suggestion...
    1- Ideally I would like to use LIKE or LOCATION to include substring matches as well ('computer' would also match 'computers', etc). If this is too big of a performance hit though, I'm willing to sacrifice it, and maybe do some suffix removal before searching... your thoughts?
    2- While your solution works for checking the keywords table and requiring a minimum number of matches (and is probably way more efficient than the joins I was thinking about), it doesn't solve the real problem; I need to search the text information as well, ie personal_statement and extracted_text. It gets more complicated to avoid counting them twice if they were found as a keyword AND in there...

  4. #4
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,247
    Mentioned
    59 Post(s)
    Tagged
    3 Thread(s)
    not really sure how to respond to those additional nuances

    perhaps you should declare your keywords in advance? or filter them somehow? i'm not sure how i would handle the "computer" versus "computers" keyword issue other than not allowing bofadem

    to search the personal information, i would just use additional LIKE conditions in the WHERE clause, alongside whatever it is you're doing with the regions
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  5. #5
    SitePoint Member
    Join Date
    Apr 2010
    Posts
    4
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    One idea comes to mind of creating 2 binary variables (columns) for each search term, one for if its found as a keyword and one if its found in the text, then adding them up using an OR to avoid double counting. But I'm worried that doing it this way (if its even possible) would end up searching redundant text because of all the joins.

    but if this all sounds silly, what would you recommend?
    should I just give up on normalization, concatenate the keywords, and use full-text search?

  6. #6
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,247
    Mentioned
    59 Post(s)
    Tagged
    3 Thread(s)
    normalization should never be abandoned, full stop

    maybe you're over-thinking this
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  7. #7
    SitePoint Member
    Join Date
    Apr 2010
    Posts
    4
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    sorry, shouldnt have said "give up on normalization" but rather "denormalize"

    I think for now, Im going to just do a bunch of joins and look for anything that matches any of the search terms. I'm just stumped and am tired of being stuck for so long on this...

    Searching normalized keywords AND full text content at the same time seems like a common enough situation. I really wonder how others are doing it.


Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •