SitePoint Sponsor

User Tag List

Results 1 to 12 of 12
  1. #1
    Spirit Coder allspiritseve's Avatar
    Join Date
    Dec 2002
    Location
    Ann Arbor, MI (USA)
    Posts
    648
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Searching with weighted tags

    I would like to build a search system that can factor weighted tags into its results. Is this feasible to do with PHP/MySQL, or should I be looking at something like Lucene (or is there something better)? I'd like to build something similar to Google's autosuggest, and will probably be using a jQuery plugin to request results using AJAX, so the faster the search, the better.

  2. #2
    SitePoint Evangelist stonedeft's Avatar
    Join Date
    Aug 2009
    Posts
    590
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I've been trying thesame thing on my developments and found no real automated way to do it. My solution is on tag clouds, manually inputing weighted tags on the databse.

    I wonder if someone has an automated solution for this. Let me subscribe to this thread.
    Don't Panic

  3. #3
    SitePoint Enthusiast
    Join Date
    Jul 2005
    Location
    Norway
    Posts
    88
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Arrow

    One way I have done with articles is that I parse the article for words, store them in a global word table, and if there's a word that's already registered (which will often happen) then the weight of that word will be increased by one.

    I also register in a different table which words and how many of each word a particular article has.

    So when you do it this way, words like "you, this, I, have" and other very common words will have a very high weight in the global words table, while uncommon words will have a low weight.

    Now to search, you parse the search input text into words, find those words in your global table, ignore words that has a high weigth, and use only those words that have low weight (or if every word has high weight you have to use them anyway). Then you search for these words in the other table, and finds articles which has a high amount of these search words.

    So:
    Global table:
    id INT(11) AUTO_INCREMENT PRIMARY KEY,
    word/tag VARCHAR(200) NOT NULL,
    weight INT(11) NOT NULL DEFAULT 0,
    UNIQUE INDEX word (word)

    Article words:
    articleId INT(11) NOT NULL,
    wordId INT(11) NOT NULL,
    weight INT(11) NOT NULL DEFAULT 0,
    PRIMARY KEY (articleId, wordId)

    Find the words from the search input that are the least common, then use those to find find the articles that has the most amount of these words.

  4. #4
    SitePoint Wizard silver trophybronze trophy Cups's Avatar
    Join Date
    Oct 2006
    Location
    France, deep rural.
    Posts
    6,869
    Mentioned
    17 Post(s)
    Tagged
    1 Thread(s)
    This post cropped up at the weekend : Alternative Term Weighting - even if it does not deal with databases, it looks a useful site to follow all the same

  5. #5
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Which kind of data are you indexing? Lucene is a fulltext engine, so you would generally use it for indexing/searching text data (Such as page on a website).

  6. #6
    Spirit Coder allspiritseve's Avatar
    Join Date
    Dec 2002
    Location
    Ann Arbor, MI (USA)
    Posts
    648
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by kyberfabrikken View Post
    Which kind of data are you indexing? Lucene is a fulltext engine, so you would generally use it for indexing/searching text data (Such as page on a website).
    Just tags for now.

  7. #7
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by allspiritseve View Post
    Just tags for now.
    So what do you mean by search? If the only input to the search is a tag (category), then what is it that you weight?

  8. #8
    Spirit Coder allspiritseve's Avatar
    Join Date
    Dec 2002
    Location
    Ann Arbor, MI (USA)
    Posts
    648
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by kyberfabrikken View Post
    So what do you mean by search? If the only input to the search is a tag (category), then what is it that you weight?
    I don't know, maybe the tags do not need to be weighted.

    Essentially what I'm trying to do is make a dropdown autosuggest feature that takes any user input, and over time trains itself to offer more relevant results based on user selections. I read a paper about user vocabulary for a class I'm taking, and a system like this was proposed but not implemented. I thought it could be applicable to web applications with a lot of techie language that most users might not understand.

    So lets say, for example, we have a page called "Insert page". Insert is commonly used in a CRUD context, but might not mean much to the user. They might type in "add page" or "create a page" or "new page", etc.

    Lets take "create a page", for example. An untrained system would come up with a couple of weak results, such as "create a post", "create profile", "insert page", "edit page". The user, hopefully, would then see "insert page" and select it, thus tagging the command "insert page" with "create a page". Thus the training happens behind the scenes, and with enough use the user should either 1. be able to type in commands using their own vocabulary or 2. learn the system commands through trial and error.

    Here's where I thought the weighting would come in: if the system receives the same input and selection multiple times, I wanted that tag to be weighted more heavily so that the association would be stronger. I'm now thinking that just having the tag available to search might be enough of a boost in relevancy for those terms that weighting won't be needed.

  9. #9
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Aha. Then you mean something else than I was thinking of. You want a thesaurus that users can add to. And you want to select from that thesaurus based on whether a given match was previously found acceptable. How about a table with alias, command and number_of_selections. When a user selects an already existing phrase, you increment the number_of_selections counter. When selecting the list, you sort by the same field.

  10. #10
    Spirit Coder allspiritseve's Avatar
    Join Date
    Dec 2002
    Location
    Ann Arbor, MI (USA)
    Posts
    648
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by kyberfabrikken View Post
    Aha. Then you mean something else than I was thinking of. You want a thesaurus that users can add to. And you want to select from that thesaurus based on whether a given match was previously found acceptable. How about a table with alias, command and number_of_selections. When a user selects an already existing phrase, you increment the number_of_selections counter. When selecting the list, you sort by the same field.
    Ah, ok.

    Should I be doing a phrase -> command thesaurus, or split words up, ie:

    "add a new page" split into:
    Code:
    add -> insert page
    new -> insert page
    page -> insert page
    (common words like "a" ignored, obviously)

    Which should, after a number of training sessions, have the effect of listing all "insert ___" commands when you start typing "add ".

    Or maybe commands should be split too, so:
    Code:
    add -> insert
    add -> page
    new -> insert
    new -> page
    page -> insert
    Thus, a search for "add " would list all insert commands whether they are paired with the alias "add" or not (given enough pairing between "add" and insert")

  11. #11
    SitePoint Member
    Join Date
    Dec 2009
    Posts
    6
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    The tagadelic module generates a page with weighted tags, indicating how many times a category or tag has been used to categorize content on the site. The cool thing is that by merely altering font sizes, these lists suddenly gain a dimension: the more often a tag is used

  12. #12
    SitePoint Zealot
    Join Date
    May 2008
    Location
    Montreal
    Posts
    155
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    This is very interesting. Is using the weights as an ordering not satisfactory? That is, do a basic search, but order the tags by weight in non-ascending order.

    Are you thinking of a global weighting system or per user ratings? If you do settle on how to implement the search, it might be interested to use a global ratings system and then on top of that have a per-user system that is interested only in exceptions to the global rule. E.g. if your system suggests a bunch of stuff but the user decides to take the 5th choice instead of the 1st or 2nd.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •