SitePoint Sponsor

User Tag List

Page 1 of 2 12 LastLast
Results 1 to 25 of 31
  1. #1
    Making a better wheel silver trophy DR_LaRRY_PEpPeR's Avatar
    Join Date
    Jul 2001
    Location
    Missouri
    Posts
    3,428
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    designing a search engine for a large forum?

    MattR and other DB gurus, i'm looking at you!

    i'm trying to figure out the best way to implement a search engine (using MySQL) for forum software. it seems like the 2 options are: 1) MySQL's FULLTEXT indexes or 2) searchword and searchindex (wordid+postid lookup) tables like vBulletin, phpBB2, etc.

    i want something that would work with more than 1 million posts (e.g. HardForum: 2 million+, FanHome: 4 million+), if possible. HardForum seems to be doing all right with vB's search engine (it must have, what, 50-100 mil. rows in the searchindex table?), and FanHome's search is dead (i know, they're using the old vB, which has a different search).

    FULLTEXT indexes seem easier, but MySQL 3.23's support for it isn't nearly as good as 4.0's. not sure about their speed or storage requirements with lots and lots of posts.

    word lookup tables might give more options for wild-card suffix searches, but seem like such a pain to manually keep these tables updated. how does this method's speed/storage reqs compare to FULLTEXT? not to mention how large these extra tables would make dumps. and if you don't dump them, it would take forever to reindex 1,000,000+ posts.


    SO, starting from scratch, what would be the best way to create a good, scaleable search for huge forums in MySQL? what could you do differently with a different database system?

    thanks!
    Last edited by DR_LaRRY_PEpPeR; Jul 24, 2002 at 20:57.
    - Matt ** Ignore old signature for now... **
    Dr.BB - Highly optimized to be 2-3x faster than the "Big 3."
    "Do not enclose numeric values in quotes -- that is very non-standard and will only work on MySQL." - MattR

  2. #2
    Database Jedi MattR's Avatar
    Join Date
    Jan 2001
    Location
    buried in the database shell (Washington, DC)
    Posts
    1,107
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)


    Well, I'd say use MySQL 4 and get their full-text indexing. It's certainly a lot easier to set up, maintain, etc.

    Our search actually is using Sybase EFTS, but is not online yet (still working out the bugs, etc.).

    There are several disadvantages to the vBulletin (et al) method (table-based method):
    • Doesn't retain paragraph, sentence info -- cannot perform 'near'/sentence queries, 'paragraph' queries, etc. which the full-text method can.
    • Doesn't retain phrases, so cannot search for "George W"; simply the occurance of George and W in a post (or cannot exclude certain phrases in searches).
    • Doesn't retain word occurance ranking, a post with many occurances of the phrase "MySQL statistics" may be more relevent than a post with only one occurance.


    You could get around the third one by storing a 'count' in the wordid, postid table, although again doesn't help for phrase searches.

    But I would say the full-text search method, properly implemented in MySQL , would be far preferable to the table-based due to the significant advantages it has in finding qualifying rows. Not only that, but you don't have to write any complex logic to handle deleting posts, searching, etc.

    Using the table method you have simply another table, which means you can dump/load it like any other. The row count can get extraordinarily huge which doesn't occur with full text indexes, since it doesn't chop up posts in such a manner.

    I would think the table method would cost more in terms of disk space if you have more large posts than not, you need to store the words and then the relations in a table, so for a post with 3 different words, it would cost:
    3 x row in word (@ 4 bytes + char( 5 ) = 9 bytes ) = 27 bytes
    3 x row in word/post intersection (4, 4, 4 = 16) = 48 bytes

    Of course, as the number of posts go up, the number of duplicate words goes up (and hence forgoing the need to create new word rows), so as # posts -> infinity the word cost drops to zero.

    So our only cost is the word/post intersection which is 16 bytes / row (wordid, postid, count; not counting indexes).

    This needs to have a row for each word, so here would be the break-even point:
    PHP Code:
    if( (row_length number_words index_length number_words) > length_of_off_row-text_index ) {
      echo 
    "use full-text";
    } else {
      echo 
    "table-based";

    Now, the question is how MySQL stores text indexes.

  3. #3
    Making a better wheel silver trophy DR_LaRRY_PEpPeR's Avatar
    Join Date
    Jul 2001
    Location
    Missouri
    Posts
    3,428
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    thanks very much for the insightful reply (as always). i didn't reply sooner because i've been doing some more research (to ask more questions ).

    Originally posted by MattR
    Well, I'd say use MySQL 4 and get their full-text indexing. It's certainly a lot easier to set up, maintain, etc.
    yeah, but the problem is with other people using 3.23. well, maybe in 6 months or something MySQL 4 will be closer to being officially stable and hosts, etc. will upgrade.

    what do you think about using FULLTEXT with 3.23? should it absolutely not be used? the main thing it's missing is the boolean mode search operators. its behavior is that all words are optional, so there's no way to make certain words need to be in the result, as you can by prefixing a word with + in MySQL 4. however, i was thinking that maybe i could get around this in 3.23 by [still] using the FULLTEXT search to narrow down the rows, and then if a word was prefixed with +, parse that apart with PHP (easy), and then use post LIKE '%$word_with_plus_sign%' in the query also to make sure that those words were in the result. at least the LIKE would only have to look at the rows found by the FULLTEXT search and not the whole table. i could do similar for the other boolean operators.

    BTW, what's your opinion on each word being optional by default? how does it work in Sybase?

    Our search actually is using Sybase EFTS, but is not online yet (still working out the bugs, etc.).
    what is EFTS? Enhanced Full-Text Search...?

    does a FULLTEXT search system come with Sybase or is it an add-on?

    how does what you're using handle 4 million+ posts? easily? or is it close to its limit and really hard on the server? just curious.

    There are several disadvantages to the vBulletin (et al) method (table-based method):
    • Doesn't retain paragraph, sentence info -- cannot perform 'near'/sentence queries, 'paragraph' queries, etc. which the full-text method can.
    • Doesn't retain phrases, so cannot search for "George W"; simply the occurance of George and W in a post (or cannot exclude certain phrases in searches).
    • Doesn't retain word occurance ranking, a post with many occurances of the phrase "MySQL statistics" may be more relevent than a post with only one occurance.
    • MySQL's FULLTEXT can't do near/sentence/paragraph queries, either (yet).
    • i'm not sure how MySQL 4's FULLTEXT would handle a "George W" exact phrase search. i'm thinking the "W" would be excluded 'cause it's too short. but i understand what you mean.
    • again, one of the problems (?) is that search words are optional without a + prefix. so if you searched for "MySQL Sybase statistics" (not an exact phrase), the posts won't necessarily contain all 3 words. although, i guess if they're sorted by relevancy, the most relevant posts should be listed first, in theory.


    You could get around the third one by storing a 'count' in the wordid, postid table, although again doesn't help for phrase searches.
    heh, i don't think i'd bother keeping track of the word count if i was using that method.

    But I would say the full-text search method, properly implemented in MySQL , would be far preferable to the table-based due to the significant advantages it has in finding qualifying rows. Not only that, but you don't have to write any complex logic to handle deleting posts, searching, etc.
    do you think FULLTEXT could handle millions of posts OK? it does mention "a 1GB table" on this page, which would be about 1,000,000 posts, so i guess it's designed for that. it just seems like so much data. i guess it's better than vB's method, though.

    Now, the question is how MySQL stores text indexes.
    what do you mean? the FULLTEXT indexes or were you talking in regard to vB's "word" table method?

    this brings me to 1 last question. if you were talking about how MySQL stores the FULLTEXT indexes, it's right in the .MYI file with other indexes. this made me wonder, if a FULLTEXT index is so large, does it hurt the other small indexes' performance (postid, threadid, etc.) since they all share the same file? or is it not even an issue?


    sorry for the long post, but thanks for any more info!

  4. #4
    Database Jedi MattR's Avatar
    Join Date
    Jan 2001
    Location
    buried in the database shell (Washington, DC)
    Posts
    1,107
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    No problem, thinking is good -- keeps the mind sharp.

    yeah, but the problem is with other people using 3.23. well, maybe in 6 months or something MySQL 4 will be closer to being officially stable and hosts, etc. will upgrade.

    That is true -- I don't have much experience with 3's FTS but I also think it can't limit the number of rows returned, which can make it really unsuitable if many rows match.

    BTW, what's your opinion on each word being optional by default? how does it work in Sybase?

    What do you mean each word optional? In your query? In Verity you have to explicitly say 'word<and>word2<or>word3'. I don't know what it does if you leave the and/or part out (probably and).


    what is EFTS? Enhanced Full-Text Search...?
    Yes, enhanced full-text search. Verity has (or had) two products, one which was a basic FTS (like MySQL's) and one which was enhanced, which let you do REALLY complex things (setting up topic groups, replication, etc.). Sybase used to sell both, although most people opted for the enhanced version so they dropped 'regular' from their line up (or Verity stopped producing the 'regular' FTS).

    does a FULLTEXT search system come with Sybase or is it an add-on?

    It is an add-on. Pricing depends on your configuration, but is REALLY cheap (< $1000).

    how does what you're using handle 4 million+ posts? easily? or is it close to its limit and really hard on the server? just curious.

    It seems to do fine, I think we're one of the largest EFTS installs (at least on Linux). I think there are a couple more clients with 20+ GB on EFTS so our measly 5GB or so is no contest.

    The neat thing about EFTS is that it is a separate daemon outsite of ASE, so I can place it on its own box if we wanted to free up system resources (illustrated in attached IMG). I can also replicate the indexes, so that if we wanted 100% uptime or enhanced performance we can do so.

    The other thing is that I can back up the indexes and not have to rebuild them on the fly when loading the DB back in. MySQL is stupid in that it dumps the CREATE TABLE / INSERT statements (or skips the indexes with the BACKUP TABLE syntax) so that you would have to re-create your text indexes with every INSERT, which will be horribly slow. Sybase/MS SQL/Oracle/et al do bytewise backups which include all indexes, so all I have to do is load the DB (11GB load takes around 20 minutes) and we're online again, no reindexing needed.

    So that is something to consider with MySQL you can't backup (natively; if you use something like Veritas snapshot you can get everything backed up, but if you have that kind of $$ you're not using MySQL ) the FTS indexes, so you'll have a LOT of downtime while MySQL rebuilds those indexes.

    Although on the MySQL dump DB page there is this comment
    In older versions, mysqldump dump table structure and data but nothing about indexes. Recent versions dump index information too.

    But I can see nothing in the MySQL docs which corroborates that statement.

    • MySQL's FULLTEXT can't do near/sentence/paragraph queries, either (yet).
    • i'm not sure how MySQL 4's FULLTEXT would handle a "George W" exact phrase search. i'm thinking the "W" would be excluded 'cause it's too short. but i understand what you mean.
    • again, one of the problems (?) is that search words are optional without a + prefix. so if you searched for "MySQL Sybase statistics" (not an exact phrase), the posts won't necessarily contain all 3 words. although, i guess if they're sorted by relevancy, the most relevant posts should be listed first, in theory.


    That's a problem with the FTS -- the word being too short thing. That's where a vB-based system might work better since you could hard code certain 'must index' phrases/words, so that "George W" is treated as a single word. Although if your FTS engine is smart it doesn't have a 'min word' length -- it will determine whether or not W is relevant (e.g. if you have several George W phrases in the post) and it will index accordingly. MySQL's is immature in that respect.

    Re 'optional' I'd say you could say in your search page 'All words will be ANDed together unless you use AND/OR'. That would remove ambiguity, and your host language (PHP?) would say $user_query = str_replace( ' or ', ' +', lower( $user_query ) );

    Or you could force the users to enter the +/- in the query themselves.

    do you think FULLTEXT could handle millions of posts OK? it does mention "a 1GB table" on this page, which would be about 1,000,000 posts, so i guess it's designed for that. it just seems like so much data. i guess it's better than vB's method, though.

    Well the problem is, since you said it lives in the .MYI file, is when you hit 2GB in a single file (e.g. many many posts). You'll have to reconfigure your server to enable large files in that instance.

    But I would look long and hard at your chosen architecture (MySQL). If you're expecting millions upon millions of rows (and hence thousands of users online at a time) I would suggest a more capable RDBMS (PostGRES, MS SQL, Sybase, Oracle, etc.). You'll find that you can achive MORE with these RDBMS's with the same hardware (well maybe not the Oracle box ) due to less demanding/complex apps (can package things up in stored procs, subselects, etc.), enhancements (subselects, pre-compiled stored procs, replication, etc.), and other DBA-happy things (online backups, better locking, etc.).

    what do you mean? the FULLTEXT indexes or were you talking in regard to vB's "word" table method?

    this brings me to 1 last question. if you were talking about how MySQL stores the FULLTEXT indexes, it's right in the .MYI file with other indexes. this made me wonder, if a FULLTEXT index is so large, does it hurt the other small indexes' performance (postid, threadid, etc.) since they all share the same file? or is it not even an issue?


    I was wondering where they stored the index information for FTS, e.g. like you said in the .MYI file. However they keep you in the dark as to the actual datastructure so I can't complete the other half of that equation.

    I wouldn't think it would be a problem, Enterprise RDBMS's don't store each table/index/data in a separate file, but in a large 'file' called a DEVICE. For instance, FanHome is on 6 2GB file-system devices which is basically a large 2GB file. Nothing to worry about unless MySQL is dumb in the way it opens/locks/closes the file (e.g. when a text index is being written it locks the MYI file so all indexes are locked as well).
    Attached Images Attached Images
    Last edited by MattR; Jul 28, 2002 at 10:05.

  5. #5
    Making a better wheel silver trophy DR_LaRRY_PEpPeR's Avatar
    Join Date
    Jul 2001
    Location
    Missouri
    Posts
    3,428
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    hey, thanks very much for the reply again! i read your post right away, but didn't get around to replying. i don't want you to think that i didn't appreciate your post!


    Originally posted by MattR
    That is true -- I don't have much experience with 3's FTS but I also think it can't limit the number of rows returned, which can make it really unsuitable if many rows match.
    ... LIMIT n? or do you mean that if you have like 300,000 posts that it may return 100,000? that's way too many, but i think that would be the same in 4 too.

    What do you mean each word optional? In your query? In Verity you have to explicitly say 'word<and>word2<or>word3'. I don't know what it does if you leave the and/or part out (probably and).
    i mean that each word is optional. the equivalent of using or is the default.

    Well the problem is, since you said it lives in the .MYI file, is when you hit 2GB in a single file (e.g. many many posts). You'll have to reconfigure your server to enable large files in that instance.
    i was thinking that the data file (.MYD) would be larger than the index file? if so, it would, of course, hit the OS limit first.

    i'm wondering if a FULLTEXT index on, say, the post column is as large (or larger) than the post column in the data file? any idea? i would think that there would be some way that they save space rather than totally duplicating each post record. but i guess it also has to keep track of how many times a post has a particular word, and how those words relate to other rows, etc., which would take some space.

  6. #6
    Database Jedi MattR's Avatar
    Join Date
    Jan 2001
    Location
    buried in the database shell (Washington, DC)
    Posts
    1,107
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    No problem. I enjoy such things since it keeps me thinking and typically I learn something new as well, which is usually not the case with 'What is wrong with my SQL query?' when it's usually a missing comma.

    ... LIMIT n? or do you mean that if you have like 300,000 posts that it may return 100,000? that's way too many, but i think that would be the same in 4 too.

    I think in MySQL 3's version it ignores the limit clause, so if 100,000 posts match it will dump all 100,000 rows to your client which makes it virtually unusable for large systems. From what I understand they made this work better in version 4, although what I would do would be to set it at LIMIT 1000 and grab postids only, then store that in a search table kind of like this:
    Code:
    CREATE TABLE search(
      searchid INT      NOT NULL AUTO_INCREMENT,
      userid   INT      NOT NULL,
      date     DATETIME NOT NULL
    )
    
    CREATE TABLE search_results(
      searchid INT NOT NULL,
      postid   INT NOT NULL,
    PRIMARY KEY( searchid, postid )
    )
    That way pagnation is a lot faster since full-text queries can be somewhat expensive and you don't want to repeat the search every pagefull of results.

    i was thinking that the data file (.MYD) would be larger than the index file? if so, it would, of course, hit the OS limit first.

    That's probably an accurate assumption but I figured you may have something like 50MB of indexes (generally there is some amount of duplication due to different queries/covering indexes) and the extra bit from the full-text might bump it over -- e.g.
    you have 1.5GB text data, 500MB other
    you have 600MB of indexes
    1.5GB text index wouldn't fit.

    But you're right you'd want to test with some data how big exactly the text indexes are for a million, 2 million, etc. rows. I don't know what would be the best way to stress-test -- perhaps ask someone who runs a forum now and has a couple million if they wouldn't mind letting you poke at a post table dump (hidden/mods only posts removed of course).

  7. #7
    Making a better wheel silver trophy DR_LaRRY_PEpPeR's Avatar
    Join Date
    Jul 2001
    Location
    Missouri
    Posts
    3,428
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Originally posted by MattR
    I enjoy such things since it keeps me thinking and typically I learn something new as well, which is usually not the case with 'What is wrong with my SQL query?' when it's usually a missing comma.
    yep, me too.

    I think in MySQL 3's version it ignores the limit clause, so if 100,000 posts match it will dump all 100,000 rows to your client which makes it virtually unusable for large systems. From what I understand they made this work better in version 4, although what I would do would be to set it at LIMIT 1000 and grab postids only, then store that in a search table kind of like this:
    Code:
    CREATE TABLE search(
      searchid INT      NOT NULL AUTO_INCREMENT,
      userid   INT      NOT NULL,
      date     DATETIME NOT NULL
    )
    
    CREATE TABLE search_results(
      searchid INT NOT NULL,
      postid   INT NOT NULL,
    PRIMARY KEY( searchid, postid )
    )
    That way pagnation is a lot faster since full-text queries can be somewhat expensive and you don't want to repeat the search every pagefull of results.
    i just tried LIMIT in 3.23 and it actually works fine. just like you said though, i was thinking about using LIMIT 1000 or so to keep a ton of results from being returned. and i was also going to use a search_results table too to hold the results so that the FULLTEXT processing isn't needed for a repeated search. we're thinking alike.

    That's probably an accurate assumption but I figured you may have something like 50MB of indexes (generally there is some amount of duplication due to different queries/covering indexes) and the extra bit from the full-text might bump it over -- e.g.
    you have 1.5GB text data, 500MB other
    you have 600MB of indexes
    1.5GB text index wouldn't fit.

    But you're right you'd want to test with some data how big exactly the text indexes are for a million, 2 million, etc. rows. I don't know what would be the best way to stress-test -- perhaps ask someone who runs a forum now and has a couple million if they wouldn't mind letting you poke at a post table dump (hidden/mods only posts removed of course).
    i was just doing some testing earlier with 25 different test "posts" that i inserted very many times to create 250,000 rows. i don't know how long the typical forum post is (in bytes), but my 25 were between 447 and 1175 bytes, with the average being 645.

    i was thinking that MySQL would "notice" that the posts were the same 25 repeated over and over. this doesn't appear to be the case, however, since the index is very large. i also compared MySQL 3.23.51 and 4.0.2 to see how long they took to build the FULLTEXT index (on Win2000, dual 800MHz PIIIs). so here's what i found with 250,000 posts:

    data file (.MYD) was 156MB and before the FULLTEXT index, the index file (.MYI) was 10MB on 3.23 and 4.

    3.23 built the FULLTEXT index in about 23:30. the index file was 196MB (bigger than the datafile ).

    4 built the FULLTEXT index in 3:40 (much faster!). the index file was 106MB (90MB smaller!).

    so they've definitely improved FULLTEXT speed in 4. from the 4.0.0 changelog:

    Implemented ``repair by sort'' for FULLTEXT indexes. REPAIR TABLE, ALTER TABLE, and OPTIMIZE TABLE for tables with FULLTEXT indexes are now up to 100 times faster.
    i didn't see 100 times faster, but it sure is a lot faster than 3.23. it appears that you would be able to restore a million posts in about 15 minutes compared to 90 in 3.23.

    interesting stuff. i'm glad i experimented.


    (BTW, i have a good estimation that vB's etc. searchword/searchindex table system would be about the same size as 3.23's FULLTEXT index.)

  8. #8
    Making a better wheel silver trophy DR_LaRRY_PEpPeR's Avatar
    Join Date
    Jul 2001
    Location
    Missouri
    Posts
    3,428
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    coming back to this... hopefully you'll see this again Matt.

    i'm guessing MySQL 4.0 is going be "officially" stable soon now. that's good, since hopefully most hosts will upgrade to it soon after. and i don't have to worry about the slowness and doing workarounds for 3.23's FULLTEXT search.

    the thing i was wondering: do you have a rough estimate of how long it should take to search 1, 2, 5 million etc. posts? i'm trying to get an idea of how fast or slow MySQL 4 is at this.

    like how long would it take in Sybase, SQL Server, Oracle, etc. with FULLTEXT-type search features?

    5, 10, 30 seconds per n posts or what?

    thanks again.
    - Matt ** Ignore old signature for now... **
    Dr.BB - Highly optimized to be 2-3x faster than the "Big 3."
    "Do not enclose numeric values in quotes -- that is very non-standard and will only work on MySQL." - MattR

  9. #9
    Database Jedi MattR's Avatar
    Join Date
    Jan 2001
    Location
    buried in the database shell (Washington, DC)
    Posts
    1,107
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Sybase ASE uses a third-party product called 'verity' -- Oracle uses it as well. At least with ASE it is a separate executable program -- that means it can be placed on its' own machine to improve performance.

    I have it set up on a small machine (PIII850, 256MB RAM) -- it performs decently for such a small machine. I'll see if I can't get some benchmarks tonight or sometime this weekend.

    I don't have any MySQL data. I would be interested on any findings you have, and I would be willing to perform tests on like data if you have a bunch of text laying around. I have 5 million rows (10GB) on ASE but would not be able to share it without some sort of an NDA.

  10. #10
    Making a better wheel silver trophy DR_LaRRY_PEpPeR's Avatar
    Join Date
    Jul 2001
    Location
    Missouri
    Posts
    3,428
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    thanks again for your reply. i really appreciate your comments and help on this stuff!


    Originally posted by MattR
    Sybase ASE uses a third-party product called 'verity' -- Oracle uses it as well. At least with ASE it is a separate executable program -- that means it can be placed on its' own machine to improve performance.
    right, i remember you saying that. i still was wondering how it performs -- 3rd-party or not.


    I have it set up on a small machine (PIII850, 256MB RAM) -- it performs decently for such a small machine. I'll see if I can't get some benchmarks tonight or sometime this weekend.


    that system is about like mine that i mentioned above -- 2x PIII 800, 384 RAM. of course only one CPU gets used by one MySQL thread...

    cool if you can get benchmarks.


    I don't have any MySQL data. I would be interested on any findings you have, and I would be willing to perform tests on like data if you have a bunch of text laying around. I have 5 million rows (10GB) on ASE but would not be able to share it without some sort of an NDA.
    NDA? do you mean to like send me some of your posts to test with? nah, don't worry about that. i couldn't get much on my 56k anyway. i wouldn't want to download more than about 100M of zipped dump. although i did download 1.7GB of Red Hat ISOs a couple months ago on 56k. but yeah...

    what would be cool is if you could install MySQL 4 and try it with some of your Sybase data/posts. i don't know how hard it is to get it into MySQL, though.

    i said above how i was trying to test with 250k posts, which was only 25 made up "posts" inserted 10k times -- not very real-world. it seemed to be kinda slow, but i imagine that's because it was finding so many posts. if you searched for a word(s) that didn't exist, it returned very quickly.

    i was trying to figure out to make a bunch of random "real-world" posts to test with. so i think i might find some large pieces of text and put them in a text file. then put them in a PHP array, and grab random ones till i make a "post" 500-1100 characters long (don't know how long the average forum post is, but that seems good). it seems like the best i can do with fake "posts."

  11. #11
    SitePoint Addict mr tinkles's Avatar
    Join Date
    Jan 2003
    Posts
    262
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    great, data warehousing, content management, third party sw, find this amusing...

    I have a huge DB, gotta use something else to search?

    how do you think the 'other guys' accomplish?

    if you truly want to search for a small needle in huge haystack, you have some choices:

    make bigger/better needles, easier to find, may not be what original needle looking for...

    make smaller haystacks, loose some hay, may include needle...

    burn haystack, easy to find sooted needles

    crunchey fried mice!, lol, sorry.

    there are tips and tricks, but basically comes down to, have 'data', have something to search for.

    if (!DB handle search)
    {
    do 'other guys' solution
    }else
    {
    return (search results)
    }

    OR
    if ('other guys' solution)
    {
    //hope it works
    }else
    {
    return (search results)
    }

  12. #12
    Making a better wheel silver trophy DR_LaRRY_PEpPeR's Avatar
    Join Date
    Jul 2001
    Location
    Missouri
    Posts
    3,428
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    i've been doing some testing. i will try to, in the next couple days, post what i've learned about the FULLTEXT searching if you're interested (well i'll post it anyway ). i must say i'm pretty impressed with it! although i'm going to need to do certain things like limit the posts it returns to something like 1000 and implement a work-around for things like +required +words (even in 4.0; kinda like what i referred to above for 3.23) to keep the performance and results good.

    without going into details yet, i got a million made-up posts in a table and it's returning 1000 relevance-sorted results for any search in under 10 seconds most of the time when it has to read the index from disk. after the disk reads, the same searches take less than 1-2 seconds 99% of the time.

    i have 3 million posts in a table now to try with later. i'm trying to build the FULLTEXT index but it's failed 3 times 'cause it says there's no space left on one of the partitions. oops. there looks to be plenty of space though... i moved some things around and hopefully it will work this time. i hope it'll perform as well with 3 mil. *crosses fingers*

    BTW Matt, i assume your 5m rows are FanHome's posts? what is the average length of them -- AVG(LENGTH(post))? just wondering.


    edit: FULLTEXT creation on 3m failed again. i'll figure out some way to get it to work, even if i have to delete 500,000 rows or something.
    Last edited by DR_LaRRY_PEpPeR; Feb 22, 2003 at 18:06.

  13. #13
    Making a better wheel silver trophy DR_LaRRY_PEpPeR's Avatar
    Join Date
    Jul 2001
    Location
    Missouri
    Posts
    3,428
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    i have a report about the FULLTEXT stuff now. i don't know if you'll reply Matt, but i'd love to hear any comments. or if you have any benchmarks from other DBs, etc.

    i finally got 3 million posts in a table. i didn't realize how much space was needed for temporary files when building the index, which is why i was getting "No space left" errors...

    i was using MySQL 3.23.55 and 4.0.11. hard drives are 15G IBM 75GXP 7200 ATA-100.

    i only tested with 250,000 posts again in 3.23 because it took almost 10 hours (!) to build the index. i tried it a second time after increasing myisam_sort_buffer_size from the default 8M to 64M (or was it 32?), and it finished in 7-something hours. i don't know if it was faster because of the larger buffer or if it's just because i wasn't doing as much other stuff while it was running. i wasn't gonna wait who knows how long for it to do a million or more.

    MySQL 4 indexed the same 250k in < 4 minutes; big difference!

    where i mentioned testing with 250k above, it only took 3.23 ~25 mins to index. i guess that's because there were only 25 different "posts" and much fewer total/unique words to index.

    250k table was 336MB in 3.23 (161 .MYD; 175 .MYI). 4's .MYI was 123MB.

    and with 4.0 only: 1 mil. posts = 1.1GB (640/492 .MYD/.MYI) and 3 mil. posts = 3.36GB (1.86/1.5 .MYD/.MYI). 1 mil. took 20 mins. to index and 3 mil. completed in 1h 5m (12-15 of that was making a copy of the MYD file). do those times sound OK?

    one more thing about the temp file sizes used during FULLTEXT index (re)creation -- they're HUGE! indexing 1 mil. used 2 1.36GB files. 3 mil. failed because it hit the 4GB file limit on the FAT32 temp partition. after doing a little research, i lowered the ft_max_word_len_for_sort variable from the default 20 to 15. this let the 3 mil. complete with "only" 2 3.57GB files.

    i made the posts with some text i gathered from different places plus some from the dictionary. up to a million posts, the words file was 300-something K with about 14k unique words. after a million, i added some more for ~488K and 20k unique. after 2.25 million i brought it up to 524K with 24.9k unique.

    with so few unique words for that many posts, there's obviously a lot more duplicates than real world posts, which will make more posts be found when searching (e.g. slower). but that's good to test "worse case," right?

    the words were picked randomly from the file until the post was n characters long, where n is a random number between about 250 and 1050. that made the posts an average of 645 chars long:

    Code:
    mysql> SELECT COUNT(*), AVG(LENGTH(post)) AS postlen FROM post;
    +----------+----------+
    | COUNT(*) | postlen  |
    +----------+----------+
    |  3000000 | 645.1696 |
    +----------+----------+
    1 row in set (2 min 19.62 sec)


    like i said, i don't know what the average length of real posts is, but that seems good.

    making/inserting posts seems to be a little slow. maybe that's because it has to compute the relevance with so few unique words. with the FULLTEXT index in place (i inserted most of the posts w/o the index for speed), inserting a single post takes anywhere from 300-800ms (it's about the same with 250k in 3.23 as 3 mil. in 4.0). of course it's a bit faster with fewer posts.

    i tried enabling delay_key_write on the table in 4.0 (where indexes aren't written after each query) and using a 64MB key_buffer. this made inserts of a single post about 10x faster. but i quickly discovered that i don't think i want to use delay_key_write in the real world. besides greater chance of index corruption from the not flushed keys, i noticed that the post's [large fulltext] key blocks are held in the key_buffer (which i knew, but i thought they always were even if written right away). this means that key blocks from/for reads are forced out of the buffer. so key reads could be slower because 1) less hits in the cache and 2) the unwritten blocks may need to be written to make room for the new blocks (e.g. writes to do reads!). my theory anyway, without knowing too much about it.


    OK, now that you know the background, let's get to the actual searching. i did most search tests in 4.0 with 1 and 3 mil. posts (not really much speed diff that i noticed with 1 vs 3). with the little bit that i tried with 250k in 3.23, i noticed it was a bit slower than with millions in 4.0 on searches that return a lot of results.

    up until i did this testing, i thought MySQL 4 would be best so i could use IN BOOLEAN MODE searches. well, 4 still is best for the speed improvements, but i don't think it's a good idea to use IN BOOLEAN MODE (except maybe in some certain cases).

    note: in example queries, i'm just going to use 1, 2, 3, etc. to represent different words instead of actual words.

    as i guess you know, when not using IN BOOLEAN MODE, all words are optional, so posts won't necessarily, and probably won't, have all words that were searched for. posts are just automagically sorted with the highest relevance ones first. previously, i thought i was going to add a + in front of all words and use IN BOOLEAN MODE so all words would be found unless the search terms were specified otherwise. i thought i'd work around 3.23's lack of boolean searches with WHERE MATCH(post) AGAINST('1 2') AND post LIKE '%1%' AND post LIKE '%2%'.

    after testing, i'm sure i'm now gonna let all words be optional unless specified otherwise and not use a boolean search. the reason for this is because MATCH(post) AGAINST('1 2 3') LIMIT 1000 performs very well and with results automagically sorted be relevance -- which should return the most relevant posts even if they don't have every word, right?

    the problem with boolean searches is that results aren't automagically sorted by relevance. even if a post has all words (e.g. +1 +2 +3 in the query), its relevance will still differ from other posts because of how many times the word(s) appear, the length of the post, etc. so i'd still like them to be sorted by relevance. since it doesn't happen automatically, i'd have to use something like

    Code:
    SELECT postid, MATCH(post) AGAINST('1 2 3') AS rel FROM post WHERE
    	MATCH(post) AGAINST('+1 +2 +3' IN BOOLEAN MODE)
    	ORDER BY rel DESC LIMIT 1000
    but that uses filesort, which wouldn't be that bad if it only found a few posts, but much worse for many.

    now i was faced with how to deal with searches like 1 +2 3 where some words are required. well, i didn't know this before, but i discovered that you can specify a word multiple times in AGAINST() and it will give that word higher relevance! so i could look for words with a + prefix in PHP and repeat them in the query to bring posts that have the required word(s) "to the top" with a non-boolean search. i then have 2 options to check whether the required word is in a post:

    1) do the WHERE MATCH(post) AGAINST('1 2 2 3') AND post LIKE '%2%' LIMIT 1000 query. the problem with this, i discovered, is that if there are less than 1000 posts that match the LIKE, the query will run until it's found all posts that contain 1 or 3 and failing at the LIKE each time. this will of course become worse with more individual words in the search.

    2) use these queries

    Code:
    INSERT INTO searchresult (postid, rel) SELECT postid, MATCH(post) AGAINST('1 2 3') AS rel FROM post
    	WHERE MATCH(post) AGAINST('1 2 2 3') LIMIT 1000;
    
    SELECT p.postid FROM searchresult sr, post p WHERE p.postid=sr.postid AND p.post LIKE '%2%'
    	ORDER BY sr.rel DESC
    the first query should only look at 1000 posts even if only 1 contains word 2. the one(s) with word 2 should be "at the top" since it was listed twice. then in the second query, just scan those explicit postids that were found making sure they have word 2. i think this method should be faster than the first in most cases.

    i'm not exactly sure what to do with multiple words that are all required. hmm... that would get slower the more words there are. maybe i'd use the second method above but with a larger LIMIT to increase the possibility that posts with all words would be found. then again, maybe the posts (if any) with all words would have the highest relevance anyway? i'm not sure if MySQL gives "more individual words found" higher relevance than "more instances of one word." i would think/hope that it would. or... then again, maybe i should use IN BOOLEAN MODE in 4. but i don't know that it would be any faster -- again, especially as there are more words. from what i saw, it seems to have to check the same way i am with the LIKE. and the boolean search has the filesort problem in the case that lots of posts would have all the words. *sigh*


    i don't know if you know this, but MySQL will always pick the FULLTEXT index when you're using MATCH() in the WHERE, unless you also have WHERE unique_col=const in which case it will use the unique_col index. e.g.

    Code:
    // This will use postid index
    ... WHERE MATCH(post) ... AND postid=123
    
    // This will use FULLTEXT index, which is CRAZY!
    ... WHERE MATCH(post) ... AND postid IN (123, 456)
    the second will find all posts that match the search and then see if it matches the postid! this is almost certainly much slower than the other way around.

    the reason i mention this is for searching for posts by a certain user or something. instead of putting the MATCH() in the WHERE, which would unnecessarily find posts by all other users, this query would be much faster:

    Code:
    SELECT postid, MATCH(post) AGAINST('1 2') as rel FROM post
    	WHERE userid=123 HAVING rel > 0
    	ORDER BY rel DESC
    i face the same, but harder, choice of how to search an individual forum, for example. do i MATCH() in the WHERE and then see if the forumid matches, or scan all posts in the forum and check their relevance with HAVING? (BTW, i'm not storing the forumid in the post table, so it would have to go through the thread table.) i guess i'll have to check the forum's # of posts vs the total post # and figure out which would be faster... or something.


    oh, i forgot to mention the search times for plain, non-boolean searches, but they're about what i said in my last post. even on the 3 million posts, 98% of searches take < 10-12 secs. to return 1000 postids when it has to read from disk. < 0.5-1.5s for most after that. i think that's pretty good, considering my small amount of RAM and IDE HD. as long as people do just plain searches, which i'm guessing most searches are (?), it should perform fairly well.

    one last thing: i noticed that with regular searches where all words are optional and you use LIMIT n, it doesn't really make much difference speed-wise if you search for 2 words or 20. that was kinda surprising.


    sorry for the long post. i wanted to "scribble down" what i could remember about my findings. i learned more than i knew a couple weeks ago. maybe you learned a couple things about FULLTEXT searching too.

    like i said, i'd love to hear any comments from MattR or anybody that sees this!
    Last edited by DR_LaRRY_PEpPeR; Mar 2, 2003 at 20:07.

  14. #14
    .NET inside archigamer's Avatar
    Join Date
    Jan 2002
    Location
    Strongsville OH
    Posts
    1,534
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    nice post, you should hand it to a college professor that teaches database administration
    Web Finesse Studios
    Professional, business oriented web hosting and development.

  15. #15
    Database Jedi MattR's Avatar
    Join Date
    Jan 2001
    Location
    buried in the database shell (Washington, DC)
    Posts
    1,107
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Verrryyy interestink. Ok, now you made me set up NFS so that I can get extra disk space I can work on setting a suitable test up.

    Couple of things:
    Performing a frequency analysis on the data to determine what words to search for based upon their relevance (e.g. see how long it takes to search the most/least/average common word) would be worth a bit more unless most/least common does not show a performance delta.

    I would think dups would be better since the index can account for that and would be more compact. Maybe I'm wrong?

  16. #16
    Making a better wheel silver trophy DR_LaRRY_PEpPeR's Avatar
    Join Date
    Jul 2001
    Location
    Missouri
    Posts
    3,428
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by archigamer
    nice post, you should hand it to a college professor that teaches database administration
    ha ha, thanks.

    it's fun to experiment with this stuff and learn new things.


    Quote Originally Posted by MattR
    Verrryyy interestink. Ok, now you made me set up NFS so that I can get extra disk space I can work on setting a suitable test up.
    cool. so you'll get some benchmarks or something? it'd be cool to see how MySQL 4 does with your test data like i said. but that's probably too much trouble for you...


    Performing a frequency analysis on the data to determine what words to search for based upon their relevance (e.g. see how long it takes to search the most/least/average common word) would be worth a bit more unless most/least common does not show a performance delta.
    well, since there aren't that many unique words, almost anything that returns something returns lots. however, things that return less than 1000 results from the 3 mil. posts are very fast. and of course ones that return nothing are instant. searches that would return the most without the LIMIT -- multiple common words -- still complete in 10-12 secs. or less when reading from disk.


    I would think dups would be better since the index can account for that and would be more compact. Maybe I'm wrong?
    that's what i'd think too, but judging from my previous tests above a couple months ago with only 25 different "posts" inserted 10k times for 250k rows, the index was basically as large as this round (but took much less time to index though)...

    if there's anything else you'd like to know, i'll try to find out/try it.



    anyway, forgot to say last post (did you guys actually read all of it? ), i don't think MySQL (especially 4) will have much trouble handling 5 million or more posts during regular stuff. it's the darn searching that i'm concerned about! but i'm glad 4 looks good so far with my workarounds.

    i need to test whether, in multi-word searches, it gives higher relevance to "more unique words found" instead of "more instances of a single word."

    i'm also not sure if/how i'd do wildcard* searches. i guess i could use boolean mode for those if i did them since it should be faster than a scan with LIKE.


    i also wish there was a way to make FULLTEXT creation faster in 3.23 since it looks painfully slow once you get up to 100k posts. but hopefully 4 will be officially stable soon and i won't need to really worry about 3.23. but, still...

    i dumped a 100k post table to see how long it would take to restore and what could make it faster. in 3.23, it looks like key_buffer is the main thing. with the default 8M, the restore was maybe 75-80% done after an hour. i stopped it then (probably would've took 2h or so) and upped the key_buffer to 32M. this let it restore in less than an hour.

    in case you didn't know, the key_buffer matters for these INSERTs because they have LOCK TABLE/UNLOCK TABLE around them so the indexes aren't flushed after each query (like delay_key_write). so if it can hold more key blocks in the buffer = less writes = faster.

    it was inserting about 1,400 rows per INSERT and when it first started (with 32M key_buffer) an INSERT would take 12-15s or less. by the time it got to the end, they were taking more than 1m 30s. so you can see it really slows with more rows.

    i'm just speculating here, but i don't think key_buffer matters for restoring a --opt dump in 4.0+. that's because it adds ALTER TABLE table DISABLE KEYS and ALTER ... ENABLE KEYS around the INSERTs which disables indexes to make the INSERTs faster (so keys wouldn't even go into the key_buffer). the ALTER ... ENABLE KEYS then rebuilds the table and creates the index(es). so for this, myisam_sort_buffer_size would i guess be the determining variable.

    of course the fastest way to backup/restore in either version is with LOCK TABLES + FLUSH TABLES and copy the files raw (e.g. mysqlhotcopy).


    one last thing about the differences between 3.23 and 4.0 when creating (e.g. ALTER/REPAIR TABLE, not reg. INSERTs) FULLTEXT indexes. PROCESSLIST with 3.23 shows "Repair with keycache" whereas 4.0 shows "Repair by sorting," which is supposed to be "up to 100 times faster."

    3.23 doesn't use the huge temporary files that i mentioned when creating the index. i guess this is because of the Repair with keycache -- which seems to work directly on the .MYI file? i don't know. i don't see any copy or temp files during this.

    from what i understand, 4.0 uses the slower Repair with keycache method for words longer than ft_max_word_len_for_sort after it has done the shorter ones with the fast Repair by sorting.


    sorry if that's useless info, but creation time is important too, not just searching time. and i just wanted to post it in case it helps anyone.
    Last edited by DR_LaRRY_PEpPeR; Mar 4, 2003 at 04:16.

  17. #17
    Database Jedi MattR's Avatar
    Join Date
    Jan 2001
    Location
    buried in the database shell (Washington, DC)
    Posts
    1,107
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    So, how long does it take to insert a row in a full-text indexed table? That needs to be considered as well. On a our peak days we were certainly having upwards of several hundred posts inserted per second. Each one taking 12 seconds to go in would kill it.

  18. #18
    Making a better wheel silver trophy DR_LaRRY_PEpPeR's Avatar
    Join Date
    Jul 2001
    Location
    Missouri
    Posts
    3,428
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    people were posting several hundred per second?! i can't believe that. that'd be hundreds of thousands of posts per day. even with more than 1000 online, i think 100 or a few more in a minute would be a lot?

    anyway, the 12 seconds in my last post (if that's what your "12" is referring to?) was about 3.23 taking that long to INSERT about 1400 rows when restoring the dump first started (with the FULLTEXT index in place). when it got up to 100,000 (in 3.23), it was taking almost 2 mins to insert the same ~1400 rows.

    for inserting single posts, i referred to those times in the first long post a couple days ago. inserting into a table with FULLTEXT index in 3.23 with 250k rows or in 4.0 with 3m rows takes anywhere from 0.3 to 1s for a 900 character long post (major disk access during this). i said that seemed a bit slow, although i'm thinking it's because there are so few unique words and it has to compute those words' relevance in the new post...?

    using delay_key_write on the post table makes INSERTs 10x+ faster, which is nice. but i don't know if i like that those new key blocks take space in the key_buffer.

    i guess INSERT DELAYED is another possibility, since it returns very quickly and during heavy posting multiple rows could be written at once. the problem with DELAYED is i'm afraid it might take too long to insert if there's non-stop SELECTs running the post table.

  19. #19
    Database Jedi MattR's Avatar
    Join Date
    Jan 2001
    Location
    buried in the database shell (Washington, DC)
    Posts
    1,107
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    100s a sec is probably too much. Although it did get pretty hairy -- but obviously it was not sustained for any substantial length of time. For example when a team would be playing the users would have game threads in which they would post a play-by-play. So you have a bunch of users typing in posts quite rapidly. Multiply that by the large numbers of games going on in virtually every sport and it adds up.

    More accurately it would be 10s per second in peak. Averaged over the life of a day I think our rate was one every 5 to 7 seconds. But generally the traffic patterns were that there were large numbers of posts in a short time span, and then several hours would go by (from say 1AM EST to 6AM EST) in which posts would rack up one every 30 seconds or more.

    It would be interesting if you could get a few concurrent threads going inserting at random to simulate real-world posting habits. Nothing major, but say insert 100 total posts in a minute from 10 random threads (e.g. create a PHP page that does 10 inserts with a random delay from 0-6 seconds in between inserts then kick that off 10 times) to see what kind of delays we're talking about (if any).

    That's probably a little above and beyond considering the work you've already done, so if you want me to sod off that's understandable.
    Last edited by MattR; Mar 5, 2003 at 02:35.

  20. #20
    Making a better wheel silver trophy DR_LaRRY_PEpPeR's Avatar
    Join Date
    Jul 2001
    Location
    Missouri
    Posts
    3,428
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by MattR
    100s a sec is probably too much.



    Quote Originally Posted by MattR
    That's probably a little above and beyond considering the work you've already done, so if you want me to sod off that's understandable.
    not at all. i love testing this stuff. i'll try that what you said tomorrow or something. good idea.

    i think you've gone above and beyond with your comments/advice in this thread! i really appreciate it. hopefully my previous couple posts weren't too terribly long for you to read.


    i realized after i posted before that i wouldn't be able to use INSERT DELAYED anyway because i need to be able to use mysql_insert_id().


    also forgot to say, the other day when i was inserting posts (2 or 3 scripts w/ 500 row INSERTs), i was doing searches at the same time to see how much the INSERTs slowed it down (not much at all) and i think i might've found a bug in the FULLTEXT code or something. i was doing a search for the same word each time and every 10 secs. or so i would get "error 127 from table handler" which means "Record-file is crashed." however, after a couple seconds, it would work again, BUT with at least another row found for that word. so something was going wrong when searching for a word that's being inserted at the same time (delay_key_write was on, which could've revealed this bug, i guess).

    yesterday i noticed the error log full of these entries:

    Code:
    030228 17:21:45  ft_read_next: Got error 127 when reading table ./forum/post
    030228 17:21:45  ft_read_next: Got error 127 when reading table ./forum/post
    030228 17:21:46  ft_read_next: Got error 127 when reading table ./forum/post
    030228 17:21:47  ft_read_next: Got error 127 when reading table ./forum/post
    030228 17:21:48  ft_read_next: Got error 127 when reading table ./forum/post
    030228 17:24:02  ft_read_first: Got error 127 when reading table ./forum/post
    030228 17:24:03  ft_read_first: Got error 127 when reading table ./forum/post
    030228 17:24:04  ft_read_first: Got error 127 when reading table ./forum/post
    030228 17:24:05  ft_read_first: Got error 127 when reading table ./forum/post
    030228 17:24:06  ft_read_first: Got error 127 when reading table ./forum/post
    030228 17:29:25  ft_read_next: Got error 127 when reading table ./forum/post
    030228 17:52:41  ft_read_next: Got error 127 when reading table ./forum/post
    definitely something went wrong in the FULLTEXT calls... weird.
    Last edited by DR_LaRRY_PEpPeR; Mar 5, 2003 at 03:56.

  21. #21
    Making a better wheel silver trophy DR_LaRRY_PEpPeR's Avatar
    Join Date
    Jul 2001
    Location
    Missouri
    Posts
    3,428
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    all right, i did your suggestion. with the usual 3.23 + 250k and 4.0 + 3m.

    i made the PHP script do 10 inserts of 900 character posts, with a sleep(mt_rand(1, 6)) in the for () loop. i started 10 simultaneously with ApacheBench (ab -n10 -c10 ...). at the end, the PHP script inserted the time for each query into a table so i could do an AVG(time), MIN(time), MAX(time) query on the 100 inserts' times.

    the 100 actually got done in 40-50 secs. most of the time instead of 60. up to 4-6 INSERTs were running at the same time in PROCESSLIST.

    when i first started inserting into the 3m post table, the queries were taking 4 seconds on average (with a max of about 8)! i thought, ouch, that's slow! however, i had just finished making a copy of the table's 3GB+ of files so none of the other files were cached by the OS anymore.

    after a few rounds, the average was down to under 1 second. after a few more, the average was down to 0.4-0.6s (min of ~0.05/max of 1.5-2s).

    after yet more rounds, the best averages were about 0.3-0.5 (min of ~0.035/max of 1-1.6s).

    the fastest single, non-concurrent inserts were consistently 0.035-0.05s, which was the lowest i've seen.

    i don't remember exactly, but 3.23 w/ 250k posts had about the same results.

    BTW, default settings were used -- way-too-small 8M key_buffer and no delay_key_write.

  22. #22
    Making a better wheel silver trophy DR_LaRRY_PEpPeR's Avatar
    Join Date
    Jul 2001
    Location
    Missouri
    Posts
    3,428
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    it looks like Yahoo! guy Jeremy Zawodny has discovered the wonders of full-text search. i posted a comment there with a link to this thread...


    Matt, you said "setting a suitable test up." so were/are you going to have some results with other DBs (Sybase i guess)? those would be great to see. not bugging ya, just wondering.


    got a little bit more stuff too:

    forgot to mention in previous posts that even if inserting posts seems a little slow, vBulletin's method is probably about the same (or worse) overall, right? instead of one INSERT like mine, they have to manually split the post's words apart, look up the wordid of each word, and insert wordid + postid combinations into the searchindex etc.


    also was doing some testing yesterday to see how MySQL ranks multi-word searches: do n occurances of a single word get the same relevance as 1 occurance each of n words? thankfully, the answer seems to be "no."

    making these posts (remember, each number represents a word...)

    • 1 1 1 1
    • 2 2 2 2
    • 3 3 3 3
    • 2 3 1


    and using MATCH(post) AGAINST('1 2 3') gave highest relevance to the post that contained all 3 words.


    finally, i discovered another possible bug yesterday when doing a search while inserting posts. about 3 out of 4 searches on particular words ran fine (speed-wise) while the inserts were being done. the other 25% or so, however, would hang -- mysqld using all CPU with "FULLTEXT initialization" in PROCESSLIST -- for about 30 seconds before returning. these were searches that took < 1 sec. otherwise. just changing the search slightly, while keeping some of the same words, could cause or not cause this behavior (no pattern to it, but the same searches would always be fast or slow). again, weird. nothing in the error log this time though...
    Last edited by DR_LaRRY_PEpPeR; Mar 11, 2003 at 04:19.

  23. #23
    Database Jedi MattR's Avatar
    Join Date
    Jan 2001
    Location
    buried in the database shell (Washington, DC)
    Posts
    1,107
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Sorry, been on vacation the past week. Once I get over the jetlag I'll post what I found.

  24. #24
    Making a better wheel silver trophy DR_LaRRY_PEpPeR's Avatar
    Join Date
    Jul 2001
    Location
    Missouri
    Posts
    3,428
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    well... while waiting for those findings, i'll post something else i just found out.

    Matt, i assume you're familiar with Infopop's hosted OpenTopic forum? the one that has "no growth limits," costs, what, $1,000s per month, and uses an Oracle backend?

    i was just looking around their site last night and then went to the Ars Technica's OpenTopic forum just to see how many posts they had (just counted 4,582,315 from the forum list) and i was wondering what OpenTopic's search feature was like and how it performed on Ars. well, a thread titled Search Being Rebuilt just happened to be the last thread in a forum. i was like, "how does the search index need to be "rebuilt" in Oracle? is it using a searchindex type table like vB?"

    when i got the bottom of page 3, i couldn't believe what i read! they are using MySQL's full-text search! can you believe that? Oracle for everything else, and MySQL for searching. sounds like it might be a new thing that they're switching to, but still. and they were even using it with MySQL 3.23 i guess until they were told to upgrade to 4.0.

    i did a couple seaches and it was kinda slow... like more than 30 secs. or so. maybe they're not using the workarounds i've discovered.


    regarding my setup... i've been thinking about maybe taking any quoted post text out of posts and store it in a seperate "quotes" column or such. i would then reassemble the quote(s) into the correct place on display. i don't think either would be too hard to do. this would of course prevent text that was already in a previous post from being indexed again, thus making the part that has been be parsed and indexed smaller, which would make the index smaller (and INSERTs faster), which should make searches faster (and seemingly more accurate with less duplicated text).

    understand what i mean? sound like a good idea?

  25. #25
    Database Jedi MattR's Avatar
    Join Date
    Jan 2001
    Location
    buried in the database shell (Washington, DC)
    Posts
    1,107
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by DR_LaRRY_PEpPeR
    when i got the bottom of page 3, i couldn't believe what i read! they are using MySQL's full-text search! can you believe that? Oracle for everything else, and MySQL for searching. sounds like it might be a new thing that they're switching to, but still. and they were even using it with MySQL 3.23 i guess until they were told to upgrade to 4.0.
    Interesting. I could think of a few reasons for using MySQL, cost being one of them. Oracle's interMedia (the part of the DBMS that has the full-text dealie) is/was a very expensive option to have licensed. Infopop may be trying to get 'Oracle-on-the-cheap' to give the impression of large-scale implementations even though it could be an older version of Oracle which may not even have the ability to run interMedia.

    Also, it could be very complex to set-up and maintain whereas MySQL's is relatively easy to poke at.

    I'm wondering, though, if putting up with the replication delay of posts from Oracle to MySQL (e.g. stale indexes) is worth it. interMedia supports way more search capabilities than MySQL -- for instance if you have a post which attaches a pdf interMedia could index that as well.

    Quote Originally Posted by DR_LaRRY_PEpPeR
    regarding my setup... i've been thinking about maybe taking any quoted post text out of posts and store it in a seperate "quotes" column or such. i would then reassemble the quote(s) into the correct place on display. i don't think either would be too hard to do. this would of course prevent text that was already in a previous post from being indexed again, thus making the part that has been be parsed and indexed smaller, which would make the index smaller (and INSERTs faster), which should make searches faster (and seemingly more accurate with less duplicated text).

    understand what i mean? sound like a good idea?
    I think I see what you are doing.. So you'd have to store the quoted post_id somewhere and then join it back to the post table to get matching posts?

    e.g.
    post( post_id, text )
    quoted_post( post_id, quoted_from_id, text )

    The problem would be setting up the quoting mechanism so that you knew what the original post was. If I make a quote by hand or one that covers several posts you'd have a bit of a problem I would think.

    Then, of course, you'd need to have a way of stripping out the quote tags, inserting some system-specific method of indicating where a quote should go, and then dynamically reassembling on the fly. And you'd have to make it work for many quotes.

    I would think that would be very expensive to perform... thereby killing your database's cache since you can't linerally cache a thread (if you use a clustered index you can organize posts by thread and post time so your DBMS can peform a sequential read).

    In any case, I think the search gain would be overwhelmed by the loss in post display speed.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •