I need to create a large scale search engine, if anyone has any ideas, pointers or tips they would be greatly appreciated? If any samples are available, that would be nice:P. Its essentially for an internal application that needs to extract data from a number of tables in a MySQL database. It would ideally be something like vBulletins search functionality.
Full text search is what most forums do, isn't it? I believe vB caches searches too, I'm not 100% sure though.. HEAP table's are always fast, and then if the phrase has been searched lately it'll be a lot faster, less server intensive (ie. have a third table of phrases2posts, and it has an id, phraseid, postid.. phraseid is the phrases table, which is a list of phrases, postid is the id of the posts table, obviously respectively). I'm not sure what kind of cache time frame you'd go by, but a cron job would probably be your best bet for truncating that table.
Post (or blog) your conclusions, eh? It'd be interesting to know how you end up doing it.
I need to create a large scale search engine, if anyone has any ideas, pointers or tips they would be greatly appreciated?
With MySQL you have to match words exactly and cannot search for misspellings. Take a look at Xapian, which unlike most full text engines, has incremental indexing. That is you don't have to rebuild the indexes when you add more data.
I hastily posted this thread, when I first found out the initial task and now that I have some further details, I feel i'm going to get myself in a bit of a mess with the way the servers are currently configured. I am not thrilled at the idea of suggesting the use of Xapian, as this would be out of my hands and left with the hosting administrators to configure.
Currently we have a very unusual configuration for the servers, which means alot of messing about with cached content that makes me a little dubious as to Xapian's implementation. I would prefer, something in PHP that does not require contacting the administrators, as well as being something that I can manage myself.
I will get back to this thread later this week when I do have to delve a little deeper as the search engine is not a priority right now. If you can suggest something that can aid me that would be a great help, thanks for the suggestions, so far Nathan & Marcus