I plan on building a fulltext search for my website coupled with the porter2 stemming algorithm and a synonym dictionary to expand the results.
I would also like to include spelling suggestions - I think this can be done php's pspell_suggest function or even levenshtein or similar_text.
I've been reading around and some people have suggested that using php for this on a large database would be a massive memory drain - I haven't been able to find much more on it. Is this true? If so what alternatives could I use?
I can not answer that question from a hardware point of view (server limits) - though I think it's an intresting job and I am convinced it is possible!
One just needs to reduce the memory usage with clever coding - and a decent server will help too ofcourse..
I mean - if nothing helps, use that c-compiler from facebook - it's opensource and converts your php to c++ (if Im not mistaken)
However - here are some hints - I assume you know all that already - but if not - hope it helps..
From coding some advanced webstats tools, I got some insight in huge datasets..
since there are mostly integers to find, it is not that relevant for your job..
but I found some good clues on "how to get memory usage down" on huge data sets
tree-structurize your search..
=> main query looks in a smaller table for some raw info, where to look. Than narrow your search to those pointers
imagine something like
"if a part of the term is "sh", than look in those 3 thousand rows only"
so to speak a contents directory or a search index that delivers an integer row id to use in a second level of querys..
you will have a smaller data set to search that way.. - fulltext on a huge table is not fast and not lightweight
but fulltext on a limited set of data (even 3000 rows), that is additionally limited and filtered by integer IDs is much faster and less a hog.
additionally the second query-level, those that can be run one after the other in a loop, are split-parts of the full result, those parts can be cached and reused for a short time
20 seconds will already reduce load a lot depending on your project!
=> Do not use JOINS
you can even create huge queries with hundreds of:
"AND foo = 'bar' "
On large tables, the above will be much faster than a join with a scond table..
but joins and huge tables = lot of memory, lot of time, lot of cpu..
it's not sexy to skip joins but a good idea for your task..
=> always niclude a integer id in the query (can only be done with a search index liek described above)
create the final sql-string in a loop that ads a condition for each what-ever-you-search and run at the end only one query in (preferably only one) table at a time
to query two tables in one query is also possible and also for quite large tables still an option though
it does really (almost) not matter for the memory question how many conditions you have - but if you do not adress each row manually firts, SQL has to look thourgh all rows to sort,order,search something.
if you have 3000 times OR id = 123, you will only have 3000 rows to do a fulltext - not millions of rows..
integer filtering on a key field is made to be fast ans lightweight..
rely on that feature..
well - my 2 cents..
Hope you or some else can use it
good luck.. :-)
Ok say I have indexed all my pages into a table that had the columns id, title, headings & body and each column is weighted differently.
And the term "Apple" is searched.
Are you suggesting that I have something that says the term "Apple" fits in the "Fruit" category and that contains these page id's, so search only the page id's in the "Fruit" category?
That could be difficult to implement but maybe it's necessary for faster results.
Thanks for the tips Hensel - I totally forgot about facebook's php compiler
yes and no my idea was even more abstracted - I try to elaborate with your example..
I think for a text search, something like this would be benefitial..
apple starts with Ap and contains "pp" - both are specific, recognizable parts of the term, that can narrow down the search result
an Index, that can deliver rows, where one of those fits, would help..
so "Ap" is found in rows 1500-2200 and 3215-3400 etc...
pp can be found found in rows x - y
even if the row list looks like this
with the hint from above
=> "create the final sql-string in a loop that ads a condition for each what-ever-you-search and run at the end only one query in (preferably only one) table at a time"
you will save a lot of memory..
the bigger your table, the more you save..
each time a new term is entered in the database, a specially made function, analysis the term and slpits it in to "searchable" parts like "Ap" or "pp"
than you add the row number of the new term to the index-table where it fits..
Either you define those "parts" or you get some text-analyzer that can do that..
look at phpclasses.org - Im sure you will find anything that suits this job..
Alltogether it may look too complexe for the task - but imagine millions of rows and a memmory error on each search..
if that happens, a complexe but working solution is preferable..
However - those hints are just methods to reduce memory usage and other problems you "might" encounter..
as I said, Im not even sure, at what point you will reach such problems
but IF they come, try those methods..
Im trying to put all in a relation:
if your search is just for one website, you will have some thousand rows if it's a big site - you will NEVER have any problems with fulltext on that size
only if you have scanned millions of websites, or hundreds of books, etc..
for example I made for a client a full text search on about 20 different versions of the bible in 10 different languages..
those are hundredthousands of verses (1 row in db for each verse) - I didnt use an index as described - and I have no issues with searching..
so if your project is NOT much bigger, skip the above, ignore the rants about php and code your fulltext search - it will work..
I used as pressumption, that you mean a search engine in the size of google or so - there you might find memory trouble and need those tricks :-)
you said "on a large database would be a massive memory drain"
in my understanding, that issue is irrelevant below let's say 2million rows.. (estimation-alert)
<irony> ok ok WP can have such memory problems already with 8 plugins and no content at all - but since we dont do it the same way than the WP guys... </irony>
if the WP fans say php is not suited for big jobs because WP has those issues, it's just some wannabe-smart-ppl who have no clue what they talk about..
Thanks Hensel, I will probably stick just the full text search for now. I will only have small site to begin with but I would like to allow account for future growth but In reality it would probably take many years for the site to grow to above 100,000 rows if it's successful - I'm just being overly cautious.
But if things start to slow down I know where look, thanks Hensel.