SitePoint Sponsor

User Tag List

Results 1 to 3 of 3
  1. #1
    SitePoint Addict Phil-man's Avatar
    Join Date
    Nov 2000
    Posts
    291
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hi:

    I have an archive of articles published on a web site. Certain information about each article (the title, the date it was published, the category to which it belongs, etc.) is stored in a MySQL database. A single page written in PHP displays an index of the articles, while another PHP page displays the appropriate article that is selected by the user.

    The CONTENT of the articles, however, is not stored in the database but is instead stored in text files and pulled in via PHP's "include" function. The names of the text files correspond to the value of an ID field in the MySQL database. My multipart question is...

    1) Is there a way to include the text of the "included" text files in a search of the database?

    2) If not, is there a simple, automated way to create an index of keywords from each article and store them in a single, new column in the database?

    3) Am I going about this entire thing the wrong way? Keep in mind that I'm confined to PHP and MySQL, at least for now. I was trying to avoid storing the entire content of the articles (complete with HTML) in the database.

    Thanks for any help!

  2. #2
    Dumb PHP codin' cat
    Join Date
    Aug 2000
    Location
    San Diego, CA
    Posts
    5,460
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I think the time you would save from opening a text file with PHP and searching through it for keywords would be much greater than just storing the atricle with html and all in the database.
    Please don't PM me with questions.
    Use the forums, that is what they are here for.

  3. #3
    Grumpy Mole Man Skunk's Avatar
    Join Date
    Jan 2001
    Location
    Lawrence, Kansas
    Posts
    2,066
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I've personally never seen anything wrong with storing entire articles in the database - that's how most places seem to do it (Site point and these forums included) - however I'm no expert on these things.

    I've read a few articles on basic search engine type stuff, and it shouldn't be at all hard to buld up an index of words in the articles - in fact the easiest way runs as follows:

    1. Remove all HTMl tags from the articles - just use a regular expression to get rid of anything matching <text>.
    2. Convert the entire article to lower case.
    3. Remove all punctuation, new lines etc.

    You should now have a big block of 'words' seperated by spaces, with nothing else in there.

    4. Remove all 'common' words - stuff like the, and, they, it etc

    And there's your searchable index. Bung it in a database field or just store it in another flat file - it should make searching pretty easy.

    There's a great example of a PERL script that uses this technique to index flat HTMl files - check out the script in action here or visit the script's web site here.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •