SitePoint Sponsor

User Tag List

Results 1 to 8 of 8
  1. #1
    SitePoint Enthusiast
    Join Date
    Apr 2013
    Posts
    31
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Article database - best practices?

    I intend to create a MySQL database of a discontinued magazine that was issued for about 50 years with 4-6 issues per year. So the amount of data is set once and for all; when I've fed it all in there will be no more writing to the database. I could invent all the wheels on my own, but I'd rather hear some more experienced opinions before I go too far. So here are my thoughts on it. I'd be happy to get some feedback on them.

    There will be a fairly simple web interface with two possibilities - browsing by selecting year and issue or searching in a single search field. I need a fulltext index for the articles as well as the titles and subtitles, and ordinary single-word indices for the keywords and authors' first and last names.

    The main table articles should contain one article per row and I started out by creating the columns year, issue, pages, title, subtitle and text. Then the tables keywords and authors as well as tables for connecting these to the main table.

    First of all, I suppose most people would agree that I should keep authors and keywords in separate tables. But that will give me queries like the following:

    Code:
    SELECT year, issue, title, subtitle 
    FROM articles AS a
    INNER JOIN articles_keywords AS ak
    ON a.id = ak.articleid
    INNER JOIN keywords AS k
    ON k.id = ak.keywordid
    INNER JOIN articles_authors AS aau
    ON a.id = aau.articleid
    INNER JOIN authors AS au
    ON au.id = aau.authorid
    WHERE MATCH(articletext,title,subtitle) AGAINST(:ft_1 IN BOOLEAN MODE) OR keyword = :x_1 OR au.firstName = :x_1 OR au.lastName = :x_1)
    – in its most simple form, i.e. when there's only one word in the search field and no boolean operators (some version of AND, OR and NOT should be possible as well).

    Are there any acceptable ways to get rid of the JOINs and simplify queries in cases like this? (Or maybe the joins won't affect speed as much as I imagine?) I see two possibilities: to move keywords and authors to the main table (which I guess is out of the question) or to make sure that the keywords and authors end up in the fulltext index for the articles themselves. Hidden paragraphs/divs/whatever? Meta tags? Other solutions?

    The answer to that question is perhaps dependent on the answer to my main concern – how to store the articles. Some of them are extremely short and consists in a single paragraph, but others will need a lot more markup such as various levels of headers, plenty of paragraph tags, lists, tables and images. Should I store the articles with HTML and all in the database (or XML, which I've discarded as an unnecessary extra step) or in a file system? I definitely need a version with markup somewhere, and if I keep it outside the database I'm going to need two copies of the articles - one with markup, one without markup for the index. The second solution will waste more space than the first, but if the speed is affected positively by keeping markup out of the index, I suppose it's worth it. Otherwise I don't see any real problems with having the markup in the index. The chances of false positives (where the query happens to be identical to a tag name or attribute name/value) are pretty slim, especially since the articles aren't written in English. Any other considerations or solutions?

    As for the images, I intend to keep those outside the db entirely.

    Ideas? Suggestions? Thanks.

  2. #2
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,215
    Mentioned
    58 Post(s)
    Tagged
    3 Thread(s)
    keywords and authors should remain in their own separate tables, with many-to-many relationship tables as you have anticipated in the sample query you posted

    the sample query will not work exactly as is, you would need subqueries to return GROUP_CONCAT strings for keywords and authors

    as for storing the articles, use two TEXT columns, one without markup for searching and one with markup for display (space is not really an issue because you won't have bazillions of articles)
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  3. #3
    SitePoint Enthusiast
    Join Date
    Apr 2013
    Posts
    31
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thanks. I'm not sure how to implement the subqueries, though, but I'll get there eventually.

    As for the text version with markup, are there any obvious pros and cons putting it into a TEXT column compared to having it in a file system and just store the paths in the db?

  4. #4
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,215
    Mentioned
    58 Post(s)
    Tagged
    3 Thread(s)
    yeah, i guess that's an option too
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  5. #5
    Utopia, Inc. silver trophy
    ScallioXTX's Avatar
    Join Date
    Aug 2008
    Location
    The Netherlands
    Posts
    9,031
    Mentioned
    152 Post(s)
    Tagged
    2 Thread(s)
    Quote Originally Posted by atoroqo View Post
    As for the text version with markup, are there any obvious pros and cons putting it into a TEXT column compared to having it in a file system and just store the paths in the db?
    You can't use MySQL's fulltext functionality on the files, you'd have to add some sort of search functionality in your code, and they're mostly not trivial to write. Unless you have a very strong reason not to, I'd stick them in the database in a TEXT field.

    Also, you want to have a look at Spinx Search: http://www.ibm.com/developerworks/li...-sphinxsearch/
    Takes some time to set up, but works well (better than MySQL fulltext IMHO, which is a tad too simple for any real searching).
    Rémon - Hosting Advisor

    Minimal Bookmarks Tree
    My Google Chrome extension: browsing bookmarks made easy

  6. #6
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,215
    Mentioned
    58 Post(s)
    Tagged
    3 Thread(s)
    Quote Originally Posted by ScallioXTX View Post
    You can't use MySQL's fulltext functionality on the files
    no, but my suggestion was to have a non-marked-up copy of the article in the database, the only issue then being whether to retrieve the marked-up version from an adjacent column on the row, or from an external file

    plus, if you do the fulltext searching outside mysql, how would you combine ose results with additional conditions on other attributes, like category, or date_published, or author, etc.
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  7. #7
    Utopia, Inc. silver trophy
    ScallioXTX's Avatar
    Join Date
    Aug 2008
    Location
    The Netherlands
    Posts
    9,031
    Mentioned
    152 Post(s)
    Tagged
    2 Thread(s)
    Quote Originally Posted by r937 View Post
    no, but my suggestion was to have a non-marked-up copy of the article in the database, the only issue then being whether to retrieve the marked-up version from an adjacent column on the row, or from an external file
    Sorry, I must have missed that.

    Quote Originally Posted by r937 View Post
    plus, if you do the fulltext searching outside mysql, how would you combine ose results with additional conditions on other attributes, like category, or date_published, or author, etc.
    My point exactly
    Rémon - Hosting Advisor

    Minimal Bookmarks Tree
    My Google Chrome extension: browsing bookmarks made easy

  8. #8
    SitePoint Enthusiast
    Join Date
    Apr 2013
    Posts
    31
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by ScallioXTX View Post
    Also, you want to have a look at Spinx Search: http://www.ibm.com/developerworks/li...-sphinxsearch/
    Takes some time to set up, but works well (better than MySQL fulltext IMHO, which is a tad too simple for any real searching).
    Thanks, it sounds like something worth looking into. But I would either have to convince my present host to install it or switch to another web hotel. So I'll go with the MySQL fulltext for now and see how far it'll take me.

    (I'm glad the two of you sorted out the other question without me. )


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •