SitePoint Sponsor

User Tag List

View Poll Results: Where should articles be stored?

Voters
14. You may not vote on this poll
  • The whole article in the database

    8 57.14%
  • The database knows the file location

    6 42.86%
Results 1 to 25 of 25
  1. #1
    SitePoint Evangelist jkh1978's Avatar
    Join Date
    May 2001
    Location
    Northern Virginia
    Posts
    445
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Question file or database?

    A buddy and I are writing a sports-related website. We want people to be able to post articles. Here's the debate:

    Option 1. Write a html form to have the article author type in the title, date, body of article, etc. This will then be stored in seperate fields in a database.

    Option 2. Write a html form to upload a document (could be text file, rtf, word doc, etc.) The filename (and location) will be stored in the database.

    Which of these options make more sense? Also, feel free to comment on why your choice is better or why the other is incorrect.

  2. #2
    morphine for a wooden leg randem's Avatar
    Join Date
    Jun 2002
    Location
    .chicago.il.us
    Posts
    957
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    It's simple mathematics...Here are two scenarios:

    Scenario A, all in database
    Let's assume that each article is, on average, 900 words, and works out to about 6kb. As you add that into the database, it goes right into the same row with the rest of the fields related to the article you're adding. Not counting the other data being inserted for each record, the articles alone will have a huge impact on performance.

    100 articles adds up to over 600kb of text data -- the most cumbersome data type in a database. Now if we add all the other data being added as well -we'll assume 100 bytes- you've got another 10kb, which seems insignificant next to the 600kb of text.

    Scenario B, database points to file
    Now, rather than holding the entire text, what if we only hold pointers to the files? We can be extremely generous and give the Path field a 255 size, which is more than you could possibly use in Windows...

    255 bytes times 100 articles is 25kb. Add to that the 10kb of other data, and you've got a 35kb database to search.

    And so...
    Sure, 600kb doesn't sound like too much to search. But that doesn't count the extra data in the file... headers and such that your particular database uses to know what kinds of data it holds. And that 600kb was only for 100 records... what if you have 1000?

    If I had to choose between searching 600kb and searching 35kb, I'd most certainly take the latter. Especially for a CMS where you're going to be searching it EVERY TIME a page is asked for.
    ----Adopt-a-Sig----
    Your message here!

  3. #3
    SitePoint Evangelist jkh1978's Avatar
    Join Date
    May 2001
    Location
    Northern Virginia
    Posts
    445
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    My arguments for A -

    Since you are only storing the text and not the formatting, the site will have a consistent look and feel. Also, if someone stores a word document, don't I also need to have a plug-in installed to view that document? In a database, I could have an index on the text field so that I could write a search engine to search for keywords... I do not know how I would pull off a search engine using text files.

  4. #4
    Hosting Team Leader silver trophybronze trophy
    cpradio's Avatar
    Join Date
    Jun 2002
    Location
    Ohio
    Posts
    4,807
    Mentioned
    141 Post(s)
    Tagged
    0 Thread(s)
    Why allow someone the chance to upload a malicious script to your server? Trust no one, and force the user to have to type/copy the information into a textarea.
    Be sure to congratulate xMog on earning April's Member of the Month
    Go ahead and blame me, I still won't lose any sleep over it
    My Blog | My Technical Notes

  5. #5
    morphine for a wooden leg randem's Avatar
    Join Date
    Jun 2002
    Location
    .chicago.il.us
    Posts
    957
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    For what it's worth...

    I also strongly disagree with allowing people to add Word files, PDF files, RTF files, text files ad nauseum to the site. As someone else mentioned, the text should be copied and pasted into a TEXTAREA.

    Regarding making the site searchable, there are better, more appropriate solutions for making a site search, such as HTDIG in unix, or Index Server in Windows.

    Storing entire articles in a database with the intent of making it also function as a site search is quite simply a kludge. It's duck-tape for a web site, and anyone who willingly does it has no business calling themself a professional web developer.
    ----Adopt-a-Sig----
    Your message here!

  6. #6
    Hosting Team Leader silver trophybronze trophy
    cpradio's Avatar
    Join Date
    Jun 2002
    Location
    Ohio
    Posts
    4,807
    Mentioned
    141 Post(s)
    Tagged
    0 Thread(s)
    I have to fully back that up.
    Be sure to congratulate xMog on earning April's Member of the Month
    Go ahead and blame me, I still won't lose any sleep over it
    My Blog | My Technical Notes

  7. #7
    Database Jedi MattR's Avatar
    Join Date
    Jan 2001
    Location
    buried in the database shell (Washington, DC)
    Posts
    1,107
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I haven't seen enough evidence to convince me flat-file is superior to RDBMs storage in every instance, nor am I, of course, convinced that RDBMS storage is superior in every instance either. Still, I am inclined to believe that RDBMS storage is adventageous to flat-file storage for most applications.

    Couple advantages to storing in RDBMS:
    1. Extensive full-text indexing by RDBMS
    2. Centralized storage of content
    3. Subject to RDBMS-security features
    4. Performance enhancing/concurrency features of RDBMS are employed
    5. Easier for 'novice'/sites which don't offer full control


    Extensive full-text indexing by RDBMS
    This is probably the 'killer' item, at least to me, in this issue. Pretty much every commercial RDBMS offers a very powerful full-text search engine built in (or for pay). MS SQL Server, Sybase, Oracle, etc. all have extensive full-text search capabilities, far exceeding HTIG (never used Index Server although I would suspect it uses the same mechanism as SQL Server) and most other external search engines. With the built-in indexing you can easily store queries and run reporting (e.g. "most popular searches"), group/score/summarize/etc. than a 3rd party application spidering your site every so often.

    They also can be configured to update their indexes *immediately* as data changes (or set to deferred update) so that your indexes are not stale. I haven't used HTIG in almost a year, but it had to be kicked off and re-index EVERYTHING every time we posted a new HTML page. Not a big deal since we only had 100 HTML documents, but if you have many, many MB (or GB) of documents that is unacceptable. Frequent changes would necessitate frequent updates which are performed in a less-optimal fashion (e.g. the htdig approach).

    Centralized Storage of Content
    Typically CMS-type are heavily read-only (e.g. slashdot front-page articles are pretty static, HTML templates are static, articles are static, etc.). RDBMS can easily and effortlessly (sort of ) read data all day. The bottleneck is typically in application servers fetching and dumping the data to users. Such applications benefit from many 'blade'-type solutions of a medium sized RDBMs server feeding many smaller (e.g. commodity) boxes with content. (more on that in the 'disadvantages' section). It is more difficult to synch more than one server with HTML content if you are utilizing a centralized RDBMS for other things (e.g. user logins, etc.). Or to put it another way, you'd need to code (or buy) distribution schemes to take the single form entry and push it to several servers. With the RDBMS, it's a simple INSERT statement in your code, and you're already using a RDBMS for other things to there is no added complexity.

    Many RDBMS' have a form of replication and other technologies for fault tolerance. You can easily have your app server round-robbin to a master set of switches which then delegate your query to any number of RDBMS servers. You can suffer multiple machine failures before availibility and performance are affected.

    Not only is the data centralized, but it is also packaged up nice and tidy. You don't have to worry about 1,000 text files piling up in a directory, nor do you have to worry about a logical directory structure (should you create a directory for each 'group' of articles and place corresponding articles in there? etc).

    Subject to RDBMS-security features
    Many RDBMS' offer encryption of data devices, automatically (once you turn it on, of course). Lacking that, you can package up encryption routines in stored procedures which can be called via authorized users which will provide the encryption/decryption routines. It provides an additional level of security above simple filesystem-based levels (if you get a web-server security exploit then hackers could read sensative documents which are web-readable, etc.). Typically your RDBMs is shielded from the outside world via a firewall so hackers would have to go through several edge boxen before reaching the RDBMS server anyway.

    Performance enhancing/concurrency features of RDBMS are employed
    RDBMS have highly advanced methods of positioning data on drives and caching them in RAM. This ultimately leads to high performing reads for items cached in memory. Not only that, but you have complete control over what goes in there (and for how long) so if you can forcast demand you can easily tune your system.

    Woe are the users of UBB-Classic which have high-traffic sites! It is a commonly known problem to have corrupted/lost files with UBB-Classic (the perl flat-file based method). The problem of 'why' isn't exactly well known but some issues are low concurrency control (perl must lock the file to open it, then it writes to the file, saves, unlocks; routinely this can break if many perl scripts are trying to access the same file). For a CMS system this may not be a problem since most are reads, but if you're storing user comments or the like you *could* encounter major difficulties.

    Easier for 'novice'/sites which don't offer full control
    I think to allow uploaded files you need to CHMOD some directories or the like. In the least you need to know how to code file i/o which is a little more difficult than issuing a SQL INSERT INTO statement.

    Disadvantages
    1. Centralized storage of content
    2. Management/Administration needed


    Centralized Storage of Content
    Depending on your loads and application centralizing all of your content can be a significant bottleneck. It's far easier and cheaper (provided you have already written distribution scripts for your app server(s), have load balancing hardware, etc.) to add commodity linux boxen to a pool (a la blades) than it is to buy a bigger 8 way, 16 way, etc. UltraSparc box.

    Management/Administration Needed
    Setting up and maintaining a RDBMS takes work, no doubt about it. A file-system based approach is easily requires little/no work for smaller sites -- simply keep track of what is in a directory and voila! That's it!

    Conclusion


    At least to me, I will always reach for the RDBMS first for any type of application in which I need to store/retrieve data. The advantages are far greater for 99.9% of applications than the few disadvantages.

    Personally, if I knew I was making a mostly read-only CMS system which was going to recieve 100 million hits a day I would set it up to store the stories or whatnot inside a RDBMS, then have scripts kick off on my web servers which pulls the data out of the RDBMS and dumps them into flat files on the web boxen so that I can have the best of both worlds.

  8. #8
    SitePoint Evangelist jkh1978's Avatar
    Join Date
    May 2001
    Location
    Northern Virginia
    Posts
    445
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Thanks Matt

    Thanks Matt for your insightful response.

    An issue we have here is that if we store in a database, we remove the formatting from the content. If we store in a file, the author can format however (s)he wants. I believe that it is best to keep formatting seperate from content, and have stylesheets keep all the articles looking the same. Is there any comments on the original problem or this question?

    Thanks!

  9. #9
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,014
    Mentioned
    53 Post(s)
    Tagged
    2 Thread(s)
    > An issue we have here is that if we store in a database,
    > we remove the formatting from the content.

    that's not an issue, that's a feature!!

    > I believe that it is best to keep formatting seperate from
    > content, and have stylesheets keep all the articles looking the same.

    excellent thinking, and yes, you can do that

    what you need is a way to parse whatever was entered into the textarea, strip out stuff like FONT tags, yet keep the structural html tags like P and Hn

    the standardized approach is also preferred because of its wider accessibilty -- not everybody can nor wants to read proprietary formats like pdf and doc

    i've seen intranets where use of proprietary formats is mandated (and where the internal network has the screaming bandwidth to support bloated formats) but on the web it's considered a no-no
    r937.com | rudy.ca | Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  10. #10
    morphine for a wooden leg randem's Avatar
    Join Date
    Jun 2002
    Location
    .chicago.il.us
    Posts
    957
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Wink

    Comments? Sure I'll add a couple more...

    As you mentioned, formatting belongs in a file, not in the database.

    Furthermore, in rebuttal to Matt's dissertation on the merit's of the infallible database engine towards solving the problems of world hunger and AIDS, I offer this:

    Indexing engines ignore formatting characters. They ignore HTML tags... They ignore (most importantly) noise words.

    Tossing an 800-word article about cats into a database is totally wasteful. You'll no doubt end up with 30-40 instances of the word "the", along with "a", "an", "it", and so on. And then there's the issue of the word "cat" appearing 40 times in the text.

    If you're doing a search for the term "cat" in the database, you're most likely going to search for what?
    Code:
    SELECT * FROM Articles WHERE Text LIKE '%cat%'
    So whether the word "cat" appears once, or one hundred times, you're only going to find it once. Furthermore, you're filling up your database with all the extra instances of it.

    And what happens if you are selling furniture, and people are searching for a specific "table"? If there's HTML tags in your database, they're going to turn up a lot of false hits on that one!

    You can't tell me that the convenience of centralized content by way of database justifies the 8-CPU Sun machine required to search through all the noise words and HTML tags, while I'm running the same site on a P-II 233 with Red Hat Linux and 1000's of text files neatly filed into one directory.

    I stand firmly on what I said before... anyone who uses a database to do the job of an indexing server is a moron. And that also goes for people who are keeping mailing lists in their databases rather than using a list server. There are reasons that these things are invented, people...

    ---
    Disclaimer:
    The tone of this post in no way reflects upon my opinion of any other participant in this thread or in this forum. If you are offended by my harsh opinions, do not take it as an insult from me, but rather as an indication that perhaps you could learn a better way to do something.
    ----Adopt-a-Sig----
    Your message here!

  11. #11
    SitePoint Evangelist jkh1978's Avatar
    Join Date
    May 2001
    Location
    Northern Virginia
    Posts
    445
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    A part of the disagreement in designing my site is if formatting should be stored with the content or not.

    randem believes it should be, r937 thinks it should be seperate.

    randem does not believe that index on a text field is a good idea, MattR thinks it is. I've written code for it before, based on a sitepoint article I cannot find now (I do not know if it was moved to Kevin Yank's excellent book).
    I may or may not be able to run seperate unix programs, but I know that I have the index features of mysql.

    Seems to be two schools of thought on this... with both sides believing that they are correct. I am on the side with the majority posted so far, with randem on the side of my friend.

  12. #12
    Database Jedi MattR's Avatar
    Join Date
    Jan 2001
    Location
    buried in the database shell (Washington, DC)
    Posts
    1,107
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    randem you appear to be blissfully unaware of RDBMS full-text indexing avances which have taken place in, oh, the last 5 to 6 years (if not more). Not only that but you ignored pretty much all of my 'pros' and focused on one con (cost, debunked ;)) below.

    Blow-by-blow:
    Indexing engines ignore formatting characters. They ignore HTML tags... They ignore (most importantly) noise words.
    Database full-text indexes can be configured (if not already, depends on the package) to ignore any number of words, regular expressions, etc. Some are 'content aware' so that if you specify a record is a 'HTML' (or PDF, Word, etc.) it can apply special filters for that.

    They all ignore 'noise' words (if, and, the, etc.) by default (or you can change the weighting, etc.).

    Tossing an 800-word article about cats into a database is totally wasteful.
    Define wasteful to me. Explain what is 'wasted' with it sitting in a table somewhere vs. a filesystem. Last time I checked, database systems use a file (in the end that is what most devices are) to store data. How is it 'wasteful' to put it in one file vs. another? You have never explained how.

    So whether the word "cat" appears once, or one hundred times, you're only going to find it once. Furthermore, you're filling up your database with all the extra instances of it.
    No, that is if you use character indexes which are not the same as full-text indexes. Character indexes are the indexes we all know and love, e.g. WHERE username = 'Bob Ross' will allow the RDBMS to efficiently find a record based upon my username. Plus using a leading-character wild-card like that (LIKE '%bob') will break 'regular' indexes leading to a table scan anyway.

    What I've been discussing, and you're apparently not aware of, is full-text indexing. Full-text indexing is a daemon (much like ht:dig) that companies such as Verity provide RDBMS vendors with to index and query their diverse range of text, PDF, Word, etc. documents that they have ammased in their databases. Large companies like centralized storage and administration, so storing so called 'unstructured' data like that in an RDBMS is not a foreign concept to them. They can natively slurp up any text you give them, and you can peform all sorts of wonderful tricks with them.

    An example of the rich syntax of the Verity engine is here:
    http://www.sybase.com/advanced_syntax

    Further, documents with more 'cats' in them can (again depending on how you set it up) generate a much higher relevance score than documents without.

    You can't tell me that the convenience of centralized content by way of database justifies the 8-CPU Sun machine required to search through all the noise words and HTML tags, while I'm running the same site on a P-II 233 with Red Hat Linux and 1000's of text files neatly filed into one directory.
    Searching 1,000 records in Verity would be no more costly in the RDBMS than using something like HT:DIG. Remember RDBMs full-text indexes are separate data structures apart from the text itself. When you create a Verity (or whatever your full-text data store is) data store it will rip through all the rows (or however many you want) in the table(s) you specify and create it's own data indexes, much in the same way HT:DIG (and virtually every search engine) does. When you run a query, such as "'Bob Ross' NEAR painting" it will CERTAINLY NOT touch the data rows in your table. It will look in its data structures for rows which match, then give the RDBMS rowIDs which match. The RDBMS will then fetch them and you can do with them what you want like any SQL result set.

    How do you define 'neatly'? I do not think 0001.txt, 0002.txt, ... 1000.txt is at all neat. What happens when you reach 10,000 items? 100,000? There is a cost incurred with each new file created. FAT32, NTFS, EXT2/etc. handle directory structures quite differently, and you will find yourself hitting a point in which you have 'too many' files in a particular directory. Not that the OS will say 'too many' but you'll notice significant performance degredation and the liklihood of file corruption increases.

    I stand firmly on what I said before... anyone who uses a database to do the job of an indexing server is a moron.
    I stand firmly that you are ignorant of RDBMS full-text indexing technology and are basing your claim -- which you have yet to explain at all -- on outdated or plain wrong theory.

    Let's say you have a news site like slashdot. You remember there was a Linux Journal article less than a year ago which had a neat Steve Balmer quote about the Department of Justice. You can't remember what it was, but it was something like "'The DOJ is crazy', Balmer noted in his keynote address". You'd want to run a search that limits the articles to ones by Linux Journal and that has Balmer near DOJ (since there have been a lot of Microsoft vs. DOJ articles in the news, the liklihood of you finding that exact quote with Balmer AND DOJ is slim).

    With the RDBMS-based method you can easily say (pseudo-code):
    Code:
    SELECT article_text
      FROM articles
     WHERE publisher = 'Linux Journal'
       AND artile_text MATCHES( "Microsoft AND ( Balmer NEAR DOJ )" )
       AND article_date > "16 July 2001"
    How would you search for an article like that in, say, ht:dig utilizing the filesystem based method?

    The prime advantage, I think, that RDBMS-methods have is the meta-data you store along with the article in the database. I can store categories, authors (if I created a many-to-many table to link authors and articles I wouldn't even have to full-text search I could simply give the user a dropdown to select the author he or she is looking for) etc. and tie them into the 'unstructured' text data to richen and broaden ANY search.

    The ability to easily store and retrieve past searches, sort/group query results, etc. is just icing on the cake and further diffrentiates the RDBMS full-text search method between the 'flat file' approach.

    Am I saying the flat-file approach is inappropriate for any use? Certainly not. If I had 30,000 HTML files somewhere I certainly wouldn't expect someone to painstakingly insert each and every one of them into a RDBMS. Further, if you only need simple search algorithms or have WILDLY dissimilar data then the RDBMS search results would approximate the conventional approach, since you can extract no useful linkages or knowledge between the documents which would help 'pare down' (e.g. separate the wheat from the chaff) the resuls.
    Last edited by MattR; Jul 16, 2002 at 11:25.

  13. #13
    morphine for a wooden leg randem's Avatar
    Join Date
    Jun 2002
    Location
    .chicago.il.us
    Posts
    957
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Matt... perhaps you'd like to provide a step-by-step example? Or a tutorial? Or at least a few links to such things?

    If what you're saying is truly so easy and so powerful as it sounds, and at the same time so cost effective, then by all means I would be very interested in learning about it and using such technology for my own web sites.

    Please... provide me with some information.
    ----Adopt-a-Sig----
    Your message here!

  14. #14
    Database Jedi MattR's Avatar
    Join Date
    Jan 2001
    Location
    buried in the database shell (Washington, DC)
    Posts
    1,107
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Not sure if you are being sarcastic or not (such a short reply ) but the short answer is:
    "It all depends".

    What RDBMS engine are you using? Most 'for pay' databases include it as part of their engine and have documentation on how to set it up and begin writing queries to snag it.

    MS SQL Server:
    http://msdn.microsoft.com/library/de...d5dbsearch.asp

    Sybase:
    http://manuals.sybase.com/onlinebook...eric__BookView

    Oracle:
    http://technet.oracle.com/products/text/content.html

    Postgres:
    http://openfts.sourceforge.net/

    MySQL:
    http://www.mysql.com/doc/F/u/Fulltext_Search.html

    etc.

  15. #15
    SitePoint Wizard Goof's Avatar
    Join Date
    Feb 2001
    Location
    Pittsburgh, PA
    Posts
    1,154
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Since articles will hardly ever be edited, I would purpose two databases and the use of static files. Let me explain:

    Database 1: Store all of the descriptive information (date, author, path to file, etc.)

    Files: Write a STATIC html file based on the submitted article (this way you aren't dynamically pulling the article from anywhere - you just point to the .html file on the server).

    Database 2: Remove all formatting (all HTML tags) from the static file and store this in a seperate database.

    This allows us to use static files which will increase web server efficiency, a database to store article relavent information (so you can sort by authors, dates, or whatever else you want to), and another seperate database to store article body text that you can search against (but won't hinder the performance of database #1).

    In this scenario, the only loser is the hard-drive, since you'll be storing two copies of the article (one in the static file and one in database #2). I'm considering taking this approach to a site that I'm currently building because I don't expect the content to change very much, but I still want the site to be searchable (and Index Server is just too inefficient).

    Just my thoughts,
    Goof
    Nathan Rutman
    A slightly offbeat creative.

  16. #16
    SitePoint Evangelist jkh1978's Avatar
    Join Date
    May 2001
    Location
    Northern Virginia
    Posts
    445
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    The reason we are having this discussion is if users should submit articles by typing it in an html form, or uploading a file. Your solution brings up another possibility... but doesn't solve the argument. My partner believes that allowing users to upload any type of file is the way to go.

  17. #17
    morphine for a wooden leg randem's Avatar
    Join Date
    Jun 2002
    Location
    .chicago.il.us
    Posts
    957
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    The topic got a bit side-tracked, but regardless, I think that just about EVERYONE who has posted in this thread has agreed that you should not let users upload "any ol' file".
    ----Adopt-a-Sig----
    Your message here!

  18. #18
    SitePoint Wizard Goof's Avatar
    Join Date
    Feb 2001
    Location
    Pittsburgh, PA
    Posts
    1,154
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Originally posted by jkh1978
    The reason we are having this discussion is if users should submit articles by typing it in an html form, or uploading a file. Your solution brings up another possibility... but doesn't solve the argument. My partner believes that allowing users to upload any type of file is the way to go.
    If your partner wants people to hack into your website, then sure . I would say take static text from a form and save that to a file. This prevents the user from uploading anything he wants, but still allows you to use the benefits of using a static file (which is the only reason I could see that your partner would argue for this method).

    The only way I could see letting the user upload a file would be if it was going to be e-mailed to an editor so they could look at it. You don't want the user to upload a file that goes live without anyone looking at it.

    Edit:


    Are you saying that this is more of a design issue than a performance issue between you and your partner? Design-wise this should be a split-second decision. You definately want the format to be consistant throughout the site. You don't want one user to use blue text for the body of his article while another user uses green. This should be a no brainer! It's your site, not theirs.


    Hope that clears things up,
    Goof
    Last edited by Goof; Jul 16, 2002 at 18:29.
    Nathan Rutman
    A slightly offbeat creative.

  19. #19
    SitePoint Zealot Drew630's Avatar
    Join Date
    Nov 2001
    Location
    Maryland
    Posts
    175
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    I'm here

    I guess it is about time that I put in my 2 cents worth, since jkh1978 and I are the ones building the site.

    Now, first, before anyone challneges my competency level, I want to clear up some vague points here. First, in no way, shape or form, did I ever suggest that users would submit any file they wanted to the databse. That is shear lunacy (and while I have been accused of it, my lunacy has never been proven ). My articel idea is thus: I would like to offer an user 2 options, either to enter their article through a web-based form and then submit it to <i>us</us> via email, so we can evaluate its merit, or by uploading a file and emailing it to us for the same reasons. There is no direct uploading to the DB. After an article is reviewed, we decide if it is posted or not. I'm still working on the best solution for the web-based posting.

    I have read all of the valid points and there are plenty of people smarter than I regarding DB storage and search engines. I freely admit that. And I also see the search engine shortcomings of having a file-extension dependency for the articles. But, I would liek to think that our site will grow over time and with that, the number of articles that we recieve, by either form of submission. My worry lies in the following points:

    1.) The DB will grow in size to an enormous nature, thus slowing it down, if we store full text in the DB.

    2.) If, for some ungodly reason, the DB would happen to go down and be irrepairable, I find it more comforting to know that the article content is stored in a safe place and not lost. (I know, the same could haoppen to the files, but it is easier in my opinion to backup files than a growing DB)

    The other major issue is formatting. While a DB does have the capability to hold formatted data, I agree that
    a.) data and format should be split, and
    b.) each page should have a consistent look and feel for a more professional appearance

    Here is where jkh and I might not totally agree. My main concern with this comes from my artisitc and journalistic background, where I understand that author's use words, format's and spacing not only to record their thoughts, but the <i>way</i> that they are used can be instrumental in conveying a message. While some people may not agree with me on this, I do feel stronlgy and i wish to offer author's a level of reassurance that we do not intend to butcher their articles. I realize that this can probably be achieved in using seperate files for formatting, but my hesitation for total acceptance hinges on the simple fact that we haven't tried anything yet.

    So that is my noteworthy voice. As with most of our disagreements, jkh and I probably misunderstood each other, and we'll probably end up with a asutible compromise. We just wish to hear what more knowledgable people would have to say.

    thanks to all for the opinions. it gives us much more to discuss.

    Drew
    ~Drew

    There Is No Greater Joy Than Soaring High On The Wings Of Your Dreams, Except Maybe The Joy Of Watching A Dreamer Who Has Nowhere To Land But In The Ocean Of Reality.

  20. #20
    Your Lord and Master, Foamy gold trophy Hierophant's Avatar
    Join Date
    Aug 1999
    Location
    Lancaster, Ca. USA
    Posts
    12,305
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    I fully agree with everything that MattR has said.

    I would like to add one more point here mainly about Oracle but other RDBMS might handle it the same way. When you are using Oracle it will not store a BLOB in the database. It creates a neat directory of files all indexed to their appropriate row, all neatly indexed for full-text indexing and any other goodies you want to through in there. The only time it will read them into memory is when they are needed.

    p.s. I also feel that "SELECT * FROM some_table" is lazy coding and not a good way to do things unless of course you really need every field in that table which isn't always the case, especially in the beginning of a search algorithm.
    Wayne Luke
    ------------


  21. #21
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,014
    Mentioned
    53 Post(s)
    Tagged
    2 Thread(s)
    drew, you should have described it in that amount of detail to begin with

    your review process, to evaluate the merit of an article, is the missing piece of the puzzle

    let your users send you whatever format they want

    you concede that you want to open the document, if only just to read it, and therefore you can and should also assume the responsibility of immediately converting it to standard html (dare i say xml? i dare not) designed to be used together with your publication's "house" stylesheets

    what remains are workflow issues, such as whether to convert before reviewing

    store the text, including the html tags, in the database as character fields (or memo fields or whatever), and rest assured your database will not be enormous

    note that the html tags being stored with the text are structure, not format

    the format, the consistent look and feel that you want, is achieved by the stylesheets

    do a search at http://evolt.org/ for "content management system"
    r937.com | rudy.ca | Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  22. #22
    SitePoint Evangelist jkh1978's Avatar
    Join Date
    May 2001
    Location
    Northern Virginia
    Posts
    445
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thanks Drew for stating it so well.

    Because we were fighting over the architecture of the system, we did not discuss if an editor would be reviewing each submission.

  23. #23
    Database Jedi MattR's Avatar
    Join Date
    Jan 2001
    Location
    buried in the database shell (Washington, DC)
    Posts
    1,107
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thanks for the support Wayne.

    Also, Microsoft SQL Server and Sybase handle text data in a similar fashion. Instead of storing lots of text on the row, they instead store a 4 byte pointer (or 8 byte if you are on a 64 bit system) to your text page chain. It's no less efficient than storing it in several files, and I would contend for many files it would be less efficient due to filesystem constraints. (read about NTFS MFT performance considerations here: http://www.pcguide.com/ref/hdd/file/ntfs/archMFT-c.html -- know that FAT32 is much less efficient for directories with many files).

    MySQL and Postgres do store text data on row so that would need to be taken into consideration when choosing a RDBMS.

    Our Sybase ASE database at FanHome is almost 12GB -- it's just as fast (well I'd assume it is a bit slower, but nothing that the naked eye can tell in browser load) as it was when it was 200MB. We're talking one or two disk (if not in the memory cache) seeks to find an article by unique identifier for a 4+ million row table. You will have no problems with storing lots of data in a capable RDBMS.

    Backing up the database is just as easy as backing up the filesystem, provided you have a consistent regimine of backups. If you fail to backup your data, you will encounter data loss. Recovery might be a bit more difficult since traditionally full backups are done daily (or weekly) and incremental during the day, so if a *single* article would be corrupted it would be a chore to have to load the entire database from a dump. So filesystem could be easier to recover provided you back up each file individually.
    Last edited by MattR; Jul 17, 2002 at 10:48.

  24. #24
    SitePoint Zealot Drew630's Avatar
    Join Date
    Nov 2001
    Location
    Maryland
    Posts
    175
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    r937 -

    My reason for not posting earlier was that i had no idea of the post's existence. When I posted, it was my first read through.

    Damn Verizon has still not completed our DSL at work, so I have had no internet at work for abtou 2 1/2 weeks. Makes it hard to use online manuals in this regard....

    But anyway........

    Drew
    ~Drew

    There Is No Greater Joy Than Soaring High On The Wings Of Your Dreams, Except Maybe The Joy Of Watching A Dreamer Who Has Nowhere To Land But In The Ocean Of Reality.

  25. #25
    Perl/Mason Guru Flawless_koder's Avatar
    Join Date
    Feb 2002
    Location
    Gatwick, UK
    Posts
    1,206
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    RDBMS every time....

    The advantages are overwhelming!

    Flawless
    ---=| If you're going to buy a pet - get a Shetland Giraffe |=---


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •