Storing text in text files?

Hey guys,

Do you think sites like W3Schools uses server side scripting to pull the majority of its text content from a database with text fields, or from text files? or do you think they have individual php files with the text already formatted with HTML?

Does this question make sense?

How about Wikipedia?

IMHO, in this age of quite powerful RDBMSes, I don’t think any sites do use the file system rather than the database so neither w3schools nor wikipedia are using files to store the contents.

So for example, is it correct to say the text on one of w3 school’s pages is stored in a few fields of a database, and then every time someone accesses that page, the server must run the necessary queries?

I’d be fairly certain both sites are using a database driven CMS for their content.

Yes but the queries don’t have to run every time if any caching system is used.

Caching system on the client? Or the server? Or both?

How far do you think they would go in a single database field? A whole paragraph of text? A whole page?

Hi,

Most of the CMS sites are build with DB, If you try it with filesystem then it is difficlt to mange the sites with huge content.

So do you think they will store like 5 paragraphs of text (like a plot of a movie) all in one field in a table of a database? Or do you think they would use multiple fields? I don’t know how big a field can get for this purpose.

Yes, you can easily store 5 paragraphs of text in a database field.

In fact, you can store up to 2 GB in a (MySQL) database field, I think, as long as your hardware supports it.

It’s also quite possible they’ve split the content into several different fields.

We can’t really say without knowing how their system works. :slight_smile:

Barebones CMS is one of a handful of CMS products that do NOT use a database and relies, instead, on the file system. Here is a performance analysis of it that I put together:

http://barebonescms.com/documentation/performance_analysis/

If there’s a large site in question, such as wikipedia - it’s incredibly STUPID to use filesystem to store data within it.
I’ll elaborate:

  1. Huge websites tend to have a lot of traffic, which means they require a cluster server or sort of a load balancer. They usually have a cluster that’s responsible for handling requests and a cluster that’s in charge of the database. Such decentralized environment cannot rely on what we call filesystem.

  2. Huge websites have a ton of data. Now, people usually want to read that data (such as wikipedia), but they also want to search it and do other interesting things that require data manipulation.
    Using a filesystem would require you to find the record out of the billions of records - it’s not “instant” as it sounds immediately, that’s why smart people invented indexes. However, databases excel here because they store data in certain structures to make the most use out of the indexes.
    In short term, and we could argue here for ages but we won’t - on large data-sets (terrabytes and more), databases will operate faster with finding the data than an operating system will.

  3. Optimization of huge websites - the best way to optimize something is to reduce the calls to slow operations - the slowest being something involving hard disk. Hard disk is, in most machines, mechanical unit and there’s time involved in moving the needle to read the head / sector.
    There are methods that are employed to make these movement requirements less, I won’t go into details here - basically, if you have a page displaying some content that rarely changes (like w3schools pages) - you don’t want to ask the database for content every time someone connects.
    You also don’t want to cache that information to hard drive, as you’ll, again, do something involving physical movement and hence providing “lag”.
    That’s why it’s good to cache the data in the memory. If you take a look at w3schools content, it’s generally really small, not exceeding 100kb (and I’ve said a lot here).
    The logical move is to cache that data set (let’s call it an object) directly in the memory, which is cheap as peanuts these days and it’s so much faster than hard drive that it’s not even funny.

Since the “slow” operation involving site such as w3schools would be the following:
1 - get the request
2 - connect to database (or look for the thing in the filesystem)
3 - obtain the data in question
4 - send it

why not just get the request, pull the data from the memory (RAM) and send it? We just avoided the call to the database, we avoided asking OS to find the cache on HDD - basically what happened is that we avoided any sort of expensive call to obtain relevant info.

Barebones CMS is one of a handful of CMS products that do NOT use a database and relies, instead, on the file system. Here is a performance analysis of it that I put together:

I’d be interested in the following test:

Your CMS finding 25 random entries among the set of 600 gigabytes and displaying them. The test you provided doesn’t prove anything about your products’ scaling abilities, hence I find it misleading. I’d also be interested in how your CMS scales with searching if all you rely on is filesystem, not to mention using agregate functions to obtain various interesting data neccessary for building reports.

And yet MediaWiki does use a static file cache :lol:
Dude, you really need to research more before making pronouncements like this. You’re starting to sound like Kalon.

  1. Huge websites tend to have a lot of traffic, which means they require a cluster server or sort of a load balancer. They usually have a cluster that’s responsible for handling requests and a cluster that’s in charge of the database. Such decentralized environment cannot rely on what we call filesystem.

Samba share and more than a couple other distributed file systems would like to have a word with you…

The largest sites use a hybrid approach which rely on the fact that there are more reads than writes in almost any CMS system (just compare view to post stats on this very forum or any forum). The static file is cached after creation and remains until the next update occurs. With many CMS’s, especially news sites the static file may not be updated more once an hour or so, or even less.

Slashdot does this - it’s why your post may not appear on another computer for up to 2 minutes, and why comment counts on the front page are never very accurate.

MediaWiki, the engine behind wikipedia, uses this sort of caching. Any guest version of a “page” is stored in filesystem cache such that neither PHP nor the database is required to be active for the page’s delivery.

And as I mention at the start of this post, even if you are using load balancing there are file mirroring applications to address the distribution problem, and properly implemented the PHP software doesn’t need to be aware they are in place at all.

Database reads are a bottleneck. Few sites experience the kind of traffic necessary to drive this point home though, so to some extent worrying about it is over engineering. Static files have their place - the webserver can always serve up statics faster than any PHP script.

But in database’s defense - they will be faster than a filesystem for pulling disparate pieces of information together to form a page. This, and the ability to create derived information (like, say, doing ledger totals on a table column or positional calculations using trig to show locations within X miles), is where they rule the roost, and will continue to do so.

why not just get the request, pull the data from the memory (RAM) and send it? We just avoided the call to the database, we avoided asking OS to find the cache on HDD - basically what happened is that we avoided any sort of expensive call to obtain relevant info.

Most webservers either do this with static files already, or have extensions that allow for it.

And yet MediaWiki does use a static file cache
Dude, you really need to research more before making pronouncements like this. You’re starting to sound like Kalon.

I don’t really care how MediaWiki works with its cache, if we’re talking about caching over 10k entries - I won’t resort to static file cache stored on hard drive. We could go over Samba and distributed file systems, but if you’re at the level where you need load balancers, clustered servers and so on and if speed is critical - you will have the money for extra RAM.
To cut things shorter, all you’re talking about is storing cache to hard drive and I’m talking about storing it to RAM. There’s no question what’s more efficient and what the limitations are. So “dude”, before actually looking down on me from your throne, work on avoiding your arrogance a bit.

I’m perfectly aware of how things work, the question is when to use the right tool. If I have loads of unused RAM - I’ll use it for cache. If I don’t - I’ll use the hard drive if necessary. Simple as that.

If there’s loads of data with various attributes that are accounted for when displaying results - there’s no question about storing the data within the filesystems, that’s what databases are for.

Been there, done that, got the t-shirt even. Using RAM instead of the disk is all fine and well, but if you do it right PHP doesn’t need to know about it.

And you made the first pronouncement there sparky :rolleyes:

So anyone who uses a file cache is stupid yes? If you can’t grasp how foolish your statement looks in light of the fact that your cited example: WIKIPEDIA uses a file cache, there isn’t much help for you.

PHP sees a db program or the file system and it doesn’t, and more importantly shouldn’t, care whether the db program or the OS pulled the information from the disk or RAM. If it does then there’s a problem, cause such over-engineering collapses in a hurry in my experience when something goes wrong or when it comes time to scale things.

Feel free to use FILESYSTEM as cache, I’m still standing by my statement. It’s incredibly STUPID, if you have RAM at your disposal. If you can’t see why, then why even bother with IT as your choice of profession?
I’m going to excuse myself from further discussion, you strike me as one of those stubborn nevergonnachangemymind guys so I’ll stop wasting words. If you’re not able to grasp the simple concept of ram > hard disk, we have nothing to discuss.

You don’t understand separation of concerns. It isn’t normally the job of a PHP script to worry with RAM vs. Disk. That’s the concern of the database program and OS, and even to some degree the PHP interpreter rather than the script itself.

Perhaps we are talking past each other. When I say “file cache” I’m talking about taking the completed page, writing it to a directory where the webserver can see it, and then ending execution of the PHP script normally. The next user to request that resource hits the webserver and that program sees the cache file and delivers it rather than invoking PHP at all.

That, my friend, is the fastest way possible - no PHP. If the resource is requested frequently enough the webserver or OS will move the file to memory cache, or hell if it’s REALLY getting nailed the OS might put it in the processor L2 cache if it’s a small enough file getting requested enough times. But all these events occur outside PHP.

You think I’m talking about writing a cache file and then using file_get_contents or include to pull it in on each iteration of the script. I’m not, and you are right if a bit heavy handed in pronouncing that approach to be silly. This sort of caching is best served by APC or similar mechanisms. But again, I’m not talking about that, I’m talking about using a static file cache to avoid the need to invoke the PHP Interpreter AT ALL.

So get off your horse Blue. You aren’t the smartest person here. Not by a long shot. Neither am I. If it goes to a vote I’ll go for r937, Salanthe or ScallioTX* - in either event none of them stoop to this sort of bickering and I’m trying to learn to avoid it, God help me I am. But pronouncements the one you made to start this whole chain off really irk me.

Read more carefully and be more open. I will try to do the same.

  • Apologies to anyone inclined to be slighted by being omitted. This is just off the top of my head.