Input on DB Planning and Application Interaction please :)

Sorry for the lengthy post… going to be processing a lot of data and once I flip the switch… it’s going to be a lot harder to switch it around! So, I want to know how other people would handle certain things. THANKS!

MySQL 5.0.92
Table Types: MyISAM

PHP 5.3.4

Looking for input… want to get troubleshoot up front!

I am working on an application that will be doing 20,000 API calls per hour and returning about 1,000,000 articles per hour that need logged. (using article as a generic term)

So… 24,000,000 articles per day… just to start… this will grow to probably at least double in the first 3 months. So… 720,000,000 to 1,440,000,000 rows per month within 3 months.

These articles come with a varying array of data types / uses.

It should be noted I do bulk inserts into MySQL using the VALUES (),(),() method. I gather things in PHP arrays and insert in groups of 50 or more. (will do larger once stable)

Also, you can mostly assume I know how to add appropriate INDEXes, but feel free to offer advice.

Main purpose: Tracking statistics of each article. The articles come along with statistics which are logged to a main statistics DB table.

Every instance of an article gets logged to the main tracking DB every time fetched even if multiple times an hour (could happen because may appear in more than API feed) Logging of each needs to occur to track which feed it came from and then the most accurate statistics data will be parsed out at a later time. (using the most accurate stats for that time period)

So… 720,000,000 million rows monthly just to start for sure! Later on I will be aggregating the data and even serving it off to other servers through an in house API.

I am not so concerned about getting data sets out of it as I already have 100M row table I do similar on through use of appropriate keys, but I am a bit concerned about the sheer table size as the months go on. 8.6 billion rows in 12 months is a lot… and that is the minimum!!![COLOR=Blue]

How would you handle this many rows monthly of statistics data? (Separate tables, etc?)[/COLOR]

The row size is about 200 B and fixed length for max efficiency. Also, once data goes in… that’s it mostly. No updates or deletions. Row example from table article_stats: article id, stat, stat, stat, stat, stat, stat, api date, insert date

Secondary purpose: Tracking meta data and changes to the meta data. Each article comes along with meta data… some of which is quite bulky and does not need logged on every “hit” as the statistics do. However, I do want to track changes in the meta data… so I am trying to build my DB schema while taking that into account.

It is important to note that… while each article could be unique… in a lot of cases they aren’t. For a certain portion… only their statistics changes are unique each hour. So, the 1,000,000 rows per hour of statistics data may only results in 100,000 new unique articles be added to the DB in the Meta Data tables. (I have no clue yet the actual ratio… it will constantly be in flux)

DB Planning

I have created article_stats (example above… fixed row length) for the stats (the monster table)

For the meta data I have the following plan…

articles (fixed row length - ONE row per article limit)

Mostly static data… article id, length, published datetime BUT also updates category AND title on duplicate articles

Uses ON DUPLICATE KEY UPDATE (Unique Key: article id)

article_meta (fixed row length)

Data that can change and want to track changes… article id, category, title (Article id can’t change… used as key identifier. Also on row datetime)

Uses ON DUPLICATE KEY UPDATE (Unique Key: article id, category, title)

A new row is only added if the category or title has changed!
(Thereby tracking changes.)

article_meta_lookup (variable row length)

Okay, this one is a special case… because the content type is too large (?) to use the ON DUPLICATE KEY UPDATE trick…

article id, type, value, api datetime, insert datetime

The 2 types of data are ‘description’ or ‘tags’ with values for description up to 5000 char (but average 600) and tags up to 600 char.

Again, I only want to record new articles or articles whose description OR tags have changed. To do this… I run 2 SELECTS in the application logic…

WHERE article = '$id' AND type = '$type' AND value = '$value' LIMIT 1

Then, if it does not exist… I add it to an array in PHP for inserting.

Initially, I had article_meta_desc (article id, description, api datetime, insert datetime) and article_meta_tags (article id, tags, api datetime, insert datetime) instead of this one table, but the data types seemed so similar I combined them.

Would you keep the description and tag types in their own table or combine as I am leaning toward?

There are 2+ other tables to track meta data changes, but the ones above demonstrate their basic logic (mostly using ON DUPLICATE KEY UPDATE to “IGNORE” unchanged entries), so giving me insight about the above would give me insight on the others. (I only put the other data in additional tables because it seems very unrelated to the above mentioned data or less likely to change at the same time.)

So, to summarize the DB load for data gathering (mostly INSERTS) per hour (at launch!)…

article_stats: 1,000,000

articles (static data mostly): 1 to 1,000,000 (ON DUP UPDATE only updates static row)

article_meta (category, title): 1 - 1,000,000 (ON DUP UPDATE “IGNORE” unchanged rows)

article_meta_lookup (description, keys):

2 - 2,000,000 SELECTS (Could prob be combined to 1 - 1M)
2 - 2,000,000 INSERTS (Inserts rows not found by select)

[COLOR=Blue]I am a little concerned about throwing this many selects into the mix… should I expect and issue here time wise? (as in… slowing down my application progress… should I perhaps try to handle these later? I could just bulk insert the XML for each entry into a queue DB and process later)

[/COLOR]For “reads” … the level will be minimal, actually, compared to the “writes” and I don’t expect much of an issue with reads for the foreseeable future… other than dealing with the size of the data store.

[B]Conclusion

[/B]Yeah… that’s a lot to chew on. I know some of this falls under YMMV, but any helpful input you can provide on how you would do things would be useful to me!

Thanks!

-freelancer

I wouldn’t know where to start on the specifics but here’s a 10,000’ view.

First, make sure you’re on good hardware. This is going to require some serious power and storage space. It also need to be able to scale, because once you’re a year in, it will be extremely difficult to migrate.

Second, unless you require full text indexing, use InnoDB. The table locks alone are going to kill you with Myisam. If you require full text indexing, I would use a hybrid innodb/setup. Use innodb for normal database functions and use a myisam table specifically for the large indexed data sets.

Third, you most likely will need a thoughtful partitioning/sharding layout. I can’t see a database of this size being usable unless you use extensive partitioning.

You’ve really got several different parts here to consider, and I think that if you muck up any of them, the entire usage will drastically suffer once you get into a larger data set. You’ve got the straight hardware and disk setup, the database design, and the application design. Each one of these presents multiple areas for bottlenecks. I would tackle the tasks initially in this order, and then adjust as you get further into the project.

jestep… thanks for the (only) reply! lol

I think I’ve made some decisions and will get this thing rolling (for better or worse) I’ll be keeping a close eye and hopefully be able to switch things up if needed before it gets too unwieldy.

I’ve decided to keep data from this crawl in it’s own table… so it will top out at 8.6 billion rows per year. (has some unique columns anyway) Also, going to research partitioning appropriately.

One thing I am still in the air about… the question regarding [COLOR=Blue]article_meta_lookup [COLOR=Black](that is likely the only table that may need full text searches)

Anyway, if you have any input on a combined table as described or separate table as in last sentence, I would appreciate it.

Thanks!

-freelancer
[/COLOR][/COLOR]