Instead of saying that you will update on every X visits, instead set it up to be every X hours.
We have used a similar approach with success on a API service that received thousands of hits per second, here we bundled it with the system that temporary blocks keys for abuse. Since it already store the keys, ips and hits per hour this was just updated so it kept storing the information instead of removing it after a hour.
Then twice per day, the information get processed and inserted into the database.
What is important here is to make certain the memcache instance(s) that the servers have access too is large enough, so the data is not truncated before it get processed.
Please note that in your case, you could also map reduce the access log and get the same result, i.e. views per page. Though doing it by memcache would most probably be easier if you have a cluster of web servers.
In some cases you cannot use Google Analytics, or similar systems.
In addition, even if you use it for tracking, there is many business reports that would be difficult to create if you also dont have the information stored in a way that allow you to generate the reports management or marketing ask for.
So you have about 150,000 unique pages each one getting a certain number of hits every day? Or do you mean 150,000 hits total for all pages and the number of pages is actually smaller? How many hits per second are we talking about?
Have you measured that the mysql updates are the bottleneck? There are some ways to make updates faster like using myisam tables, tuning innodb_flush_log_at_trx_commit variable or updating many rows in a transaction. You could also use the Memory storage engine as a temporary table for storing current hits and then periodically moving the data to the proper tables for permament storage.
The content pages get about 300,000 views per day. Tag on player counting, add another 250,000 views, updates on views spread across two tables: Video Pages (115,000 rows) and Library Pages (30,000 rows). All rows have chance of getting visited.
Right now we’ve been just doing +1 updates to a database server dedicated strictly to counting views, but are contemplating shutting that down to save money.
*should point out we’ve been slow this season. Our content pages can spike to 1.5M views in a day.
You have to realize the essential unreliability of memcache, which is losing data by design. Due to TTL or lack of memory or anything.
Memcache can be used only to replicate a data that already exists somewhere. Otherwise you’ll be losing your views. By tens.
Thus memcache is absolutely not a tool you’re looking for. If you want to stick to an external solution, a nosql database like Redis or Mongo would be more reliable. However, I believe that Mysql is still good for such a number of writes, if used properly.
There are basically two problems with writing data.
First one is ACID compliance of innodb tables which makes each write take a lot of time. If you have tables dedicated for storing views and nothing else, I’d suggest to use myisam engine for them, which will speed up your write queries 70-fold in average, if currently you are using innodb with innodb_flush_log_at_trx_commit set to 2. Just setting it to 1 will make a great relief to your servers too, but it depends.
Another problem is more generic - an index update. Each write triggers an index update which for huge tables can be slow. To overcome this problem usually the same approach as you proposed with memcache is used - to collect your views in temp tables, and then flush them into main tables once in a while. The difference with memcache is that you always have everything that you wrote.
This is about 17.3 visits per second - a busy site but not so much as not to be able to manage counting visits in an efficient way. One way to speed things up is use the mysql solutions I and @colshrapnel were talking about above. If you are updating an Innodb table in mysql’s standard configuration then these writes can be resource consuming since each one is triggering rsync - a physical write to disk. You probably don’t need such rock solid data retention in this case.
Another solution might be to simply use plain files for counting visits - files are usually much faster for writes than the database and writing to them takes advantage of the OS buffer meaning the buffer is physically written only every X seconds, which improves performance a lot. For example, you can have files with the row ID in their names and on each page visit do this:
and then a cron job can periodically inspect the length of these files and simply issue an update for each row to increment the value and clear the files. If you do the cumulative update of all rows in a transaction it will perform quite well even for Innodb tables.
I once wanted to implement a mechanism that logs each visit of a busy site into a database and was looking for a fast solution. Inserting data into a myisam tables was much faster than into Innodb. However, the fastest solution turned out to be writing each visit to a text file in a separate line and then have a cron job run every 5 minutes and transfer that data to the database.
Storing the information to the database should not be a problem with that little views. Your problem is not storing it in the database, but the way it is done.
When you update a row (from what I understand dedicated to that page), it is locked (if MyISAM the entire table is locked), depending on your settings you might or might not allow concurrent reads of the same data. If you get more than one hit to the same page at the same time, they will need to wait before the previous lock is released before doing their update.
Create linked tables, where you just insert each hit. Then you can map reduce that once a day if you want, for example down to hits per day, per hour etc.
Depends how important the data is, i.e. if you lose a day of data is it mission critical. In our case it was not, as it was just used for trend statistics on the api usage.
Though, so far I do not believe we have lost even one day of data (been running this system for over two years), the key is however as you mention as well, that you need to be certain that you have more than enough memory on the memcache servers.
If the data is mission critical and you receive too much data, then NoSQL is a good solution. Though I would use Cassandra in that case setup with multiple nodes (mainly as its fault tolerance is better).
In terms of memcache and how critical the info is, it’s not critical. Memcache crashes or memory clear or server restart and at worst we missed reporting up to 100 views for some content, which is nothing.
The Library table is myisam, so that will experience table locks during an update, while the Video table is innodb, and will experience row locking for the update.
Do you realize that memcache is losing data not only on restart? That it’s intended to lose data, by design? And that you’d likely will get no views at all for the less popular videos, that being viewed just once in a while?
Don’t know if it would work better, try having it write the needed info to an XML for each day file then have a cron job start a new file at the start of a new day and do a bulk insert into the database, based on the data in the XML file (I’ve never worked with XML files so can’t say if it’ll be viable)
When it comes to keeping count, there will be no TTL.
On my actual memcached pages and archives, the TTL is sometimes 6 minutes, while other times it’s 29 days, only getting deleted and recreated when row is edited or something is added. If the key doesn’t exist, the content makes a request to the database and creates the key again.
I’m going to have two dedicated innodb tables for getting the updated views, and they will sync their data over to primary tables only once per day.