In the first part of this series, we looked at a relational databases and how NoSQL is different in comparison to relational databases. In this part we will look at “Google App Engine Datastore” which is one of options for storing your data with Google App Engine. Among other options first one is Google Cloud SQL – which is a relational DB in cloud, based on MySQL. Second option is Google Cloud Storage which is a storage service for storing files and objects of sizes upto terabytes.
Google App Engine Datastore
The Datastore is an infinitely scalable, schemaless object store, right at your disposal. It handles data quite a bit different than a RDBMS, which is why it provides solutions for many of RDBMS shortcomings. At the same time it comes with it’s own set of restrictions and different ways of modelling and building the data access layer of your application. Let’s look at some features of the Google App Engine Datastore.
The Datastore doesn’t require a fixed schema for your data. It’s an object store, you can throw objects at it and it’ll store them. Let’s talk in relation to a practical example. Let’s say we need to store business card information for an application. First name, last name and email address are mandatory fields, and there are optional fields like mobile number, LinkedIn URL, Twitter handle. Now when we are storing an entity of type “Business” we of course need to make sure to supply the mandatory fields, but optional fields can be stored only when they are available. So one entity might have twitter handle stored and another one might have twitter handle and mobile number. Let’s look at a JSON representation of object to understand the how entity looks like:
"firstName" : "John", "LastName" : "Taylor", "Email" : "email@example.com", "twitter_id" : "john_t" }
"firstName" : "Tom", "LastName" : "Rogers", "Email" : "firstname.lastname@example.org", "twitter_id" : "trogers", "mobileNumber" : "567 555 1256" }
Let’s compare how this would be a modeled in a relational DB. A table with columns of all required and all optional fields will be created. For entities which don’t have any value for an optional field, a “null” will be filled. Any new field to be added to entity would mean change in table structure and populating value of that field for all entities.
You can store as much data in the datastore as you desire (leaving per GB cost aside for a moment), none of your queries will slow down. Fetching five entities from 50 is no different than fetching five entities from 50 million, performance wise. Query runtime will only increase with the size of your result-set and not the data-set to be scanned.
If you recall the discussion we had in part 1 of this series, you’ll quickly ask yourself how the Datastore is able to shard automatically, when RDBMS can’t. This amazing property is due to the way data is modeled inside the Datastore. Instead of spreading attributes across several relations, all information about a single entity is kept in one place. All entities are then ordered by their unique id. A simple algorithm can now split this list of entities by their ids and store the resulting shards on separate machines. The same algorithm can now be used to route every request to the appropriate machine.
Strong consistency vs. eventual consistency
The Datastore will only guarantee strong consistency for reads and “ancestor queries”, every other query will be eventually consistent. So there is a slight chance, that a user might not see the most up-to-date version of an entity, when it was very recently updated. This is not a big deal for a lot of use-cases (“Gosh! This Tweet only showed up now, when it was posted two seconds ago!”). There are ways to trade performance for strong consistency, but I won’t go into them here (take this as a starting point).
Sacrificing perfect consistency means there is no need to immediately synchronize all machines, a RDBMS would have to wait until every machine finishes updating the data. The Datastore also won’t check referential integrity, because that would mean having to read data from other machines to ensure a valid update. Another benefit of being able to scale horizontally is, that the Datastore is impressively fault-tolerant. Many machines in your data center could fail, and your data would most likely still be served without hesitation.
Modeling your data
In the datastore, you’ll have to model your data based on the queries you’ll want to run in your application. Because all the data for one entity has to be kept in one place, the Datastore will have to build an index for every query you need, before such queries can be served. This implies, that Ad-Hoc queries won’t be an option with the Datastore. App Engine provides (great) tools like MapReduce to crunch to huge datasets and perform analytical tasks, but such tasks will take significantly more time to implement than the elegant SQL statement you might be used to.
Choosing between SQL and NoSQL datastore
To summarize the discussion so far
“The Datastore provides a way to persist ‘dumb’ data, which the application turns into information, RDBMS provide a way to persist structured data, which the application can make use of directly”
Remember the discussion we had in part 1 around RDBMS needing information about the structure of your data to provide application-independent services like data aggregation on the database-level. The Datastore won’t do that. It only cares about the pieces of data it needs to build indexes, the rest of your entity is seen as a sealed blob of bytes.
Let’s look at some scenarios and use cases and try to evaluate if the RDBMS is a better solution or a NoSQL
- Can a single server provide the performance we need? Maybe by utilizing caching? In this case RDBMS is the way to go. Look for example into CloudSQL, AppEngine’s purely relational offering.
- Do you plan on growing a multi-application environment working on the same dataset? Depending on your volume, RDBMS might be what you need, because it separates the database layer very strictly from the application.
- Do you need Ad-Hoc queries? In terms of query flexibility, SQL is the clear winner.
- Do you require perfect consistency? Even though there are ways to achieve strong consistency in the Datastore, it’s not what it was designed for. Again, RDBMS is the better choice.
- Are you expecting millions of reads & writes per second? The Datastore provides automatic scaling to infinity and beyond, and it’s right there for you to use with AppEngine.
- Do you need a simple, scalable way to persist entities with variable attributes? Even though you’ll have to handle consistency and data aggregation yourself, the schemaless Datastore should be what you need. And it’s integrated right into AppEngine, so it’s the ideal choice for quick prototypes with changing entities.
Choosing the right database is a vast topic, but with two great options at hand (CloudSQL and the Datastore), you at least won’t have to steer away from App Engine. I hope this article made the decision easier for you, and I wish you all the best with whatever great application you have in mind.
- SQLvsNoSQL - BattleoftheBackends : A very informative and somewhat entertaining presentation at Google I/O 2012
- ThegreatAppEnginedocumentation : Provides lots of information about everything Datastore related
- MoredetailsonNoSQL - Datamodeling
- Google App Engine: Databse Strategies
Choosing a right datastore for your application and understanding NoSQL, both are big topics. What we have covered in this two part series is a just the tip of an iceberg. Any missed out mentions of a major or emerging player in this space is purely inadvertent. We at CloudSpring hope to cover more of this great subject in near future, keep watching.