This is the first article in an occasional series on interesting database technologies oustide the (No)SQL mainstream. I will introduce you to the core concepts of these DBMS, share some thoughts on them and tell you where to find more information.
Most of this is not intended for immediate use in your next project: rather, I want to provide inspiration and communicate interesting new takes on the problems in this field. But if, someday, one of those underdogs becomes the status quo, you can tell everyone that you knew it before it was cool …
All jokes aside, I hope you’ll enjoy these. Let’s get started.
Datomic is the latest brain-child of Rich Hickey, the creator of Clojure. It was released earlier this year and is basically a new type of DBMS that incorporates his ideas about how today’s databases should work. It’s an elastically scalable, fact-based, time-sensitive database with support for ACID transactions and JOINs.
Here are the core aspects this interesting piece of technology revolves around:
- A novel architecture. Peers, Apps and the Transactor.
- A fact-based data-model.
- A powerful, declarative query language, “Datalog”.
The Datomic team wants its DBMS to provide the first, “real” record implementation: records in the pre-computer age preserved information about the past, whereas in todays databases old data is only overwritten with new. Datomic changes that, and preserves all information, differentiating between information by making time an integral part of the system.
1. A novel architecture
The single most revolutionary thing about Datomic would be its architecture. Datomic puts the brain of your app back into the client. In a traditional setup, the server handles everything from queries and transactions to actually storing the data. With increasing load, more servers are added and the dataset is sharded across these. As most of todays NoSQL databases show, this method works very well, but comes at the cost of some “brain”, as Mr. Hickey argues. The loss of consistency and/or query-power is a well-known tradeoff for scale.
To achieve distributed storage, but with a powerful query language and consistent transactions, Datomic leverages existing scalable databases as simple distributed storage services. All the complex data processing is handled by the application itself. Almost as in a native desktop application (if you can remember one of those).
This brings us to the first cornerstone of the Datomic infrastructure:
The Peer Application
A peer is created by embedding the Datomic library into your client-code. From then, every instance of your application will be able to:
- ● communicate with the Transactor and storage services
- ● run Datalog queries, access data and handle caching of the working set
Every peer manages its own working-set of data in memory and synchronizes with a “Live Index” of the global dataset. This allows the application to run very flexible queries without the need for roundtrips (more on that under “Criticism”).
But so far, we’ve only got back query power. To also re-enable consistent transactions, Datomic takes a step further: it makes the storage service read-only and forces all writes through a new kind of architectural component, the “Transactor”.
The Transactor will:
- ● handle ACID transactions
- ● synchronously write to redundant storage
- ● communicate changes to Peers
- ● indexing your dataset in the background
It seems as if the Datomic Team banished everything that made Relational DBMS hard to scale in a separate module, and tried not to worry too hard about it. For example, the Datomic-Rationale states:
“When reads are separated from writes, writes are never held up by queries. In the Datomic architecture, the transactor is dedicated to transactions, and need not service reads at all!”
“Putting query engines in peers makes query capability as elastic as the applications themselves. In addition, putting query engines into the applications themselves means they never wait on each other’s queries.”
The first statement is true for some read-operations, but I couldn’t find a hint as to how the Transactor handles reads in transactions. Though it is mentioned that “Each peer and transactor manages its own local cache of data segments, in memory”, which would require perfect cache synchronisation. Otherwise consistency is only guaranteed for one peer, which quite frankly is pointless. Hopefully, this management overhead won’t neutralize the promising ACID capabilities and the performance gains of in-memory operations.
The second quote makes initial sense, but still raises a few concerns:
What happens when one Transactor faces too much load? The Datomic team would like to avoid sharding, but wouldn’t exactly that be necessary at some point? Also, even if we pretend that the number of transactions wouldn’t increase with more peers, the time it takes to transmit changes to all of them sure does.
In conclusion, the Transactor could be an amazing thing to have with smaller datasets but may become a potential performance bottleneck or Single Point of Failure.
These services handle the distributed storage of data. Some possibilities:
- ● Transactor-Local storage (free, useful for playing with Datomic on a single machine)
- ● SQL Databases (require Datomic Pro)
- ● DynamoDB (require Datomic Pro)
- ● Infinispan Memory Cluster (require Datomic Pro)
… plus a few more. Storage service support could be one big reason to try out Datomic, but unfortunately, only the temporary local storage is available for Datomic Free users (aka users who aren’t willing to pay $3,000+ for a brand new DBMS).
All in all, the Datomic architecture comes with loads of innovative ideas and potential benefits, but its real-world applicability remains to be proven.
(Original image taken from here.)
2. A fact-based data-model
Datomic doesn’t model data as documents, objects or rows in a table. Instead, data is represented as immutable facts called “Datoms”. They are made up of four pieces:
- Transaction timestamp
Datoms are highly reminiscent of the subject-predicate-object scheme used in RDF Triplestores.
Anything can be a datom:
“John’s balance is $12,000” → [john :balance 12000 <timestamp>]
These attribute definitions are the only type of schema implied on the dataset.
In a relational database, this would be represented as a 12000 at the “john”-row in the “balance” cell (data is place-oriented). If now, a month later, John’s balance changes to 6,000, this specific cell would be wiped, and the new value would be put in. The fact, that John had 12,000 on his account a month ago is gone forever.
One of the main reasons for the creation of Datomic was the feeling, that today’s hardware is finally able to keep true records of data, something no other popular DBMS to date does. Datomic never updates data, it simply writes the new facts and keeps the old ones. This paves the way for lots of interesting, time-sensitive queries.
Because of its immutable, fact-based nature, Datomic would handle John’s new balance by simply inserting a new fact:
[john :balance 6000 <timestamp2>]
Facts are never lost. If John is interested in his current balance, he queries for the most recent Datom. But nothing would prevent him to query his complete balance-history anytime he wants. Besides, Datoms are, as the name implies, atomic, the highest possible form of normalization. You can express your data-model in as many entities as you want, the bigger picture is automatically constructed via implicit JOINS.
- ● Support for sparse, irregular or hierarchical data
○ Attribute values can be references to other entities
- ● Native support for multi-valued attributes
- ● No enforced schema
- ● No need to store data history separately, time is an integral part of datomic
Such flexibility allows Datomic to function as almost anything, for example a complete graph API.
Some fallbacks of this approach:
- ● Not suited for large, dynamic data
○ As updates are always written as new datoms with a more recent timestamp, large, dynamic blobs of data would soon fill up quite a bit of space
- ● Flexible schemes tend to lead to rashness
○ Some planning should still be done on the data-model
- ● Attribute conflicts
○ Namespacing should be employed right from the beginning
3. Datalog, the finder of lost facts
Datomic queries are made up of “WHERE”, “FIND” and “IN” clauses, and a set of rules to apply to facts. The query processor then finds all matching facts in the database, taking into account implicit information
Rules are “fact-templates”, which all facts in the database are matched against.
Explicit rules could be something like:
[?entity :age 42]
Implicit rules look like variable bindings:
[?entity :age ?a]
and can be combined with LISP/Clojure-like expressions:
[ (< ?a 30) ]
A rule-set to match customers of age > 40 who bought product p would look like this:
[?customer :age ?a] [ (> ?a 40) ] [?customer :bought p]
The rule-sets are then embedded into the basic query-skeleton:
[:find <variables> :where <rules>]
<variables> is just the set of variables you want to have included in your results. We are only interested in the customer, not in his age, so our customer query would look like this:
[:find ?customer :where [?customer :age ?a] [(> ?a 40)] [?customer :bought p]]
We now run this via:
This syntax will probably take some getting used to (except you’re familiar with Clojure), but I find it to be very readable and, as the Datomic Rationale promises, “meaning is evident”.
Querying the past
To run your queries on your fact-history, no change in the query string is required:
You can also simulate your query on hypothetical, new data. Kind of like a predictive query about the future:
Datomic is certainly not here to kill every other DBMS, but it’s an interesting match for some applications. The one that came to my mind first was analytics:
- ● Facts are immutable, non-ACID writes should be fast, as analytics systems usually won’t require strong consistency
- ● Facts are time-sensitive. This is quite interesting for analytics.
Status messages, tweets or prices could also be stored much more naturally with regard to their dynamic nature. Also, since this DBMS was explicitly constructed to provide “real” records, everything record related should be an obvious fit.
In general, Datomic’s time-sensitive layer provides an interesting twist to your existing dataset. It can be used as a normal DBMS, with an additional dimension of insight. Imagine your e-commerce database including the complete price-history of every item and the engagement history of every customer. Wouldn’t it be fascinating to get a quick answer for otherwise complex questions? “When did this product became popular? – Oh it was after the 10$ price-drop.” “When did this customer started using the site every day? – That was two months ago, here is a graph of his daily time-on-page increase”.
Revolutionary ideas, like the ones Datomic is based on, should always be appreciated, but analysed from multiple angles. The technology is too young to make a final judgement but some early criticism includes:
- ● Separation of data and processing.
All required data has to be moved to the client application, before it can be processed/queried. This might pose a problem once you get to larger datasets. Local cache will also inevitably constrain working-set growth. Once this upper-bound is passed, round-trips to the server-backend will be necessary again, with even bigger performance penalties. The Datomic team expects it’s approach to “work for most common use cases”, but this can’t be verified at this early stage.
- ● The Transactor component as a Single Point of Failure and a bottleneck
- ● Sharding might not be necessary for the read-only infrastructure, but the Transactor will need some type of sharding mechanism, once it has to deal with heavy loads.
- ● The Datomic Pro pricing is quite restricting, and considering that Datomic Free is no good for any kind of production use, building some experimental projects will be pretty hard for the average developer.
Alex Popescu wrote a great post on Datomic, focusing on these critical points and some more.
I personally think that many of the concepts and ideas behind Datomic, especially making time a first-class citizen, are great and bear a lot of potential. But I can’t see me using it in the near future, because I’d like to prove some of the team’s performance claims for myself, a desire not in keeping with my finances relative to a Pro license.
Otherwise, some of Datomic’s advanced features, like fulltext search, multiple data-sources (besides the distributed storage service) and the possibility to use the system for local data-processing only, could potentially be useful.
Please share any thoughts you might have in the comments!
○ Datomic: Initial Analysis (comparing Datomic to other DBMS)