Modeling Data Relationships in MongoDB

It’s important to consider the data needs of your application right from the very start of development. But if your app will be using NoSQL and you come from a RDBMS/SQL background, then you might think looking at data in terms of NoSQL might be difficult. This article will help you by showing you how some of the basic data modeling concepts apply in the realm of NoSQL.

I’ll be using MongoDB for our discussion as it is one of the leading open-source NoSQL databases due to its simplicity, performance, scalability, and active user base. Of course the article assumes that you are aware of basic MongoDB concepts (like collections and documents). If not, I suggest you read some of the previous articles here at SitePoint to get started with MongoDB.

Understanding Relationships

Relationships show how your MongoDB documents are related to each other. To understand the different ways in which we can organize our documents, let’s look at the possible relationships.

1-to-1 Relationship (1:1)

A 1:1 relationship exists when one object of an entity is related to one and only one object of another entity. For example, one user can have one and only one birth date. So if we have a document which stores user information and another document which stores birth dates, there will be a 1:1 relationship between them.

mongo-model-01

1-to-Many Relationship (1:N)

Such relationship exists when one object of an entity can be related to many objects of another entity. For example, there can be 1:N relationship between a user and his contact numbers since it is possible for one user to have more than one number.

mongo-model-02

Many-to-Many Relationship (M:N)

Such relationship exists when one object of an entity is related to many objects of another entity and vice versa. If we correlate it to the case of users and items bought by them, one user can purchase more than one items as well as one item can be purchased by more than one user.

mongo-model-03

Modeling 1-to-1 Relationships (1:1)

Consider the following example where we need to store address information for each user (for now let’s assume that there’s a single address for each user). In this case, we can design an embedded document having the following structure:

{
    "_id": ObjectId("5146bb52d8524270060001f4"),
    "user_name": "Mark Benzamin" 
    "dob": "12 Jan 1991",
    "address": {
       "flat_name": "305, Sunrise Park",
       "street": "Cold Pool Street",
       "city": "Los Angeles" 
    }
}

We have the address entity embedded within the user entity making all of the information present in a single document. This means we can find and retrieve everything with a single query.

<?php
// query to find user 'Mark Benzamin' and his address 
$cursor = $collection->find(
    array("user_name" => "Mark Benzamin"),
    array("user_name" => 1,"address" => 1)
 );

Embedding documents is roughly analogous to de-normalization and is useful when there is a “contains” relationship between two entities. That is, one document can be stored within another, thus placing the related pieces of information within a single document. As all the information is available in one document, such approach has better read performance because a query operation within a document is less expensive for the server and we find and retrieve the related data in the same query.

In contrast, a normalized approach would call for two documents (ideally in separate collections), one to store basic user information and another to store address information. The second document would contain a user_id field indicate the user to which the address belongs.

{
    "_id": ObjectId("5146bb52d8524270060001f4"),
    "user_name": "Mark Benzamin",
    "dob": "12 Jan 1991"
}
{
    "_id": ObjectId("5146bb52d852427006000de4"),
    "user_id": ObjectId("5146bb52d8524270060001f4"), 
    "flat_name": "305, Sunrise Park",
    "street": "Cold Pool Street",
    "city": "Los Angeles" 
}

We now need to execute two queries to fetch the same data:

<?php
// query to find user information
$user = $collection->findOne(
    array("user_name" => "Mark Benzamin"),
    array("_id" => 1, "user_name" => 1)
);

// query to find address corresponding to user
$address = $collection->findOne(
    array("user_id" => $user["_id"]),
    array("flat_name" => 1, "street" => 1, "city" => 1)
);

The first query fetches the _id of the user which is then used in the second query to retrieve his address information.

The embedding approach makes more sense than the referencing approach in this case since we are frequently retrieving user_name and address together. What approach you should use ultimately depends on how you logically connect your entities and what data you need to retrieve from the database.

Modeling Embedded 1-to-Many Relationships (1:N)

Now let’s consider the case where one user can have multiple addresses. If all of the addresses should be retrieved along with the basic user information, it would be ideal to embed the address entities inside the user entity.

{
    "_id": ObjectId("5146bb52d8524270060001f4"),
    "user_name": "Mark Benzamin"
    "address": [
        {
            "flat_name": "305, Sunrise Park",
            "street": "Cold Pool Street",
            "city": "Los Angeles"
        },
        {
            "flat_name": "703, Sunset Park",
            "street": "Hot Fudge Street",
            "city": "Chicago" 
        }
    ]
}

We’re still able to fetch all of the required information with a single query. The referencing/normalized approach would have us design three documents (one user, two addresses) and two queries to accomplish the same task.
Besides efficiency and convenience, we should use the embedded approach in instances where we need atomicity in operations. Since any updates happen within the same document, atomicity is always insured.

Modeling Referenced 1-to-Many Relationships (1: N)

Bear in mind that embedded documents can continue to grow in size over the life of an application which can badly impact write performance. There’s also a limit of 16MB on the maximum size of each document. A normalized approach is preferred if the embedded documents would be too large, when the embedding approach would result in a lot of duplicate data, or if you need to model complex or hierarchical relationships between documents.

Consider the example of maintaining posts made by a user. Let’s assume that we want the user’s name and his profile picture with every post (similar to a Facebook post where we can see a name and profile picture with every post). The denormalized approach would store the user information in each post document:

{
    "_id": ObjectId("5146bb52d8524270060001f7"),
    "post_text": "This is my demo post 1", 
    "post_likes_count": 12, 
    "user": {
        "user_name": "Mark Benzamin",
        "profile_pic": "markbenzamin.jpg"
    }
}

{
   "_id": ObjectId("5146bb52d8524270060001f8"),
   "post_text": "This is my demo post 2", 
   "post_likes_count": 32, 
    "user": {
        "user_name": "Mark Benzamin",
        "profile_pic": "markbenzamin.jpg"
    }
}

We can see this approach stores redundant information in each post document. Looking a bit forward, if the user name or profile picture was ever changed, we would have to update the appropriate field in all of the corresponding posts.
So the ideal approach would be to normalize the information and connect it via references.

{
    "_id": ObjectId("5146bb52d852427006000121"),
    "user_name": "Mark Benzamin",
    "profile_pic": "markbenzamin.jpg"
}
{
    "_id": ObjectId("5146bb52d8524270060001f7"),
    "post_text": "This is my demo post 1", 
    "post_likes_count": 12, 
    "user_id": ObjectId("5146bb52d852427006000121")
}

{
    "_id": ObjectId("5146bb52d8524270060001f8"),
    "post_text": "This is my demo post 2", 
    "post_likes_count": 32, 
    "user_id": ObjectId("5146bb52d852427006000121")
}

The user_id field in the post document contains a reference to the user document. Thus, we can fetch posts made by the user using two queries as follows:

<?php
$user = $collection->findOne(
    array("user_name" => "Mark Benzamin"),
    array("_id" => 1, "user_name" => 1, "profile_pic" => 1)
); 

$posts = $collection->find(
    array("user_id" => $user["_id"])
);

Modeling Many-to-Many Relationships (M:N)

Let’s take our previous example of storing users and the items they purchased (ideally in separate collections) and design referenced documents to illustrate the M:N relationship. Assume the collection which store documents for user information is as follows such that each document contains reference IDs for the list of items purchased by the user.

{                                 
    "_id": "user1",
    "items_purchased": {
        "0": "item1",
        "1": "item2" 
    } 
}

{
    "_id": "user2",
    "items_purchased": {
        "0": "item2",
        "1": "item3" 
    } 
}

Similarly, assume another collection storing the documents for the available items. These documents will in turn store reference IDs for the list of users who have purchased it.

{                                 
    "_id": "item1",
    "purchased_by": {
        "0": "user1"
    } 
}

{                                 
    "_id": "item2",
    "purchased_by": {
        "0": "user1",
        "1": "user2" 
    } 
}

{                                 
    "_id": "item3",
    "purchased_by": {
        "0": "user2"
    } 
}

To fetch all the items purchased by a user, we would write the following query:

<?php
// query to find items purchased by a user
$items = $collection->find(
    array("_id" => "user1"),
    array("items_purchased" => 1)
);

The above query will return the IDs of all the items purchased by user1. We can later use these to fetch the corresponding item information.

Alternatively, if we wanted to fetch the users who have purchased a specific item, we would write the following:

<?php
// query to find users who have purchased an item
$users = $collection->find(
    array("_id" => "item1"),
    array("purchased_by" => 1)
);

The above query returns the IDs of all the users who have purchased item1. We can later use these IDs to fetch the corresponding user information.

This example demonstrated M:N relationships which are very useful in some cases. However, you should keep in mind that many times such relationships can be handled using 1:N relationships along with some smart queries. This reduces the amount of data to be maintained in both documents.

Conclusion

That’s it for this article. We’ve learned about some basic modeling concepts which will surely help you head start your own data modeling: 1-to-1, 1-to-many, and many-to-many relationships, and also a little about data normalization and de-normalization. You should be able to easily apply these concepts to your own application’s modeling needs. If you have any questions or comments related to the article, feel free to share in the comments section below.

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • http://www.highintegritydesign.com NorthK

    Ashish, thanks for a great article. I’m wondering how one would accomplish atomicity/transactions for updates involving multiple collections (referenced data/normalized data) in NoSQL?

    • Nicolas

      @NorthK MongoDB and NoSQL in general are not fully ACID compliant. You most often have to choose between speed and atomicity. While NoSQL provide speed, Relational database provide atomicity. But MongoDB suggest a “Two phase commit” to provide a pseudo transation safe.
      http://docs.mongodb.org/manual/tutorial/perform-two-phase-commits/
      best regards

  • http://twitter/aasiutin Aleksey Asiutin

    Nice article!
    As for Many-to-Many relations, I suppose we can remove “purchased_by” key from products collection and make queries “fetch all users, who bought item1″ from users table. But how does it affect on performance, if there will be a lot of data?