Using JOINs in MongoDB NoSQL Databases

By Craig Buckler
We teamed up with SiteGround
To bring you the latest from the web and tried-and-true hosting, recommended for designers and developers. SitePoint Readers Get Up To 65% OFF Now

Using $lookUp with NoSQL

Thanks to Julian Motz for kindly helping to peer review this article.


One of the biggest differences between SQL and NoSQL databases is JOIN. In relational databases, the SQL JOIN clause allows you to combine rows from two or more tables using a common field between them. For example, if you have tables of books and publishers, you can write SQL commands such as:

SELECT book.title, publisher.name
FROM book
LEFT JOIN book.publisher_id ON publisher.id;

In other words, the book table has a publisher_id field which references the id field in the publisher table.

This is practical, since a single publisher could offer thousands of books. If we ever need to update a publisher’s details, we can change a single record. Data redundancy is minimized, since we don’t need to repeat the publisher information for every book. The technique is known as normalization.

SQL databases offer a range of normalization and constraint features to ensure relationships are maintained.

NoSQL == No JOIN?

Not always …

Document-oriented databases such as MongoDB are designed to store denormalized data. Ideally, there should be no relationship between collections. If the same data is required in two or more documents, it must be repeated.

This can be frustrating, since there are few situations where you never need relational data. Fortunately, MongoDB 3.2 introduces a new $lookup operator which can perform a LEFT-OUTER-JOIN-like operation on two or more collections. But there’s a catch …

MongoDB Aggregation

$lookup is only permitted in aggregation operations. Think of these as a pipeline of operators which query, filter and group a result. The output of one operator is used as the input for the next.

Aggregation is more difficult to understand than simpler find queries and will generally run slower. However, they are powerful and an invaluable option for complex search operations.

Aggregation is best explained with an example. Presume we’re creating a social media platform with a user collection. It stores every user’s details in separate documents. For example:

{
  "_id": ObjectID("45b83bda421238c76f5c1969"),
  "name": "User One",
  "email: "userone@email.com",
  "country": "UK",
  "dob": ISODate("1999-09-13T00:00:00.000Z")
}

We can add as many fields as necessary, but all MongoDB documents require an _id field which has a unique value. The _id is similar to an SQL primary key, and will be inserted automatically if necessary.

Our social network now requires a post collection, which stores numerous insightful updates from users. The documents store the text, date, a rating and a reference to the user who wrote it in a user_id field:

{
  "_id": ObjectID("17c9812acff9ac0bba018cc1"),
  "user_id": ObjectID("45b83bda421238c76f5c1969"),
  "date: ISODate("2016-09-05T03:05:00.123Z"),
  "text": "My life story so far",
  "rating": "important"
}

We now want to show the last twenty posts with an “important” rating from all users in reverse chronological order. Each returned document should contain the text, the time of the post and the associated user’s name and country.

The MongoDB aggregate query is passed an array of pipeline operators which define each operation in order. First, we need to extract all documents from the post collection which have the correct rating using the $match filter:

{ "$match": { "rating": "important" } }

We must now sort the matched items into reverse date order using the $sort operator:

{ "$sort": { "date": -1 } }

Since we only require twenty posts, we can apply a $limit stage so MongoDB only needs to process data we want:

{ "$limit": 20 }

We can now join data from the user collection using the new $lookup operator. It requires an object with four parameters:

  • localField: the lookup field in the input document
  • from: the collection to join
  • foreignField: the field to lookup in the from collection
  • as: the name of the output field.

Our operator is therefore:

{ "$lookup": {
  "localField": "user_id",
  "from": "user",
  "foreignField": "_id",
  "as": "userinfo"
} }

This will create a new field in our output named userinfo. It contains an array where each value is the matching the user document:

"userinfo": [
  { "name": "User One", ... }
]

We have a one-to-one relationship between the post.user_id and user._id, since a post can only have one author. Therefore, our userinfo array will only ever contain one item. We can use the $unwind operator to deconstruct it into a sub-document:

{ "$unwind": "$userinfo" }

The output will now be converted to a more practical format which can have further operators applied:

"userinfo": {
  "name": "User One",
  "email: "userone@email.com",
  …
}

Finally, we can return the text, the time of the post, the user’s name and country using a $project stage in the pipeline:

{ "$project": {
  "text": 1,
  "date": 1,
  "userinfo.name": 1,
  "userinfo.country": 1
} }

Putting It All Together

Our final aggregate query matches posts, sorts into order, limits to the latest twenty items, joins user data, flattens the user array and returns necessary fields only. The full command:

db.post.aggregate([
  { "$match": { "rating": "important" } },
  { "$sort": { "date": -1 } },
  { "$limit": 20 },
  { "$lookup": {
    "localField": "user_id",
    "from": "user",
    "foreignField": "_id",
    "as": "userinfo"
  } },
  { "$unwind": "$userinfo" },
  { "$project": {
    "text": 1,
    "date": 1,
    "userinfo.name": 1,
    "userinfo.country": 1
  } }
]);

The result is a collection of up to twenty documents. For example:

[
  {
    "text": "The latest post",
    "date: ISODate("2016-09-27T00:00:00.000Z"),
    "userinfo": {
      "name": "User One",
      "country": "UK"
    }
  },
  {
    "text": "Another post",
    "date: ISODate("2016-09-26T00:00:00.000Z"),
    "userinfo": {
      "name": "User One",
      "country": "UK"
    }
  }
  ...
]

Great! I Can Finally Switch to NoSQL!

MongoDB $lookup is useful and powerful, but even this basic example requires a complex aggregate query. It’s not a substitute for the more powerful JOIN clause offered in SQL. Neither does MongoDB offer constraints; if a user document is deleted, orphan post documents would remain.

Ideally, the $lookup operator should be required infrequently. If you need it a lot, you’re possibly using the wrong data store …

If you have relational data, use a relational (SQL) database!

That said, $lookup is a welcome addition to MongoDB 3.2. It can overcome some of the more frustrating issues when using small amounts of relational data in a NoSQL database.

We teamed up with SiteGround
To bring you the latest from the web and tried-and-true hosting, recommended for designers and developers. SitePoint Readers Get Up To 65% OFF Now
  • Gerald Sangudi

    Hi Craig, what if I told you that you could have the best of both worlds. NoSQL data, with the flexibility and richness / denormalization of JSON, as well as querying with SQL. For your example above, here is the query in N1QL, the SQL query language from Couchbase:

    SELECT book.title, publisher.name
    FROM book
    LEFT JOIN publisher ON KEYS book.publisher_id;

    This performs a LEFT OUTER JOIN and returns the results as JSON documents containing { “name”: “”Pendant”, “title”: “Paradise Lost” }.

    I hope you check it out and share your thoughts in a follow up article.

    • Craig Buckler

      Thanks Gerald. NoSQL databases, methods and facilities differ greatly across applications. I’ve not looked at Couchbase before and I understand RethinkDB also offers JOIN-like options. I’ll take a look…

    • Jay Gopalakrishnan

      Nice article, Craig. From our own experience within our MongoDB customer base (we provide native BI/Reporting on NoSQL databases), we are starting to see increasing $lookup usage. As you highlight, it can be slightly limiting beyond simple use cases, given that it is for left outer joins on unshared collections on the same database. To second Gerald, Couchbase N1QL is arguably the best query language on NoSQL we’ve worked with, with its SQL like familiarity and inner/left outer join support.

  • Mustafa Ekim

    I think that $lookup can go mainstream. Joins let us build normalized data which is good for most cases. I use mongodb not because it favors denormalized data but because it allows me to store my java objects as they are without any orm and because its schema free design let me develop quicker. Joins are a must and I am glad to see that doing joins on application level now has an alternative. I am sure it will perform much better compared to app level

    • Craig Buckler

      Thanks Mustafa. While I agree that schema-free feels liberating, a solid DB design may save you significant effort later. The most trouble I’ve experienced with MongoDB is when I’ve required some relational data. If your data is naturally relational, a SQL database will normally be the best way forward.

  • Rob Gillan

    Hey Craig – just wanted to point out what I believe is an error: you refer to it as “rating” in your initial document but then ask for matches on “status”: “important”.

    If I’m not mistaken, then it made the tutorial a little harder to follow. Cheers. :)

    • Craig Buckler

      Well spotted Rob! Sorry about that. Any SitePoint editors around today to change “rating”: “important” to “status”: “important” in the third code clip (post document)?

      • Rob Gillan

        No apologies necessary! Glad to help – thanks for taking the time to write the article.

      • Ralph Mason

        Done. :-)

  • Priyanka

    Is it possible to add subquery in lookup like in MySQL
    select a.column1, a.column2, b.column3 from TABLE_A a
    JOIN (select column1, column3, column4 from TABLE_B) b on a.column1=b.column1;

    • Craig Buckler

      Yes, but it doesn’t quite work in the same way. Aggregation is a pipeline where any step can add, remove, consolidate or adjust data. You’d possibly want to run the sub-query first then match against that – but it really depends on your data.

  • justin

    I’ve found that I’ve needed to restructure my database or code numerous times for optimal results. If you are using both SQL and NOSQL, you can definitely just use one or the other, but obviously requires effort.

  • ItsMe

    What if i do not have MongoDB 3.2 and to upgrade the DB is not in my authority and no body is willing to? Is there any similar solution for the lower version?

    • Craig Buckler

      Upgrading is the only option – you’re software stack should be kept reasonably current so there aren’t many excuses. It just takes a few minutes. Without 3.2 you’ll need to manually associate records which is no fun.

      • ItsMe

        Hi Craig, there isn’t anybody that i know who is knowledgeable enough to do that plus there are a ton of legacy code which is running. I am a bit apprehensive about it. Can you recommend any tutorial to do that, in case i succeed in convincing folks about it?

        • Craig Buckler

          I don’t know what OS you’re using but on Ubuntu Server and similar distros it’s simply a matter of:

          sudo apt-get update
          sudo apt-get upgrade

          MongoDB backward compatibility is unlikely to be a problem – the DB is just a few years old. Bedsides, you can always revert back if necessary.

  • The moment you start making join of documents in MongoDB it’s time to stop and think if you chose the right DB…

    • Craig Buckler

      You could make that point but it’s rarely black and white. If your data is naturally relational then absolutely use a SQL database. Similarly, if your data is a collection of unrelated documents, NoSQL could be a better option. Unfortunately, few projects start or remain at either end of those extremes.

      MongoDB $lookup is ideal for those situations where you have a little relational data which doesn’t need to be queried regularly, e.g. the ID of the user who wrote an article. You can store their name in the article collection for fast reference as well as the ID should you ever need further details such as their email address.

      • Looking from that point, I must agree with you.

        Congratulations on your post by the way.

    • ARASH GHOREYSHI

      is there really a situation that there is no relation between data ? would you give an example ?

  • Craig Buckler

    First, make sure your MongoDB library is up-to-date. Next, create a simple aggregate query which works in the mongo CLI or whatever tools you’re using. Now add that query to your code and test.

    • ItsMe

      I have been suggested to not to use this, instead do it through an ETL tool as it takes for ever to return required data through pymongo. Thanks though

  • Sumit (StrTg Prprtn)

    How can use multiple collections in joins in mondb? i am very confused.

    • Craig Buckler

      You can use as many $lookups as you like to aggregate data from two or more collections.

      • Sumit

        any example for using it with more than two collections? i tried but was not successful

  • Hasan Paknia

    Does mongodb support multilevel joins? comment -> post -> user?

    • Craig Buckler

      Yes. A $lookup effectively brings in data from another document. That data can then be used within another $lookup to bring in further data. In your example, you could take a comment, find it’s post then find the author of that post.