MongoDB Slides Until ClassTest
MongoDB Slides Until ClassTest
MongoDB Slides Until ClassTest
Database is usually defined as collection of data and the system that handles data, transactions,
problems and issues of the database is known as Database Management System (DBMS).
A Database was incepted in 1960 in order to satisfy the need of storing and finding data.
The relational database has been the foundation of enterprise data management for over thirty
years.
But the way we build and run applications today, coupled with unrelenting growth in new data sources
and user loads are pushing relational databases beyond their limits. This is compelling more and more
organizations to migrate to alternatives.
Introduction
"NoSQL database" is a new breed of database management system which doesn't use the
relational model — it uses a wide variety of data models. The actual data model that it uses
depends on the database.
When people use the term “NoSQL database”, they typically use it to refer to any non-relational
database. Some say the term “NoSQL” stands for “non SQL” while others say it stands for “not only
SQL.”
NoSQL database is a highly scalable and flexible database management system.
It allows the user to store and process unstructured data and flexible-structured data.
NoSQL systems don’t generally provide the same level of data consistency as SQL databases. In
fact, while SQL databases have traditionally sacrificed scalability and performance for the ACID
properties. NoSQL databases guarantee high speed and scalability performance.
Although NoSQL databases have been around since the 1960s, it wasn't until the early 2000s that
the NoSQL approach started to pick up steam, and a whole new generation of NoSQL systems was
born.
So defining the schema in advance became nearly impossible. NoSQL databases allow developers
to store huge amounts of unstructured data, giving them a lot of flexibility.
NoSQL characteristics
Every time you look something up in a row-oriented database, every row is scanned, regardless of
which columns you require. Let’s say we only want a list of birthdays in September. The database will
Column databases store each column separately, with the related row numbers. Every entity (person)
is divided over multiple tables, allowing for quicker scans when only a small number of columns are
involved.
A column database maps the data to the row numbers; in that way counting becomes quicker, so it’s
easy to see how many people like archery, for instance. Storing the columns separately also allows for
optimized compression because there’s only one data type per table.
The column-oriented database shines when performing analytics and reporting: summating values and
counting entries. A row-oriented database is often the operational database of choice for actual
transactions (such as sales). Overnight batch jobs bring the column-oriented database up to date,
supporting lightning-speed lookups and aggregations using MapReduce algorithms for reports.
Examples of column-family stores are Apache HBase, Facebook’s Cassandra, Hypertable, and the
grandfather of wide-column stores, Google BigTable.
2. Key-Value Store
The database stores data as a collection of key/value pairs A simple phone directory is a classic
- each item contains keys and values. example of a key-value database:
A value can typically only be retrieved by referencing its
key.
Some popular use cases of the key-value However key-value databases are not the ideal choice for
databases: every use case when:
For storing user session data We have to query the database by specific data value.
Maintaining schema-less user profiles We need relationships between data values.
A document store does assume a certain document structure that can be specified with a schema.
Document stores appear the most natural among the NoSQL database types because they’re designed
to store everyday documents as is.
JSON and BSON are close cousins, as their nearly identical names imply, but you wouldn’t know it by
looking at them side-by-side. JSON, or JavaScript Object Notation, is the wildly popular standard for
data interchange on the web, on which BSON (Binary JSON) is based.
As JSON being very common, MongoDB’s inventor chooses JSON for representing data structures in
the document model. However, there are several issues that make JSON less than ideal for usage inside
of a database.
1. JSON is a text-based format, and text parsing is very slow
2. JSON’s readable format is far from space-efficient, another database concern
3. JSON only supports a limited number of basic data types
In order to make MongoDB JSON-first, but still high-performance and general-purpose, BSON was
invented to bridge the gap: a binary representation to store data in JSON format, optimized for speed,
space, and flexibility.
BSON’s binary structure encodes type and length information, which allows it to be parsed much more
quickly. Since its initial formulation, BSON has been extended to add some optional non-JSON-native
data types, like dates and binary data, without which MongoDB would have been missing some
valuable support.
MongoDB stores data in BSON format both internally, and over the network, but that doesn’t mean
you can’t think of MongoDB as a JSON database. Anything you can represent in JSON can be natively
stored in MongoDB, and retrieved just as easily in JSON.
4. Graph databases
Graph databases are basically built upon the Entity – Attribute – Value model. Entities are also known
as nodes, which have properties. It is a very flexible way to describe how data relates to other data.
We will mainly focus our study on MongoDB. So let us give you a brief introduction of MongoDB.
A record in MongoDB is a
Document Database
document, which is a data
structure composed of field and
value pairs. MongoDB documents
are similar to JSON objects. The
values of fields may include other
documents, arrays, and arrays of
documents.
o Support for embedded data models requires lesser input and output operations. Hence
reduces I/O activity on database system.
o Indexes support faster queries
o It’s query language that is rich and supports text search, aggregation features, and CRUD
operations.
Note :
Horizontal scaling Vertical scaling
Horizontal scaling means we scale by adding Vertical scaling means we scale by adding more
additional machines to our existing bunch of computing power like CPU and RAM to an
resources. existing machine.
Flexible Schemas : MongoDB is document database in which one collection holds different
documents. Number of fields, content and size of the document can differ from one document to
another.
Structure of a single object is clear
No complex joins
Deep query-ability. MongoDB supports dynamic queries on documents using a document-based
query language that's nearly as powerful as SQL
Ease of scale-out: MongoDB is easy to scale
Conversion / mapping of application objects to database objects not needed
Uses internal memory for storing the (windowed) working set, enabling faster access of data
Big Data
Content Management and Delivery
Mobile and Social Infrastructure
User Data Management
Data Hub
While MongoDB incorporates great features to deal with many of the challenges in big data, it comes
with some limitations, such as:
1. To use joins, you have to manually add code, which may cause slower execution and less-than-
optimum performance.
2. Lack of joins also means that MongoDB requires a lot of memory as all files have to be mapped
from disk to memory.
3. Document sizes cannot be bigger than 16MB.
4. The nesting functionality is limited and cannot exceed 100 levels.
Atlas makes installing MongoDB as easy as clicking a button and answering 4 questions. Once that is
complete you will have a MongoDB cluster running a few minutes later. Creating users and allocating
limited permissions is easy and done through a nice UI.
Atlas also handles growing/shrinking your cluster when the need arises, and patching/upgrading your
MongoDB cluster when new version is released. MongoDB Atlas belongs to "MongoDB
Hosting" category of the tech stack,
MongoDB Atlas is generally recommended to every company who has a significant need in the NoSQL
database and do not want to manage their infrastructure. Using MongoDB Atlas can significantly
reduce the management time and cost, which saves valuable resources for other tasks. It also suits a
smaller company as MongoDB Atlas scales up and down very quickly.
On the other hand, MongoDB Compass is detailed as "A GUI for MongoDB". Visually explore your data.
Run ad hoc queries in seconds. Interact with your data with full CRUD functionality. View and optimize
your query performance.
MongoDB Compass can be primarily classified under "Database Tools".
MongoDB is better placed in large projects, with great scalability. It also allows you to work quite
comfortably with projects based on programming languages such as javascript angular typescript C #.
When MongoDB is used MongoDB Compass may be used as a tool.
No discussion on Big Data is complete without bringing up Hadoop and MongoDB, two of the most
prominent software programs that are available today.
What is Hadoop?
Hadoop is an open-source set of programs that you can use and modify for your big data processes. It
is made up of 4 modules, each of which performs a specific task related to big data analytics.
These platforms include:
Distributed File-System
MapReduce
Hadoop Common
Hadoop YARN
Here for your consideration are six reasons why Hadoop may be the best fit for your company and its
need to capitalize on big data.
1. You can quickly store and process large amounts of varied data generated from the internet of
things and social media.
2. The Distributed File System gives Hadoop high computing power necessary for fast data
computation.
3. Hadoop protects against hardware failure by redirecting jobs to other nodes and automatically
storing multiple copies of data.
4. You can store a wide variety of structured or unstructured data (including images and videos)
without having to preprocess it.
5. The open-source framework runs on commodity servers, which are more cost-effective than
dedicated storage.
6. Adding nodes enables a system to scale to handle increasing data sets. This is done with little
administration.
Limitations of Hadoop
1. Due to its programming, MapReduce is suitable for simple requests. You can work with
independent units, but not as effective with interactive and iterative tasks. Unlike independent
tasks that need simple sort and shuffle, iterative tasks require multiple maps and reduce processes
to complete. As a result, numerous files are created between the map and reduce phases, making it
inefficient at advanced analytics.
2. Only a few entry-level programmers have the java skills necessary to work with MapReduce. This
has seen providers rushing to put SQL on top of Hadoop because programmers skilled in SQL are
easier to find.
Both Hadoop and MongoDB offer more advantages compared to the traditional relational database
management systems (RDBMS), including parallel processing, scalability, ability to handle aggregated
data in large volumes, MapReduce architecture, and cost-effectiveness due to being open source.
More so, they process data across nodes or clusters, saving on hardware costs.
However, in the context of comparing them to RDBMS, each platform has some strengths over the
other. We discuss them in detail below:
MongoDB Hadoop
RDBMS Replacement A flexible platform that can Hadoop cannot replace RDBMS but rather
make a suitable replacement supplements it by helping to archive data.
for RDBMS
Memory Handling MongoDB is a C++ based Hadoop is a Java-based collection of software
database, which makes it that provides a framework for storage,
better at memory handling. retrieval, and processing. Hadoop optimizes
space better than MongoDB.
Data Import and Data is stored as JSON, Hadoop accepts various formats of data, thus
Storage BSON, or binary, and all eliminating the need for data transformation
fields can be queried, during processing.
indexed, aggregated, or
replicated at once.
Additionally, data in
MongoDB has to be in JSON
or CSV formats to be
imported.
Big Data Handling MongoDB was not built with On the other hand, Hadoop was built for that
big data in mind. sole purpose. As such, the latter is great at
batch processing and running long ETL jobs.
Implementing MapReduce on Hadoop is more
efficient than in MongoDB, again making it a
better choice for analysis of large data sets.
Real-time Data MongoDB handles real-time Hadoop is not very good at real-time data
Processing data analysis better and is handling, but if you run Hadoop SQL-like
also a good option for client- queries on Hive, you can make data queries
side data delivery due to its with a lot more speed and with more
Prepared by Kamal Podder Page 11
readily available data. effectiveness than JSON.
Additionally, MongoDB’s
geospatial indexing makes it
ideal for geospatial
gathering and analyzing GPS
or geographical data in real-
time.
Each company and individual comes with its own unique needs and challenges, so there’s no such thing
as a one-size-fits-all solution. When determining something like Hadoop vs. MongoDB, you have to
make your choice based on your unique situation.
You could take a look and see which big companies use which platform and try to follow their example.
For instance, eBay, SAP, Adobe, LinkedIn, McAfee, MetLife, and Foursquare use MongoDB. On the
other hand, Microsoft, Cloudera, IBM, Intel, Teradata, Amazon, Map R Technologies are counted
among notable Hadoop users.
Ultimately, both Hadoop and MongoDB are popular choices for handling big data. However, although
they have many similarities (e.g., open-source, NoSQL, schema-free, and Map-reduce), their approach
to data processing and storage is different. It is precisely the difference that finally helps us to
determine the best choice between Hadoop vs. MongoDB.
Cassandra vs MongoDB
You probably came across Cassandra and MongoDB when searching for a NoSQL database. Still, these
two popular NoSQL choices have much less in common than expected.
When making a comparison between two database systems, it is usually inferred there are shared
similarities as well. Although they do exist, in regards to Cassandra and MongoDB, these similarities are
limited.
MongoDB Cassandra
Data Availability A single master directing multiple slave It utilizes multiple masters inside a
nodes. If the master node goes down, cluster. With multiple masters present,
one of the slave nodes takes over its there is no fear of any downtime. The
role. Although automatic failover does redundant model ensures high
ensure recovery, it may take up to a availability at all times.
minute for the slave to become the
master. During this time, the database
isn’t able to respond to requests.
Scalability Only the master node can write and Having multiple master nodes
accept input. In the meantime, the slave increases Cassandras writing
nodes are only used for reads. As capabilities. It allows this database to
MongoDB has a single master node, it is coordinate numerous writes at the
limited in terms of writing scalability. same time, all coming from its masters.
Therefore, the more master nodes
there are in a cluster, the better the
write speed (scalability).
Data Model MongoDB’s data model is categorized as Cassandra has a table structure using
object and document-oriented. This rows and columns it is column
means it can represent any kind of oriented. Still, it is more flexible than
object structures which can have relational databases since each row is
properties or even be nested for not required to have the same
multiple levels. columns. Upon creation, these
columns are assigned one of the
If you need a rich data model, MongoDB available Cassandra data types,
may be the better solution. ultimately relying more on data
structure.
Query Language MongoDB uses queries structured into Cassandra has its own query language
JSON fragments and does not have any called CQL (Cassandra Query
query language support yet. If you or Language). Its syntax is similar to SQL
your team is used to SQL, this will be but still has some limitations.
something to get used to. However, it is Essentially, the database has a
easy enough to manage. different way of storing and recovering
data due to it being non-relational.
How are Queries Selecting records from the employee Selecting records from the employee
Different? table: ‘db.employee.find()’ table: ‘SELECT * FROM employee;’
Inserting records into the employee Inserting records into the employee
table: table:
Mainly, the database model (or schema) makes a big difference in performance
quality as some are more suitable for MongoDB while others may work better
with Cassandra.
What’s more, the load characteristic of the application your database needs to
support also plays a crucial role. If you are expecting heavy load input, Cassandra,
with its multiple master nodes, will give better results. With heavy load output
both MongoDB and Cassandra will show good performance.
Run the msi installer after completion of download. Following screen will appear.
In the following 'Service Configuration' dialog, we are going to uncheck 'Install MongoD as a Service'
(checked by default) so that we can start MongoDB instance ourselves rather than it's running as a
service all the time.
Then installation will start. After installation the message ‘Completed the MongoDB 5.0.9……. Set up
wizard’ will appear in a dialog. Click on Finish button on that dialog to exit set up.
Then set the Path environmental variable to include the MongoDB installed path as shown below.
C:\WINDOWS\system32>mongod.exe --dbpath=e:\mongoDbData
If we don’t set the path of the bin folder in path environmental variable then we should follow the
steps:
First navigate to your MongoDB Bin folder
To start MongoDB, run mongod.exe from the Command Prompt.
It will start MongoDB main process and “The waiting for connections” message will appear in the
console.
If you want to connect mongodb through shell, use below commands in another command prompt
(don’t close earlier command prompt of window which is used for running mongoDB).
Remember
Now extract the files from the downloaded archive in c:\Program Files\mongosh-1.5.1-win32-x64.
Ensure that the extracted MongoDB Shell binary is in the desired location in your filesystem, then add
that location to your PATH environment variable.
To confirm that your PATH environment variable is correctly configured to find mongosh, open a
command prompt and enter the mongosh --help command. If your PATH is configured correctly, a list
of valid commands displays.
Prerequisites : To use the MongoDB Shell, we must have a MongoDB deployment to connect to.
So run the mongodb local instance in a command prompt using following command :
C:\WINDOWS\system32>mongod.exe --dbpath=e:\mongoDbData
Run mongosh without any command-line options to connect to a MongoDB instance running on
your localhost with default port 27017 in another command window:
mongosh This is equivalent to the command: mongosh "mongodb://localhost:27017"
When we issue show dbs in the prompt we will get following output :
The test shown at the bottom of the result is the default database created for us by mongosh.
Reserved database used to store the metadata of the replication process and other related data.
It is not part of the replication database, meaning that, the collection in local database will not
replicate from the primary node of MongoDB to the secondary node of MongoDB.
admin database
It plays a vital role in authentication and authorization of MongoDB database users. This database
is used for administrative purpose too.
There are different security mechanisms to enable security in MongoDB. If you have enabled
security in MongoDB for authentication and authorization of MongoDB database user then
this admin database comes into the picture.
config database
Notes
Capped collections are fixed-size circular collections that follow the insertion order to support high
performance for create, read, and delete operations. By circular, it means that when the fixed size
allocated to the collection is exhausted, it will start deleting the oldest document in the collection
without providing any explicit commands.
Capped collections restrict updates to the documents if the update results in increased document
size. Since capped collections store documents in the order of the disk storage, it ensures that the
document size does not increase the size allocated on the disk. Capped collections are best for storing
log information, cache data, or any other high volume data.
Use custom Collation - New in version 3.4.
Collation allows users to specify language-specific rules for string comparison, such as rules for
lettercase and accent marks. You can specify collation for a collection or a view, an index, or specific
operations that support collation.
Click on 'CREATE DATABASE' and enter database name as authDB and first collection name as users
(In MongoDB, a collection is the equivalent of an RDBMS table):
Creating Document
A 'Document' is a record in a MongoDB collection. It is the basic unit of data in MongoDB. Documents
are written in BSON (Binary JSON) format. BSON is similar to JSON but has a more type-rich format. To
create a document click on 'authDB' (above screenshot) database. It will show the collection list
where first collection users will appear.
Insert Documents
Compass provides two ways to insert documents into your collections: JSON Mode and a Field-by-Field
Editor.
JSON Mode (New in Compass 1.20) : Allows you to write or paste JSON documents in the editor.
Use this mode to insert multiple documents at once as an array.
Prepared by Kamal Podder Page 13
Field-by-Field Editor : Provides a more interactive experience to create documents, allowing you to
select individual field values and types. This mode only supports inserting one document at a time.
In JSON format, type or paste the document(s) you want to insert into the collection. To insert multiple
documents, enter a comma-separated array of JSON documents.
EXAMPLE
The following array inserts 2 documents into the collection:
[
Modify Documents
You can edit existing documents in your collection. When you edit a document, Compass performs
a findAndModify operation to update the document.
Limitations
Modifying documents is not permitted in MongoDB Compass Readonly Edition.
You cannot use the MongoDB Compass GUI to modify documents in a sharded collection. As an
alternative, you can use the Embedded MongoDB Shell in Compass to modify a sharded collection.
To learn more about updates on sharded collections, see Sharded Collection Behavior.
Procedure
Select the appropriate tab based on whether you are viewing your documents in List, JSON, or Table
view:
To modify a document, hover over the document and click the pencil icon as shown in picture
below.
After you click the pencil icon, the document enters edit mode
You can now make changes to the fields, values, or data types of values and click on Update
To exit the edit mode and cancel all pending changes to the document, click the Cancel button.
You can insert new documents by cloning the schema and values of an existing document in a
collection. Select the appropriate tab based on whether you are viewing your documents in List, JSON,
or Table view
To clone a document, hover over the desired document and click the Clone button.
When you click the Clone button, Compass opens the document insertion dialog with the same schema
and values as the cloned document. You can edit any of these fields and values before you insert the
new document. To learn more about inserting documents, see Insert Documents.
Delete Documents
After you click the delete button, the document is flagged for deletion. Compass asks for confirmation
that you want to remove the document:
Mongo Shell
Mongo Shell is a JavaScript based command line interface to connect to MongoDB and to perform
various operations. Mongo shell comes with MongoDB installation by default.
Prerequisites
The MongoDB server must be installed and running before you can connect to it from
the mongo shell.
Once you have verified that the mongod server is running, open a terminal window (or a
command prompt for Windows) and run mongo.
Connecting MongoDB
Run mongod.exe from a command prompt as shown below. Don’t close the command window.
Run mongo.exe from another command prompt to execute mongo shell. In the prompt we can
type shell command to execute.
Querying a collection
To find the documents in the collection “users” in database authDB issue the command
db.getCollection(“users”).find()
The db.collection.find() method returns a cursor to the results; however, in the mongo shell, if the
returned cursor is not assigned to a variable using the var keyword, then the cursor is automatically
iterated up to 20 times to print up to the first 20 documents that match the query. The mongo shell
will prompt Type it to iterate another 20 times.
To format the printed result, you can add the .pretty() to the operation, as in the following:
db.myCollection.find().pretty()
Inserting a document
If you end a line with an open parenthesis ('('), an open brace ('{'), or an open bracket ('['), then the
subsequent lines start with ellipsis ("...") until you enter the corresponding closing parenthesis (')'),
the closing brace ('}') or the closing bracket (']'). The mongo shell waits for the closing parenthesis,
closing brace, or the closing bracket before evaluating the code, as in the following example:
> if ( x > 0 ) { You can exit the line continuation mode if you enter two blank lines,
... count++; as in the following example:
... print (x); > if (x > 0
... } ...
...
>
mongosh is currently available as a Beta release. The new MongoDB Shell, mongosh, offers numerous
advantages over the mongo shell, such as:
Improved syntax highlighting. During the beta stage, mongosh supports a subset of the
Improved command history. mongo shell methods. Achieving feature parity between
Improved logging. mongosh and the mongo shell is an ongoing effort.
To maintain backwards compatibility, the methods that mongosh supports use the same syntax as the
corresponding methods in the mongo shell. To see the complete list of methods supported by
mongosh, see MongoDB Shell Methods.
MongoDB support for VS Code is provided by the MongoDB for VS Code extension. To install the
MongoDB for VS Code extension, open the Extensions view by pressing Ctrl+Shift+X and search for
'MongoDB' to filter the results. Select the MongoDB for VS Code extension.
Note: To connect to a deployment using a connection string, we must have a MongoDB cluster running
on our machine or have one in the cloud using Atlas.
So first start local mongoDB instance from command prompt using following command :
Open the MongoDB interactive panel by clicking on the leaf icon on the left sidebar menu, then click on
Add Connection (1 in diagram) to connect to a database instance. Then click Connect (2) and then
enter connection string for our database in the text bar at the top of the window (3 in the diagram). In
this case it is mongodb://localhost:27017/authDB. This database is already created.
Prepared by Kamal Podder Page 21
Upon a successful connection, you should see the following changes:
To perform queries and other database operations on our new database, we can create
a Playground in VS Code to do these. Click on the green create playground button in VS Code as shown
in above diagram to create a playground.
It will open a new editor tab which should look like below. This a default template supplied to help
writing code for mongoDb in the VSCode.
Click on the play button at the top-right side to run the code. A new panel “Playgroud Result” should
open with our results like below:
Notes on writeConcern
Simply put, a write concern is an indication of ‘durability’ passed along with write operations to
MongoDB. To clarify, let us look at the syntax:
Majority states that acknowledgment
{ w: <value>, j: <boolean>, wtimeout: <number> } is requested from a majority of the
Where, “voting nodes.”
w can be an integer | "majority" | , it represents the number of members that must acknowledge
the write. Default value is 1.
j field indicates that a write be acknowledged after it is written to the on-disk journal as opposed to
just the system memory. Unspecified by default.
wtimeout specifies timeout for the applying the write concern. Unspecified by default.
Setting Write Concern on Replica Sets without a wtimeout can cause Writes to Block Indefinitely.
Note that “If you do not specify the wtimeout option and the level of write concern is
unachievable, the write operation will block indefinitely."
Insert’s write concern can be read as
Example:
follows: acknowledge this write when ‘at
db.inventory.insert(
least 2 members of the replica set have
{ sku: "abcdxyz", qty : 100, category: "Clothing" },
written it to their journals within 5000
{ writeConcern: { w: 2, j: true, wtimeout: 5000 } }
msecs or return an error’.
)
Available Write Concerns
Write Meaning Description
Concern
w=0 Unacknowledged Requests no acknowledgment of the write operation.
However, w: 0 may return information about socket exceptions
and networking errors to the application. Data can be rolled
back if the primary steps down before the write operations have
replicated to any of the secondaries.
A WriteResult object for single insert that contains the status of the operation
A BulkWriteResult object for bulk inserts that contains the status of the operation.
Example :
1. Insert a single invoice document in invoice collection. Document does not contain the _id field:
If we don't specify the _id
db.invoice.insert( { inv_no: "I00001", inv_date: "10/10/2012" } );
parameter, then MongoDB
Output:
assigns a unique ObjectId
> db.invoice.insert( { inv_no: "I00001", inv_date: "10/10/2012" } );
for this document.
WriteResult({ "nInserted" : 1 })
Successful Results : Upon success, the WriteResult object contains information on the number of
documents inserted as shown above.
Write Concern Errors : If the insert() method encounters write concern errors, the results include
the WriteResult.writeConcernError field:
WriteResult({
"nInserted" : 1,
"writeConcernError" : {
"code" : 64,
"errmsg" : "waiting for replication timed out at shard-a"
}
})
Errors Unrelated to Write Concern : If the insert() method encounters a non-write concern error, the
results include the WriteResult.writeError field:
WriteResult({
> db.invoice.find();
{ "_id" : ObjectId("567554d2f61afaaed2aae48f"), "inv_no" : "I00001", "inv_date" : "10/10/2012" }
Note :
_id is 12 bytes hexadecimal number unique for every document in a collection. 12 bytes are
divided as follows − _id: ObjectId(4 bytes timestamp, 3 bytes machine id, 2 bytes process id, 3
bytes incrementer)
Create Collection - If the collection does not exist, then the insert() method will create the
collection.
2. Insert a single invoice document in invoice collection specifying the _id field:
Output:
> db.invoice.insert( { _id: 901,inv_no: "I00001", inv_date: "10/10/2012" } );
WriteResult({ "nInserted" : 1 })
If you specify the _id field, the value must be unique within the collection. For operations with
write concern, if you try to create a document with a duplicate _id value, mongod returns a
duplicate key exception.
Other Methods to Add Documents : You can also add new documents to a collection using methods
that have an upsert option. If the option is set to true, these methods will either modify existing
documents or add a new document when no matching documents exist for the query.
BulkWriteResult.writeConcernError - Document that describe error related to write concern and contains
the field:
BulkWriteResult.writeConcernError.code - integer value identifying the cause of the write concern error.
BulkWriteResult.writeConcernError.errInfo - A document identifying the write concern setting related to
the error.
BulkWriteResult.writeConcernError.errmsg - A description of the cause of the write concern error.
The following example performs an unordered insert of three documents. With unordered inserts, if an
error occurs during an insert of one of the documents, MongoDB continues to insert the remaining
documents in the array.
db.products.insert(
[
{ _id: 20, item: "lamp", qty: 50, type: "desk" },
{ _id: 21, item: "lamp", qty: 20, type: "floor" },
{ _id: 22, item: "bulk", qty: 100 }
],
{ ordered: false } )
Syntax : db.collection.insertOne(<document>,
{ writeConcern: <document> }
)
Parameter
<document> The document or record that is to be stored in the database
writeConcern: Optional.
Return Value: It returns the _id of the document inserted into the database.
Example
Following example creates a new collection named empDetails and inserts a document using the
insertOne() method.
> db.createCollection("empDetails")
{ "ok" : 1 }
> db.empDetails.insertOne(
{
Return Value: It returns the _ids of the documents inserted into the database.
Example
This query selects the documents in the users collection that match the condition age is greater than
18. To specify the greater than condition, query criteria uses the greater than (i.e. $gt) query selection
operator. The query returns at most 5 matching documents (or more precisely, a cursor to those
documents). The matching documents will return with only the _id, name and address fields.
Query Behavior
MongoDB provides a db.collection.findOne() method as a special case of find() that returns a single
document.
Projections, which are the second argument to the find() method, may either specify a list of fields to
return or list fields to exclude in the result documents.
Important: Except for excluding the _id field in inclusive projections, you cannot mix exclusive and
inclusive projections.
Projection Examples
Exclude One Field From a Result Set : db.records.find( { "user_id": { $lt: 42 } }, { "history": 0 } )
This query selects documents in the records collection that match the condition { "user_id": { $lt: 42
} }, and uses the projection { "history": 0 } to exclude the history field from the documents.
Return Two fields and the _id Field : db.records.find( { "user_id": { $lt: 42 } }, { "name": 1, "email": 1 } )
This query selects documents in the records collection that match the query { "user_id": { $lt: 42 }
} and uses the projection { "name": 1, "email": 1 } to return just the _id field (implicitly included), name
field, and the email field in the documents in the result set.
By default, the _id field is included in the results. To suppress the _id field from the result set,
specify _id: 0 in the projection document.
For fields that contain arrays, MongoDB provides the following projection operators: $elemMatch,
$slice, and $.
For related projection functionality in the aggregation framework pipeline, use the $project
pipeline stage.
For example, in the mongo shell, the following read operation queries the inventory collection for
documents that have type equal to ’food’ and automatically print up to the first 20 matching
documents:
To manually iterate the cursor to access the documents, see Iterate a Cursor in the mongo Shell
Cursor Behaviors
Starting in MongoDB 5.0 (and 4.4.8), cursors created within a client session can close when the
corresponding server session ends with the killSessions command, if the session times out, or if the
client has exhausted the cursor.
By default, server sessions have an expiration timeout of 30 minutes. To change the value, set
the localLogicalSessionTimeoutMinutes parameter when starting up mongod.
Cursors that aren't opened under a session automatically close after 10 minutes of inactivity, or if
client has exhausted the cursor. To override this behavior in mongosh, you can use
the cursor.noCursorTimeout() method:
After setting the noCursorTimeout option,
var myCursor = db.users.find().noCursorTimeout(); you must either close the cursor manually
with cursor.close() or by exhausting the
Cursor Isolation : cursor's results.
MongoDB cursors can return the same document more than once in some situations.
Because the cursor is not isolated during its lifetime, intervening write operations on a document
may result in a cursor that returns a document more than once if that document has changed.
As a cursor returns documents other operations may interleave with the query. If some of these
operations change the indexed field on the index used by the query; then the cursor will return the
same document more than once.
Cursor Batches
The MongoDB server returns the query results in batches. Batch size will not exceed the maximum
BSON document size. For most queries, the first batch returns 101 documents or just enough
documents to exceed 1 megabyte. Subsequent batch size is 4 megabytes.
To override the default size of the batch, we can use batchSize() and limit(). batchsize() specifies the
number of documents to return in each batch of the response from the MongoDB instance. In most
cases, modifying the batch size will not affect the user or the application, as mongosh and
most drivers return results as if MongoDB returned a single batch.
Example : db.inventory.find().batchSize(10) sets the batch size for the results of a query (i.e. find())
to 10. The batchSize() method does not change the output in mongosh, which, by default, iterates over
the first 20 documents.
For queries that include a sort operation without an index, the server must load all the documents in
memory to perform the sort before returning any results. As you iterate through the cursor and reach
the end of the returned batch, if there are more results, cursor.next() will perform a getmore operation
to retrieve the next batch. To see how many documents remain in the batch as you iterate the cursor,
you can use the objsLeftInBatch() method, as in the following example:
Cursor Information
The db.serverStatus() method returns a document that includes a metrics field. The metrics field
contains a cursor field with the following information:
number of timed out cursors since the last server restart
number of open cursors with the option DBQuery.Option.noTimeout set to prevent timeout
after a period of inactivity
number of “pinned” open cursors
total number of open cursors
When we call the db.serverStatus() method and accesses the metrics field from the results and then
the cursor field from the metrics field it results following document:
MongoDB treats some data types as equivalent for comparison purposes. For instance, numeric types
undergo conversion before comparison. For most data types, however, comparison operators only
perform comparisons on documents where the BSON type of the target field matches the type of the
query operand. Consider the following collection:
The following query uses $gt to return
{ "_id": "apples", "qty": 5 } documents where the value of qty is
{ "_id": "bananas", "qty": 7 } greater than 4.
{ "_id": "oranges", "qty": { "in stock": 8, "ordered": 12 } } db.collection.find( { qty: { $gt: 4 } } )
{ "_id": "avocados", "qty": "fourteen" } Query returns the following documents:
{ "_id": "apples", "qty": 5 }
The document with _id equal to "oranges" is not
{ "_id": "bananas", "qty": 7 }
returned because its qty value is of type object.
The find() method returns a cursor to the results. In the monogosh shell, if the returned cursor is not
assigned to a variable using the var keyword, the cursor is automatically iterated to access up to the
first 20 documents that match the query. You can set the DBQuery.shellBatchSize variable to change
the number of automatically iterated documents.
To manually iterate over the results, assign the returned cursor to a variable with the var keyword, as
shown in the following sections.
With next() Method : This cursor method next() is used to access the next document in the cursor:
var myCursor = db.bios.find( );
var myDocument = myCursor.hasNext() ? myCursor.next() : null;
if (myDocument) {
var myName = myDocument.name;
We can also use the cursor method next() to access the documents, as shown below:
Iterator Index
In mongosh, you can use the toArray() method to iterate the cursor and return the documents in an
array, as in the following:
The toArray() method loads into RAM all
var myCursor = db.inventory.find( { type: 2 } ); documents returned by the cursor;
var documentArray = myCursor.toArray(); the toArray() method exhausts the cursor.
var myDocument = documentArray[3];
Additionally, some Drivers provide access to the documents by using an index on the cursor
(i.e. cursor[index]). This is a shortcut for first calling the toArray() method and then using an index on
the resulting array.
Apart from the find() method, there is findOne() method, that returns only one document.
Syntax : >db.COLLECTIONNAME.findOne()
To display the results in a formatted way, you can use pretty() method.
Syntax : >db.COLLECTION_NAME.find().pretty()
Example
Following example retrieves all the documents from the collection named mycol and arranges them in
an easy-to-read format.
> db.mycol.find().pretty()
{
"_id" : ObjectId("5dd4e2cc0821d3b44607534c"),
"title" : "MongoDB Overview",
"description" : "MongoDB is no SQL database",
"by" : "tutorials point",
"url" : "http://www.tutorialspoint.com",
"tags" : [
"mongodb",
"database",
"NoSQL"
],
"likes" : 100
}
{
"_id" : ObjectId("5dd4e2cc0821d3b44607534d"),
"title" : "NoSQL Database",
"description" : "NoSQL database doesn't have tables",
"by" : "tutorials point",
"url" : "http://www.tutorialspoint.com",
"tags" : [
"mongodb",
"database",
"NoSQL"
],
"likes" : 20,
"comments" : [
{
"user" : "user1",
"message" : "My first comment",
"dateCreated" : ISODate("2013-12-09T21:05:00Z"),
"like" : 0
AND in MongoDB
Syntax : To query documents based on the AND condition, you need to use $and keyword. Following is
the basic syntax of AND –
To query documents based on the OR condition, you need to use $or keyword.
Example : Following example will show all the tutorials written by 'tutorials point' or whose title is
'MongoDB Overview'.
>db.mycol.find({$or:[{"by":"tutorials point"},{"title": "MongoDB Overview"}]}).pretty()
{
"_id": ObjectId(7df78ad8902c),
"title": "MongoDB Overview",
"description": "MongoDB is no sql database",
"by": "tutorials point",
"url": "http://www.tutorialspoint.com",
"tags": ["mongodb", "database", "NoSQL"],
"likes": "100"
}
>
Using AND and OR Together
The following example will show the documents that have likes greater than 10 and whose title is
either 'MongoDB Overview' or by is 'tutorials point'. Equivalent SQL where clause is 'where likes>10
AND (by = 'tutorials point' OR title = 'MongoDB Overview')'
To specify an equality match on the whole embedded document, use the query document { <field>:
<value>} where <value> is the document to match. Equality matches on an embedded document
require an exact match of the specified <value>, including the field order.
In the following example, the query matches all documents where the value of the field producer is an
embedded document that contains only the field company with the value ’ABC123’ and the field
address with the value ’123 Street’, in the exact order:
Use the dot notation to match by specific fields in an embedded document. Equality matches for
specific fields in an embedded document will select documents in the collection where the embedded
document contains the specified fields with the specified values. The embedded document can contain
additional fields.
In the following example, the query uses the dot notation to match all documents where the value of
the field producer is an embedded document that contains a field company with the value ’ABC123’
and may contain other fields:
Use the dot notation to return specific fields inside an embedded document. For example, the
inventory collection contains the following document:
{
"_id" : 3, "type" : "food", "item" : "aaa",
"classification": { dept: "grocery", category: "chocolate" }
}
Arrays
When the field holds an array, you can query for an exact array match or for specific values in the
array. If the array holds embedded documents, you can query for specific fields in the embedded
documents using dot notation. If you specify multiple conditions using the $elemMatch operator, the
array must contain at least one element that satisfies all the conditions.
If you specify multiple conditions without using the $elemMatch operator, then some combination of
the array elements, not necessarily a single element, must satisfy all the conditions; i.e. different
elements in the array can satisfy different parts of the conditions. Consider an inventory collection that
contains the following documents:
{ _id: 5, type: "food", item: "aaa", ratings: [ 5, 8, 9 ] }
{ _id: 6, type: "food", item: "bbb", ratings: [ 5, 9 ] }
{ _id: 7, type: "food", item: "ccc", ratings: [ 9, 5, 8 ] }
To specify equality match on an array, use the query document { <field>: <value> } where <value> is
the array to match. Equality matches on the array require that the array field match exactly the
specified <value>, including the element order.
The following example queries for all documents where the field ratings is an array that holds exactly
three elements, 5, 8, and 9, in this order:
The operation returns the following document:
{ "_id" : 5, "type" : "food", "item" : "aaa",
db.inventory.find( { ratings: [ 5, 8, 9 ] } )
"ratings" : [ 5, 8, 9 ] }
Equality matches can specify a single element in the array to match. These specifications match if the
array contains at least one element with the specified value.
The following example queries for all documents where ratings is an array that contains 5 as one of its
elements:
db.inventory.find( { ratings: 5 } )
Equality matches can specify equality matches for an element at a particular index or position of the
array using the dot notation. In the following example, the query uses the dot notation to match all
documents where the ratings array contains 5 as the first element:
db.inventory.find( { 'ratings.0': 5 } )
The following operation queries the bios collection and returns the last field in the name embedded
document and the first two elements in the contribs array:
db.bios.find({ }, { _id: 0, 'name.last': 1, contribs: { $slice: 2 } } )
Starting in MongoDB 4.4, you can also specify embedded fields using the nested form, e.g.
db.bios.find( { }, { _id: 0, name: { last: 1 }, contribs: { $slice: 2 } } )
The $elemMatch operator limits the contents of an <array> field from the query results to contain only
the first element matching the $elemMatch condition.
For Single Element Satisfies the Criteria use $elemMatch operator to specify multiple criteria on the
elements of an array such that at least one array element satisfies all the specified criteria. The
following example queries for documents where the ratings array contains at least one element that is
greater than ($gt) 5 and less than ($lt) 9:
The operation returns the following documents, whose ratings array contains the element 8 which
meets the criteria:
{ "_id" : 5, "type" : "food", "item" : "aaa", "ratings" : [ 5, 8, 9 ] }
{ "_id" : 7, "type" : "food", "item" : "ccc", "ratings" : [ 9, 5, 8 ] }
The following example queries for documents where the ratings array contains elements that in some
combination satisfy the query conditions; e.g., one element can satisfy the greater than 5 condition
and another element can satisfy the less than 9 condition, or a single element can satisfy both:
The operation returns the following documents: The document with the ratings"
{ "_id" : 5, "type" : "food", "item" : "aaa", "ratings" : [ 5, 8, 9 ] } : [ 5, 9 ] matches the query
{ "_id" : 6, "type" : "food", "item" : "bbb", "ratings" : [ 5, 9 ] } since the element 9 is greater
{ "_id" : 7, "type" : "food", "item" : "ccc", "ratings" : [ 9, 5, 8 ] } than 5 (the first condition) and
the element 5 is less than 9 (the
Array of Embedded Documents second condition).
If you know the array index of the embedded document, you can specify the document using the
embedded document’s position using the dot notation. The following example selects all documents
where the memos contains an array whose first element (i.e. index is 0) is a document that contains
the field by whose value is ’shipping’:
If you do not know the index position of the document in the array, concatenate the name of the field
that contains the array, with a dot (.) and the name of the field in the embedded document.
The following example selects all documents where the memos field contains an array that contains at
least one embedded document that contains the field by with the value ’shipping’:
For Single Element Satisfies the Criteria use $elemMatch operator to specify multiple criteria on an
array of embedded documents such that at least one embedded document satisfies all the specified
criteria.
The following example queries for documents where the memos array has at least one embedded
document that contains both the field memo equal to ’on time’ and the field by equal to ’shipping’:
The following example queries for documents where the memos array contains elements that in some
combination satisfy the query conditions; e.g. one element satisfies the field memo equal to ’on time’
Selects documents in a collection or view and returns a cursor to the selected documents.
Syntax : db.collection.find(query, projection)
Returns: When the find() method “returns documents,” the method is actually returning a cursor to
the documents.
Example
Consider the collection mycol has the following data −
{_id : ObjectId("507f191e810c19729de860e1"), title: "MongoDB Overview"},
Prepared by Kamal Podder Page 16
{_id : ObjectId("507f191e810c19729de860e2"), title: "NoSQL Overview"},
{_id : ObjectId("507f191e810c19729de860e3"), title: "Tutorials Point Overview"}
Following example will display the title of the document while querying the document.
>db.mycol.find({},{"title":1,_id:0}) Please note _id field is always displayed
{"title":"MongoDB Overview"} while executing find() method, if you don't
{"title":"NoSQL Overview"} want this field, then you need to set it as 0.
{"title":"Tutorials Point Overview"}
>
The projection parameter
It determines which fields are returned in the matching documents. The projection parameter takes a
document of the form: { <field1>: <value>, <field2>: <value> ... }
Projection Description
<field>: <1 or true> Specifies the inclusion of a field.
<field>: <0 or false> Specifies the exclusion of a field.
"<field>.$": <1 or true> With the use of the $ array projection operator, you can specify the
projection to return the first element that match the query condition on
the array field; e.g. "arrayField.$" : 1. (Not available for views.)
<field>: <array projection> Using the array projection operators $elemMatch, $slice, specifies the
array element(s) to include, thereby excluding those elements that do
not meet the expressions. (Not available for views.)
<field>: <$meta expression> Using the $meta operator expression, specifies the inclusion of
available per-document metadata. (Not available for views.)
<field>: <aggregation expres Specifies the value of the projected field.
sion> Starting in MongoDB 4.4, with the use of aggregation expressions and
syntax, including the use of literals and aggregation variables, you can
project new fields or project existing fields with new values. For
example,
If you specify a non-numeric, non-boolean literal (such as a literal
string or an array or an operator expression) for the projection
value, the field is projected with the new value; e.g.:
o { field: [ 1, 2, 3, "$someExistingField" ] }
o { field: "New String Value" }
o { field: { status: "Active", total: { $sum: "$existingArray" } }
}
To project a literal value for a field, use the $literal aggregation
expression; e.g.:
o { field: { $literal: 5 } }
o { field: { $literal: true } }
o { field: { $literal: { fieldWithValue0: 0, fieldWithValue1: 1 }
}}
In versions 4.2 and earlier, any specification value (with the exception of
the previously unsupported document value) is treated as
For fields that contain arrays, MongoDB provides the following projection operators: $elemMatch,
$slice, and $.
Then the following operation uses the $slice projection operator to return just the first two elements in
the ratings array.
$elemMatch, $slice, and $ are the only way to project portions of an array. For instance, you cannot
project a portion of an array using the array index; e.g. { "ratings.0": 1 } projection will not project the
array with the first element.
Syntax : >db.COLLECTION_NAME.find().limit(NUMBER)
Example : Consider the collection myycol has the following data.
To sort documents in MongoDB, you need to use sort() method. The method accepts a document
containing a list of fields along with their sorting order. To specify sorting order 1 and -1 are used. 1 is
used for ascending order while -1 is used for descending order.
Syntax : >db.COLLECTION_NAME.find().sort({KEY:1})
Example
Consider the collection myycol has the following data.
{_id : ObjectId("507f191e810c19729de860e1"), title: "MongoDB Overview"}
{_id : ObjectId("507f191e810c19729de860e2"), title: "NoSQL Overview"}
{_id : ObjectId("507f191e810c19729de860e3"), title: "Tutorials Point Overview"}
Following example will display the documents sorted by title in the descending order.
>db.mycol.find({},{"title":1,_id:0}).sort({"title":-1})
{"title":"Tutorials Point Overview"}
{"title":"NoSQL Overview"}
{"title":"MongoDB Overview"}
>
Please note, if you don't specify the sorting preference, then sort() method will display the documents
in ascending order.
We can use the following syntax to sort documents in MongoDB by multiple fields:
Following example will set the new title 'New MongoDB Tutorial' of the documents whose title is
'MongoDB Overview'.
>db.mycol.update({'title':'MongoDB Overview'},{$set:{'title':'New MongoDB Tutorial'}})
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
Choose the condition which you want to use to decide which document needs to be updated. Here
we want to update the document which has title = ‘MongoDB Overview’
Use the set command to modify the Field Name
Choose which Field Name you want to modify and enter the new value accordingly.
The update operation returns a WriteResult object which contains the status of the operation. A
successful update of the document returns the above object. The nMatched field specifies the
number of existing documents matched for the update, and nModified specifies the number of
existing documents modified.
>db.mycol.find()
{ "_id" : ObjectId(5983548781331adf45ec5), "title":"New MongoDB Tutorial"}
{ "_id" : ObjectId(5983548781331adf45ec6), "title":"NoSQL Overview"}
{ "_id" : ObjectId(5983548781331adf45ec7), "title":"Tutorials Point Overview"}
>
>db.Employee.update({"Employeeid" : 1},{$set: { "EmployeeName" : "NewMartin"}});
If the command is executed successfully, the following Output will be shown
To update a field within an embedded document, use the dot notation. When using the dot notation,
enclose the whole dotted field name in quotes. The following updates the model field within the
embedded details document.
The update operation returns a WriteResult object which contains the status of the operation. A
successful update of the document returns the following object:
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
To replace the entire content of a document except for the _id field, pass an entirely new document as
the second argument to update(). The replacement document can have different fields from the
original document. In the replacement document, you can omit the _id field since the _id field is
immutable. If you do include the _id field, it must be the same value as the existing value.
The following operation replaces the document with item equal to "BE10". The newly replaced
document will only contain the _id field and the fields in the replacement document.
db.inventory.update( { item: "BE10" },
{
item: "BE05",
stock: [ { size: "S", qty: 20 }, { size: "M", qty: 5 } ], category: "apparel"
}
)
upsert Option
By default, if no document matches the update query, the update() method does nothing. However, by
specifying upsert: true, the update() method either updates matching document or documents, or
inserts a new document using the update specification if no matching document exists.
When you specify upsert: true for an update operation to replace a document and no matching
documents are found, MongoDB creates a new document using the equality conditions in the update
conditions document, and replaces this document, except for the _id field if specified, with the update
document.
When you specify upsert : true for an update operation that modifies specific fields and no matching
documents are found, MongoDB creates a new document using the equality conditions in the update
conditions document, and applies the modification as specified in the update document. The following
update operation either updates specific fields of a matching document or adds a new document if no
matching document exists.
Insert : If the document does not contain an _id field, then the save() method calls the insert() method.
During the operation, the mongo shell will create an ObjectId and assign it to the _id field.
NOTE
Most MongoDB driver clients will include the _id field and generate an ObjectId before sending the
insert operation to MongoDB; however, if the client sends a document without an _id field,
the mongod will add the _id field and generate the ObjectId.
Update : If the document contains an _id field, then the save() method is equivalent to an update with
the upsert option set to true and the query predicate on the _id field.
Examples
Here save() method performs an insert since the document passed to the method does not contain
the _id field:
db.products.save( { item: "book", qty: 40 } )
During the insert, the shell will create the _id field with a unique ObjectId value, as verified by the
inserted document:
{ "_id" : ObjectId("50691737d386d8fadbd6b01d"), "item" : "book", "qty" : 40 }
Save a New Document Specifying an _id Field : Here save() performs an update with upsert:true since
the document contains an _id field:
db.products.save( { _id: 100, item: "water", qty: 30 } )
Because the _id field holds a value that does not exist in the collection, the update operation results in
an insertion of the document. The results of these operations are identical to an update() method with
the upsert option set to true.
Following example will replace the document with the _id '5983548781331adf45ec5'.
>db.mycol.save(
{
"_id" : ObjectId("507f191e810c19729de860ea"), "title":"Tutorials Point New Topic",
"by":"Tutorials Point"
}
)
WriteResult({"nMatched" : 0, "nUpserted" : 1, "nModified" : 0,
"_id" : ObjectId("507f191e810c19729de860ea")
})
Parameters:
filter: It specifies the selection criteria for the update. The type of this parameter is document. If it
contains empty document, i.e, {}, then this method will update all the documents of the collection with
the update document.
update: The type of this parameter is document or pipeline and it contains modification that will apply
to the documents. It can be a update Document(only contain update operator expressions) or
aggregation pipeline(only contain aggregation stages, i.e, $addFields, $project, $replaceRoot).
Options :
{
upsert: <boolean>,
writeConcern: <document>,
collation: <document>,
arrayFilters: [ <filterdocument1>, ... ],
hint: <document|string> // Available starting in MongoDB 4.2.1
}
Optional Parameters:
upsert: The value of this parameter is either true or false. If the value of this parameter is true,
then the method will update the documents that match the given condition or if any of the
documents in the collection does not match the given filter, then this method will insert a new
document(i.e., update Document) in the collection. The type of this parameter is a Boolean and the
default value of this parameter is false.
writeConcern: It is only used when you do not want to use the default write concern. The type of
this parameter is document.
collation: It specifies the use of the collation for operations. It allows users to specify the language-
specific rules for string comparison like rules for letter case and accent marks. The type of this
parameter is document.
arrayFilters: It is an array of filter documents that indicates which array elements to modify for an
update operation on an array field. The type of this parameter is an array.
hint: It is a document or field that specifies the index to use to support the filter. It can take an
index specification document or the index name string and if you specify an index that does not
exist, then it will give an error.
Example
> db.empDetails.updateOne({First_Name: 'Radhika'},
{ $set: { Age: '30',e_mail: 'radhika_newemail@gmail.com'}})
The operation returns
{ "acknowledged" : true, "matchedCount" : 1, "modifiedCount" : 0 }
after successful update.
If no matches were found, the operation instead returns:
{ "acknowledged" : true, "matchedCount" : 0, "modifiedCount" : 0 }
Note : Setting upsert: true would insert the document if no match was found.
You can see the updated values if you retrieve the contents of the document using the find method as
shown below −
> db.empDetails.find()
{ "_id" : ObjectId("5dd6636870fb13eec3963bf5"), "First_Name" : "Radhika", "Last_Name" : "Sharma",
"Age" : "00", "e_mail" : "radhika_newemail@gmail.com", "phone" : "9000012345" }
{ "_id" : ObjectId("5dd6636870fb13eec3963bf6"), "First_Name" : "Rachel", "Last_Name" :
"Christopher", "Age" : "00", "e_mail" : "Rachel_Christopher.123@gmail.com", "phone" : "9000054321"
}
{ "_id" : ObjectId("5dd6636870fb13eec3963bf7"), "First_Name" : "Fathima", "Last_Name" : "Sheik",
"Age" : "24", "e_mail" : "Fathima_Sheik.123@gmail.com", "phone" : "9000054321" }
>
MongoDB - Delete Document
Example
Sample data is some product data for an online shop of laptops, as demonstrated below :
[
{
"_id": 1,
"name": "HP EliteBook Model 1",
"price": 38842.0,
"quantity": 1,
"brand": "HP",
"attributes": [
{ "attribute_name": "cpu", "attribute_value": "Intel Core i7" },
{ "attribute_name": "memory", "attribute_value": "8GB" },
{ "attribute_name": "storage", "attribute_value": "256GB" }
]
},
{
"_id": 2,
"name": "Lenovo IdeaPad Model 2",
"price": 9405.0,
"quantity": 2,
"brand": "Lenovo",
"attributes": [
{ "attribute_name": "cpu", "attribute_value": "Intel Core i5" },
{ "attribute_name": "memory", "attribute_value": "8GB" },
{ "attribute_name": "storage", "attribute_value": "256GB" }
]
},
………
]
As we see, the laptop documents in the laptops collection have an attributes field that is an array of
embedded documents. It is more complex to query and update a field like this than simple non-nested
fields.
First, let’s find all the laptops whose CPU is Intel Core i7–8550U. This should be pretty simple because
the CPU is unambiguous:
db.laptops.find(
{
"attributes.attribute_value": "Intel Core i7"
}
)
Note that the nested field is queried with a dot notation and must be put in quotes. We should get the
result we want because there is only one nested document in the attributes array that has the value for
Now let’s find all the laptops that have a memory of 16GB. Intuitively, you may want to use a query like
this:
db.laptops.find(
{
"attributes.attribute_name": "memory",
"attributes.attribute_value": "16GB"
}
)
When the above query is executed, it seems all the laptops whose memory is 16GB are returned:
[
{
_id: 9,
name: 'HP EliteBook Model 9',
price: 22209,
Prepared by Kamal Podder Page 2
quantity: 9,
brand: 'HP',
attributes: [
{ attribute_name: 'cpu', attribute_value: 'Intel Core i7' },
{ attribute_name: 'memory', attribute_value: '16GB' },
{ attribute_name: 'storage', attribute_value: '512GB' }
]
},
{
_id: 11,
name: 'HP ZBook Model 11',
price: 45175,
quantity: 1,
brand: 'HP',
attributes: [
{
attribute_name: 'cpu',
attribute_value: 'Intel Core i7'
},
{ attribute_name: 'memory', attribute_value: '16GB' },
{ attribute_name: 'storage', attribute_value: '512GB' }
]
},
......
]
However, this is where many beginners of MongoDB make mistakes and where some bugs are
introduced into your code. If you enter “it” in mongosh to show more results or just scroll down close to
the bottom of the result page with an IDE and check carefully, you will find something strange:
...,
{ We get the laptops whose
_id: 144, storage is 16GB as well with
name: 'HP ZBook Model 144', the query above! This is
price: 14759, because the above query finds
quantity: 2, the documents where the
brand: 'HP', attributes array has at least
attributes: [ one embedded document that
{ attribute_name: 'cpu', attribute_value: 'Intel Core i5' }, contains field attribute_name
{ attribute_name: 'memory', attribute_value: '16GB' }, equal to memory and at least
{ attribute_name: 'storage', attribute_value: '16GB' } one embedded document (but
] not necessarily the same
}, embedded document) that
{ contains field attribute_value
_id: 145, equal to 16GB.
name: 'HP ZBook Model 145',
Prepared by Kamal Podder Page 3
price: 53855,
quantity: 3,
brand: 'HP',
attributes: [
{ attribute_name: 'cpu', attribute_value: 'Intel Core i7' },
{ attribute_name: 'memory', attribute_value: '8GB' },
{ attribute_name: 'storage', attribute_value: '16GB' }
]
},
...
Since all the attributes arrays have an embedded document whose attribute_name is memory, the
query above gave use erroneous results. What we want is that both conditions should be satisfied for
the same embedded document. To achieve this, we cannot query by dot notation as shown above but
need to use the $elemMatch operator:
db.laptops.find(
{ This time the laptops whose
attributes: { storage is 16GB but memory is
$elemMatch: { not 16GB will not be returned.
attribute_name: 'memory', attribute_value: '16GB' If you don’t believe it, you can
} count the number of
} documents that are returned
} by both queries:
)
db.laptops.find(
{
"attributes.attribute_name": "memory",
"attributes.attribute_value": "16GB"
}
).count()
// Returns 77
db.laptops.find(
{
attributes: {
$elemMatch: {
attribute_name: 'memory', attribute_value: '16GB'
}
}
}
).count()
// Returns 75
db.laptops.find(
{
$and: [
{
attributes: {
$elemMatch: {
attribute_name: 'memory',
attribute_value: '16GB'
}
}
},
{
attributes: {
$elemMatch: {
attribute_name: 'storage',
attribute_value: '1TB'
}
}
}]
}
)
This query will return the result we want:
[
{
_id: 22,
name: 'HP EliteBook Model 22',
price: 32425,
quantity: 1,
brand: 'HP',
attributes: [
{ attribute_name: 'cpu', attribute_value: 'Intel Core i7' },
{ attribute_name: 'memory', attribute_value: '16GB' },
{ attribute_name: 'storage', attribute_value: '1TB' }
]
},
{
_id: 107,
name: 'HP EliteBook Model 107',
price: 35450,
Prepared by Kamal Podder Page 5
quantity: 6,
brand: 'HP',
attributes: [
{ attribute_name: 'cpu', attribute_value: 'Intel Core i7' },
{ attribute_name: 'memory', attribute_value: '16GB' },
{ attribute_name: 'storage', attribute_value: '1TB' }
]
},
{
_id: 129,
name: 'Lenovo Legion Model 129',
price: 29495,
quantity: 7,
brand: 'Lenovo',
attributes: [
{ attribute_name: 'cpu', attribute_value: 'Intel Core i7' },
{ attribute_name: 'memory', attribute_value: '16GB' },
{ attribute_name: 'storage', attribute_value: '1TB' }
]
}
]
Note that even though the $and operator is the You will get one incorrect result in this case. If you
default one in MongoDB, it is mandatory here try with other conditions, you will get more
because we are querying the same field in two incorrect results.
different conditions. Otherwise, we will only query ...
by the second condition and will get incorrect {
results:
db.laptops.find( _id: 67,
{ name: 'HP ZBook Studio Model 67',
attributes: { price: 54575,
$elemMatch: { quantity: 6,
attribute_name: 'memory', brand: 'HP',
attribute_value: '16GB' attributes: [
} { attribute_name: 'cpu', attribute_value: 'Intel
}, Core i7' },
attributes: { { attribute_name: 'memory', attribute_value:
$elemMatch: { '32GB' },
attribute_name: 'storage', { attribute_name: 'storage', attribute_value:
attribute_value: '1TB' '1TB' }
} ]
} },
} ...
)
Let’s update the memory to 16GB for the laptop with _id equal to 1:
db.laptops.updateOne(
{ _id: 1 },
{ $set: { "attributes.1.attribute_value": "16GB" } }
)
Currently mongosh supports a subset of the mongo shell methods. Achieving feature parity
between mongosh and the mongo shell is an ongoing effort.
To maintain backwards compatibility, the methods that mongosh supports use the same syntax as the
corresponding methods in the mongo shell.
We can use the MongoDB Shell to connect to MongoDB version 4.0 or greater.
Prerequisites
The MongoDB server must be installed and running before you can connect to it from
the mongo shell or mongoDB shell.
Once you have verified that the mongod server is running, open a terminal window (or a
command prompt for Windows) and run mongo or mongosh.
Run mongod.exe from a command prompt as shown below. Don’t close the command window.
Run mongo.exe from another command prompt to execute mongo shell. In the prompt we can
type shell command to execute.
You should start mongoDB before starting the shell because shell automatically attempt to connect to
a MongoDB server on startup. As the shell is a full-featured JavaScript interpreter It is capable of
running Arbitrary JavaScript program.
As mongosh is built on top of Node.js the entire Node.js API is available inside mongosh. This is a big
step forward from the legacy mongo shell, where the API available to developers was a limited
JavaScript subset. We can customize mongosh to suit developer needs like any other modern tool.
The mongosh console is line oriented. However, you can also use an editor to work with multiline
functions. There are two options:
1. Use the edit command with an external editor. Here we will discuss only how to
2. Use the .editor command, a built-in editor. use built-in editor.
Run the .editor command to execute multi-line commands. Press Ctrl+d to run a command
or Ctrl+c to cancel.
MongoDB shell is JavaScript and Node.js REPL, so you can also execute limited JavaScript code.
List of databases
Querying a collection
To find the documents in the collection “uses” in database authDB issue the command
db.getCollection(“users”).find() . The db.collection.find() method returns a cursor to the results;
cursor.pretty()
It configures the cursor to display results in a format that is easy to read. The pretty() method has the
prototype form: db.collection.find(<query>).pretty()
Note : In most cases, mongosh
The pretty() method: methods work the same way as the
Does not change the output format in mongosh. legacy mongo shell methods.
Changes the output format in the legacy mongo shell. However, some legacy methods are
unavailable in mongosh.
db.books.save({
"_id" : ObjectId("54f612b6029b47909a90ce8d"),
"title" : "A Tale of Two Cities",
"text" : "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age
of foolishness...",
"authorship" : "Charles Dickens"})
By using cursor.pretty() you can set the cursor to return data in a format that is easier to read:
db.books.find().pretty()
{
"_id" : ObjectId("54f612b6029b47909a90ce8d"),
"title" : "A Tale of Two Cities",
"text" : "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of
foolishness...",
"authorship" : "Charles Dickens"
}
Inserting a document : Use db.collection.insertOne(theDocument) command as shown below:
When you run an aggregation, MongoDB Shell outputs the results directly to the terminal.
Following are the MongoDB collection methods that are used in different scenarios.
1. db.collection.aggregate(pipeline, option)
The aggregate method calculates aggregate values for the data in a collection or a view and return
computed results. It collects values from various documents and groups them together and then
performs different types of operations on that grouped data like sum, average, minimum, maximum,
etc. to return a computed result. It is similar to the aggregate function of SQL.
Aggregation pipeline
The aggregation pipeline consists of stages and each stage transforms the document. Or in other
words, the aggregation pipeline is a multi-stage pipeline, so in each state,
the documents taken as input and produce the resultant set of documents
now in the next stage(id available) the resultant documents is taken as input and produce output
this process is going on till the last stage.
Following figure shows the pictorial representation of above aggregation method. In the above
example
$match: It is used for filtering the documents can reduce the amount of documents that are
given as input to the next stage.
$project: It is used to select some specific fields from a collection.
$group: It is used to group documents based on some value.
$sort: It is used to sort the document that is rearranging them
$skip: It is used to skip n number of documents and passes the remaining documents
$limit: It is used to pass first n number of documents thus limiting them.
$unwind: It is used to unwind documents that are using arrays i.e. it deconstructs an array field
in the documents to return documents for each element.
$out: It is used to write resulting documents to a new collection
Expressions: It refers to the name of the field in input documents for e.g. { $group : { _id : “$id“,
total:{$sum:”$fare“}}} here $id and $fare are expressions.
Note:
in $group _id is Mandatory field
$out must be the last stage in the pipeline
$sum: 1 will count the number of documents and $sum:”$fare” will give the sum of total fare
generated per id.
2. db.collection.bulkWrite()
MongoDB provides clients the ability to perform write operations in bulk. Bulk write operations affect
a single collection. Array of write operations are executed by this operation. Operations are executed
in a specific order by default.
MongoDB allows applications to determine the acceptable level of acknowledgement required for bulk
write operations.
The db.collection.bulkWrite() method provides the ability to
Syntax: perform bulk insert, update, and remove operations. MongoDB
db.collection.bulkWrite( also supports bulk insert through the db.collection.insertMany().
[ <op. 1>, <op. 2>, .. ],
{ writeConcern : <document>, ordered: <boolean> }
)
Ordered vs Unordered Operations
With an ordered list of operations, MongoDB executes the operations serially. If an error occurs during
the processing of one of the write operations, MongoDB will return without processing any remaining
write operations in the list.
With an unordered list of operations, MongoDB can execute the operations in parallel, but this
behavior is not guaranteed. If an error occurs during the processing of one of the write operations,
MongoDB will continue to process remaining write operations in the list.
db.pets.insertMany([
{ _id: 1, name: "Wag", type: "Dog", weight: 20 },
{ _id: 2, name: "Bark", type: "Dog", weight: 10 },
{ _id: 3, name: "Meow", type: "Cat" },
{ _id: 4, name: "Scratch", type: "Cat" },
{ _id: 5, name: "Bruce", type: "Bat" }
])
We can now use db.collection.bulkWrite() to perform a bulk write operation against that collection.
db.pets.bulkWrite([
{ insertOne: { "document": { "_id": 6, "name": "Bubbles", "type": "Fish" }}},
{ updateOne : {
"filter" : { "_id" : 2 },
"update" : { $set : { "weight" : 15 } }
} },
{ deleteOne : { "filter" : { "_id" : 5 } } },
{ replaceOne : {
"filter" : { "_id" : 4 },
"replacement" : { "name" : "Bite", "type" : "Dog", "weight": 5 }
}}
]) In this case, we inserted one document, updated
Result: another document, deleted another, and replaced
{ another document.
"acknowledged" : true,
"deletedCount" : 1, The db.collection.bulkWrite() method returns the
"insertedCount" : 1, following:
"matchedCount" : 2,
"upsertedCount" : 0, A boolean acknowledged as true if the operation ran
"insertedIds" : { with write concern or false if write concern was
"0" : 6 disabled.
}, A count for each write operation.
"upsertedIds" : { An array containing an _id for each successfully
inserted or upserted documents.
}
}
Another example
Following performs multiple write operations. The characters collection contains the following
documents:
try {
db.characters.bulkWrite(
[
{ insertOne : { "document" : { "_id" : 4, "char" : "Dithras", "class" : "barbarian", "lvl" : 4 } } },
{ insertOne : { "document" : { "_id" : 5, "char" : "Taeln", "class" : "fighter", "lvl" : 3 } } },
{ updateOne : { "filter" : { "char" : "Eldon" },
"update" : { $set : { "status" : "Critical Injury" } } }
},
{ deleteOne : { "filter" : { "char" : "Brisbane" } } },
{ replaceOne :
{
"filter" : { "char" : "Meldane" },
"replacement" : { "char" : "Tanys", "class" : "oracle", "lvl" : 4 }
}
}
]
);
}
catch (e) { print(e); }
}
}
Some MongoDB Shell Collection Methods and their syntax used in bulkWrite()
db.collection.bulkWrite( [
{ insertOne : { "document" : <document> } }
])
updateOne: It updates only one document that matches the filter in the collection.
db.collection.bulkWrite( [
{ updateOne :
{
"filter": <document>,
"update": <document or pipeline>,
"upsert": <boolean>,
"collation": <document>,
"arrayFilters": [ <filterdocument1>, ... ],
"hint": <document|string>
}
}
])
db.collection.bulkWrite( [
{ updateMany :{
"filter" : <doc.>,
"update" : <document or pipeline>,
"upsert" : <Boolean>,
"collation": <document>,
db.collection.bulkWrite([
{ replaceOne :
{
"filter" : <doc.>,
"replacement" : <doc.>,
"upsert" : <boolean>,
"collation": <document>,
"hint": <document|string>
}
}
])
3. db.collection.count(query, option)
The db.collection.count() method is used to return the count of documents that would match a find()
query. The db.collection.count() method does not perform the find() operation but instead counts and
returns the number of results that match a query.
Note:
This method is equivalent to db.collection.find().count().
You cannot use this method in transactions.
On a shared cluster, if you use this method without a query predicate, then it will return an
inaccurate count if orphaned documents exist or if a chunk migration is in progress. So, to avoid
such type of situation use db.collection.aggregate() method
Count the number of the documents in the restaurants collection with the field matching the cuisine is
American:
Output : >db.restaurants.find({"cuisine" :
db.restaurants.find({"cuisine" : "American "}).count() "American "}).count();
6183
Example : Count all Documents that Match a Query using more than on criteria
Count the number of the documents in the collection restaurants filtering with the field cuisine is equal
to Italian and zipcode is 10075:
db.restaurants.find( { "cuisine": "Italian", "address.zipcode": "10075" } ).count();
Output:
4. Db.collection.countDocuments(query, options)
The countDocument() method return the number of documents that match the query for a collection
or view. It does not use the metadata to return the count.
Only, the countDocuments returns the actual count of the documents. The other methods return
counts based upon the collection's meta data.
The db.collection.find method returns a cursor. The cursor.count() method on the cursor counts
the number of documents referenced by a cursor. This is same as the db.collection.count().
Both these methods (the cursor.count() and db.collection.count()) are deprecated as of MongoDB
v4.0. in favor of new APIs for countDocuments() and estimatedDocumentCount().
Avoid using the db.collection.count() method without a query predicate since without the query
predicate, the method returns results based on the collection’s metadata, which may result in an
approximate count.
db.collection.countDocuments(query) returns the count of documents that match the query for a
collection or view. This is the method you need to use to count the number of documents in your
collection.
Most of the times all of the above return the exact same thing. Only, the countDocuments returns
the actual count of the documents. The other methods return counts based upon the collection's
meta data.
Examples
5. db.collection.createIndex()
Keys: For an ascending index on a field we need to specify a value of 1 and for the descending index
we need to specify a value of -1.
Example
db.collection.createIndex( { tut_Date: 1 } )
The example below will create an index named as category_tutorial. The example creates the index
with the collation that specifies the locale fr and comparison strength.
db.collection.createIndex(
{ category: 1 },
{ name: "category_tutorial", collation: { locale: "fr", strength: 2 } }
)
6. db.collection.createIndexes()
The createIndexes() method creates one or more indexes on a collection. It is used to create one or
more indexes based on the field of the document. If the index is already created or exists then this
method does not recreate the existing index.
Keypatterns: It is an array that contains the index specific documents. All the document have field-
value pairs, where the field is the index key, and the value describes the type of index for that field,
For an ascending index on a field we need to specify a value of 1 and for the descending index we
need to specify a value of -1
Example
In the example below we considered an employee collection. We are creating index on field Employid.
Output:
Now, the example below creates two indexes on the products collection:
Index on the manufacturer field in ascending order.
Prepared by Kamal Podder Page 17
Index on the category field in ascending order.
It uses a collation which
db.products.createIndexes( [ { "manufacturer": 1}, { "category": 1 } ], specifies the basic fr and
{ collation: { locale: "fr", strength: 2 } }) comparison strength as 2.
Create an index on the multiple fields: Ascending index on the name, and Descending index on
language:
db.employee.createIndexes([{name:1,department:-1}]
Here, we create the indexes on multiple fields, i.e. ascending index on the name field, and
descending index on department field.
db.employee.createIndexes([{joinYear:1}],{unique:true})
Here, we create the ascending unique indexes on the joinYear field by setting the value of the unique
parameter to true.
Introduction
In SQL databases you must determine and declare a table’s schema before inserting data. But
MongoDB supports flexible schema, which means there is no need of defining a structure for the
data before insertion.
MongoDB is a document based database. A set of documents form a collection. Any document
within the same collection is not mandatory to have same set of fields or structure and the data
type for a field can differ across documents within a collection.
To change the structure of the documents in a collection, such as add new fields, remove existing
fields, or change the field values to a new type, we simply update the documents to the new
structure.
This flexibility facilitates easy mapping of documents to an entity or an object. The documents in a
collection may have substantial variation in structure but each one will match the data fields of the
represented entity. In practice, however, the documents in a collection share a similar structure. This
will give us a better chance to set some document validation rules as a way of improving data integrity
during insert and update operations.
Application-specific data access patterns (i.e. queries, updates, and processing of the data).
Finding out the questions that our users will have is paramount to designing our entities.
The inherent structure of the data itself.
Document Structure
The key decision in designing data models for MongoDB applications revolves around the structure of
documents and how the application represents relationships between data. There are 2 ways in which
the relationships between the data can be established in MongoDB:
1. Embedded Documents 2. Reference Documents
Embedded Documents
Embedded documents capture relationships between data by storing related data in a single document
structure. MongoDB documents make it possible to embed document structures in a field or array
within a document. These denormalized data models allow applications to retrieve and manipulate
related data in a single database operation.
Consider 2 collections student and address. Let us see how the address can be embedded into student
Prepared by Kamal Podder Page 1
collection.
{ Embedding multiple addresses can also be done.
_id:123, student collection See the below example :
name: "Student1" {
} _id:123,
name: "Student1"
{ addresses: [
address collection
_studentId:123, {
to contains the
street: "123 Street", street: "123 Street",
addresses of the
city: "Bangalore", city: "Bangalore",
students.
state: "KA" state: "KA"
} },
{
{ street: "456 Street",
_studentId:123, city: "Punjab",
street: "456 Street", state: "HR"
city: "Punjab", }
state: "HR" ]
} }
Strength of embedding
We can retrieve all relevant information in a single query. So better performance for read
operations
Avoid implementing joins in application code or using $lookup
Update related information as a single atomic write operation. By default, all CRUD operations on a
single document are ACID compliant
Weaknesses of Embedding
Restricted document size. All documents in MongoDB are constrained to the BSON size of 16
megabytes. Therefore, overall document size together with embedded data should not surpass this
limit. Otherwise, for some storage engines such as MMAPv1, data may outgrow and result in data
fragmentation as a result of degraded write performance.
Large documents mean more overhead if most fields are not relevant. You can increase query
performance by limiting the size of the documents that you are sending over the wire for each
query.
Data duplication: multiple copies of the same data make it harder to query the replicated data and
it may take longer to filter embedded documents, hence outdo the core advantage of embedding.
Referenced Documents
References store the relationships between data by including links or references from one document to
another. Applications can resolve these references to access the related data. Broadly, these
are normalized data models.
This is one of the ways to implement the relationship between data stored in different collections. In
this, a reference to the data in one collection will be used in connecting the data between the
collections. Consider 2 collections books and authors as shown below:
{ In this example, {
title: "Java in action", the publisher data is repeated. In name: "My Publciations",
author: "author1", order to avoid this repetition, we founded:1980,
language: "English", can add references of the book to location: "CA",
publisher: { the publisher data instead of books: [111222333,444555666,
name: "My publications", using entire data of publisher in ..]
founded:1990, every book entry, as shown. }
location: "SF"
} This can be done the other way {
} round as well, where in one can _id:111222333,
reference the publisher id in the title: "Java in action",
{ books data, your choice. author: "author1",
title: "Hibernate in action", language: "English"
author: "author2", }
language: "English",
publisher: { {
name: "My publications", _id:444555666,
founded:1990, title: "Hibernate in action",
location: "SF" author: "author2",
Normalized data
models describe
relationships using
references between
documents.
when embedding would result in duplication of data but would not provide sufficient read
performance advantages to outweigh the implications of the duplication.
to represent more complex many-to-many relationships.
to model large hierarchical data sets.
Strengths of Referencing
Data consistency is better. Same piece of information is not repeated as embedded document in
various other documents (say same author in various books). Hence chances of data inconsistency
are pretty low.
Improved data integrity. Due to normalization, it is easy to update data regardless of operation
duration length and therefore ensure correct data for every document without causing any
confusion. For updating an author we need not update author in several book documents.
By splitting up data, we will have smaller documents. Less likely to reach 16-MB-per-document
limit
Improved cache utilization. Canonical documents accessed frequently are stored in the cache
rather than for embedded documents which are accessed a few times.
Improved flexibility especially with a large set of subdocuments. Infrequently accessed
information not needed on every query.
Faster writes.
Weaknesses of Referencing
Multiple lookups: Since we have to look in a number of documents that match criteria there is
increased read time when retrieving from disk. Besides, this may result into cache misses.
Many queries are issued to achieve some operation hence normalized data models require more
round trips to the server to complete a specific operation.
Prepared by Kamal Podder Page 4
Example of Data Modelling in MongoDB
Let us consider a simple example of building a student database in a college. Assume there are 3
models – Student, Address and Course. In a typical RDBMS database, these 3 models will be translated
into 3 tables as shown below:
Hence, from this
model, if a student
details has to be
added, then entries
should be made in all
the 3 tables.
{
Let us see, how the same data can be modelled in MongoDB. The
_id: 123,
firstName: 'Test', schema design will have only one collection Student and will have
the structure as shown. Data related to all the 3 models will be
lastName: 'Student',
shown under one Collection
address :[{
City: 'Bangalore',
NOTE : Fieldnames in a collection like firstName and lastName etc in
State: 'Karnataka',
above examples also use memory, may 10-20 bytes or so. But when
Country: 'India'
the dataset is very large, this can add up to a lot of memory. Hence
}
it is advised, for large datasets, use short fieldnames to store data in
],
collections, like fname instead of firstName.
Course: 'MCA'
}
Embedded Vs References
Effective data models support your application needs. The key consideration for the structure of your
documents is the decision to embed or to use references.
Proper MongoDB schema design is the most critical part of deploying a scalable, fast, and
affordable database. It's one of the most common questions developers have pertaining to
MongoDB. And the answer is, it depends. This is because document databases have a rich
vocabulary that is capable of expressing data relationships in various ways than SQL.
In RDBMS developers
model their schema
independent of
queries.
In this example, you can see that the user data is split into separate tables and it can be JOINED
together using foreign keys in the user_id column of the Professions and Cars table. Now, let's take a
look at how we might model this same data in MongoDB.
Now, MongoDB schema design works a lot differently than relational schema design. With MongoDB
schema design, there is:
a) No formal process b) No algorithms c) No rules
The only thing that matters is that design a schema such that it will work well for your application. Two
different apps that use the same exact data might have very different schemas if the applications are
used differently. When designing a schema, we want to take into consideration the following:
Efficient storing the data
Provide good query performance
Require reasonable amount of hardware
Let's take a look at how we might model the relational User model in MongoDB.
{
2 "first_name": "Paul",
3 "surname": "Miller",
4 "cell": "447557505611",
5 "city": "London", Here instead of splitting our data up
6 "location": [45.123, 47.232], into separate collections or documents,
7 "profession": ["banking", "finance", "trader"], we take advantage of MongoDB's
8 "cars": [ document based design to embed data
9 { "model": "Bentley", "year": 1973 }, into arrays and objects within the User
1 { "model": "Rolls Royce", "year": 1965 } object. Now we can make one simple
1 ] query to pull all that data together for
1} our application.
Type of Relationships
Now we will discuss some interesting patterns and relationships and how we model them with real-
world examples.
The most important consideration we make for our schema is how the data is going to be used by the
system. So for the same exact data as the examples listed below, we might have a completely different
schema than the one that outlined here. In each example, we will outline the requirements for each
application and why a given schema was used for that example.
In this process we are going to establish a couple of handy rules to help our schema design.
1. One-to-One
Let's take a look at our User document. This example has some one-to-one data in it. For example,
here one user can only have one name. So, this would be an example of a one-to-one relationship. We
Prepared by Kamal Podder Page 7
can model all one-to-one data as key-value pairs in our database.
{
"_id": "ObjectId('AAA')",
"name": "Joe Karlsson", We should prefer key-value pair embedded in the
"company": "MongoDB", document. For example, an employee can work in
"twitter": "@JoeKarlsson1", one and only one department.
"twitch": "joe_karlsson",
"tiktok": "joekarlsson",
"website": "joekarlsson.com"
}
Subset Pattern : A potential problem with the embedded document pattern is that it can lead to large
documents that contain fields that the application does not need. This unnecessary data can cause
extra load on the server and slow down read operations. Instead, you can use the subset pattern to
retrieve the subset of data which is accessed the most frequently in a single database call.
Consider an application that shows information on movies. The database contains a movie collection
with the following schema:
{
"_id": 1,
"title": "The Arrival of a Train", "year": 1896, "runtime": 1,
"released": ISODate("01-25-1896"),
"poster": "http://ia.media-imdb.com/images/M/MV5BMjEyNDk5MDYzOV5BMl5BanBnXkFtZTgwNjIx
MTEwMzE@._V1_SX300.jpg",
"plot": "A group of people are standing in a straight line along the platform of a railway station,
waiting for a train, which is seen coming at some distance. When the train stops at the
platform, ...",
"fullplot": "A group of people are standing in a straight line along the platform of a railway station,
waiting for a train, which is seen coming at some distance. When the train stops at the
platform, the line dissolves. The doors of the railway-cars open, and people on the platform
help passengers to get off.",
"lastupdated": ISODate("2015-08-15T10:06:53"),
"type": "movie", The movie collection contains several
"directors": [ "Auguste Lumière", "Louis Lumière" ], fields that the application does not
"imdb": { need to show a simple overview of a
"rating": 7.3, movie, such as fullplot and rating
"votes": 5043, information.
"id": 12
},
"countries": [ "France" ],
"genres": [ "Documentary", "Short" ],
"tomatoes": {
"viewer": {
"rating": 3.7,
"numReviews": 59
The movie collection contains basic information on a movie. This is the data that the application loads
by default:
// movie collection
{
"_id": 1,
"title": "The Arrival of a Train", "year": 1896, "runtime": 1,
"released": ISODate("1896-01-25"),
"type": "movie",
"directors": [ "Auguste Lumière", "Louis Lumière" ],
"countries": [ "France" ],
"genres": [ "Documentary", "Short" ],
}
The movie_details collection contains additional, less frequently-accessed data for each movie:
// movie_details collection
{
"_id": 156,
"movie_id": 1, // reference to the movie collection
"poster": "http://ia.media-
imdb.com/images/M/MV5BMjEyNDk5MDYzOV5BMl5BanBnXkFtZTgwNjIxMTEwMzE@._V1_SX300.
jpg",
"plot": "A group of people are standing in a straight line along the platform of a railway station,
waiting for a train, which is seen coming at some distance. When the ….",
"fullplot": "A group of people are standing in a straight line along the platform of a railway station,
waiting for a train, which is seen coming at some distance. When the train stops at the platform,
the line dissolves. The doors of the railway-cars open, and people….",
"lastupdated": ISODate("2015-08-15T10:06:53"),
"imdb": {
"rating": 7.3, This method improves read performance because it requires
"votes": 5043, the application to read less data to fulfill its most common
"id": 12 request. The application can make an additional database
}, call to fetch the less-frequently accessed data if needed.
"tomatoes": { TIP
"viewer": { When considering where to split your data, the most
"rating": 3.7, frequently-accessed portion of the data should go in the
"numReviews": 59 collection that the application loads first.
},
"lastUpdated": ISODate("2020-01-29T00:02:53")
Smaller documents result in improved read performance and make more memory available for the
application. However, it is important to understand your application and the way it loads data. If you
split your data into multiple collections improperly, your application will often need to make multiple
trips to the database and rely on JOIN operations to retrieve all of the data that it needs.
In addition, splitting your data into many small collections may increase required database
maintenance, as it may become difficult to track what data is stored in which collection.
2. One-to-Many
While the most common way to represent a one-to-one relationship in a document database is
through an embedded document, there are several ways to model one-to-many relationships in a
document schema. When considering your options for how to best model these, though, there are
three properties of the given relationship you should consider:
Cardinality: Cardinality is the measure of the number of individual elements in a given set. For
example, if a class has 30 students, you could say that class has a cardinality of 30. In a one-to-
many relationship, the size of “many” will affect how you might model the data.
Independent access: Some related data will rarely, if ever, be accessed separately from the main
object. Whether or not we will ever access a related document alone will also affect how we might
model the data.
Whether the relationship between data is strictly a one-to-many relationship: For example
consider the courses a student attends at a university. From the student’s perspective, they can
participate in multiple courses, so it may seem like a one-to-many relationship. However, university
courses are rarely attended by a single student; more often, multiple students will attend the same
class. In cases like this, the relationship in question is not really a one-to-many relationship, but a
many-to-many relationship, and thus you’d take a different approach to model this relationship
than you would a one-to-many relationship.
For example, we might need to store several addresses associated with a given user. It's unlikely that a
user for our application would have more than a couple of different addresses. For relationships like
this, we would define this as a one-to-few relationship.
{
"_id": "ObjectId('AAA')",
"name": "Joe Karlsson", Prefer embedding for one-to-few relationships.
"company": "MongoDB",
"twitter": "@JoeKarlsson1",
"twitch": "joe_karlsson",
"tiktok": "joekarlsson",
A potential problem with the embedded document pattern is that it can lead to large documents. In
this case, you can use the subset pattern to only access data which is required by the application,
instead of the entire set of embedded data. Consider an e-commerce site that has a list of reviews for a
product:
{
"_id": 1,
"name": "Super Widget",
"description": "This is the most useful item in your toolbox.",
"price": { "value": NumberDecimal("119.99"), "currency": "USD" },
"reviews": [
{
"review_id": 786,
"review_author": "Kristina", "review_text": "This is indeed an amazing widget.",
"published_date": ISODate("2019-02-18")
},
{
"review_id": 785,
"review_author": "Trina", "review_text": "Nice product. Slow shipping.",
"published_date": ISODate("2019-02-17")
}, The reviews are sorted in reverse
... chronological order. When a user
{ visits a product page, the
"review_id": 1, application loads the ten most
"review_author": "Hans", "review_text": "Meh, it's okay.", recent reviews.
"published_date": ISODate("2017-12-06")
}
]
}
Instead of storing all of the reviews with the product, you can split the collection into two collections:
The product collection stores information on each product, including the product’s ten most recent
reviews:
{
"_id": 1, "name": "Super Widget",
Using smaller documents containing more frequently-accessed data reduces the overall size of the
working set. These smaller documents result in improved read performance for the data that the
application accesses most frequently.
However, the subset pattern results in data duplication. In the example, reviews are maintained in
In addition to product reviews, the subset pattern can also be a good fit to store:
Comments on a blog post, when you only want to show the most recent or highest-rated
comments by default.
Cast members in a movie, when you only want to show cast members with the largest roles by
default.
Let's say that we are building a product page for an e-commerce website, and we are going to have to
design a schema that will be able to show product information. In our system, we save information
about all the many parts that make up each product for repair services. How would you design a
schema to save all this data, but still make your product page performant? You might want to consider
a one-to-many schema since your one product is made up of many parts.
Now, with a schema that could potentially be saving thousands of sub parts, we probably do not need
to have all of the data for the parts on every single request, but it's still important that this relationship
is maintained in our schema. So, we might have a Products collection with data about each product in
our e-commerce store, and in order to keep that part data linked, we can keep an array of Object IDs
that link to a document that has information about the part. These parts can be saved in the same
collection or in a separate collection, if needed. Let's take a look at how this would look.
Child references i.e. child ids are kept in parent document as shown below works well when there are
too many related objects to embed them directly inside the parent document, but the number is still
within known bounds.
What if we have a schema where there could be potentially millions of subdocuments, or more?
There are cases when the number of associated documents might be unbounded and will continue to
grow with time. Let's imagine that you have been asked to create a server logging application. Each
server could potentially save a massive amount of data, depending on how verbose you're logging and
how long you store server logs for.
With MongoDB, tracking data within an unbounded array is dangerous, since we could potentially hit
that 16-MB-per-document limit. Any given host could generate enough messages to overflow the 16-
MB document size, even if only ObjectIDs are stored in an array. So, we need to rethink how we can
track this relationship without coming up against any hard limits.
So, instead of tracking the relationship between the host and the log message in the host document,
let's each log message store the host that its message is associated with. By storing the data in the
log, we no longer need to worry about an unbounded array messing with our application! This is
known as parent referencing i.e. keep the reference of parent in child document.
Hosts:
{ Rule 4: Arrays should not grow without
"_id": ObjectID("AAAB"), bound. If there are more than a couple of
"name": "goofy.example.com", hundred documents on the "many" side,
"ipaddr": "127.66.66.66" don't embed them; if there are more than
} a few thousand documents on the "many"
Log Message: side, don't use an array of ObjectID
{ references. High-cardinality arrays are a
"time": ISODate("2014-03-28T09:42:41.382Z"), compelling reason not to embed.
"message": "cpu is on fire!",
"host": ObjectID("AAAB")
}
Another example : Iimagine that the university’s student council has a message board where any
student can post whatever messages they want, including questions about courses, travel stories, job
postings, study materials, or just a free chat. A sample message in this example consists of a subject
and a message body:
{
"_id": ObjectId("61741c9cbc9ec583c836174c"),
Now consider using child references instead of embedding full documents as in the previous example.
The individual messages would be stored in a separate collection, and the student’s document could
then have the following structure:
{
In such scenarios, a common way to connect one object to another is through parent references. Here
not the student document referring to individual messages, but rather a reference in the message’s
document pointing towards the student that wrote it.
To use parent references, you would need to modify the message document schema to contain a
reference to the student who authored the message:
{
"_id": ObjectId("61741c9cbc9ec583c836174c"),
"subject": "Books on kinematics and dynamics",
"message": "Hello! Could you recommend a good introductory books covering the topics of
kinematics and dynamics? Thanks!",
"posted_on": ISODate("2021-07-23T16:03:21Z"), Student id – parent reference
"posted_by": ObjectId("612d1e835ebee16872a109a4")
}
Notice the new posted_by field contains the object identifier of the student’s document. Now, the
student’s document won’t contain any information about the messages they’ve posted:
{
"_id": ObjectId("612d1e835ebee16872a109a4"),
"first_name": "Sammy",
"last_name": "Shark",
"emails": [ { "email": "sammy@digitalocean.com", "type": "work" },
{ "email": "sammy@example.com", "type": "home" }
],
"courses": [ ObjectId("61741c9cbc9ec583c836170a"), ObjectId("61741c9cbc9ec583c836170b") ]
}
When using parent references, creating an index on the field referencing the parent document can
significantly increase the query performance each time you filter against the parent document
identifier.
In this type of situation it’s generally advised that we store related documents separately and use
parent references to connect them to the parent document.
3. Many-to-Many
This is another very common schema pattern that we see all the time in relational and MongoDB
schema designs. For this pattern, let's imagine that we are building a to-do application. In our app, a
user may have many tasks and a task may have many users assigned to it.
In order to preserve these relationships between users and tasks, there will need to be references from
the one user to the many tasks and references from the one task to the many users. Let's look at how
this could work for a to-do list application.
Users:
{
"_id": ObjectID("AAF1"),
"name": "Kate Monster",
"tasks": [ObjectID("ADF9"), ObjectID("AE02"), ObjectID("AE73")]
} We can see that each
Tasks: user has a sub-array of
{ linked tasks, and each
"_id": ObjectID("ADF9"), task has a sub-array of
"description": "Write blog post about MongoDB schema design", owners for each item in
"due_date": ISODate("2014-04-01"), our to-do app.
"owners": [ObjectID("AAF1"), ObjectID("BB3G")]
}
Note: There is no firm rule when to use child references, parent references or both based on
cardinality of the relation. We might choose a different approach at either a lower or higher cardinality
if it’s what best suits the application in question. After all, we will always want to structure our data to
suit the manner in which your application queries and updates it.
Each MongoDB document contains a certain amount of overhead. This overhead is normally
insignificant but becomes significant if all documents are just a few bytes, as might be the case if the
documents in your collection only have one or two fields.
Consider the following suggestions and strategies for optimizing storage utilization for these
collections:
MongoDB clients automatically add an _id field to each document and generate a unique 12-
byte ObjectId for the _id field. Furthermore, MongoDB always indexes the _id field. For smaller
documents this may account for a significant amount of space.
To optimize storage use, users can specify a value for the _id field explicitly when inserting documents
into the collection. The value in the _id field serves as a primary key for documents in the collection, so
it must be unique.
MongoDB stores all field names in every document. Consider a collection of small documents that
resemble the following:
{ last_name : "Smith", best_score: 3.9 }
If we shorten the field named last_name to lname and the field named best_score to score, as follows,
you could save 9 bytes per document.
{ lname : "Smith", score : 3.9 }
NOTE
Shortening field names reduces expressiveness and does not provide considerable benefit for larger
documents and where document overhead is not of significant concern. Shorter field names do not
reduce the size of indexes, because indexes have a predefined structure.
Prepared by Kamal Podder Page 18
In general, it is not necessary to use short field names.
When designing a data model, consider how applications will use your database. For instance, if your
application only uses recently inserted documents, consider using Capped Collections. Or if your
application needs are mainly read operations to a collection, adding indexes to support common
queries can improve performance.
While designing data model we should consider various operational factors that impact the
performance of MongoDB. These factors are operational or address requirements that arise outside of
the application but impact the performance of MongoDB based applications. When developing a data
model, analyze all of your application’s read operations and write operations in conjunction with the
following considerations.
Document Growth
Some updates to documents can increase the size of documents. These updates include pushing
elements to an array (i.e. $push) and adding new fields to a document. When using the MMAPv1
storage engine, document growth can be a consideration for your data model. For MMAPv1, if the
document size exceeds the allocated space for that document, MongoDB will relocate the
document on disk.
Atomicity
In MongoDB, a write operation is atomic on the level of a single document, even if the operation
modifies multiple embedded documents within a single document.
A data model that embeds related data in a single document facilitates these kinds of atomic
operations. For data models that store references between related pieces of data, the application
must issue separate read and write operations to retrieve and modify these related pieces of data.
Ensure that the application stores all fields with atomic dependency requirements in the same
document. If the application can tolerate non-atomic updates for two pieces of data, you can store
these data in separate documents.
o When a single write operation (e.g. db.collection.updateMany()) modifies multiple documents,
the modification of each document is atomic, but the operation as a whole is not atomic.
o When performing multi-document write operations, whether through a single write operation
or multiple write operations, other operations may interleave.
For situations that require atomicity of reads and writes to multiple documents (in a single or multiple
collections), MongoDB supports multi-document transactions:
Sharding
MongoDB uses sharding to provide horizontal scaling by partitioning a collection within a database
to distribute the collection’s documents across a number of mongod instances or shards. These
clusters support deployments with large data sets and high-throughput operations.
To distribute data and application traffic in a sharded collection, MongoDB uses the shard key.
Selecting the proper shard key has significant implications for performance, and can enable or
prevent query isolation and increased write capacity. It is important to consider carefully the field
or fields to use as the shard key.
Indexes
Use indexes to improve performance for common queries. Build indexes on fields that appear often
in queries and for all operations that return sorted results. MongoDB automatically creates a
unique index on the _id field.
In certain situations, you might choose to store related information in several collections rather
than in a single collection. Consider a sample collection logs that stores log documents for various
environment and applications. The logs collection contains documents of the following form:
If the total number of documents is low, you may group
{ log: "dev", ts: ..., info: ... }
documents into collection by type. For logs, consider
{ log: "debug", ts: ..., info: ...}
maintaining distinct log collections, such
as logs_dev and logs_debug. The logs_dev collection would
contain only the documents related to the dev environment.
Generally, having a large number of collections has no significant performance penalty and results
in very good performance. Distinct collections are very important for high-throughput batch
processing.
When using models that have a large number of collections, consider the following behaviors:
Data modeling decisions should take data lifecycle management into consideration.
The Time to Live or TTL feature of collections expire documents after a period of time. Consider
using the TTL feature if your application requires some data to persist in the database for a limited
period of time.
Additionally, if your application only uses recently inserted documents, consider Capped
Collections. Capped collections provide first-in-first-out (FIFO) management of inserted documents
and efficiently support operations that insert and read documents based on insertion order.
GridFS
GridFS is a specification for storing and retrieving files that exceed the BSON-document size limit of
16MB. Instead of storing a file in a single document, GridFS divides a file into parts, or chunks, and
stores each of those chunks as a separate document. By default GridFS limits chunk size to 255k.
GridFS uses two collections to store files. One collection stores the file chunks, and the other stores file
metadata. When you query a GridFS store for a file, the driver or client will reassemble the chunks as
needed. You can perform range queries on files stored through GridFS. You also can access information
from arbitrary sections of files, which allows you to “skip” into the middle of a video or audio file.
GridFS is useful not only for storing files that exceed 16MB but also for storing any files for which you
want access without having to load the entire file into memory.
Pattern
The Parent References pattern stores each tree node in a document; in addition to the tree node, the
document stores the id of the node’s parent. Consider the following hierarchy of categories where the
tree using Parent References, storing the reference to the parent category in the field parent:
db.categories.insertMany( [
{ _id: "MongoDB", parent: "Databases" }, Above tree structure is
{ _id: "dbm", parent: "Databases" }, represented in the document
{ _id: "Databases", parent: "Programming" },
{ _id: "Languages", parent: "Programming" },
{ _id: "Programming", parent: "Books" },
{ _id: "Books", parent: null }
])
o The query to retrieve the parent of a node is fast and straightforward:
db.categories.findOne( { _id: "MongoDB" } ).parent
o We can create an index on the field parent to enable fast search by the parent node:
db.categories.createIndex( { parent: 1 } )
o We can query by the parent field to find its immediate children nodes:
db.categories.find( { parent: "Databases" } )
o To retrieve subtrees, see $graphLookup.
In this data model a tree-like structure in MongoDB documents is described by storing references of
child nodes (in an array) in the parent-nodes.
Pattern
The Child References pattern stores each tree node in a document; in addition to the tree node,
document stores in an array the id(s) of the node’s children. The following example models the tree
using Child References, storing the reference to the node’s children in the field children:
This data model that describes a tree-like structure in MongoDB documents using references to parent
nodes and an array that stores all ancestors.
Pattern : The Array of Ancestors pattern stores each tree node in a document; in addition to the tree
node, document stores in an array the id(s) of the node’s ancestors or path.
The following example models the above tree using Array of Ancestors. In addition to
the ancestors’ field, these documents also store the reference to the immediate parent category in
the parent field:
o The query to retrieve the ancestors or path of a node is fast and straightforward:
db.categories.findOne( { _id: "MongoDB" } ).ancestors
o We can create an index on the field ancestors to enable fast search by the ancestors nodes:
db.categories.createIndex( { ancestors: 1 } )
o We can query by the field ancestors to find all its descendants:
db.categories.find( { ancestors: "Programming" } )
The Array of Ancestors pattern provides a fast and efficient solution to find the descendants and the
ancestors of a node by creating an index on the elements of the ancestors field. This makes Array of
Ancestors a good choice for working with subtrees.
The Array of Ancestors pattern is slightly slower than the Materialized Paths pattern but is more
straightforward to use.
This data model that describes a tree-like structure in MongoDB documents by storing full relationship
paths between documents.
Pattern : The Materialized Paths pattern stores each tree node in a document; in addition to the tree
node, document stores as a string the id(s) of the node’s ancestors or path. Although the Materialized
Paths pattern requires additional steps of working with strings and regular expressions, the pattern
also provides more flexibility in working with the path, such as finding nodes by partial paths.
The following example models the tree using Materialized Paths, storing the path in the field path; the
path string uses the comma , as a delimiter:
db.categories.insertMany( [
{ _id: "Books", path: null },
{ _id: "Programming", path: ",Books," },
{ _id: "Databases", path: ",Books,Programming," },
{ _id: "Languages", path: ",Books,Programming," },
{ _id: "MongoDB", path: ",Books,Programming,Databases," },
{ _id: "dbm", path: ",Books,Programming,Databases," },
{ _id: "Java", path: ",Books,Programming,Languages," },
{ _id: "C++", path: ",Books,Programming,Languages," },
{ _id: "C", path: ",Books,Programming,Languages," },
{ _id: "Functional", path: ",Books,Programming,Languages,Java," },
{ _id: "Spring", path: ",Books,Programming,Languages,Java," },
{ _id: "MicroServices", path: ",Books,Programming,Languages,Java," },
] );
// add index on "path" field.
db.categories.createIndex( { path: 1 } );
For these queries an index may provide some performance improvement if the index is significantly
smaller than the entire collection.
Query
We can query to retrieve the whole tree, sorting by the field path
db.categories.find().sort( { path: 1 } );
//Output:
{ "_id" : "Books", "path" : null }
{ "_id" : "Programming", "path" : ",Books," }
{ "_id" : "Databases", "path" : ",Books,Programming," }
{ "_id" : "Languages", "path" : ",Books,Programming," }
{ "_id" : "MongoDB", "path" : ",Books,Programming,Databases," }
{ "_id" : "dbm", "path" : ",Books,Programming,Databases," }
{ "_id" : "Java", "path" : ",Books,Programming,Languages," }
{ "_id" : "C++", "path" : ",Books,Programming,Languages," }
{ "_id" : "C", "path" : ",Books,Programming,Languages," }
{ "_id" : "Functional", "path" : ",Books,Programming,Languages,Java," }
{ "_id" : "Spring", "path" : ",Books,Programming,Languages,Java," }
{ "_id" : "MicroServices", "path" : ",Books,Programming,Languages,Java," }
Use regular expressions on the path field to find the descendants/children of Programming
Note : A regular expression is a “prefix expression” if it starts with a caret (^) or a left anchor (\A),
followed by a string of simple symbols. The ^ is used to make sure that the string starts with a certain
character. For example, the regex /^abc.*/ will be optimized by matching only against the values from
the index that start with abc.
The “//” options basically means to specify the
search criteria within these delimiters. Hence,
db.categories.find( { path: /,Databases,/ } );
specifying /,Database,/ means find those
documents where path contains this string.
// Output:
{ "_id" : "MongoDB", "path" : ",Books,Programming,Databases," }
{ "_id" : "dbm", "path" : ",Books,Programming,Databases," }
Query to find all the closest related node’s with score for a given node. Here we are finding the
Closest related to “SPRING”
// Use it to search the tree with score, highest score is the best match
db.categories.find( Here we are using text search
{ $text: { $search: pathForNodeSpring } }, based on text index created earlier
{ score: { $meta: "textScore" } } // assigns score to best match.
).sort( { score: { $meta: "textScore" } } ) ;
“textScore” returns the score associated with the
corresponding $text query for each matching document. The
text score signifies how well the document matched
// Output: the search term or terms.
The highest “score” : 2.5 indicates siblings that is closest to node “Spring”. Here they are Functional, Spring
and MicroServices. It will be helpful in case where reader doesn’t find a book on “Spring” but still application
can refer book’s related in the same category available.
Cons:
Modification’s made to node should be reflected to all descendant's/child-node.
The nested set model is to number the nodes according to a tree traversal, which visits each node
twice, assigning numbers in the order of visiting, and at both visits. This leaves two numbers for each
node, which are stored as two attributes.
To generate the nested set representation of a tree the tree is traversed in depth-first order (dotted
line). Depth-first search (DFS) is an algorithm for traversing or searching tree or graph data structures.
The algorithm starts at the root node (selecting some arbitrary node as the root node in the case of a
graph) and explores as far as possible along each branch before backtracking.
The set of nodes in a given subtree correspond to those nodes whose left and right visitation numbers
fall within the range of same numbers assigned for the root of the subtree. For example for root E all
subtree nodes must have left and right visitation numbers within 4 and 11. B,C,D nodes satisfy these
criteria, hence they belongs to the subtree of E.
Pattern :
The Nested Sets pattern identifies each node in the tree as stops in a round-trip traversal of the
tree.
The application visits each node in the tree twice; first during the initial trip, and second during the
return trip. The Nested Sets pattern stores each tree node in a document; in addition to the tree
node, document stores the id of node’s parent, the node’s initial stop in the left field, and its return
stop in the right field.
The "Books" category, with the highest position in the hierarchy, encompasses all subordinating
categories. It is therefore given left and right domain values of 1 and 12, the latter value is the double
of the total number of nodes being represented.
db.categories.insertMany( [
{ _id: "Books", parent: 0, left: 1, right: 12 },
{ _id: "Programming", parent: "Books", left: 2, right: 11 },
{ _id: "Languages", parent: "Programming", left: 3, right: 4 }, All the nodes descendants of
{ _id: "Databases", parent: "Programming", left: 5, right: 10 }, parent node (Database) must
{ _id: "MongoDB", parent: "Databases", left: 6, right: 7 }, have left and right visitation
{ _id: "dbm", parent: "Databases", left: 8, right: 9 } numbers within 5 and 10.
])
We can query to retrieve the descendants of a node:
var databaseCategory = db.categories.findOne( { _id: "Databases" } );
db.categories.find( { left: { $gt: databaseCategory.left }, right: { $lt: databaseCategory.right } } );
The Nested Sets pattern provides a fast and efficient solution for finding subtrees but is inefficient for
modifying the tree structure. As such, this pattern is best for static trees that do not change.
Although MongoDB supports multi-document transactions for replica sets (starting in version 4.0) and
sharded clusters (starting in version 4.2), for many scenarios, the denormalized data model will
continue to be optimal for your data and use cases.
For example, consider a situation where you need to maintain information on books, including the
number of copies available for checkout as well as the current checkout information. The available
copies of the book and the checkout information should be in sync. As such, embedding
the available field and the checkout field within the same document ensures that you can update the
two fields atomically.
{
_id: 123456789,
title: "MongoDB: The Definitive Guide",
author: [ "Kristina Chodorow", "Mike Dirolf" ],
published_date: ISODate("2010-09-24"),
pages: 216,
language: "English",
publisher_id: "oreilly",
available: 3,
checkout: [ { by: "joe", date: ISODate("2012-10-15") } ]
}
Then to update with new checkout information, you can use the db.collection.updateOne() method to
atomically update both the available field and the checkout field:
db.books.updateOne (
{ _id: 123456789, available: { $gt: 0 } },
{
$inc: { available: -1 },
$push: { checkout: { by: "abc", date: new Date() } }
}
)
The operation returns a document that contains information on the status of the operation:
{ "acknowledged" : true, "matchedCount" : 1, "modifiedCount" : 1 }
The matchedCount field shows that 1 document matched the update condition,
and modifiedCount shows that the operation updated 1 document. If no document matched the
update condition, then matchedCount and modifiedCount would be 0 and would indicate that you
could not check out the book.
If your application needs to perform queries on the content of a field that holds text you can perform
exact matches on the text or use $regex to use regular expression pattern matches. However, for many
operations on text, these methods do not satisfy application requirements.
Pattern
To add structures to your document to support keyword-based queries, create an array field in your
documents and add the keywords as strings in the array. You can then create a multi-key index on the
array and create queries that select values from the array.
EXAMPLE
Given a collection of library volumes that you want to provide topic-based search. For each volume,
you add the array topics, and you add as many keywords as needed for a given volume.
For the Moby-Dick volume you might have the following document:
{ title : "Moby-Dick" ,
author : "Herman Melville" ,
published : 1851 ,
ISBN : 0451526996 ,
topics : [ "whaling" , "allegory" , "revenge" , "American" ,
"novel" , "nautical" , "voyage" , "Cape Cod" ]
}
You then create a multi-key index on the topics array:
db.volumes.createIndex( { topics: 1 } )
The multi-key index creates separate index entries for each keyword in the topics array. For example
the index contains one entry for whaling and another for allegory.
You then query based on the keywords. For example:
NOTE : An array with a large number of elements, such as one with several hundreds or thousands of
keywords will incur greater indexing costs on insertion.
Database schemas occasionally need to be updated. For example, a schema designed to hold user
contact information may need to be updated to include new methods of communication as they
become popular, such as Twitter or Skype.
Those who have worked with RDBMS must understand the challenge of introducing changes in the
underlying database schema of any application, especially after it's been deployed to production
environment. Typically that involves stopping the application, running the migrations, waiting for the
changes to kick in and then restarting the application. There is no way to avoid this downtime,
However, with NoSQL databases like MongoDB this problem can be avoided only because the
document model is very flexible. There can be various approaches to making schema changes. Here we
will discuss about the schema versioning pattern.
You can use MongoDB’s flexible schema model, which supports differently shaped documents in the
same collection, to gradually update your collection’s schema. As you update your schema model, the
Schema Versioning pattern allows you to track these updates with version numbers.
Your application code can use version numbers to identify and handle differently shaped documents
without downtime.
To implement the Schema Versioning pattern, add a schema_version (or similarly named) field to your
schema the first time that you modify your schema. Documents that use the new schema should have
a schema_version of 2 to indicate that they adhere to the second iteration of your schema. If you
update your schema again, increment the schema_version.
Your application code can use a document’s schema_version, or lack thereof, to conditionally handle
documents. Use the latest schema to store new information in the database.
Example
The following example iterates upon the schema for documents in the users collection.
In the first iteration of this schema, a record includes galactic_id, name, and phone fields:
// users collection
{
"_id": "<ObjectId>",
"galactic_id": 123,
"name": "Anakin Skywalker",
"phone": "503-555-0000",
}
In the next iteration, the schema is updated to include more information in a different shape:
// users collection
{
"_id": "<ObjectId>", Adding a schema_version means that an application can
"galactic_id": 123, identify documents shaped for the new schema and
"name": "Darth Vader", handles them accordingly. The application can still
"contact_method": { handle old documents if schema_version does not exist
"work": "503-555-0210", on the document.
"home": "503-555-0220",
After the document is returned from the database, the application checks to see whether the
document has a schema_version field.
If it does not have a schema_version field, the application passes the returned document to a
dedicated function that renders the phone field from the original schema.
If it does have a schema_version field, the application checks the schema version. In this example,
the schema_version is 2 and the application passes the returned document to a dedicated function
that renders the new contact_method.work and contact_method.home fields.
Using the schema_version field, application code can support any number of schema iterations in the
same collection by adding dedicated handler functions to the code.
Use Cases
The Schema Versioning pattern is ideal for any one or a combination of the following cases:
Application downtime is not an option
Updating documents may take hours, days, or weeks of time to complete
Updating documents to the new schema version is not a requirement
The Schema Versioning pattern helps you better decide when and how data migrations will take place
relative to traditional, tabular databases.
Simple e-commerce systems are a good starting point for data modeling with document databases like
MongoDB. These examples easily demonstrate core concepts of application development with
MongoDB and contain several patterns that you can reuse in other problem domains. MongoDB lets
you organize you data in "BSON documents," which you can think of as a "typed JSON" documents.
The first step is to design the schema for the website. Consider an initial product schema:
{
sku: "111445GB3",
title: "Simsong One mobile phone",
description: "The greatest Onedroid phone on the market .....",
manufacture_details: {
model_number: "A123X",
release_date: new ISODate("2012-05-17T08:14:15.656Z")
},
mongo
use ecommerce
db.products.insert({
sku: "111445GB3",
title: "Simsong One mobile phone",
description: "The greatest Onedroid phone on the market .....",
manufacture_details: {
model_number: "A123X",
release_date: new ISODate("2012-05-17T08:14:15.656Z")
},
Prepared by Kamal Podder Page 1
shipping_details: {
weight: 350, The first command (mongo) starts the mongodb console and
width: 10, connects to the local Mongo DB console on localhost and
height: 10, port 27017. The next chooses the ecommerce database (use
depth: 1 ecommerce) and the third inserts the product document in
}, the products collection. Going forward all commands are
assuming you are in the Mongo DB shell using the
quantity: 99, ecommerce database.
pricing: { The products data model has a unique sku that identifies the
price: 1000 product, title, description, a stock quantity, and pricing
} information about the item.
})
All products have categories. In the case of the Simsong One it's a 15G phone and also has a FM
receiver. As a result, this product falls into both the mobile/15G and the radio/fm categories. Add the
categories to the existing document, with the following update() operation:
To support efficient queries using the categories field, add an index on the categories field for
the products collection:
db.products.ensureIndex({categories:1 })
This returns all the products for a specific category using the index and an anchored regular expression.
As long as the regular expression is case sensitive and anchored, MongoDB will use the index to return
the query. For example, fetch all the products in the category that begins with mobile/fm:
db.products.find({categories: /^mobile\/fm/})
To be able to provide a list of all the products in a category, amend the data model with a collection of
documents for each category. In this collection, each document represents a category and contains the
path for that category in category tree. These documents would resemble the following.
{
title: "Mobiles containing a FM radio",
parent: "mobile",
path: "mobile/fm"
}
Insert the document into the categories collection and add indexes to this collection:
There are two paths in each category: this allows the application to use the same method to find all
categories for a specific category root as used for finding products by category. For example, to return
all sub-categories of the category "mobile", use the following query:
Using these path values, the application can use this method to access the category tree and extract
more sub-categories with a single index supported query. Furthermore, the application can pull all the
documents for a specific category using this path value.
The Cart
A cart in an e-commerce system, allows users to reserve items from the inventory and keep them until
they check out and pay for the items. The application must ensure that at any point in time there are
not more items in carts than there are in stock and that if the users abandons the cart, the application
must return the items from the cart to the inventory without loosing track of any objects. Take the
following document, which models the cart:
{
_id: "the_users_session_id",
status:'active'
quantity: 2,
total: 2000,
products: []
}
The products array contains the list of products the customer intends to purchase. Use the
following insert() operation to create the cart:
db.carts.update({
_id: "the_users_session_id", status:'active'
In addition to simply adding objects to carts, there are a number of cart related operations that the
application must be able to support:
users may add or remove objects from the cart.
users may abandon a cart and the application must return items in the cart to inventory.
The next sequence of operations allow the application to ensure that carts are up to date and that the
application has enough inventory to cover it. Update the cart with the new quantity, using the
following update() operation:
Now, remove the additional item from the inventory update the number of items in the shopping cart:
db.products.update({
sku: "111445GB3",
"in_carts.id": "the_users_session_id",
quantity: {
$gte: 1
}
}, {
$inc: { quantity: (-1)*quantity_delta },
$set: {
"in_carts.$.quantity": new_quantity, timestamp: new ISODate()
}
})
Ensure the application has enough inventory for the operation. If there is not sufficient inventory, the
application must rollback the last operation. The following operation checks for errors
using getLastError and rolls back the operation if it returns an error:
if(!db.runCommand({getLastError:1}).updatedExisting) {
db.carts.update({
_id: "the_users_session_id", "products.sku": "111445GB3"
}, {
$set : { "in_carts.$.quantity": old_quantity}
})
}
If a user abandons the purchase process or the shopping cart grows stale and times out, the
application must return cart content to the inventory. This operation requires a loop that finds all
expired or canceled carts and then returns the content of each cart to the inventory. Begin by finding
all sufficiently "stale" carts, and use an operation that resembles the following:
db.products.update({
db.carts.update({
_id: cart._id,
$set: {status: 'expired'}
})
}
This operation walks all products in each cart and returns them to the inventory and removes cart
identifiers from the in_carts array in the product documents. Once the application has returned all of
the items to the inventory, the application sets the cart's status to expire.
Checkout
When the user clicks the "confirm" button in the checkout portion of the application, the application
creates an "order" document that reflects the entire order. Consider the following operation:
db.orders.insert({
created_on: new ISODate("2012-05-17T08:14:15.656Z"),
shipping: {
customer: "Peter P Peterson",
address: "Longroad 1343",
city: "Peterburg",
region: "",
state: "PE",
country: "Peteonia",
delivery_notes: "Leave at the gate",
tracking: {
company: "ups",
tracking_number: "22122X211SD",
status: "ontruck",
estimated_delivery: new ISODate("2012-05-17T08:14:15.656Z")
},
},
payment: {
products: {
{quantity: 2, sku:"111445GB3", title: "Simsong mobile phone", unit_cost:1000, currency:"USDA"}
}
})
For a relational databases you might need to model this as a set of tables: for orders, shipping,
tracking, and payment. Using MongoDB one can create a single document that is self-contained, easy
to understand, and simply maps into an object oriented application. After inserting this document the
application must ensure inventory is up to date before completing the checkout. Begin by setting the
cart as finished, with the following operation:
db.carts.update({
_id: "the_users_session_id"
}, {
$set: {status:"complete"}
});
Use the following operation to remove the cart identifer from all product records:
db.products.update({
"in_carts.id": "the_users_session_id"
}, {
$pull: {in_carts: {id: "the_users_session_id"}}
}, false, true);
By using "multi-update," which is the last argument in the update() method, this operation will update
all matching documents in one set of operations.
Relationships represent the way documents are related to each other. They can be modeled through
Embedded and Referenced approaches. The relationship types can be One to One( 1:1), One to Many(
1:N), Many to One (N:1) and Many to Many (N:N).
We will start looking at Person document. The sample data for the person document is shown below:
Person Document
{
"_id":ObjectId("52eecd85242f436000001"),
"person": "Tom Hanks",
"id": "987654321",
"ssn": "345982341",
One to Many relationship between person and address is shown below using the embedded and
reference approach.
Embedded approach Reference approach
Person to Address One to Many Person to Address – References
{ Person Document
"_id":ObjectId("52ffc33cd85242f436000001"), {
"person": "Tom Hanks", "_id":ObjectId("52ffc33cd85242f436000001"),
"id": "987654321", "person": "Tom Hanks",
"ssn": "345982341", "id": "987654321",
"gender": "male" "ssn": "345982341",
A group document can have many to many relationship with person document. A sample group
document is shown below:
Group Document
{
"_id":ObjectId("22avxd85242f436000001"),
"group": "Group1",
"type": "Engineers"
}
Many to Many relationship between Person and Group is shown using embedded approach.
Person to Group using embedded
{
"_id":ObjectId("52ffc33cd85242f436000001"),
"person": "Tom Hanks",
"id": "987654321",
"ssn": "345982341",
"gender": "male"
"groups": [
{
"_id":ObjectId("22avxd85242f436000001"),
"group": "Group1",
"type": "Engineers"
},
Person Document
{
"_id":ObjectId("52ffc33cd85242f436000001"),
"person": "Tom Hanks",
"id": "987654321",
"ssn": "345982341",
"gender": "male"
}
Group Document 1
{
"_id":ObjectId("22avxd85242f436000001"),
"group": "Group1",
"type": "Engineers"
}
Group Document 2
{
"_id":ObjectId("35kfsd85242f436000001"),
"group": "Group2",
"type": "Managers"
}
Manager to a person can be a parent to child relationship. The relationship is shown using the
embedded approach.
Manager to Person using Embedded
{
"_id":ObjectId("52ffc33cd85242f436000001"),
"manager": "John Smith",
"id": "987652321",
"ssn": "245982341",
"gender": "male",
"persons":[
{
There are several approaches to modeling monetary data in MongoDB using the numeric and non-
numeric models.
Numeric Model : The numeric model may be appropriate if you need to query the database for exact,
mathematically valid matches or need to perform server-side arithmetic, e.g., $inc, $mul,
and aggregation framework arithmetic.
Using the Decimal BSON Type which is a decimal-based floating-point format capable of providing
exact precision. Available in MongoDB version 3.4 and later.
Using a Scale Factor to convert the monetary value to a 64-bit integer (long BSON type) by
multiplying by a power of 10 scale factor.
Using two fields for the monetary value: One field stores the exact monetary value as a non-
numeric string and another field stores a binary-based floating-point (double BSON type)
approximation of the value.
1. Numeric Model
The decimal BSON type uses the IEEE 754 decimal128 decimal-based floating-point numbering format.
Unlike binary-based floating-point formats (i.e., the double BSON type), decimal128 does not
approximate decimal values and is able to provide the exact precision required for working with
monetary data.
From the mongo shell decimal values are assigned and queried using he NumberDecimal() constructor.
The following example adds a document containing gas prices to a gasprices collection:
A collection’s values can be transformed to the decimal type by performing a one-time transformation
or by modifying application logic to perform the transformation as it accesses records.
One-Time Collection Transformation : A collection can be transformed by iterating over all documents
in the collection, converting the monetary value to the decimal type, and writing the document back to
the collection.
NOTE : It is strongly advised to add the decimal value to the document as a new field and remove the
old field later once the new field’s values have been verified.
WARNING
Be sure to test decimal conversions in an isolated test environment. Once datafiles are created or
modified with MongoDB version 3.4 they will no longer be compatible with previous versions and there
is no support for downgrading datafiles containing decimals.
Consider the following collection which used the Scale Factor approach and saved the monetary value
as a 64-bit integer representing the number of cents:
db.clothes.aggregate(
[
{ $match: { price: { $type: "long" }, priceDec: { $exists: 0 } } },
{
$addFields: {
priceDec: { $multiply: [ "$price", NumberDecimal( "0.01" ) ] }
}
}
]
).forEach( ( function( doc ) {
db.clothes.save( doc );
}))
The results of the aggregation pipeline can be verified using the db.clothes.find() query:
If you do not want to add a new field with the decimal value, the original field can be overwritten. The
following update() method first checks that price exists and that it is a long, then transforms
the long value to decimal and stores it in the price field:
db.clothes.update(
{ price: { $type: "long" } },
{ $mul: { price: NumberDecimal( "0.01" ) } },
{ multi: 1 }
)
The results can be verified using the db.clothes.find() query:
{ "_id" : 1, "description" : "T-Shirt", "size" : "M", "price" : NumberDecimal("19.99") }
{ "_id" : 2, "description" : "Jeans", "size" : "36", "price" : NumberDecimal("39.99") }
{ "_id" : 3, "description" : "Shorts", "size" : "32", "price" : NumberDecimal("29.99") }
{ "_id" : 4, "description" : "Cool T-Shirt", "size" : "L", "price" : NumberDecimal("24.95") }
{ "_id" : 5, "description" : "Designer Jeans", "size" : "30", "price" : NumberDecimal("80.00") }
Non-Numeric Transformation:
Consider the following collection which used the non-numeric model and saved the monetary value as
a string with the exact representation of the value:
{ "_id" : 1, "description" : "T-Shirt", "size" : "M", "price" : "19.99" }
{ "_id" : 2, "description" : "Jeans", "size" : "36", "price" : "39.99" }
{ "_id" : 3, "description" : "Shorts", "size" : "32", "price" : "29.99" }
{ "_id" : 4, "description" : "Cool T-Shirt", "size" : "L", "price" : "24.95" }
{ "_id" : 5, "description" : "Designer Jeans", "size" : "30", "price" : "80.00" }
The following function first checks that price exists and that it is a string, then transforms
the string value to a decimal value and stores it in the priceDec field:
2. Non-Numeric Model
To model monetary data using the non-numeric model, store the value in two fields:
1. In one field, encode the exact monetary value as a non-numeric data type; e.g., BinData or
a string.
2. In the second field, store a double-precision floating point approximation of the exact value.
The following example uses the non-numeric model to store 9.99 USD for the price and 0.25 USD for
the fee:
{
price: { display: "9.99", approx: 9.9900000000000002, currency: "USD" },
fee: { display: "0.25", approx: 0.2499999999999999, currency: "USD" }
}
With some care, applications can perform range and sort queries on the field with the numeric
approximation. However, the use of the approximation field for the query and sort operations requires
that applications perform client-side post-processing to decode the non-numeric representation of the
exact value and then filter out the returned documents based on the exact monetary value.
Example
In the MongoDB shell, you can store both the current date and the current client’s offset from UTC.
You can reconstruct the original local time by applying the saved offset:
Prepared by Kamal Podder Page 15
var record = db.data.findOne();
var localNow = new Date( record.date.getTime() - ( record.offset * 60000 ) );
A common method to organize time-series data is to group the data into buckets where each bucket
represents a uniform unit of time such as a day or year. Bucketing organizes specific groups of data to
help:
Discover historical trends,
Forecast future trends, and
Optimze storage usage.
Consider a collection that stores temperature data obtained from a sensor. The sensor records the
temperature every minute and stores the data in a collection called temperatures:
// temperatures collection
{
"_id": 1, "sensor_id": 12345,
"timestamp": ISODate("2019-01-31T10:00:00.000Z"), "temperature": 40
}
{
"_id": 2, "sensor_id": 12345,
"timestamp": ISODate("2019-01-31T10:01:00.000Z"), "temperature": 40
}
{
"_id": 3, "sensor_id": 12345,
"timestamp": ISODate("2019-01-31T10:02:00.000Z"), "temperature": 41
}
...
This approach does not scale well in terms of data and index size. For example, if the application
requires indexes on the sensor_id and timestamp fields, every incoming reading from the sensor would
need to be indexed to improve performance.
You can leverage the document model to bucket the data into documents that hold the measurements
for a particular timespan. Consider the following updated schema which buckets the readings taken
every minute into hour-long groups:
{
"_id": 1, "sensor_id": 12345,
"start_date": ISODate("2019-01-31T10:00:00.000Z"),
"end_date": ISODate("2019-01-31T10:59:59.000Z"),
"measurements": [
This updated schema improves scalability and mirrors how the application actually uses the data. A
user likely wouldn’t query for a specific temperature reading. Instead, a user would likely query for
temperature behavior over the course of an hour or day. The Bucket pattern helps facilitate those
queries by grouping the data into uniform time periods.
The example document contains two computed fields: transaction_count and sum_temperature. If the
application frequently needs to retrieve the sum of temperatures for a given hour, computing a
running total of the sum can help save application resources. This Computed Pattern approach
eliminates the need to calculate the sum each time the data is requested.
The pre-aggregated sum_temperature and transaction_count values enable further computations such
as the average temperature (sum_temperature / transaction_count) for a particular bucket. It is much
more likely that users will query the application for the average temperature between 2:00 and 3:00
PM rather than querying for the specific temperature at 2:03 PM. Bucketing and pre-computing certain
values allows the application to more readily provide that information.
In addition to time-series data, the Bucket pattern is useful for Internet of Things projects where you
have multiple datasets coming from many different sources. It can be helpful to bucket that data into
groups (e.g. based on device type or system) to more easily retrieve and parse the data.
The Bucket pattern is also commonly used in financial applications to group transactions by type, date,
or customer.
If your reads significantly outnumber your writes, the computed pattern reduces the frequency of
having to perform computations. Instead of attaching the burden of computation to every read, the
application stores the computed value and recalculates it as needed. The application can either
recompute the value with every write that changes the computed value’s source data, or as part of a
periodic job.
NOTE : With periodic updates, the computed value is not guaranteed to be exact in any given read.
However, this approach may be worth the performance boost if exact accuracy isn’t a requirement.
Example
An application displays movie viewer and revenue information. Users often want to know how
Consider the ollowing screenings collection: many people saw a certain
movie and how much money
// screenings collection that movie made.
{
"theater": "Alger Cinema", "location": "Lakeview, OR", In this example, to
"movie_title": "Reservoir Dogs", total num_viewers and revenue,
"num_viewers": 344, "revenue": 3440 you must perform a read for
} theaters that screened a movie
{ with the title “Reservoir Dogs”
"theater": "City Cinema", "location": "New York, NY", and sum the values of those
"movie_title": "Reservoir Dogs", fields.
"num_viewers": 1496, "revenue": 22440
}
{
"theater": "Overland Park Cinema", "location": "Boise, ID",
"movie_title": "Reservoir Dogs",
"num_viewers": 760, "revenue": 7600
}
To avoid performing that computation every time the information is requested, you can compute the
total values and store them in a movies collection with the movie record itself:
// movies collection
{
"title": "Reservoir Dogs",
"total_viewers": 2600, "total_revenue": 33480,
...
}
In a low write environment, the computation could be done in conjunction with any update of
the screenings data.In an environment with more regular writes, the computations could be done at
defined intervals - every hour for example. The source data in screenings isn’t affected by writes to
the movies collection, so you can run calculations at any time.
In addition to cases where summing is requested frequently, such as getting total revenue or viewers
in the movie database example, the computed pattern is a good fit wherever calculations need to be
run against data. For example:
A car company that runs massive aggregation queries on vehicle data, storing results to show for
the next few hours until the data is recomputed.
A consumer reporting company that compiles data from several different sources to create rank-
ordered lists like the “100 Best-Reviewed Gadgets”. The lists can be regenerated periodically while
the underlying data is updated independently.
MongoDB has a feature called schema validation that allows you to apply constraints on your
documents’ structure. Schema validation is built around JSON Schema, an open standard for JSON
document structure description and validation.
In the above example we have added the document without any special validations on field like name
height etc. If documents are valid JSON, it is enough to insert them into the collection. However, this
isn’t enough to keep the database logically consistent and meaningful. So we should build schema
validation rules to make sure the data documents in the peaks collection follow a few essential
requirements.
To assign a JSON Schema validator document to the peaks collection we run the following command:
db.runCommand({
"collMod": "collection_name", The runCommand method executes the
"validator": { collMod command, which modifies the
specified collection by applying the validator
$jsonSchema: {JSON_Schema_document}
attribute to it. The validator attribute is
} responsible for schema validation.
})
validator attribute accepts $jsonSchema operator : This operator defines a JSON Schema document
which will be used as the schema validator for the given collection.
Warning: In order to execute the collMod command, your MongoDB user must be granted the
appropriate privileges.
We can also assign a JSON Schema validator when you create a collection. To do so use following
syntax:
db.createCollection( collection_name is the name of the
"collection_name", { collection to which we want to assign the
"validator": { validator document and the validator option
$jsonSchema: {JSON_Schema_document} assigns a specified JSON Schema document
} as the collection’s validator.
})
Applying a JSON Schema validator from the start like this means every document we add to the
collection must satisfy the requirements set by the validator.
When we add validation rules to an existing collection, though, the new rules won’t affect existing
documents until you try to modify them.
The JSON schema document you pass to the validator attribute should outline every validation rule
we want to apply to the collection.
The bsonType property describes the data type that the validation engine will expect to find. For
the database document itself, the expected type is object. This means that you can only add
objects — in other words, complete, valid JSON documents surrounded by curly braces ({ and }) —
to this collection. If you were to try to insert some other kind of data type (like a standalone string,
integer, or an array), it would cause an error.
In MongoDB, every document is an object. However, JSON Schema is a standard used to describe
and validate all kinds of valid JSON documents, and a plain array or a string is valid JSON, too. When
working with MongoDB schema validation, you’ll find that you must always set the root
document’s bsonType value as object in the JSON Schema validator.
Next, the description property (optional) provides a short description of the documents found in
this collection.
The next property in the validation document is the required field. The required field can only
accept an array containing a list of document fields that must be present in every document in the
collection. In this example, ["name"] means that the documents only have to contain
the name field to be considered valid.
Following that is a properties object that describes the rules used to validate document fields. For
each field that you want to define rules for, include an embedded JSON Schema document named
after the field. Be aware that you can define schema rules for fields that aren’t listed in
the required array. This can be useful in cases where your data has fields that aren’t required, but
you’d still like for them to follow certain rules when they are present.
These embedded schema documents will follow a similar syntax as the main document. To apply this
JSON Schema to the peaks collection you created in the previous step, run the
following runCommand() method:
db.runCommand({
"collMod": "peaks",
"validator": {
$jsonSchema: {
"bsonType": "object",
Prepared by Kamal Podder Page 3
"description": "Document describing a mountain peak",
"required": ["name"],
"properties": {
"name": {
"bsonType": "string",
"description": "Name must be a string and is required"
}
}, MongoDB will respond with a success message indicating
} that the collection was successfully modified:
} Output
}) { "ok" : 1 }
Following that, MongoDB will no longer allow us to insert documents into the peaks collection if they
don’t have a name field. To test this, try inserting the document you inserted in the previous step that
fully describes a mountain, aside from missing the name field:
db.peaks.insertOne({
"height": 8611, This time, the operation will trigger an error
"location": ["Pakistan", "China"], message indicating a failed document validation:
"ascents": { Output
"first": { "year": 1954 }, WriteError({
"first_winter": { "year": 1921 }, "index" : 0,
"total": 306 "code" : 121,
} "errmsg" : "Document failed validation",
}) ...
})
Note: Starting with MongoDB 5.0, when validation fails the error messages point towards the failed
constraint. In MongoDB 4.4 and earlier, the database provides no further details on the failure reason.
MongoDB will accept any value for this field when no validation applied— even values that don’t make
any sense for this field, like negative values — as long as the inserted document is written in valid JSON
syntax. To work around this, you can extend the schema validation document from the previous step
to include additional rules regarding the height field.
Start by ensuring that the height field is always present in newly-inserted documents and that it’s
always expressed as a number. Modify the schema validation with the following command:
db.runCommand({
"collMod": "peaks",
"validator": {
$jsonSchema: {
"bsonType": "object",
"description": "Document describing a mountain peak",
To prevent this, you could add a few more properties to the schema validation document. Replace the
current schema validation settings by running the following operation:
db.runCommand({
"collMod": "peaks",
"validator": {
$jsonSchema: {
"bsonType": "object",
"description": "Document describing a mountain peak",
"required": ["name", "height"],
"properties": {
"name": { "bsonType": "string", "description": "Name must be a string and is required" },
"height": { "bsonType": "number", "description": "Height must be a number between 100
and 10000 and is required", "minimum": 100, "maximum": 10000 }
},
} The minimum and maximum attributes set constraints on values
} included in height fields, ensuring they can’t be lower than 100 or
}) higher than 10000.
Now we can turn our attention to the location field to guarantee its data consistency.
As peaks span more than one country, it would make sense store each peak’s location data as an array
containing one or more country names instead of being just a string value. As with the height values,
making sure each location field’s data type is consistent across every document can help with
summarizing data when using aggregation pipelines.
First, consider some examples of location values that users might enter, and weigh which ones would
be valid or invalid:
["Nepal", "China"]: this is a two-element array, and would be a valid value for a mountain spanning
two countries.
To ensure that MongoDB will correctly interpret each of these examples as valid or invalid, run the
following operation to create some new validation rules for the peaks collection:
db.runCommand({
"collMod": "peaks",
"validator": {
$jsonSchema: {
"bsonType": "object",
"description": "Document describing a mountain peak",
"required": ["name", "height", "location"],
"properties": {
"name": { "bsonType": "string", "description": "Name must be a string and is required" },
"height": {
"bsonType": "number",
"description": "Height must be a number between 100 and 10000 and is required",
"minimum": 100, "maximum": 10000
},
"location": {
"bsonType": "array",
"description": "Location must be an array of strings",
"minItems": 1,
"uniqueItems": true,
"items": { "bsonType": "string" }
}
},
}
}
})
The minItems property validates that the array must contain at least one element, and
the uniqueItems property is set to true to ensure that elements within each location array will be
unique. This will prevent values like ["Nepal", "Nepal"] from being accepted. Lastly,
the items subdocument defines the validation schema for each individual array item. Here, the only
expectation is that every item within a location array must be a string.
Prepared by Kamal Podder Page 6
Step 5 — Validating Embedded Documents
At this point, your peaks collection has three fields — name, height and location — that are being kept
in check by schema validation. This step focuses on defining validation rules for the ascents field, which
describes successful attempts at summiting each peak.
In the example document from Step 1 that represents Mount Everest, the ascents field was structured
as follows:
{ The ascents subdocument contains a total field whose
"name": "Everest", value represents the total number of ascent attempts
"height": 8848, for the given mountain. It also contains information on
"location": ["Nepal", "China"], the first winter ascent of the mountain as well as the
"ascents": { first ascent overall.
"first": {
"year": 1953, For now, just assume the information that you will
}, always want to have in each document is the total
"first_winter": { number of ascent attempts and the ascents field must
"year": 1980, always be present and its value must always be a
}, subdocument. This subdocument, in turn, must always
"total": 5656, contain a total attribute holding a number greater than
} or equal to zero.
}
Once again, replace the schema validation document for the peaks collection by running the
following runCommand() method:
db.runCommand({
"collMod": "peaks",
"validator": {
$jsonSchema: {
"bsonType": "object",
"description": "Document describing a mountain peak",
"required": ["name", "height", "location", "ascents"],
"properties": {
"name": { "bsonType": "string", "description": "Name must be a string and is required" },
"height": {
"bsonType": "number",
"description": "Height must be a number between 100 and 10000 and is required",
"minimum": 100, "maximum": 10000 },
"location": {
"bsonType": "array",
"description": "Location must be an array of strings",
"minItems": 1,
"uniqueItems": true,
"items": { "bsonType": "string" }
MongoDB’s schema validation feature should not be considered a replacement for data validation at
the application level, but it can further safeguard against violating data constraints that are essential to
keeping your data meaningful. Using schema validation can be a helpful tool for structuring one’s data
while retaining the flexibility of a schemaless approach to data storage.
We can use the validationLevel option to determine which operations MongoDB applies the validation
rules:
If the validationLevel is strict (the default), MongoDB applies validation rules to all inserts and
updates.
If the validationLevel is moderate, MongoDB applies validation rules to inserts and to updates to
existing documents that already fulfill the validation criteria. With the moderate level, updates to
existing documents that do not fulfill the validation criteria are not checked for validity.
db.contacts.insert([
{ "_id": 1, "name": "Anne", "phone": "+1 555 123 456", "city": "London", "status": "Complete" },
{ "_id": 2, "name": "Ivan", "city": "Vancouver" }
])
Issue the following command to add a validator to the contacts collection:
db.runCommand( {
collMod: "contacts",
validator: { $jsonSchema: {
bsonType: "object",
If you attempted to update the document with _id of 1, MongoDB would apply the validation rules
since the existing document matches the criteria.
In contrast, MongoDB will not apply validation rules to updates to the document with _id of 2 as it
does not meet the validation rules. Phone is not there.
The validationAction option determines how MongoDB handles documents that violate the validation
rules:
If the validationAction is error (the default), MongoDB rejects any insert or update that violates
the validation criteria.
If the validationAction is warn, MongoDB logs any violations but allows the insertion or update
to proceed.
For example, create a contacts2 collection with the following JSON Schema validator:
db.createCollection( "contacts2", {
validator: { $jsonSchema: {
bsonType: "object",
required: [ "phone" ],
properties: {
phone: { bsonType: "string", description: "must be a string and is required" },
email: { bsonType : "string", pattern : "@mongodb\.com$",
description: "must be a string and match the regular expression pattern" },
status: { enum: [ "Unknown", "Incomplete" ],
description: "can only be one of the enum values" }
}
} },
validationAction: "warn"
})
With the warn validationAction, MongoDB logs any violations but allows the insertion or update to
proceed.
For example, the following insert operation violates the validation rule:
However, since the validationAction is warn only, MongoDB only logs the validation violation message
and allows the operation to proceed:
2017-12-01T12:31:23.738-05:00 W STORAGE [conn1] Document would fail validation collection:
example.contacts2 doc: { _id: ObjectId('5a2191ebacbbfc2bdc4dcffc'), name: "Amanda", status:
"Updated" }
For deployments that have enabled access control, to bypass document validation, the authenticated
user must have bypassDocumentValidation action. The built-in roles dbAdmin and restore provide this
action.
Restrictions
We can't specify the following query operators in a validator object:
o $expr with $function expressions
o $near
o $nearSphere
o $text
o $where
We can't specify schema validation for:
o Collections in the admin, local, and config databases
o System collections
Let’s consider an application that tracks customer orders. The orders have a base price and a VAT.
The orders collection contains these fields to track total price: price, VAT, totalWithVAT
We create a schema validation with query operators to ensure that totalWithGST matches the
expected combination of price and GST.
The following operation fails because the totalWithGST field does not equal the correct value:
db.orders.insertOne( {
total: NumberDecimal("141"), 141 * (1 + 0.20) equals 169.2, so the value of
GST: NumberDecimal("0.20"), the totalWithGST field must be 169.2.
totalWithGST: NumberDecimal("169")
})
The operation returns this error:
MongoServerError: Document failed validation
Additional information: {
failingDocumentId: ObjectId("62bcc9b073c105dde9231293"),
details: {
operatorName: '$expr',
specifiedAs: {
'$expr': {
'$eq': [
'$totalWithGST', { '$multiply': [ '$total', { '$sum': [ 1, '$GST' ] } ] }
]
}
},
reason: 'expression did not match',
expressionResult: false
The following example specifies validator rules using the query expression $or:
Syntax :>db.COLLECTION_NAME.aggregate(AGGREGATE_OPERATION)
Example
In the collection you have the following data − {
{ _id: ObjectId(7df78ad8902e)
_id: ObjectId(7df78ad8902c) title: 'Neo4j Overview',
title: 'MongoDB Overview', description: 'Neo4j is no sql database',
description: 'MongoDB is no sql database', by_user: 'Neo4j',
by_user: 'tutorials point', url: 'http://www.neo4j.com',
url: 'http://www.tutorialspoint.com', tags: ['neo4j', 'database', 'NoSQL'],
tags: ['mongodb', 'database', 'NoSQL'], likes: 750
likes: 100 },
},
{ Now from the above collection, if you want to display
_id: ObjectId(7df78ad8902d) a list stating how many tutorials are written by each
title: 'NoSQL Overview', user, then you will use the
description: 'No sql database is very fast', following aggregate() method −
by_user: 'tutorials point', > db.mycol.aggregate([{$group : {_id : "$by_user",
url: 'http://www.tutorialspoint.com', num_tutorial : {$sum : 1}}}])
tags: ['mongodb', 'database', 'NoSQL'], { "_id" : "tutorials point", "num_tutorial" : 2 }
likes: 10 { "_id" : "Neo4j", "num_tutorial" : 1 }
}, >
In this example, we have grouped Sql equivalent query for the above use case will
documents by field by_user and on be select by_user, count(*) from mycol group by
each occurrence of by user previous by_user.
value of sum is incremented.
Pipeline Concept
In UNIX command, shell pipeline means the possibility to execute an operation on some input and use
the output as the input for the next command and so on. MongoDB also supports same concept in
aggregation framework.
There is a set of possible stages and each of those is taken as a set of documents as an input and
produces a resulting set of documents (or the final resulting JSON document at the end of the
pipeline). This can then in turn be used for the next stage and so on.
Aggregation Pipeline
The most basic pipeline stages provide filters that operate like queries and document
transformations that modify the form of the output document.
Other pipeline operations provide tools for grouping and sorting documents by specific field or
fields as well as tools for aggregating the contents of arrays, including arrays of documents. In
addition, pipeline stages can use operators for tasks such as calculating the average or
concatenating a string.
The pipeline provides efficient data aggregation using native operations within MongoDB, and is
the preferred method for data aggregation in MongoDB.
The aggregation pipeline can operate on a sharded collection.
Map-Reduce
Aggregation pipeline provides better performance and a more coherent interface than map-
reduce. For examples of aggregation alternatives to map-reduce operations, see Map-Reduce
Examples. See also Map-Reduce to Aggregation Pipeline.
MongoDB also provides map-reduce operations to perform aggregation. Map-reduce uses custom
JavaScript functions to perform the map and reduce operations, as well as the optional finalize
operation.
Aggregation Pipeline
Pipeline
The MongoDB aggregation pipeline consists of stages. Each stage transforms the documents as they
pass through the pipeline. Pipeline stages do not need to produce one output document for every
input document; e.g., some stages may generate new documents or filter out documents.
Pipeline stages can appear multiple times in the pipeline with the exception of $out, $merge, and
$geoNear stages. For a list of all available stages, see Aggregation Pipeline Stages.
MongoDB provides the db.collection.aggregate() method in the mongo shell and the aggregate
command to run the aggregation pipeline.
For example usage of the aggregation pipeline, consider Aggregation with User Preference Data and
Aggregation with the Zip Code Data Set.
Starting in MongoDB 4.2, you can use the aggregation pipeline for updates :
Command mongo Shell Methods
findAndModify db.collection.findOneAndUpdate()
db.collection.findAndModify()
update db.collection.updateOne()
db.collection.updateMany()
From the above command we can understand the process of aggregate processing.
Here, the aggregate() function is used to perform aggregation it can have three operators stages,
expression and accumulator.
$match: It is used for filtering the documents can reduce the amount of documents that are
given as input to the next stage.
$project: It is used to select some specific fields from a collection.
$group: It is used to group documents based on some value.
$sort: It is used to sort the document that is rearranging them
$skip: It is used to skip n number of documents and passes the remaining documents
$limit: It is used to pass first n number of documents thus limiting them.
$unwind: It is used to unwind documents that are using arrays i.e. it deconstructs an array field
in the documents to return documents for each element.
$out: It is used to write resulting documents to a new collection
Expressions: It refers to the name of the field in input documents for e.g. { $group : { _id : “$id“,
total:{$sum:”$fare“}}} here $id and $fare are expressions.
Prepared by Kamal Podder Page 5
Accumulators: These are basically used in the group stage
sum: It sums numeric values for the documents in each group
count: It counts total numbers of documents
avg: It calculates the average of all given values from all documents
min: It gets the minimum value from all the documents
max: It gets the maximum value from all the documents
first: It gets the first document from the grouping
last: It gets the last document from the grouping
Aggregate syntax
The Aggregation Pipeline Builder in MongoDB Compass provides the ability to create aggregation
pipelines to process data. With aggregation pipelines, documents in a collection or view pass through
multiple stages where they are processed into a set of aggregated results. The particular stages and
results can be modified depending on your application's needs.
To access the aggregation pipeline builder, navigate to the collection or view for which you wish to
create an aggregation pipeline and click the Aggregations tab. You are presented with a blank
aggregation pipeline. The Preview of Documents in the Collection section of the Aggregations view
displays 20 documents sampled from the current collection.
Start the MongoDB instance from command prompt and start the Compass. Select the database
connection store as favorites (authDBCon). Click connect as shown in figure.
Following list of DB will be shown. Click on authDB database. The collections in the database will be
shown. Click on users collection. Collection window has various tab at the top. Click on Aggregations
Fill in your selected stage. As you modify the pipeline stage, the preview documents shown in the pane
to the right of the stage update automatically to reflect the results of your pipeline as it progresses,
provided Auto Preview is enabled:
As shown below we have added $match stage to select all the documents where country = ‘Maxico’
and base = ‘DWC’. The result satisfying the expression will be displayed in the right pane.
Additional aggregation stages can be added similarly as desired by clicking the Add Stage button below
your last aggregation stage. Repeat above two operation to define the additional stage.
NOTE : The toggle to the right of the name of each pipeline stage dictates whether that stage is
included in the pipeline. Toggling a pipeline stage also updates the pipeline preview, which reflects
whether or not that stage is included.
EXAMPLE : The following pipeline excludes the first $match stage and then includes the $project stage:
Limitations : The $out stage is not available if you are connected to a Data Lake.
Compass creates a view from your pipeline results in the same database where the pipeline was
created.
Pipeline Options
The toggles at the top of the pipeline builder control the following options:
Option Description
Sample Mode (Recommended) When enabled, limits input documents passed to $group, $bucket,
and $bucketAuto stages. Set the document limit with the Limit setting.
Auto When enabled, Compass automatically updates the preview documents pane to
Preview reflect the results of each active stage as the pipeline progresses.
Example
The following example walks through creating and executing a pipeline for a collection containing
airline data. You can download this dataset from the following link: air_airlines.json.
For instructions on importing JSON data into your cluster, see mongoimport. This procedure assumes
you have the data in the example.air_airlines namespace.
The following pipeline has two aggregation stages: $group and $match.
1. The $group stage groups documents by their active status and country. The stage also adds a new
field called flightCount containing the number of documents in each group.
2. The $match stage filters documents to only return those with a flightCount value greater than or
equal to 5.
1. Click the Export to Language button at the top of the pipeline builder. The button opens the
following modal:
2. Click the Copy button in the My Pipeline panel on the left. This button copies your pipeline to
your clipboard in mongo shell syntax. You will use the pipeline in the following section.
Syntax
The imported pipeline must be in the MongoDB query language (i.e. the same syntax used as the
pipeline parameter of the db.collection.aggregate() method). The imported pipeline must be an array,
even if there is only one stage in the pipeline.
Procedure
Navigate to the collection for which you wish to import your aggregation pipeline. Click the
Aggregations tab.
a. Click the arrow next to the icon at the top of the pipeline builder.
b. Click New Pipeline From Text.
If you have a pre-written pipeline you wish to import into the Aggregation Pipeline Builder, copy it to
your clipboard and paste it into the New Pipeline from Plain Text dialog. Otherwise, type your pipeline
in the input.
Once you import your pipeline, you can add and modify individual stages and see the results reflected
in the Output of each respective stage.
$match
Filter the documents, and only the documents that meet the specified conditions are passed to the
next pipeline stage.
$match accepts a document with specified query criteria. The query syntax is the same as the read
operation query syntax. Syntax : { $match: { } }
$match can use all conventional query operators except geospatial. In practical application, try to
put $match in front of the pipeline. This has two advantages: one is that you can quickly filter out
unnecessary documents to save time Reduce pipeline workload; Second, if $match is executed
before projection and grouping, Queries can use indexes.
Restrictions:
You cannot use $as part of an aggregate pipeline in a $match query.
To use $text in the $match phase, the $match phase must be the first phase of the pipeline.
View does not support text search.
Example
Sample data:
/* 1 */
{
"_id" : ObjectId("512bc95fe835e68f199c8686"),
"author" : "dave", "score" : 80, "views" : 100
}
/* 2 */
{
"_id" : ObjectId("512bc962e835e68f199c8687"),
"author" : "dave", "score" : 85, "views" : 521
}
Returns the count of documents that contain input to the stage, which is understood as the count of
documents that match the find() query of the table or view. The DB.Collection.Count() method does
not perform the find() operation, but counts and returns the number of results that match the query.
Example:
Sample data:
Implementation results:
Output :
A group key is often a field, or group of fields. The group key can also be the result of an expression.
Use the _id field in the $group pipeline stage to set the group key. The output documents can also
contain additional fields that are set using accumulator expressions.
The memory limit of the $group phase is 100m. By default, if the stage exceeds this limit, $group will
generate an error. However, to allow processing of large datasets, set the allowdiskuse option to true
to enable the $group operation to write temporary files.
Example:
Sample data:
1. The following summary operations use the $group stage to group documents by month, date and
year, calculate the total price and average quantity, and calculate the number of documents in
each group:
return:
/* 1 */
{
"_id" : { "month" : 4, "day" : 4, "year" : 2014 },
"totalPrice" : 200, "averageQuantity" : 15.0, "count" : 2.0
}
/* 2 */
{
Prepared by Kamal Podder Page 4
"_id" : { "month" : 3, "day" : 15, "year" : 2014 },
"totalPrice" : 50, "averageQuantity" : 10.0, "count" : 1.0
}
/* 3 */
{
"_id" : { "month" : 3, "day" : 1, "year" : 2014 },
"totalPrice" : 40, "averageQuantity" : 1.5, "count" : 2.0
}
2. group null , The following aggregation operations specify the group - If Id is null, calculate the total
price, average quantity and count of all documents in the collection
4. Data conversion
Group the data in the collection by price and convert it into an item array
The returned data _id value is the field specified in the group. Items can be customized and is a
grouped list.
Aggregate the operating utility variable $$ROOT to group documents by item. The generated
document cannot exceed the bson document size limit.
return:
{
"_id" : "xyz",
"books" : [
{
"_id" : 3,
"item" : "xyz", "price" : 5, "quantity" : 10,
"date" : ISODate("2014-03-15T09:00:00.000Z")
},
{
"_id" : 4,
"item" : "xyz", "price" : 5, "quantity" : 20,
"date" : ISODate("2014-04-04T11:21:39.736Z")
}
]
}
{
"_id" : "jkl",
This operator deconstructs the array field from the input document to output the document for each
element. In short, it is used to split an array into separate documents.
Syntax: You can pass a field path operand or a document operand to unwind an array field.
Field Path Operand You can pass the array field path to $unwind. When using this
syntax, $unwind does not output a document if the field value is null,
missing, or an empty array.
{ $unwind: <field path> }
When you specify the field path, prefix the field name with a dollar
sign $ and enclose in quotes.
Document Operand You can pass a document to $unwind to specify various behavior options.
with Options {
$unwind:
{
path: <field path>,
includeArrayIndex: <string>,
preserveNullAndEmptyArrays: <boolean>
}
Prepared by Kamal Podder Page 7
}
includeArrayIndex : Optional. The name of a new field to hold the array index
of the element. The name cannot start with a dollar sign $.
preserveNullAndEmptyArrays: Optional, The default value is false.
Example:
The following aggregation uses $unwind to output a document for each element in the sizes array:
return:
db.getCollection('test').aggregate(
{ “_id” : 1, “item” : “ABC1”, “sizes” : “S” }
[ { $unwind : "$sizes" } ]
{ “_id” : 1, “item” : “ABC1”, “sizes” : “M” }
)
{ “_id” : 1, “item” : “ABC1”, “sizes” : “L” }
Each document is the same as the input document, except that the values of the sizes field are the
values of the original sizes array.
The following $unwind operation uses the include array index option to output the array index of an
array element.
db.getCollection('test').aggregate( [
{ $unwind: { path: "$sizes", includeArrayIndex: "arrayIndex" } }
])
return:
{ “_id” : 1, “item” : “ABC”, “sizes” : “S”, “arrayIndex” : NumberLong(0) }
{ “_id” : 1, “item” : “ABC”, “sizes” : “M”, “arrayIndex” : NumberLong(1) }
{ “_id” : 1, “item” : “ABC”, “sizes” : “L”, “arrayIndex” : NumberLong(2) }
{ “_id” : 3, “item” : “IJK”, “sizes” : “M”, “arrayIndex” : null }
db.inventory.aggregate( [
{ $unwind: { path: "$sizes", preserveNullAndEmptyArrays: true } }
])
return:
{ "_id" : 1, "item" : "ABC", "sizes" : "S" }
{ "_id" : 1, "item" : "ABC", "sizes" : "M" }
{ "_id" : 1, "item" : "ABC", "sizes" : "L" }
{ "_id" : 2, "item" : "EFG" }
{ "_id" : 3, "item" : "IJK", "sizes" : "M" }
{ "_id" : 4, "item" : "LMN" }
{ "_id" : 5, "item" : "XYZ", "sizes" : null }
$project
$project can select the desired and unwanted fields from the document.
The specified field can be an existing field from an input document or a new calculated field.
It can also perform some complex operations through pipeline expressions, such as mathematical
operations, date operations, string operations, and logical operations.
Syntax : { $project: { } }
The $project pipeline character is used to select fields (specify fields, add fields, and do not display
fields) _id: 0, exclude fields, etc.), rename fields, derive fields.
: <1 or true> Whether to include the field, field:1/0 , indicating select / not select field
_id: <0 or false> Specify _id field
: Add a new field or reset the value of an existing field. Change in version 3.6: mongodb 3.6 adds the
variable remove. If the expression evaluates to $$remove, the field is excluded from the output.
:<0 or false> V3.4 new function, specify exclusion field
By default _id field is included in the output document. To include any other fields in the input
document in the output document, include in $project must be explicitly specified. If you specify
to include a field that does not exist in the document, $project ignores the field inclusion and
does not add the field to the document.
To add a new field or reset the value of an existing field, specify the field name and set its value
to an expression.
To set the field value directly to a number or Boolean text, rather than to an expression that
resolves to text, use the $literal operator. Otherwise, the $project treats a number or Boolean
text as a flag to include or exclude the field.
Example:
Sample data:
{
"_id" : 1,
title: "abc123", isbn: "0001122223334", author: { last: "zzz", first: "aaa" },
copies: 5, lastModified: "2016-07-28"
}
1. The output document of the following $project phase contains only_ id , title and author fields:
To exclude _id field from the output document of the $project phase set it to 0.
2. Exclude fields from nested documents,In the $project phase, the author.first and LastModified
fields are excluded from the output:
Sample data:
{
"_id" : 1,
title: "abc123", isbn: "0001122223334",
author: { last: "zzz", first: "aaa" }, copies: 5, lastModified: "2016-07-28"
}
{
"_id" : 2,
title: "Baked Goods", isbn: "9999999999999",
author: { last: "xyz", first: "abc", middle: "" }, copies: 2, lastModified: "2017-07-21"
}
{
"_id" : 3,
title: "Ice Cream Cakes", isbn: "8888888888888",
author: { last: "xyz", first: "abc", middle: "mmm" }, copies: 5, lastModified: "2017-07-22"
}
3. The following $project phase uses the remove variable to exclude the author.middle field if it is
equal to ”“:
db.books.aggregate( [
{ Output :
$project: {
title: 1, return:
"author.first": 1, "author.last" : 1, { "_id" : 1, "title" : "abc123", "author" : { "last" :
"author.middle": { "zzz", "first" : "aaa" } }
$cond: { { "_id" : 2, "title" : "Baked Goods",
if: { $eq: [ "", "$author.middle" ] }, "author" : { "last" : "xyz", "first" : "abc" } }
then: "$$REMOVE", { "_id" : 3, "title" : "Ice Cream Cakes", "author" : {
else: "$author.middle" "last" : "xyz", "first" : "abc", "middle" : "mmm" } }
}
}
}
}] )
Contains the specified fields from the embedded document(the results only return fields that contain
nested documents, and of course, fields that contain nested documents _id)
Sample document:
Sample data:
{
"_id" : 1,
title: "abc123", isbn: "0001122223334",
author: { last: "zzz", first: "aaa" }, copies: 5
}
ISBN, LastName and copiesold are added to the return field
db.books.aggregate(
[ Return result :
{ {
$project: { "_id" : 1,
title: 1, "title" : "abc123",
isbn: { "isbn" : {
prefix: { $substr: [ "$isbn", 0, 3 ] }, "prefix" : "000",
group: { $substr: [ "$isbn", 3, 2 ] }, "group" : "11",
publisher: { $substr: [ "$isbn", 5, 4 ] }, "publisher" : "2222",
title: { $substr: [ "$isbn", 9, 3 ] }, "title" : "333",
checkDigit: { $substr: [ "$isbn", 12, 1] } "checkDigit" : "4"
}, },
lastName: "$author.last", "lastName" : "zzz",
copiesSold: "$copies" "copiesSold" : 5
} }
}
]
)
Project a new array field
Sample data:
The following aggregation operation returns the new array field myArray:
If the returned array contains fields that do not exist, null will be returned:
db.collection.aggregate( [ { $project: { myArray: [ "$x", "$y", "$someField" ] } } ] )
return:
{ "_id" : ObjectId("55ad167f320c6be244eb3b95"), "myArray" : [ 1, 1, null ] }
$limit $skip
Limit the number of documents passed to the Skip the specified number of documents entering
next stage in the pipeline the stage and pass the remaining documents to
Syntax : { $limit: } the next stage in the pipeline
Example: Synatx : { $skip: }
db.article.aggregate(
{ $limit : 5 } Example:
); db.article.aggregate(
This operation returns only the first five { $skip : 5 }
documents passed to it by the pipeline$ Limit has );
no effect on the content of the document it This action skips the first five documents passed
passes. to it by the pipeline$ Skip has no effect on the
content of the document passed along the
pipeline.
$sort
Sort all input documents and return them to the pipeline in sort order.
$sort specifies the fields to sort and the documents in the corresponding sort order. Can have one of
the following values:
1 specifies the ascending order.
-1 specifies descending order.
{$meta: “textscore”} sorts the calculated textscore metadata in descending order.
Example:
To sort fields, set the sort order to 1 or – 1 to specify ascending or descending sort, respectively, as
shown in the following example:
$sortByCount
Groups incoming documents based on the value of a specified expression, then computes the count
of documents in each distinct group.
Each output document contains two fields: an _id field containing the distinct grouping value,
and a count field containing the number of documents belonging to that grouping or category.
The documents are sorted by count in descending order.
To specify a field path, prefix the field name with a dollar sign $ and enclose it in quotes. For example,
to group by the field employee, specify "$employee" as the expression.
{ $sortByCount: "$employee" }
Example
The following operation unwinds the tags array and uses the $sortByCount stage to count the number
of documents associated with each tag:
The operation returns the following documents, sorted in descending order by count:
{ "_id" : "painting", "count" : 6 }
{ "_id" : "oil", "count" : 4 }
{ "_id" : "Expressionism", "count" : 3 }
{ "_id" : "Surrealism", "count" : 2 }
{ "_id" : "abstract", "count" : 2 }
{ "_id" : "woodblock", "count" : 1 }
{ "_id" : "woodcut", "count" : 1 }
{ "_id" : "ukiyo-e", "count" : 1 }
{ "_id" : "satire", "count" : 1 }
{ "_id" : "caricature", "count" : 1 }
"textScore" is a metadata which returns the score associated with the corresponding $text query for
each matching document. The text score signifies how well the document matched the search term or
terms.
We can use the { $meta: "textScore" } argument to sort by descending relevance score when
using $text searches.
From MongoDB 4.4 the line that goes { score: {
db.posts.find( $meta: "textScore" }} is optional. Omitting this
{ $text: { $search: "funny" } }, will omit the score field from the results. So we
{ score: { $meta: "textScore" }} can do the following (from MongoDB 4.4):
).sort({ score: { $meta: "textScore" } } db.posts.find(
).pretty() { $text: { $search: "funny" } }
).sort({ score: { $meta: "textScore" } }
Result: sorted by { $meta: "textScore" }. ).pretty()
{
"_id" : 2, "title" : "Animals", "body" : "Animals are funny things...",
"date" : ISODate("2020-01-01T00:00:00Z"), "score" : 0.6666666666666666
}
{
"_id" : 3, "title" : "Oceans", "body" : "Oceans are wide and vast, but definitely not funny...",
"date" : ISODate("2021-01-01T00:00:00Z"), "score" : 0.6
}
{
"_id" : 1, "title" : "Web", "body" : "Create a funny website with these three easy steps...",
"date" : "2021-01-01T00:00:00.000Z", "score" : 0.5833333333333334
}
Not : Doing $text searches like this requires that we’ve created a text index. If not,
an IndexNotFound error will be returned.
To understand the metadata sort let we discuss the text based searched in MongoDB engine ( it
behaves like web search engine)
db.recipes.insertMany([
{"name": "Cafecito", "description": "A sweet and rich Cuban hot coffee made by topping an
espresso shot with a thick sugar cream foam."},
{"name": "New Orleans Coffee", "description": "Cafe Noir from New Orleans is a spiced, nutty
coffee made with chicory."},
{"name": "Affogato", "description": "An Italian sweet dessert coffee made with fresh-brewed
espresso and vanilla ice cream."},
{"name": "Maple Latte", "description": "A wintertime classic made with espresso and steamed
milk and sweetened with some maple syrup."},
{"name": "Pumpkin Spice Latte", "description": "It wouldn't be autumn without pumpkin spice
lattes made with espresso, steamed milk, cinnamon spices, and pumpkin puree."}
])
To start using MongoDB’s full-text search capabilities, we must create a text index on a collection.
A text index is a special type of index used to further facilitate searching fields containing text data.
When a user creates a text index, MongoDB will automatically drop any language-specific stop
words from searches. This means that MongoDB will ignore the most common words for the given
language (in English, words like “a”, “an”, “the”, or “this”).
MongoDB will also implement a form of suffix-stemming in searches. This involves MongoDB
identifying the root part of the search term and treating other grammar forms of that root (created
by adding common suffixes like “-ing”, “-ed”, or perhaps “-er”) as equivalent to the root for the
purposes of the search.
You can only create one text index for any given MongoDB collection, but the index can be created
using more than one field.
In our example collection, there is useful text stored in both the name and description fields of each
document. It could be useful to create a text index for both fields.
Run the following createIndex() method, which will create a text index for the two fields:
Perhaps the most common search problem is to look up documents containing one or more
individual words.
Typically, users expect the search engine to be flexible in determining where the given search
terms should appear. As an example, if you were to use any popular web search engine and type in
“coffee sweet spicy”, you likely are not expecting results that will contain those three words in that
That’s also how MongoDB approaches typical search queries when using text indexes.
Here we outlines how MongoDB interprets search queries with a few examples.
Example : 1
We want to search for coffee drinks with spices in their recipe, so we search for the word spiced alone
using the following command:
Notice that the syntax when using full-text search is slightly different from regular queries. Individual
field names — like name or description — don’t appear in the filter document. Instead, the query uses
the $text operator, telling MongoDB that this query intends to use the text index we created
previously. We don’t need to be any more specific than that because, as a collection may only have a
single text index.
After running this command, MongoDB produces the following list of documents:
Output
{ "_id" : ObjectId("61895d2787f246b334ece915"), "name" : "Pumpkin Spice Latte", "description" : "It
wouldn't be autumn without pumpkin spice lattes made with espresso, steamed milk, cinnamon
spices, and pumpkin puree." }
{ "_id" : ObjectId("61895d2787f246b334ece912"), "name" : "New Orleans Coffee", "description" :
"Cafe Noir from New Orleans is a spiced, nutty coffee made with chicory." }
There are two documents in the result set, both of which contain words resembling the search query.
While the New Orleans Coffee document does have the word spiced in the description, the Pumpkin
Spice Late document doesn’t.
As MongoDB uses stemming, it stripped the word spiced down to just spice, looked up spice in the
index, and also stemmed it. Because of this, the words spice and spices in the Pumpkin Spice
Late document matched the search query successfully, even though you didn’t search for either of
those words specifically.
Example : 2
Look up documents with a two-word query, spiced espresso, to look for a spicy, espresso-based coffee.
When using multiple words in a search query, MongoDB performs a logical OR operation, so a
document only has to match one part of the expression to be included in the result set. The results
contain documents containing both spiced and espresso or either term alone. Notice that words do not
necessarily need to appear near each other as long as they appear in the document somewhere.
When a query, especially a complex one, returns multiple results, some documents are likely to be a
better match than others. For example, when you look for spiced espresso drinks, those that are both
spiced and espresso-based are more fitting than those without spices or not using espresso as the
base.
Full-text search engines typically assign a relevance score to the search results, indicating how well
they match the search query. MongoDB also does this, but the search relevance is not visible by
default.
Search once again for spiced espresso, but this time have MongoDB also return each result’s search
relevance score. To do this, you could add a projection after the query filter document:
db.recipes.find(
{ $text: { $search: "spiced espresso" } },
{ score: { $meta: "textScore" } }
)
The projection { score: { $meta: "textScore" } } uses the $meta operator, a special kind of projection
that returns specific metadata from returned documents. This example returns the
documents’ textScore metadata, a built-in feature of MongoDB’s full-text search engine that contains
the search relevance score.
After executing the query, the returned documents will include a new field named score, as was
specified in the filter document:
Notice how much higher the score is for Pumpkin Spice Latte, the only coffee drink that contains both
the words spiced and espresso. According to MongoDB’s relevance score, it’s the most relevant
document for that query. However, by default, the results are not returned in order of relevance.
To change that, you could add a sort() clause to the query, like this:
db.recipes.find(
{ $text: { $search: "spiced espresso" } }, The syntax for the sorting document is the same as
{ score: { $meta: "textScore" } } that of the projection. Now, the list of documents is
).sort( { score: { $meta: "textScore" } } ); the same, but their order is different:
Output
{ "_id" : ObjectId("61895d2787f246b334ece915"), "name" : "Pumpkin Spice Latte", "description" : "It
wouldn't be autumn without pumpkin spice lattes made with espresso, steamed milk, cinnamon
spices, and pumpkin puree.", "score" : 2.0705128205128203 }
{ "_id" : ObjectId("61895d2787f246b334ece914"), "name" : "Maple Latte", "description" : "A
wintertime classic made with espresso and steamed milk and sweetened with some maple syrup.",
"score" : 0.55 }
{ "_id" : ObjectId("61895d2787f246b334ece913"), "name" : "Affogato", "description" : "An Italian
sweet dessert coffee made with fresh-brewed espresso and vanilla ice cream.", "score" :
0.5454545454545454 }
{ "_id" : ObjectId("61895d2787f246b334ece912"), "name" : "New Orleans Coffee", "description" :
"Cafe Noir from New Orleans is a spiced, nutty coffee made with chicory.", "score" :
0.5454545454545454 }
{ "_id" : ObjectId("61895d2787f246b334ece911"), "name" : "Cafecito", "description" : "A sweet and
rich Cuban hot coffee made by topping an espresso shot with a thick sugar cream foam.", "score" :
0.5384615384615384 }
Text Score
The $text operator assigns a score to each document that contains the search term in the indexed
fields. The score represents the relevance of a document to a given text search query. The score can be
part of a sort() method specification as well as part of the projection expression.
The { $meta: "textScore" } expression provides information on the processing of the $text operation.
Examples
Sample Data
{ _id: 1, subject: "coffee", author: "xyz", views: 50 },
{ _id: 2, subject: "Coffee Shopping", author: "efg", views: 5 },
{ _id: 3, subject: "Baking a cake", author: "abc", views: 90 },
{ _id: 4, subject: "baking", author: "xyz", views: 100 },
{ _id: 5, subject: "Café Con Leche", author: "abc", views: 200 },
{ _id: 6, subject: "Сырники", author: "jkl", views: 80 },
{ _id: 7, subject: "coffee and cream", author: "efg", views: 10 },
{ _id: 8, subject: "Cafe con Leche", author: "xyz", views: 10 }
The query specifies a $search string of coffee: db.articles.find( { $text: { $search: "coffee" } } )
This query returns the documents that contain the term coffee in the indexed subject field, or more
precisely, the stemmed version of the word:
If the search string is a space-delimited string, $text operator performs a logical OR search on each
term and returns documents that contains any of the terms.
The following query specifies a $search string of three terms delimited by space, "bake coffee cake":
This query returns documents that contain either bake or coffee or cake in the indexed subject field, or
more precisely, the stemmed version of these words:
{
cust_id: "abc123",
ord_date: ISODate("2012-11-02T17:04:11.102Z"),
status: 'A',
price: 50,
items: [ { sku: "xxx", qty: 25, price: 1 },
{ sku: "yyy", qty: 25, price: 1 } ]
}
MongoDB MySql
1. Count all records in orders table
db.orders.aggregate( [
{ $group: { _id: null, total: { $sum: "$price" } } }
])
3. For each unique cust_ ID to calculate the sum of SELECT cust_id, SUM(price) AS total
prices FROM orders
db.orders.aggregate( [ GROUP BY cust_id
{ $group: { _id: "$cust_id", total: { $sum: "$price" } } }
In the following steps, we’ll prepare a test database to serve as an example data set. We’ll then learn
how to use a few of the most common aggregation pipeline stages individually. Finally, you’ll combine
these stages together to form a complete example pipeline.
To understand how the aggregation pipelines work, we need a collection of documents with multiple
fields of different types we can filter, sort, group, and summarize in different ways. Say we will use a
sample collection describing the twenty most populated cities in the world.
Run the following insertMany() method in the MongoDB shell to simultaneously create a collection
named cities and insert twenty sample documents into it. These documents describe the twenty most
populated cities in the world:
db.cities.insertMany([
{"name": "Seoul", "country": "South Korea", "continent": "Asia", "population": 25.674 },
{"name": "Mumbai", "country": "India", "continent": "Asia", "population": 19.980 },
{"name": "Lagos", "country": "Nigeria", "continent": "Africa", "population": 13.463 },
{"name": "Beijing", "country": "China", "continent": "Asia", "population": 19.618 },
{"name": "Shanghai", "country": "China", "continent": "Asia", "population": 25.582 },
{"name": "Osaka", "country": "Japan", "continent": "Asia", "population": 19.281 },
{"name": "Cairo", "country": "Egypt", "continent": "Africa", "population": 20.076 },
{"name": "Tokyo", "country": "Japan", "continent": "Asia", "population": 37.400 },
{"name": "Karachi", "country": "Pakistan", "continent": "Asia", "population": 15.400 },
{"name": "Dhaka", "country": "Bangladesh", "continent": "Asia", "population": 19.578 },
{"name": "Rio de Janeiro", "country": "Brazil", "continent": "South America", "population": 13.293 },
{"name": "São Paulo", "country": "Brazil", "continent": "South America", "population": 21.650 },
{"name": "Mexico City", "country": "Mexico", "continent": "North America", "population": 21.581 },
Prepared by Kamal Podder Page 1
{"name": "Delhi", "country": "India", "continent": "Asia", "population": 28.514 },
{"name": "Buenos Aires", "country": "Argentina", "continent": "South America",
"population": 14.967 },
{"name": "Kolkata", "country": "India", "continent": "Asia", "population": 14.681 },
{"name": "New York", "country": "United States", "continent": "North America",
"population": 18.819 },
{"name": "Manila", "country": "Philippines", "continent": "Asia", "population": 13.482 },
{"name": "Chongqing", "country": "China", "continent": "Asia", "population": 14.838 },
{"name": "Istanbul", "country": "Turkey", "continent": "Europe", "population": 14.751 }
])
The output will contain a list of object identifiers assigned to the newly inserted objects.
We can verify that the documents were properly inserted by running the find() method on
the cities collection with no arguments like db.cities.find(). This will retrieve all the documents in the
collection.
Each element inside this array is an object describing a stage. The stage is written as { $match: { } }. It is
describing the processing stage, the key $match refers to the stage type, and the value { } describes its
parameters. In our example, the $match stage uses the empty query document as its parameter and is
the only stage in the whole processing pipeline.
Remember that $match narrows down the list of documents from the collection. With no filtering
parameters applied, MongoDB will return the list of all cities from the collection.
Next, run the aggregate() method again, but this time include a query document as a parameter to
the $match stage. Any valid query document can be used here.
We can think of using the $match stage as equivalent to querying the collection with find(). The biggest
difference is that $match can be used multiple times in the aggregation pipeline, allowing us to query
Run the following aggregate() method. This example includes a $match stage to select only cities from
North America:
db.cities.aggregate([
Here { "continent": "North America" } query
{ $match: { "continent": "North America" } }
document appears as the parameter.
])
Output
{ "_id" : ObjectId("612d1e835ebee16872a109b0"), "name" : "Mexico City", "country" : "Mexico",
"continent" : "North America", "population" : 21.581 }
{ "_id" : ObjectId("612d1e835ebee16872a109b4"), "name" : "New York", "country" : "United States",
"continent" : "North America", "population" : 18.819 }
This command returns the same output as the following one which instead uses the find() method to
query the database: db.cities.find({ "continent": "North America" })
Following aggregate() method will return cities from North America and Asia:
db.cities.aggregate([
{ $match: { "continent": { $in: ["North America", "Asia"] } } }
])
With that, we’ve learned how to execute an aggregation pipeline and using the $match stage to
narrow down the collection’s documents.
$match does nothing to change or transform the data as it passes through the pipeline. When querying
the database, it’s common to expect a certain order when retrieving the results. Using the standard
query mechanism, you can specify the document order by appending a sort() method to the end of
a find() query. For example, to retrieve every city in the collection and sort them in descending order
by population, you could run an operation like this:
db.cities.find().sort({ "population": -1 })
We can alternatively sort the documents in an aggregation pipeline by including a $sort stage. To
illustrate this, run the following aggregate() method. This follows a similar syntax to the previous
examples that used a $match stage:
db.cities.aggregate([
{ $sort: { "population": -1 } }
])
Output
{ "_id" : ObjectId("612d1e835ebee16872a109ab"), "name" : "Tokyo", "country" : "Japan",
"continent" : "Asia", "population" : 37.4 }
{ "_id" : ObjectId("612d1e835ebee16872a109b1"), "name" : "Delhi", "country" : "India",
"continent" : "Asia", "population" : 28.514 }
{ "_id" : ObjectId("612d1e835ebee16872a109a4"), "name" : "Seoul", "country" : "South Korea",
"continent" : "Asia", "population" : 25.674 }
...
Suppose we want to retrieve cities just from North America sorted by population in ascending order.
To do so, we can apply two processing stages as shown below :
To obtain these results, MongoDB first passed the document collection through the $match stage,
filtered the documents against the query criteria, and then forwarded the results to the next stage in
line responsible for sorting the results. Just like the $match stage, $sort can appear multiple times in
the aggregation pipeline and can sort documents by any field you might need, including fields that will
only appear in the document structure during the aggregation.
Note: When running filtering and sorting stages at the beginning of the aggregation pipeline, before
any projection, grouping, or other transformation stages, MongoDB will use indexes to maximize the
performance just like it would with standard query.
The output documents of $group stage hold information about the group and can contain additional
computed fields like sums or averages across the list of documents from the group.
Here we includes a $group stage that will group the resulting documents by the continent in which
each city is located:
This aggregate() method, though, does specify an _id value; namely, each value found in
the continent fields of each document in the cities collection. Any time we want to refer the values of a
field in an aggregation pipeline like this, we must precede the name of the field with a dollar sign ($). In
Here in this example, "$continent" tells MongoDB to take the continent field from the original
document and use its value to construct the expression value in the aggregation pipeline. MongoDB
will output a single document for each unique value of that grouping expression:
Output
{ "_id" : "Africa" } Here output is a single document for each of the five
{ "_id" : "Asia" } continents represented in the collection. By default, the
{ "_id" : "South America" } grouping stage doesn’t include any additional fields from
{ "_id" : "Europe" } the original document, since it wouldn’t know how or
{ "_id" : "North America" } from which document to source the other values.
We can, however, specify multiple single-field values in a grouping expression. The following example
method will group documents based on the values in the continent and country documents:
db.cities.aggregate([
{ Here the _id field of grouping expression uses an
$group: { embedded document which, in turn, has two
"_id": { "continent": "$continent", fields inside: one for the continent name and
"country": "$country" } another for the country name. Both fields refer
} to fields from the original documents using the
} field path dollar sign notation.
])
This time MongoDB returns 14 results as there are 14 distinct country-continents pairs in the
collection:
Output
{ "_id" : { "continent" : "Europe", "country" : "Turkey" } }
{ "_id" : { "continent" : "South America", "country" : "Argentina" } }
{ "_id" : { "continent" : "Asia", "country" : "Bangladesh" } }
{ "_id" : { "continent" : "Asia", "country" : "Philippines" } }
{ "_id" : { "continent" : "Asia", "country" : "South Korea" } }
{ "_id" : { "continent" : "Asia", "country" : "Japan" } }
{ "_id" : { "continent" : "Asia", "country" : "China" } }
{ "_id" : { "continent" : "North America", "country" : "United States" } }
{ "_id" : { "continent" : "North America", "country" : "Mexico" } }
{ "_id" : { "continent" : "Africa", "country" : "Nigeria" } }
{ "_id" : { "continent" : "Asia", "country" : "India" } }
{ "_id" : { "continent" : "Asia", "country" : "Pakistan" } }
{ "_id" : { "continent" : "Africa", "country" : "Egypt" } }
{ "_id" : { "continent" : "South America", "country" : "Brazil" } }
MongoDB provides a number of accumulator operators which allow us to find more granular details
about your data. An accumulator operator, sometimes just referred to simply as an accumulator, is a
To illustrate, run the following aggregate() method. This method’s $group stage creates the
required _id grouping expression as well as three additional computed fields. These computed fields all
include an accumulator operator and its value. Here’s a breakdown of these computed fields:
highest_population: this field contains the maximum population value in the group.
The $max accumulator operator computes the maximum value for "$population" across all
documents in a group.
first_city: contains the name of the first city in the group. The $first accumulator operator takes the
value of "$name" from the first document appearing in the group. Notice that since the list of
documents is now unordered, this doesn’t automatically make it the city with the highest
population, but rather the first city MongoDB finds within each group.
cities_in_top_20: holds the number of cities in the collection for each continent-country pair. To
accomplish this, the $sum accumulator operator is used to compute the sum of all the pairs in the
list. In this example, the sum takes one for each document and doesn’t refer to a particular field in
the source document.
We can add as many additional computed fields as needed for your use case, but for now run this
example query:
db.cities.aggregate([
{
$group: {
"_id": { "continent": "$continent", "country": "$country" },
"highest_population": { $max: "$population" },
"first_city": { $first: "$name" },
"cities_in_top_20": { $sum: 1 }
}
}
])
MongoDB returns the following 14 documents, one for each unique group defined by the grouping
expression:
Output
{ "_id" : { "continent" : "North America", "country" : "United States" }, "highest_population" : 18.819,
"first_city" : "New York", "cities_in_top_20" : 1 }
{ "_id" : { "continent" : "Asia", "country" : "Philippines" }, "highest_population" : 13.482,
"first_city" : "Manila", "cities_in_top_20" : 1 }
{ "_id" : { "continent" : "North America", "country" : "Mexico" }, "highest_population" : 21.581,
"first_city" : "Mexico City", "cities_in_top_20" : 1 }
{ "_id" : { "continent" : "Africa", "country" : "Nigeria" }, "highest_population" : 13.463,
"first_city" : "Lagos", "cities_in_top_20" : 1 }
{ "_id" : { "continent" : "Asia", "country" : "India" }, "highest_population" : 28.514,
The field names in the returned documents correspond to the computed field names in the grouping
stage document. To examine the results more closely, let’s narrow our focus to a single document:
Note: In addition to the three described in this step, there are several more accumulator operators
available in MongoDB that can be used for a variety of aggregations.
When working with aggregation pipelines, you’ll sometimes want to return only a few of a document
collection’s multiple fields or change the structure slightly to move some fields into embedded
documents.
To illustrate, run the following aggregate() method which includes a $project stage:
In an aggregation pipeline, projections can also include additional computed fields. In such cases, the
projection automatically becomes an inclusion projection, and only the _id field can be suppressed by
appending "_id": 0 to the projection document. Computed fields use the dollar sign field path notation
for their values and can refer to the values from input documents.
In this example, the document identifier is suppressed with "_id": 0, the name and population are
computed fields referring to the name and population fields from the input documents, respectively.
The location field becomes an embedded document with two additional keys: country and continent,
referring to fields from the input documents.
Using this projection stage, MongoDB will return the following documents:
Output
Each document now follows the new format transformed through the projection stage.
We’re now ready to join together all the previous stages to form a fully functional aggregation pipeline
that both filters and transforms documents.
Suppose the task at hand is to find the most populated city for each country in country in Asia and
North America and return both its name and population. The results should be sorted by the highest
population, returning countries with the largest cities first, and you are interested only in countries
where the most populated city crosses the threshold of 20 million people. Lastly, the document
structure you aim for should replicate the following:
Example document
{
"location" : {
"country" : "Japan",
"continent" : "Asia"
},
"most_populated_city" : {
"name" : "Tokyo",
db.cities.aggregate([
{
$match: { "continent": { $in: ["North America", "Asia"]
$sort stage tells} MongoDB
} to order the documents by
}, population in descending order.
{
$sort: { "population": -1 }
}
])
Once again the returned documents have the same structure, but this time Tokyo comes first since it
has the highest population:
Output
{ "_id" : ObjectId("612d1e835ebee16872a109ab"), "name" : "Tokyo", "country" : "Japan",
"continent" : "Asia", "population" : 37.4 }
...
We now have the list of cities sorted by the population coming from the expected continents, so the
next necessary action for this scenario is to group cities by their countries, choosing only the most
populated city from each group. To do so, add a $group stage to the pipeline:
Adding this $group stage changes the number of documents returned by this method as well as their
structure. This time, the method only returns nine documents, as there are only nine unique country
and continent pairs in the previously filtered cities list. Each document corresponds to one of these
pairs, and consists of the grouping expression value in the _id field and two computed fields:
Output
{ "_id" : { "continent" : "North America", "country" : "United States" }, "first_city" : "New York",
"highest_population" : 18.819 }
{ "_id" : { "continent" : "Asia", "country" : "China" }, "first_city" : "Shanghai",
"highest_population" : 25.582 }
{ "_id" : { "continent" : "Asia", "country" : "Japan" }, "first_city" : "Tokyo", "highest_population" : 37.4 }
{ "_id" : { "continent" : "Asia", "country" : "South Korea" }, "first_city" : "Seoul",
"highest_population" : 25.674 }
{ "_id" : { "continent" : "Asia", "country" : "Bangladesh" }, "first_city" : "Dhaka",
"highest_population" : 19.578 }
{ "_id" : { "continent" : "Asia", "country" : "Philippines" }, "first_city" : "Manila",
"highest_population" : 13.482 }
{ "_id" : { "continent" : "Asia", "country" : "India" }, "first_city" : "Delhi", "highest_population" : 28.514 }
{ "_id" : { "continent" : "Asia", "country" : "Pakistan" }, "first_city" : "Karachi",
"highest_population" : 15.4 }
{ "_id" : { "continent" : "North America", "country" : "Mexico" }, "first_city" : "Mexico City",
"highest_population" : 21.581 }
Notice that the resulting documents for each group are not ordered by the population value. New York
comes first, but the second city — Shanghai — has a population of almost 7 million people more. Also,
db.cities.aggregate([
{
$match: { "continent": { $in: ["North America", "Asia"] } }
},
{ $sort: { "population": -1 } },
{
$group: {
"_id": { "continent": "$continent", "country": "$country" },
"first_city": { $first: "$name" },
"highest_population": { $max: "$population" }
This filtering $match stage refers to
}
the highest_population field available in
},
the documents coming from the
{
grouping stage, even though such a field
$match: { "highest_population": { $gt: 20.0 } }
is not part of the structure of the original
}
documents.
])
Next, sort the results according by their highest_population value. To do so, add another $sort stage:
db.cities.aggregate([
{
$match: { "continent": { $in: ["North America", "Asia"] } }
},
{ $sort: { "population": -1 } },
{
$group: {
"_id": { "continent": "$continent", "country": "$country" },
"first_city": { $first: "$name" },
"highest_population": { $max: "$population" }
Output
{ "_id" : { "continent" : "Asia", "country" : "Japan" }, "first_city" : "Tokyo", "highest_population" : 37.4 }
{ "_id" : { "continent" : "Asia", "country" : "India" }, "first_city" : "Delhi", "highest_population" : 28.514 }
{ "_id" : { "continent" : "Asia", "country" : "South Korea" }, "first_city" : "Seoul",
"highest_population" : 25.674 }
{ "_id" : { "continent" : "Asia", "country" : "China" }, "first_city" : "Shanghai",
"highest_population" : 25.582 }
{ "_id" : { "continent" : "North America", "country" : "Mexico" }, "first_city" : "Mexico City",
"highest_population" : 21.581 }
The last requirement is to transform the document structure to match the sample shown previously.
For your review, here’s that sample once more:
Example document
{ This sample’s location embedded document resembles
"location" : { the _id grouping expression value, as both
"country" : "Japan", include country and continent fields. The most populated
"continent" : "Asia" city name and population are nested as an embedded
}, document under the most_populated_city field. This is
"most_populated_city" : { different from the grouping results, where all computed
"name" : "Tokyo", fields are top-level fields.
"population" : 37.4
}
}
To transform the results to align with this structure, add a $project stage to the pipeline:
db.cities.aggregate([
{
$match: { "continent": { $in: ["North America", "Asia"] } }
},
{ $sort: { "population": -1 } },
{
$group: {
"_id": { "continent": "$continent", "country": "$country" },
This projection stage effectively constructs an entirely new structure for the output as shown below:
Output
{ "location" : { "country" : "Japan", "continent" : "Asia" }, "most_populated_city" : { "name" : "Tokyo",
"population" : 37.4 } }
{ "location" : { "country" : "India", "continent" : "Asia" }, "most_populated_city" : { "name" : "Delhi",
"population" : 28.514 } }
{ "location" : { "country" : "South Korea", "continent" : "Asia" },
"most_populated_city" : { "name" : "Seoul", "population" : 25.674 } }
{ "location" : { "country" : "China", "continent" : "Asia" },
"most_populated_city" : { "name" : "Shanghai", "population" : 25.582 } }
{ "location" : { "country" : "Mexico", "continent" : "North America" },
"most_populated_city" : { "name" : "Mexico City", "population" : 21.581 } }
This output meets all the requirements defined at the beginning of this step:
It only includes cities from Asia and North America in the lists.
For each country and continent pair, a single city is selected, and it’s the city with the highest
population.
The selected city’s name and population are listed.
Cities are sorted from the most populated to least populated.
The output format is altered to align with the example document .
Prepared by Kamal Podder Page 14