Breaking the Sound Barrier: MongoDB (WiredTiger) on AWS – Part II

You are reading: Breaking the Sound Barrier: 1.2M inserts/sec with MongoDB WiredTiger on a single AWS instance – Part II of II, by  Dmitry Agranat (LinkedIn).  The first part of this series can be read here.

Introduction – MongoDB is a cross-platform document oriented database. Classified as a NoSQL database, MongoDB eschews the traditional table-based relational database structure in favour of JSON-like documents with dynamic schemas (MongoDB calls the format BSON), making the integration of data in certain types of applications easier and faster. Released under a combination of the GNU Affero General Public License and the Apache License, MongoDB is free and open-source software.

First developed by the software company MongoDB Inc. in October 2007 as a component of a planned platform as a service, the company shifted to an open source development model in 2009, with MongoDB offering commercial support and other services. Since then, MongoDB has been adopted as backend software by a number of major websites and services, including Craigslist, eBay, and Foursquare among others. As of July 2015, MongoDB is the fourth most popular type of database management system, and the most popular for document stores.

With the MongoDB WiredTiger release, two of the major features customers can immediately profit from are write performance and storage side compression. I realized that I wanted to see the benefit of these new features myself through a simple hands-on exercise and I decided to investigate it for myself.

The first test was executed with “Journalling On” and for purposes of the second test we turned “Journalling Off”.

Test#2 – Journalling Off: The setup for this run consisted of

  • 16 mongo clients
  • Each one inserting 3,125,000,000 documents
  • In total over 50 Billion documents were inserted

MongoDB server setup for the test included –

Data Compression:
> db.insert_demo_01.count()
3125000000

Compressed:
> db.insert_demo_01.stats()
{
“ns” : “test.insert_demo_01”,
“count” : 3125000000,
“size” : 68750000000,
“avgObjSize” : 22,
“storageSize” : 15036657664,
“capped” : false,
“wiredTiger” : {
“metadata” : {
“formatVersion” : 1
},
“block-manager” : {
“file allocation unit size” : 4096,
“blocks allocated” : 469102,
“checkpoint size” : 15035752448,
“allocations requiring file extension” : 466590,
“blocks freed” : 3215,
“file magic number” : 120897,
“file major version number” : 1,
“minor version number” : 0,
“file bytes available for reuse” : 1105920,
“file size in bytes” : 15036657664
},
“nindexes” : 1,
“totalIndexSize” : 31461847040,
“indexSizes” : {
“_id_” : 31461847040
},

Uncompressed:
{
“ns” : “test.insert_demo_03”,
“count” : 3125000000,
“size” : 68750000000,
“avgObjSize” : 22,
“storageSize” : 90922614784,
“capped” : false,
“wiredTiger” : {
“metadata” : {
“formatVersion” : 1
},
“block-manager” : {
“file allocation unit size” : 4096,
“blocks allocated” : 2827200,
“checkpoint size” : 90922704896,
“allocations requiring file extension” : 2825278,
“blocks freed” : 11650,
“file magic number” : 120897,
“file major version number” : 1,
“minor version number” : 0,
“file bytes available for reuse” : 98304,
“file size in bytes” : 90922614784
},
“nindexes” : 1,
“totalIndexSize” : 66059067392,
“indexSizes” : {
“_id_” : 66059067392
},

Documents: 16 Loaders x 3,125,000,000 = 50,000,000,000 (50 Billion)

Compressed:
Data:
-rw-r–r– 1 mongod mongod 15036657664 Apr 11 04:24 42–1201249500126757072.wt
Index:
-rw-r–r–  1 mongod mongod 31461715968 Apr 11 02:53 45–1201249500126757072.wt

Data: 15036657664 Bytes x 16 collections = 240586522624 Bytes = 224 GB
Index: 31461715968 Bytes x 16 collections = 503387455488 Bytes = 468.8 GB
50 Billion documents = 693 GB compressed with zlib.

Uncompressed:
Data:
-rw-r–r– 1 mongod mongod 90922614784 Apr 11 21:40 0-7873777076658133305.wt
Index:
-rw-r–r–  1 mongod mongod 66059067392 Apr 11 21:40 1-7873777076658133305.wt

Data:

22 Byte Document x 3,125,000,000 = 68750000000 Bytes x 16 collections = 1100000000000 Bytes = 1024 GB
Index: 66059067392 Bytes x 16 collections = 1056945078272 Bytes = 984 GB
50 Billion documents = 2008 GB uncompressed

Ratio:
Data (Uncompressed/zlib compressed): 1024/224 = 4.57
Index (Uncompressed/zlib compressed): 984/468.8 = 2.01

The MongoDB parameters used for this test are as follows:

net:
maxIncomingConnections: 20000
port: 27017
processManagement:
fork: “true”
storage:
dbPath: /data/db/myProcess
engine: wiredTiger
syncPeriodSecs: 30
wiredTiger:
collectionConfig:
blockCompressor: zlib
engineConfig:
cacheSizeGB: 16
directoryForIndexes: true
indexConfig:
prefixCompression: true
systemLog:
destination: file
path: /logs/mongodb.log
quiet: true

Results: The results from the second run are quite interesting. Let’s look at the following Opcounters graph which shows ~1M Inserts/sec with some spiky throughput at various intervals.

pic1_241015Throughput for our workload has almost doubled. This confirms the hypotheses that the point of contention of the SUT (System Under Test) is the commit & journaling code paths as earlier hypothesised. The spiky-ish throughput is a direct result of memory pressure and we intend to investigate the application throughput using the mongostat application.  By looking at mongostat “flushes” field, we are able to see a direct correlation between each flush and a drop in the throughput. For the WiredTiger Storage Engine, flushes refers to the number of WiredTiger checkpoints triggered between each polling interval. Another metric worth noticing is “dirty”, which is the percentage of the WiredTiger cache with dirty bytes. We need to stabilize both “flushes” and “dirty” to smooth out the throughput.

Let’s try to tune this for our next test run.

Test#3 – Journaling Off & WT Cache Tuning & FS Cache Tuning : The setup for this run consisted of

  • 16 mongo clients
  • Each one inserting 3,125,000,000 documents
  • In total over 50 Billion documents were inserted

The MongoDB server setup is as mentioned above.

However the MongoDB parameters are provided below:

net:
maxIncomingConnections: 20000
port: 27017
processManagement:
fork: “true”
storage:
dbPath: /data/db/myProcess
engine: wiredTiger
journal:
enabled: false
syncPeriodSecs: 30
wiredTiger:
collectionConfig:
blockCompressor: zlib
engineConfig:
cacheSizeGB: 16
configString: eviction_dirty_target=10,eviction_trigger=50,eviction_target=20,checkpoint=(wait=100)
directoryForIndexes: true
indexConfig:
prefixCompression: true
systemLog:
destination: file
path: /logs/mongodb.log
quiet: true

Some of the Linux kernel settings (FS Cache) were tuned for this test run. This included dirty_background_ratio, dirty_ratio, dirty_bytes, dirty_expire_centisecs. Also some of the WT Cache parameters were optimized for this run. i.e. eviction_dirty_target, eviction_trigger, eviction_target, checkpoint).

The Operating System is known to trigger a flush by itself, based on the following params: dirty_background_ratio, dirty_ratio, dirty_writeback_centisecs, dirty_expire_centisecs, etc. (sysctl -a |grep dirty). We would like to start the background flush much earlier and the blocking flush much later. This can be achieved by tweaking the dirty_ratio (maximum amount of system memory that can be filled with dirty pages before everything must get committed to disk) and dirty_background_ratio (the number of pages at which the background kernel flusher threads will start writing out dirty data).

We also need to increase the default value of 3000 dirty_expire_centisecs. When the pdflush/flush/kdmflush processes kick in they will check to see how old a dirty page is, and if it’s older than this value it’ll be written asynchronously to disk.

  • dirty_ratio = 5
  • dirty_background_ratio = 5
  • dirty_expire_centisecs = 12000

Results: The opcounters graph below show ~1M consistent Inserts/sec. I even added more load towards the end of the run to see if consistency changes and for sure it didn’t!!!!

pic2_241015

I went a step further and added some more load in another run towards the end of the test to see if consistency breaks – it didn’t!

pic3_241015

Client side Network (Bytes/out): ~1.7GB (Graph below)

pic4_241015

Server side CPU Utilization: ~45% (Graph below)

pic5_241015

Server side Network (Bytes/out): 1.7GB (Graph below)

pic6_241015

Data Volume – VolumeConsumedReadWriteOps (count): ~17000 (Graph below)

pic7_241015

Data Volume – VolumeWriteBytes (Bytes): 63000 (Graph below)

pic8_241015

Data Volume – VolumeWiteOps (count): 5200 (Graph below)

pic9_241015

Index Volume – VolumeConsumedReadWriteOps (count): 40000 (Graph below)

pic10_241015

Index Volume – VolumeWriteBytes (Bytes): 63000 (Graph below)

pic11_241015

Index Volume – VolumeWiteOps (count): 10000 (Graph below)

pic12_241015

In summary with journaling for MongoDb turned off, we have managed to almost double the throughput for the SUT (System Under Test). Here is a summary of the results for the test run.

pic13_241015

The Cost Of The Setup: The cost of the entire benchmark is as follows:

  • c4.8xlarge x 2 Instances (Client & Server) x $2.112 per Hour x 14 hours = 59.136$
  • m3.xlarge x 1 Instance (MMS Monitoring) x 14 hours x $0.308 per Hour = 4.312$
  • 14 hours x Amazon EBS Provisioned IOPS (SSD) volumes (1K GB + 10K PIOPS) = 14.583$
  • 14 hours x 2 Instances x Detailed Monitoring for Amazon EC2 Instances = 0.131$

The total Overall cost was around 78.162$ and I am glad to say that we complied with the condition for our experiment where the cost needs to be kept below 100$.

Conclusions: So what did we learn from all of this? The results from this benchmark demonstrated a stable and consistent system for the duration of the entire experiment i.e. 14 hours. The system did not completely consume the entire 36 vCPUs available on the cloud instance during the duration of the experiment. The benchmarking also identified WT journaling as a bottleneck and provides an opportunity for improvement within the WiredTiger storage engine.

Disabling the journal is not recommended when it comes to production use but even if you did, data loss would be limited to a few hundred milliseconds worth of data. This may be an acceptable level of data loss for certain use cases such as IoT, logging/monitoring, or in memory aggregation for the slight gains in performance it provides. You should decide what works best for your business depending on the nature of your application. You can then make a smart call and work around the data loss by smartly managing copies/replicas of data across multiple systems (if performance were a major requirement and you couldn’t deal with data loss of any sort). But those discussions are for another day.

One another thing we would like to mention, is that actually journaling is overrated when you also use replication, which should be always in production with a minimum of a three node replicaset (see this page on replica set architectures). Also do keep in mind that no matter how many times you fsync that operation to the journal, it can still be rolled back during replica set failover. Hence, for durability, replication always trumps journaling. In other words, using the majority write concern is the most durable way to write to MongoDB.

Further work: In our future benchmarks we are considering reviewing the following aspects of the system i.e.

  • MongoDB Cluster scalability limits while running a heavy write workload.
  • Further Tuning of WT Cache and FS Cache for a heavy write workload.

We hope you have enjoyed this two part series on MongoDB, “Breaking the Sound Barrier: 1.2M inserts/sec with MongoDB WiredTiger on a single AWS instance”. Write to me with your thoughts, inputs and comments at dagranat at gmail dot com.


Dmitry Agranat (LinkedIn), is currently a Technical Services Engineer at MongoDB. Dmitry is an experienced Performance Engineer who focuses on optimization of End to End system performance. His professional career has started in 2005 at Mercury (now HP Software). During this period, he has played various roles in the Performance Engineering (PE)  with extremely high focus on dmitry_agranatApplication and Database Performance. He has served as Performance Doctor (fixing unhealthy systems) , Performance Fireman (triage of production fires) , Performance Paramedic (putting the system together by applying field triage, duct tape, whatever data, domain knowledge, best guess, wishful thinking or folk-lore available to fix a problem) and Performance Plumber (removing blockages and searching for leaks). He has worked with many internal and external customers to both solve immediate performance problems as well as to educate on building solid organizational performance management methodology. Dmitry believes that Performance Engineering (PE) is still not well understood, agreed upon, unified or can be quantified, thus requires building solid and effective entrepreneurship around Performance domain.

Related Posts

  • Breaking the Sound Barrier: MongoDB (WiredTiger) on AWSBreaking the Sound Barrier: MongoDB (WiredTiger) on AWS You are reading: Breaking the Sound Barrier: 1.2M inserts/sec with MongoDB WiredTiger on a single AWS instance - Part I of II, by  Dmitry Agranat (LinkedIn).  The second part of this series will be published next week here at Practical Performance Analyst. Introduction - MongoDB is a […]
  • The Truth About SPE Industry BenchmarksThe Truth About SPE Industry Benchmarks Feel free to dig into the details and read more at our SPE Benchmarks section (under Resources) here at Practical Performance Analyst. Most of us rely on systems of one form of the other to perform our daily routine tasks. Say for example - Maintaining details of our regular […]
  • The Essential Role of an Architect in Open Systems ImplementationThe Essential Role of an Architect in Open Systems Implementation About the paper - The "Essential Role of an Architect in Open Systems Implementation" is a paper published by Cary Millsap while working at Oracle way back in 1998. As the title suggests the paper focuses on various important aspects of the role of an Architect in an Open Systems […]
  • Dimensions To Consider When Designing Applications For The CloudDimensions To Consider When Designing Applications For The Cloud Introduction - With the increasing maturity of cloud service offerings and realization by customers of the benefits of moving to cloud based applications/platforms/services it's not a surprise that IT shops around the world have cloud adoption some where on their agenda. In context of […]