Breaking the Sound Barrier: MongoDB (WiredTiger) on AWS

You are reading: Breaking the Sound Barrier: 1.2M inserts/sec with MongoDB WiredTiger on a single AWS instance – Part I of II, by  Dmitry Agranat (LinkedIn).  The second part of this series will be published next week here at Practical Performance Analyst.

Introduction – MongoDB is a cross-platform document oriented database. Classified as a NoSQL database, MongoDB eschews the traditional table-based relational database structure in favour of JSON-like documents with dynamic schemas (MongoDB calls the format BSON), making the integration of data in certain types of applications easier and faster. Released under a combination of the GNU Affero General Public License and the Apache License, MongoDB is free and open-source software.

First developed by the software company MongoDB Inc. in October 2007 as a component of a planned platform as a service, the company shifted to an open source development model in 2009, with MongoDB offering commercial support and other services. Since then, MongoDB has been adopted as backend software by a number of major websites and services, including Craigslist, eBay, and Foursquare among others. As of July 2015, MongoDB is the fourth most popular type of database management system, and the most popular for document stores.

With the MongoDB WiredTiger release, two of the major features customers can immediately profit from are write performance and storage side compression. I realized that I wanted to see the benefit of these new features myself through a simple hands-on exercise and I decided to investigate it for myself.

Insert Performance – The new WiredTiger storage engine in MongoDB 3.0 is supposed to deliver 7-10x greater throughput for write-intensive applications with more granular document-level concurrency control. This should eventually translate to massive performance gains with well specked hardware for write-heavy projects such as your next Internet Of Things initiative, customer data management, social, and mobile apps.

You can read more in this great post by Asya Kamsky at the following post – “Throughput Improvements Measured With YCSB” blog post.

Compression – MongoDB now supports native compression, allowing you to reduce your physical storage footprint by up to as much as 80%. You now have the added flexibility to choose between different compression algorithms to optimize for performance and storage efficiency, depending on your applications needs.

With more granular concurrency control and built-in compression, MongoDB 3.0 lets you simplify your architecture, allowing you to do far more with less hardware. You can read more in this great post by John Page at “Wired Tiger – how to reduce your MongoDB hosting costs 10x

Goals & Conditions – Let’s start by taking a look at the goals of our exercise:

  • Find the performance and scalability limitations of a typical single MongoDB server on AWS running a heavy write workload.
  • Measure compression benefits for optimal performance and scalability of a typical MongoDB server running on AWS with a heavy write workload.

Having defined our goals let’s consider some of the pre-requisites I set myself for the given benchmark. Listed below are some of the conditions I set myself for the benchmark.

  • The benchmark should be running long enough to allow for generation of stable and consistent Throughput over time.
  • The experiment should be easily reproducible by anyone with minimal tweaking of the configuration and setup.
  • The overall experiment should cost no more than 100 USD.

Exploration – My initial dilemma was around the choice of an appropriate Load Testing Tool for the experiment. I have used LoadRunner and Jmeter for the last 10 years as a performance engineer while working for HP Software and then later for RSA. Initially I was really tempted to use one of those, however on further thought I realized that both HP Load Runner and Jmeter would really not serve one of my goals. The benchmark results should be easily reproducible by anyone with minimal configuration. Each and every Load Testing tool (Enterprise, open-source or in-house) has its unique features and requires both expert level knowledge including the need of Load Testing Tool configuration.

Benchmarking as you know is a true science, however in most cases the benchmarks have little true relationship to the performance of production systems. The intentions of the benchmarks I was conducting was to allow anyone to replicate what I had done through my extensive testing and this required that I keep the tests simple, easy to run and repeatable. The goal is not to produce a new benchmark standard but rather to give an indicator of performance for a ‘typical’ system running MongoDB in the cloud.

Hence I decided to use the following code for testing which anyone could use to generate a simple Bulk.insert() using a “for loop”. There was no need for complex and expensive Industry Standard Performance Testing tools, no need for Performance Testers to come in and write complex Parameterized + Co-related test scripts and no need for large Performance Testing environments.

function insert(count) {
every = 1000
var t = new Date()
for (var i=0; i<count; ) {
var bulk = db.c.initializeUnorderedBulkOp();
for (var j=0; j<every; j++, i++)
bulk.insert({})
bulk.execute();
}
}

Hardware Setup – It order to stress test the instance I needed to adequately provision networking between the MongoDB Server and the load generator (Client). Also, WT differs from MMAPv1 and utilizes the CPU’s so adequately provisioning these is now important for MongoDB installations. I had intended to use Amazon Web Services or AWS to build the Performance Testing setup. After reviewing all the relevant EC2 compute options and filtering them based on the criteria above for EC2 Instances with 10 Gigabit and Enhanced Networking, there remained the following 4 candidates.

mongodb_081015_pic1

Eventually I ended up choosing the c4.8xlarge at EC2 on Amazon Web Service due to higher CPU Clock speed and Intel Turbo Boost Technology. In addition, this is the only EBS-Optimized Instance out of 4 options mentioned above.

Environment setup – The MongoDB Cloud Manager Automation (read more here – https://docs.cloud.mongodb.com/) solution lets us quickly deploy large scale MongoDB clusters, in this case a few clicks on a web page and in about 10 minutes all the desired MongoDB infrastructure required for the benchmark was up and running.

Monitoring and analysis – I decided that I was going to use MongoDB Cloud Manager and AWS CloudWatch for monitoring of the infrastructure with the objective of collecting information at a one second resolution through iostat, sysmon, MongoDB logs and WT Server Status Statistics. I also ended us using some Times Series Visualization Tools (An internal tool being developed) for purposes of analysis. These Times Series Visualization Tools provide a time series visualization of various MongoDB metrics and should be released by MongoDB sometime soon. The goal is to visualize a large number of different metrics from across all layers of the system using “sparkline” graphs in a way that allows easy and accurate correlation of the performance at all layers of the system. The intention was to visualize using time series the following metrics side by side – iostat, sysmon, MongoDB logs and WT Server Status Statistics.

Test Configuration #1 with Journaling On – The setup consisted of 16 mongo clients, each one inserting 3,125,000,000 documents. Overall 50 Billion documents would need to be inserted.

MongoDB server setup for the test included –

Data Compression:
> db.insert_demo_01.count()
3125000000

Compressed:
> db.insert_demo_01.stats()
{
    “ns” : “test.insert_demo_01”,
    “count” : 3125000000,
    “size” : 68750000000,
    “avgObjSize” : 22,
    “storageSize” : 15036657664,
    “capped” : false,
    “wiredTiger” : {
        “metadata” : {
            “formatVersion” : 1
        },
    “block-manager” : {
            “file allocation unit size” : 4096,
            “blocks allocated” : 469102,
            “checkpoint size” : 15035752448,
            “allocations requiring file extension” : 466590,
            “blocks freed” : 3215,
            “file magic number” : 120897,
            “file major version number” : 1,
            “minor version number” : 0,
            “file bytes available for reuse” : 1105920,
            “file size in bytes” : 15036657664
    },
    “nindexes” : 1,
    “totalIndexSize” : 31461847040,
    “indexSizes” : {
        “_id_” : 31461847040
    },

Uncompressed:
{
    “ns” : “test.insert_demo_03”,
    “count” : 3125000000,
    “size” : 68750000000,
    “avgObjSize” : 22,
    “storageSize” : 90922614784,
    “capped” : false,
    “wiredTiger” : {
        “metadata” : {
            “formatVersion” : 1
        },
    “block-manager” : {
            “file allocation unit size” : 4096,
            “blocks allocated” : 2827200,
            “checkpoint size” : 90922704896,
            “allocations requiring file extension” : 2825278,
            “blocks freed” : 11650,
            “file magic number” : 120897,
            “file major version number” : 1,
            “minor version number” : 0,
            “file bytes available for reuse” : 98304,
            “file size in bytes” : 90922614784
    },
    “nindexes” : 1,
    “totalIndexSize” : 66059067392,
    “indexSizes” : {
        “_id_” : 66059067392
    },

Documents: 16 Loaders x 3,125,000,000 = 50,000,000,000 (50 Billion)

Compressed:
Data:
-rw-r–r– 1 mongod mongod 15036657664 Apr 11 04:24 42–1201249500126757072.wt
Index:
-rw-r–r–  1 mongod mongod 31461715968 Apr 11 02:53 45–1201249500126757072.wt

Data: 15036657664 Bytes x 16 collections = 240586522624 Bytes = 224 GB
Index: 31461715968 Bytes x 16 collections = 503387455488 Bytes = 468.8 GB
50 Billion documents = 693 GB compressed with zlib.

Uncompressed:
Data:
-rw-r–r– 1 mongod mongod 90922614784 Apr 11 21:40 0-7873777076658133305.wt
Index:
-rw-r–r–  1 mongod mongod 66059067392 Apr 11 21:40 1-7873777076658133305.wt

Data:

22 Byte Document x 3,125,000,000 = 68750000000 Bytes x 16 collections = 1100000000000 Bytes = 1024 GB
Index: 66059067392 Bytes x 16 collections = 1056945078272 Bytes = 984 GB
50 Billion documents = 2008 GB uncompressed

Ratio:
Data (Uncompressed/zlib compressed): 1024/224 = 4.57
Index (Uncompressed/zlib compressed): 984/468.8 = 2.01

The MongoDB parameters used for this test are as follows:

net:
  maxIncomingConnections: 20000
  port: 27017
processManagement:
  fork: “true”
storage:
  dbPath: /data/db/myProcess
  engine: wiredTiger
   syncPeriodSecs: 30
  wiredTiger:
    collectionConfig:
      blockCompressor: zlib
    engineConfig:
      cacheSizeGB: 16
     directoryForIndexes: true
    indexConfig:
      prefixCompression: true
systemLog:
  destination: file
  path: /logs/mongodb.log
  quiet: true

Results for Test Configuration #1 With Journaling On:

mongodb_081015_pic2Cloud Manager Opcounters graph showing ~500K Inserts/sec

mongodb_081015_pic3Client side Network (Bytes/out): ~770 MB

mongodb_081015_pic4Server side CPU Utilization: ~60%

mongodb_081015_pic5Server side Network (Bytes/In): ~770 MB

mongodb_081015_pic6Data Volume – VolumeConsumedReadWriteOps (count): 8500

mongodb_081015_pic7Data Volume – VolumeWriteBytes (Bytes): 53000

mongodb_081015_pic8Data Volume – VolumeWiteOps (count): 2600

 

mongodb_081015_pic9Index Volume – VolumeConsumedReadWriteOps (count): 20000

mongodb_081015_pic10Index Volume – VolumeWriteBytes (Bytes): 55000

mongodb_081015_pic11Index Volume – VolumeWiteOps (count): 5000

 

mongodb_081015_pic12Journal Volume – VolumeConsumedReadWriteOps (count): 225000

mongodb_081015_pic13Journal Volume – VolumeWriteBytes (Bytes): 60000

mongodb_081015_pic14Journal Volume – VolumeWiteOps (count): 60000

Performance Analysis – Analysis of the benchmark results suggests that the performance of the system is relatively steady. The IO and CPU metrics suggest that the performance of the system for the given workload is steady but low. These graphs suggest that the system has the ability to be pushed further and hasn’t maxed out yet, so something is possibly holding us back, likely inside MongoDB. I used the gdbprof.py (stack trace samples collection, an internal tool we are developing) tool to drill down and see exactly what is going on inside MongoDB.

Examining the profiler tree, 10 out of 16 Loader threads are in __wt_log_write function (this manages the WT journaling), which further breaks down into various yield, wait and mutex functions (not shown):

mongodb_081015_pic16

So already off the bat it looks like WT commit+journaling is a contention point where threads are mostly waiting for their turn. In the HTML version you can also see a Tufte “sparkline” graph, showing that this situation is constant for the 10 seconds of sampling.

The other significant code paths are:

3.60   13.00   ││  │ ├mongo::Collection::insertDocument
3.60   13.00   ││  │ │ mongo::Collection::_insertDocument
3.00   13.00   ││  │ │ ├mongo::WiredTigerRecordStore::insertRecord

This means that only 10% of threads are actually working on the insert itself (this includes small amount of index updates on the _id). Roughly the same amount of threads are working on reading data from the network:

{code}
4.70    7.00   │└mongo::MessagingPort::recv
{code}

The remaining threads are 11 different background threads + the main thread. Running the experiment with 32 Loaders highlighted something interesting. Basically, all of the new threads (16) accumulate under __wt_log_write. The rest of the profiler tree is similar to the experiment with 16 Loader threads. This is strongly indicates that this is the point of contention for this benchmark.

mongodb_081015_pic15

The final proof, of course, is to disable WT journaling to verify that it removes the bottleneck.

The second part of this series will be published next week here at Practical Performance Analyst.


Dmitry Agranat (LinkedIn), is currently a Technical Services Engineer at MongoDB. Dmitry is an experienced Performance Engineer who focuses on optimization of End to End system performance. His professional career has started in 2005 at Mercury (now HP Software). During this period, he has played various roles in the Performance Engineering (PE)  with extremely high focus on dmitry_agranatApplication and Database Performance. He has served as Performance Doctor (fixing unhealthy systems) , Performance Fireman (triage of production fires) , Performance Paramedic (putting the system together by applying field triage, duct tape, whatever data, domain knowledge, best guess, wishful thinking or folk-lore available to fix a problem) and Performance Plumber (removing blockages and searching for leaks). He has worked with many internal and external customers to both solve immediate performance problems as well as to educate on building solid organizational performance management methodology. Dmitry believes that Performance Engineering (PE) is still not well understood, agreed upon, unified or can be quantified, thus requires building solid and effective entrepreneurship around Performance domain.

Related Posts

  • Breaking the Sound Barrier: MongoDB (WiredTiger) on AWS – Part IIBreaking the Sound Barrier: MongoDB (WiredTiger) on AWS – Part II You are reading: Breaking the Sound Barrier: 1.2M inserts/sec with MongoDB WiredTiger on a single AWS instance - Part II of II, by  Dmitry Agranat (LinkedIn).  The first part of this series can be read here. Introduction - MongoDB is a cross-platform document oriented database. […]
  • The Truth About SPE Industry BenchmarksThe Truth About SPE Industry Benchmarks Feel free to dig into the details and read more at our SPE Benchmarks section (under Resources) here at Practical Performance Analyst. Most of us rely on systems of one form of the other to perform our daily routine tasks. Say for example - Maintaining details of our regular […]
  • The Essential Role of an Architect in Open Systems ImplementationThe Essential Role of an Architect in Open Systems Implementation About the paper - The "Essential Role of an Architect in Open Systems Implementation" is a paper published by Cary Millsap while working at Oracle way back in 1998. As the title suggests the paper focuses on various important aspects of the role of an Architect in an Open Systems […]
  • Dimensions To Consider When Designing Applications For The CloudDimensions To Consider When Designing Applications For The Cloud Introduction - With the increasing maturity of cloud service offerings and realization by customers of the benefits of moving to cloud based applications/platforms/services it's not a surprise that IT shops around the world have cloud adoption some where on their agenda. In context of […]
  • Rajneesh Mitharwal

    Hi Dmitry
    I was also bench-marking mongo 3.0.7 with WT for our work load. Our work load is write/ read high with no document update. To make to close to our production I used documentation size (application side JSON size) distribution 5% 15KB, 25% 100 Bytes, 70% 500Bytes. I chosen i2.xlarge AWS instance ( 4 core, 30GB, 800GB SSD) and I am not any bulk insert.
    During the bench-marking I bottle-necked on CPU utilization just after ~4k inserts/sec. While I ran same test with mongo 3.0.7 with MMAPv1 storage I was able to gain same throughput at lower CPU utilization ~150 % out of 400% of 4 cores.
    can you suggest some tuning to me ?
    I just set wiredTigerConcurrentWriteTransactions to 5 ( no of core =4 + 1)
    Now I am able to get throughput of 9k inserts/sec but still CPU utilization is high.
    db.adminCommand( { setParameter: 1, wiredTigerConcurrentWriteTransactions: 5 } )

    • Gret

      Hi Rajneesh,

      It appears your issue is not directly related to MongoDB. You are using single-thread insert method, thus saturating one single CPU Core. By increasing the “wiredTigerConcurrentWriteTransactions”, you can gain some improvements but you will only see the real increase in the throughput once you’ll start using unordered Bulk() write operations.

  • Nachiket Kate

    Hi Dmitry,
    I read both the parts for this benchmark. Insights are really nice!. I have few questions like why only write heavy workload and why not read heavy (I think Mongo is good in terms of read than write – I did mongo/cassandra benchmarking using YCSB few months ago.)
    Another question is like you tested it on SSDs which is less common case. large set of deployements are still on disk based storage for which this benchmark would be less useful.
    According to me 22 bytes of document is very low sized document and would affect the benchmark.
    I read MongoDB 3.0 also supports in-memory database feature. Can we have that benchmark in next part? 🙂
    Please correct if I am wrong.

    • Gret

      Hi Nachiket,

      Thanks for reading my post. I’ll break this answer up to address the different questions.

      “why only write heavy workload and why not read heavy” – In this exercise, I was evaluating the new WiredTiger storage engine which delivers 7-10X greater throughput for write-intensive applications.

      “I think Mongo is good in terms of read than write – I did mongo/cassandra benchmarking using YCSB few months ago” – This is incorrect, the new WiredTiger storage engine which delivers 7-10X greater throughput for write-intensive applications.

      “Another question is like you tested it on SSDs which is less common case. large set of deployements are still on disk based storage for which this benchmark would be less useful.” – First of all, I have showed that MongoDB EC2 compute cost was around 0.05 cents per 1 million document inserts with daily storage cost of 2 cents per million documents (using very high IOPS storage), which is very impressive in terms of costs even for SSD. Second, you need to understand how WT disk access patterns for WT btree works:

      Checkpoints (which are triggered at least once per minute) will write a lot of data, with no order guarantees, followed by an fsync. Eviction, which is triggered by heavy insert load or when the cache becomes full, will write smaller amounts of data to random locations.
      Data will be read into cache as required in small reads from random locations. Doing a scan over a newly-created collection or index should result in a mostly sequential scan of the file, but once there are updates and multiple checkpoints written, it becomes increasingly likely that key order will not match disk order.
      All of this means that SSDs will be more efficient than spinning disks for complex workloads, because there is no seek time.

      “According to me 22 bytes of document is very low sized document and would affect the benchmark.” – This was not exactly a benchmark, but rather an experiment with the workload similar to IoT, in logging/monitoring, in memory aggregation or time-series use-cases where the documents are usually small (think of a sensor data).

      “I read MongoDB 3.0 also supports in-memory database feature. Can we have that benchmark in next part?” – Can you point me to the source where you read this?

      As a side note, please avoid using YCSB where possible as it only does key-value operations (MongoDB is not key-value Database).

      Finally, feel free to ping me in private if you’d like to further discuss this.