How to benchmark MongoDB

There are generally three components to any benchmark project:

  1. Create the benchmark application
  2. Execute it
  3. Publish your results

I assume many people think they want to run more benchmarks but give up since step 2 is extremely consuming as you expand the number of different configurations/scenarios.

I'm hoping that this blog post will encourage more people to dive-in and participate, as I'll be sharing the bash script I used to test the various compression options coming in the MongoDB 3.0 storage engines. It enabled me to run a few different tests against 8 different configurations, recording insertion speed and size-on-disk for each one.

If you're into this sort of thing, please read on and provide any feedback or improvements you can think of. You also might want to grab a Snickers, as there is a lot to cover. I've commented along the way so hopefully it is an interesting read. Also, links to the full script and configuration files are at the bottom of the blog. Lets get started!

#!/bin/bash

# remember the directory we are starting from
# the script expects the MongoDB configuration files
export homeDirectory=$PWD

# directory where MongoDB/TokuMX tarballs are located
export tarDirectory=${BACKUP_DIR}/mongodb

# directory used for MongoDB server binaries and data folder
export MONGO_DIR=~/temp

# perform some sanity checks

# check that $MONGO_DIR is defined
if [ -z "$MONGO_DIR" ]; then
echo "Need to set MONGO_DIR"
exit 1
fi

# check that $MONGO_DIR exists
if [ ! -d "$MONGO_DIR" ]; then
echo "Need to create directory $MONGO_DIR"
exit 1
fi

# check that $MONGO_DIR is empty
# force manual cleanup before starting
if [ "$(ls -A ${MONGO_DIR})" ]; then
echo "Directory $MONGO_DIR must be empty before starting"
exit 1
fi


I'm a big fan of two things at the top of all my scripts: directory locations and sanity checks. The three directories needed for this particular benchmark run are as follows:

  • homeDirectory = The directory from where we are executing the script.
  • tarDirectory = The directory where the tar files exist for the various MongoDB flavors/versions that we are benchmarking. You'll likely need to change this for your benchmarks.
  • MONGO_DIR = The directory where we'll be unpacking the tar files (to execute the mongod binary) as well as creating a directory for storing the data for the benchmark. Make sure this is on decent storage is you are running a performance benchmark, a single SATA drive isn't fast. You'll likely need to change this for your benchmarks.

The sanity checks follow, we want to make sure that $MONGO_DIR is defined (just in case), the the $MONGO_DIR directory exists, and that the $MONGO_DIR directory is empty. The empty check is something I think is important, you might have something interesting in that directory and should manually clear it out before starting the benchmark.


# decide which tarballs and configurations we want to benchmark
# use semi-colon list of "tarball;id;config;mongo_type"
# tarball : MongoDB or TokuMX tarball
# id : Short hand description of this particular benchmark run, ends up in the log file and the summary log
# config : YAML configuration file to use for the this benchmark run
# mongo_type : Identifies which "type" of MongoDB, tokumx|mxse|wt|mongo
export benchmarkList=""
export benchmarkList="${benchmarkList} mongodb-linux-x86_64-tokumxse-1.0.0-rc.2.tgz;mxse_100rc2_none;tokumxse-uncompressed.conf;mxse"
export benchmarkList="${benchmarkList} mongodb-linux-x86_64-tokumxse-1.0.0-rc.2.tgz;mxse_100rc2_quicklz;tokumxse-quicklz.conf;mxse"
export benchmarkList="${benchmarkList} mongodb-linux-x86_64-tokumxse-1.0.0-rc.2.tgz;mxse_100rc2_zlib;tokumxse-zlib.conf;mxse"
export benchmarkList="${benchmarkList} mongodb-linux-x86_64-tokumxse-1.0.0-rc.2.tgz;mxse_100rc2_lzma;tokumxse-lzma.conf;mxse"
export benchmarkList="${benchmarkList} mongodb-linux-x86_64-3.0.0-rc8.tgz;mmapv1_300rc8;mmapv1.conf;mongo"
export benchmarkList="${benchmarkList} mongodb-linux-x86_64-3.0.0-rc8.tgz;wt_300rc8_none;wiredtiger-uncompressed.conf;wt"
export benchmarkList="${benchmarkList} mongodb-linux-x86_64-3.0.0-rc8.tgz;wt_300rc8_snappy;wiredtiger-snappy.conf;wt"
export benchmarkList="${benchmarkList} mongodb-linux-x86_64-3.0.0-rc8.tgz;wt_300rc8_zlib;wiredtiger-zlib.conf;wt"


Benchmarking is usually a single test run against multiple scenarios, and this is the section where we define those scenarios. The benchmarkList variable starts empty and is then appended with one or more scenarios. The scenario information is broken down into 4 segments, each delimited by a semi-colon. The comment above it is self-explanatory but worth explaining is the fourth segment, mongo_type. This script doesn't care what specific "MongoDB" you are running, but others I've created do so I always define it should I want it somewhere else.

# make sure we have valid tarballs and config scripts for this benchmark run
echo "checking that all needed tarballs exist."
for thisBenchmark in ${benchmarkList}; do
TARBALL=$(echo "${thisBenchmark}" | cut -d';' -f1)
MONGOD_CONFIG=$(echo "${thisBenchmark}" | cut -d';' -f3)

if [ -e ${tarDirectory}/${TARBALL} ]; then
echo " located ${tarDirectory}/${TARBALL}"
else
echo " unable to locate ${tarDirectory}/${TARBALL}, exiting."
exit 1
fi

if [ -e ${MONGOD_CONFIG} ]; then
echo " located ${MONGOD_CONFIG}"
else
echo " unable to locate ${MONGOD_CONFIG}, exiting."
exit 1
fi
done


More sanity checking here. Before running any benchmarks we want to make sure that all the tar files and configuration files actually exist on the server. Nothing is more disappointing than starting a long running series of benchmarks only to come back in a day and find that some of them failed because of a type or missing file.

export DB_NAME=test
export NUM_CLIENTS=2
export DOCS_PER_CLIENT=$((512 * 80000))
export NUM_INSERTS=$((NUM_CLIENTS * DOCS_PER_CLIENT))
export SUMMARY_LOG_NAME=summary.log
rm -f ${SUMMARY_LOG_NAME}


This section allows some control over the benchmark itself, plus gives us information needed for interpreting some of the results.

  • DB_NAME = The MongoDB database we'll be inserting into.
  • NUM_CLIENTS = The number of simultaneous insert clients. You can set this to any value >= 1, if you set it to < 1 you'll still get a single insert client.
  • DOCS_PER_CLIENT = The number of documents a single client will insert. This is multiplied by NUM_CLIENTS to find the total number of inserts (NUM_INSERTS), and is needed to calculate inserts per second later in the script. This value of 512 * 80000 is taken directly from the Javascript code, I'd normally inject it for the benchmark but didn't due to a lack of time.
  • NUM_INSERTS = Total number of inserts for the benchmark, a cooler way to do this would be to get a count from the collection itself, but that might take a while if an exact count is important and the particular storage engine supports document level locking. And remember, benchmarking isn't always about being cool, efficiency counts too.
  • SUMMARY_LOG_NAME = A single log file that will contain all results, summarized. And yes, delete it if it exists.


for thisBenchmark in ${benchmarkList}; do
export TARBALL=$(echo "${thisBenchmark}" | cut -d';' -f1)
export MINI_BENCH_ID=$(echo "${thisBenchmark}" | cut -d';' -f2)
export MONGOD_CONFIG=$(echo "${thisBenchmark}" | cut -d';' -f3)
export MONGO_TYPE=$(echo "${thisBenchmark}" | cut -d';' -f4)

echo "benchmarking tarball = ${TARBALL}"


Start the loop where we benchmark each scenario by grabbing each one and cutting it into the four components. Give the user a heads up as to which TARBALL we're benchmarking this time.

    # clean up + start the new server

pushd ${MONGO_DIR}
if [ "$?" -eq 1 ]; then
echo "Unable to pushd $MONGO_DIR, exiting."
exit 1
fi

# erase any files from the previous run
rm -rf *

# untar server binaries to here
tar xzvf ${tarDirectory}/${TARBALL} --strip 1

# create the "data" directory
mkdir data
bin/mongod --config ${homeDirectory}/${MONGOD_CONFIG}
popd


Did I mention how defensive I try to write these benchmarking scripts? Maybe paranoid is a better term. Earlier we confirmed that MONGO_DIR is defined, exists as a directory, and is empty. Guess what? Something might go terribly wrong during the benchmark and that might no longer be the case. So right after changing to the MONGO_DIR directory using pushd, check that pushd succeeded. Erase any existing files in the directory, untar the current benchmark's tarball, create a data folder, start MongoDB with the current scenario's configuration file, and popd back to our starting directory.

    # wait for mongo to start
while [ 1 ]; do
$MONGO_DIR/bin/mongostat -n 1 > /dev/null 2>&1
if [ "$?" -eq 0 ]; then
break
fi
sleep 5
done
sleep 5


We are starting mongod forked, so the MongoDB server isn't yet available. This code executes until the mongostat utility returns data, letting us know that the server is running.

Any ideas on a cleaner way to do this?

    # log for this run
export LOG_NAME=${MINI_BENCH_ID}-${NUM_CLIENTS}-${NUM_INSERTS}.log
rm -f ${LOG_NAME}


Create a custom log file for this particular scenario.

    # TODO : log server performance with mongostat


If you've ever attending one of my benchmark presentations you've likely heard me say that benchmarking is never done, there is always more to measure and analyze. This script currently records overall (cumulative) inserts per second, catching mongostat output along the way would allow for creating a pretty graph over time. I highly recommend picking a way to add "to-do" tasks to your scripts and code, mine is as simple as "TODO : ".

    # start the first inserter
T="$(date +%s)"
echo "`date` | starting insert client 1" | tee -a ${LOG_NAME}
$MONGO_DIR/bin/mongo ${DB_NAME} --eval 'load("./compress_test.js")' &
sleep 5


This particular benchmark is simple Javascript, so we execute it using the mongo shell. Prior to starting the client we grab the current time (probably the number of seconds since the epoch) so we can calculate the total inserts per second. I include a "sleep 5" after this first client since it might take a bit of time for the collection to get created, I've found it's always safest to let the first insert client get started on it's own.

Again, thanks to Adam Comerford for sharing this benchmark.

    # start the additional insert clients
clientNumber=2
while [ ${clientNumber} -le ${NUM_CLIENTS} ]; do
echo "`date` | starting insert client ${clientNumber}" | tee -a ${LOG_NAME}
$MONGO_DIR/bin/mongo ${DB_NAME} --eval 'load("./compress_test.js")' &
let clientNumber=clientNumber+1
done


If we are running 2 or more insert clients then each gets started with this loop.

    # wait for all of the client(s) to finish
wait


I only learned about the wait command a few months ago, and it is extremely useful. It causes our script to pause (wait) until any children processes we created are finished. So for this example each of the insert clients will finish before the script continues.

    # report insert performance
T="$(($(date +%s)-T))"
printf "`date` | insert duration = %02d:%02d:%02d:%02d\n" "$((T/86400))" "$((T/3600%24))" "$((T/60%60))" "$((T%60))" | tee -a ${LOG_NAME}
DOCS_PER_SEC=`echo "scale=0; ${NUM_INSERTS}/${T}" | bc `
echo "`date` | inserts per second = ${DOCS_PER_SEC}" | tee -a ${LOG_NAME}


Now that the inserts are finished we find the number of elapsed seconds by subtracting the current seconds (from the epoch) from our starting time. Calculating inserts per second is a simple as dividing the number of inserts by the number of seconds.

    # stop the server
T="$(date +%s)"
echo "`date` | shutting down the server" | tee -a ${LOG_NAME}
$MONGO_DIR/bin/mongo admin --eval "db.shutdownServer({force: true})"

# wait for the MongoDB server to shutdown
while [ 1 ]; do
pgrep -U $USER mongod > /dev/null 2>&1
if [ "$?" -eq 1 ]; then
break
fi
sleep 5
done
T="$(($(date +%s)-T))"
printf "`date` | shutdown duration = %02d:%02d:%02d:%02d\n" "$((T/86400))" "$((T/3600%24))" "$((T/60%60))" "$((T%60))" | tee -a ${LOG_NAME}


Prior to calculating size on disk I like to stop the server, since that allows each storage engine to perform cleanup, flush old log files, and shut down cleanly. I also like to time the operation. It's always bothered me that the MongoDB server shutdown process is asynchronous, the client requesting the shutdown is immediately disconnected with an unfriendly warning message (which one might mistaken for an error).

In any event, the loop immediately following the db.shutdownServer() call is there to wait for the mongod process to disappear. Until it does, MongoDB is not really stopped.

Any ideas on how to improve this?

    # report size on disk
SIZE_BYTES=`du -c --block-size=1 ${MONGO_DIR}/data | tail -n 1 | cut -f1`
SIZE_MB=`echo "scale=2; ${SIZE_BYTES}/(1024*1024)" | bc `
echo "`date` | post-load sizing (SizeMB) = ${SIZE_MB}" | tee -a ${LOG_NAME}


Find and report the total megabytes of the data directory (dbPath). I usually only report on the specific collection and it's indexes, this is simpler in that it includes the entire data directory.

    # put all the information into the summary log file
echo "`date` | tech = ${MINI_BENCH_ID} | ips = ${DOCS_PER_SEC} | sizeMB = ${SIZE_MB}" | tee -a ${SUMMARY_LOG_NAME}
done


Having all the results go to a single summary log file make it easy to interpret and graph your results.



So there you have it. Download the script and configuration files, make some changes, and run a few tests for yourself. Oh, give me some feedback if you can think of areas I can improve the above.

You are well on your way to your benchmarking black belt!


Links to everything you'll need to try this at home.