There are generally three components to any benchmark
project:
- Create the benchmark application
- Execute it
- Publish your results
I assume many people think they want to run more benchmarks but
give up since step 2 is extremely consuming as you expand the
number of different configurations/scenarios.
I'm hoping that this blog post will encourage more people to
dive-in and participate, as I'll be sharing the bash script I
used to test the various compression options coming in the MongoDB
3.0 storage engines. It enabled me to run a few different
tests against 8 different configurations, recording insertion
speed and size-on-disk for each one.
If you're into this sort of thing, please read on and provide any
feedback or improvements you can think of. You also might want to
grab a Snickers, as there is a lot to cover. I've commented along
the way so hopefully it is an interesting read. Also, links to
the full script and configuration files are at the bottom of the
blog. Lets get started!
#!/bin/bash
# remember the directory we are starting from
# the script expects the MongoDB configuration files
export homeDirectory=$PWD
# directory where MongoDB/TokuMX tarballs are located
export tarDirectory=${BACKUP_DIR}/mongodb
# directory used for MongoDB server binaries and data folder
export MONGO_DIR=~/temp
# perform some sanity checks
# check that $MONGO_DIR is defined
if [ -z "$MONGO_DIR" ]; then
echo "Need to set MONGO_DIR"
exit 1
fi
# check that $MONGO_DIR exists
if [ ! -d "$MONGO_DIR" ]; then
echo "Need to create directory $MONGO_DIR"
exit 1
fi
# check that $MONGO_DIR is empty
# force manual cleanup before starting
if [ "$(ls -A ${MONGO_DIR})" ]; then
echo "Directory $MONGO_DIR must be empty before starting"
exit 1
fi
I'm a big fan of two things at the top of all my scripts:
directory locations and sanity checks. The three directories
needed for this particular benchmark run are as follows:
- homeDirectory = The directory from where we are executing the script.
- tarDirectory = The directory where the tar files exist for the various MongoDB flavors/versions that we are benchmarking. You'll likely need to change this for your benchmarks.
- MONGO_DIR = The directory where we'll be unpacking the tar files (to execute the mongod binary) as well as creating a directory for storing the data for the benchmark. Make sure this is on decent storage is you are running a performance benchmark, a single SATA drive isn't fast. You'll likely need to change this for your benchmarks.
The sanity checks follow, we want to make sure that $MONGO_DIR is
defined (just in case), the the $MONGO_DIR directory exists, and
that the $MONGO_DIR directory is empty. The empty check is
something I think is important, you might have something
interesting in that directory and should manually clear it out
before starting the benchmark.
# decide which tarballs and configurations we want to benchmark
# use semi-colon list of "tarball;id;config;mongo_type"
# tarball : MongoDB or TokuMX tarball
# id : Short hand description of this particular benchmark run, ends up in the log file and the summary log
# config : YAML configuration file to use for the this benchmark run
# mongo_type : Identifies which "type" of MongoDB, tokumx|mxse|wt|mongo
export benchmarkList=""
export benchmarkList="${benchmarkList} mongodb-linux-x86_64-tokumxse-1.0.0-rc.2.tgz;mxse_100rc2_none;tokumxse-uncompressed.conf;mxse"
export benchmarkList="${benchmarkList} mongodb-linux-x86_64-tokumxse-1.0.0-rc.2.tgz;mxse_100rc2_quicklz;tokumxse-quicklz.conf;mxse"
export benchmarkList="${benchmarkList} mongodb-linux-x86_64-tokumxse-1.0.0-rc.2.tgz;mxse_100rc2_zlib;tokumxse-zlib.conf;mxse"
export benchmarkList="${benchmarkList} mongodb-linux-x86_64-tokumxse-1.0.0-rc.2.tgz;mxse_100rc2_lzma;tokumxse-lzma.conf;mxse"
export benchmarkList="${benchmarkList} mongodb-linux-x86_64-3.0.0-rc8.tgz;mmapv1_300rc8;mmapv1.conf;mongo"
export benchmarkList="${benchmarkList} mongodb-linux-x86_64-3.0.0-rc8.tgz;wt_300rc8_none;wiredtiger-uncompressed.conf;wt"
export benchmarkList="${benchmarkList} mongodb-linux-x86_64-3.0.0-rc8.tgz;wt_300rc8_snappy;wiredtiger-snappy.conf;wt"
export benchmarkList="${benchmarkList} mongodb-linux-x86_64-3.0.0-rc8.tgz;wt_300rc8_zlib;wiredtiger-zlib.conf;wt"
Benchmarking is usually a single test run against multiple
scenarios, and this is the section where we define those
scenarios. The benchmarkList variable starts empty and is then
appended with one or more scenarios. The scenario information is
broken down into 4 segments, each delimited by a semi-colon. The
comment above it is self-explanatory but worth explaining is the
fourth segment, mongo_type. This script doesn't care what
specific "MongoDB" you are running, but others I've created do so
I always define it should I want it somewhere else.
# make sure we have valid tarballs and config scripts for this benchmark run
echo "checking that all needed tarballs exist."
for thisBenchmark in ${benchmarkList}; do
TARBALL=$(echo "${thisBenchmark}" | cut -d';' -f1)
MONGOD_CONFIG=$(echo "${thisBenchmark}" | cut -d';' -f3)
if [ -e ${tarDirectory}/${TARBALL} ]; then
echo " located ${tarDirectory}/${TARBALL}"
else
echo " unable to locate ${tarDirectory}/${TARBALL}, exiting."
exit 1
fi
if [ -e ${MONGOD_CONFIG} ]; then
echo " located ${MONGOD_CONFIG}"
else
echo " unable to locate ${MONGOD_CONFIG}, exiting."
exit 1
fi
done
More sanity checking here. Before running any benchmarks we want
to make sure that all the tar files and configuration files
actually exist on the server. Nothing is more disappointing than
starting a long running series of benchmarks only to come back in
a day and find that some of them failed because of a type or
missing file.
export DB_NAME=test
export NUM_CLIENTS=2
export DOCS_PER_CLIENT=$((512 * 80000))
export NUM_INSERTS=$((NUM_CLIENTS * DOCS_PER_CLIENT))
export SUMMARY_LOG_NAME=summary.log
rm -f ${SUMMARY_LOG_NAME}
This section allows some control over the benchmark itself, plus
gives us information needed for interpreting some of the
results.
- DB_NAME = The MongoDB database we'll be inserting into.
- NUM_CLIENTS = The number of simultaneous insert clients. You can set this to any value >= 1, if you set it to < 1 you'll still get a single insert client.
- DOCS_PER_CLIENT = The number of documents a single client will insert. This is multiplied by NUM_CLIENTS to find the total number of inserts (NUM_INSERTS), and is needed to calculate inserts per second later in the script. This value of 512 * 80000 is taken directly from the Javascript code, I'd normally inject it for the benchmark but didn't due to a lack of time.
- NUM_INSERTS = Total number of inserts for the benchmark, a cooler way to do this would be to get a count from the collection itself, but that might take a while if an exact count is important and the particular storage engine supports document level locking. And remember, benchmarking isn't always about being cool, efficiency counts too.
- SUMMARY_LOG_NAME = A single log file that will contain all results, summarized. And yes, delete it if it exists.
for thisBenchmark in ${benchmarkList}; do
export TARBALL=$(echo "${thisBenchmark}" | cut -d';' -f1)
export MINI_BENCH_ID=$(echo "${thisBenchmark}" | cut -d';' -f2)
export MONGOD_CONFIG=$(echo "${thisBenchmark}" | cut -d';' -f3)
export MONGO_TYPE=$(echo "${thisBenchmark}" | cut -d';' -f4)
echo "benchmarking tarball = ${TARBALL}"
Start the loop where we benchmark each scenario by grabbing each
one and cutting it into the four components. Give the user a
heads up as to which TARBALL we're benchmarking this time.
# clean up + start the new server
pushd ${MONGO_DIR}
if [ "$?" -eq 1 ]; then
echo "Unable to pushd $MONGO_DIR, exiting."
exit 1
fi
# erase any files from the previous run
rm -rf *
# untar server binaries to here
tar xzvf ${tarDirectory}/${TARBALL} --strip 1
# create the "data" directory
mkdir data
bin/mongod --config ${homeDirectory}/${MONGOD_CONFIG}
popd
Did I mention how defensive I try to write these benchmarking
scripts? Maybe paranoid is a better term. Earlier we confirmed
that MONGO_DIR is defined, exists as a directory, and is empty.
Guess what? Something might go terribly wrong during the
benchmark and that might no longer be the case. So right after
changing to the MONGO_DIR directory using pushd, check that pushd
succeeded. Erase any existing files in the directory, untar the
current benchmark's tarball, create a data folder, start MongoDB
with the current scenario's configuration file, and popd back to
our starting directory.
# wait for mongo to start
while [ 1 ]; do
$MONGO_DIR/bin/mongostat -n 1 > /dev/null 2>&1
if [ "$?" -eq 0 ]; then
break
fi
sleep 5
done
sleep 5
We are starting mongod forked, so the MongoDB server isn't yet
available. This code executes until the mongostat utility returns
data, letting us know that the server is running.
Any ideas on a cleaner way to do this?
# log for this run
export LOG_NAME=${MINI_BENCH_ID}-${NUM_CLIENTS}-${NUM_INSERTS}.log
rm -f ${LOG_NAME}
Create a custom log file for this particular scenario.
# TODO : log server performance with mongostat
If you've ever attending one of my benchmark presentations you've
likely heard me say that benchmarking is never done, there is
always more to measure and analyze. This script currently records
overall (cumulative) inserts per second, catching mongostat
output along the way would allow for creating a pretty graph over
time. I highly recommend picking a way to add "to-do" tasks to
your scripts and code, mine is as simple as "TODO : ".
# start the first inserter
T="$(date +%s)"
echo "`date` | starting insert client 1" | tee -a ${LOG_NAME}
$MONGO_DIR/bin/mongo ${DB_NAME} --eval 'load("./compress_test.js")' &
sleep 5
This particular benchmark is simple Javascript, so we execute it
using the mongo shell. Prior to starting the client we grab the
current time (probably the number of seconds since the epoch) so
we can calculate the total inserts per second. I include a "sleep
5" after this first client since it might take a bit of time for
the collection to get created, I've found it's always safest to
let the first insert client get started on it's own.
Again, thanks to Adam Comerford for sharing this benchmark.
# start the additional insert clients
clientNumber=2
while [ ${clientNumber} -le ${NUM_CLIENTS} ]; do
echo "`date` | starting insert client ${clientNumber}" | tee -a ${LOG_NAME}
$MONGO_DIR/bin/mongo ${DB_NAME} --eval 'load("./compress_test.js")' &
let clientNumber=clientNumber+1
done
If we are running 2 or more insert clients then each gets started
with this loop.
# wait for all of the client(s) to finish
wait
I only learned about the wait command a few months ago, and it is
extremely useful. It causes our script to pause (wait) until any
children processes we created are finished. So for this example
each of the insert clients will finish before the script
continues.
# report insert performance
T="$(($(date +%s)-T))"
printf "`date` | insert duration = %02d:%02d:%02d:%02d\n" "$((T/86400))" "$((T/3600%24))" "$((T/60%60))" "$((T%60))" | tee -a ${LOG_NAME}
DOCS_PER_SEC=`echo "scale=0; ${NUM_INSERTS}/${T}" | bc `
echo "`date` | inserts per second = ${DOCS_PER_SEC}" | tee -a ${LOG_NAME}
Now that the inserts are finished we find the number of elapsed
seconds by subtracting the current seconds (from the epoch) from
our starting time. Calculating inserts per second is a simple as
dividing the number of inserts by the number of seconds.
# stop the server
T="$(date +%s)"
echo "`date` | shutting down the server" | tee -a ${LOG_NAME}
$MONGO_DIR/bin/mongo admin --eval "db.shutdownServer({force: true})"
# wait for the MongoDB server to shutdown
while [ 1 ]; do
pgrep -U $USER mongod > /dev/null 2>&1
if [ "$?" -eq 1 ]; then
break
fi
sleep 5
done
T="$(($(date +%s)-T))"
printf "`date` | shutdown duration = %02d:%02d:%02d:%02d\n" "$((T/86400))" "$((T/3600%24))" "$((T/60%60))" "$((T%60))" | tee -a ${LOG_NAME}
Prior to calculating size on disk I like to stop the server,
since that allows each storage engine to perform cleanup, flush
old log files, and shut down cleanly. I also like to time the
operation. It's always bothered me that the MongoDB server
shutdown process is asynchronous, the client requesting the
shutdown is immediately disconnected with an unfriendly warning
message (which one might mistaken for an error).
In any event, the loop immediately following the
db.shutdownServer() call is there to wait for the mongod process
to disappear. Until it does, MongoDB is not really stopped.
Any ideas on how to improve this?
# report size on disk
SIZE_BYTES=`du -c --block-size=1 ${MONGO_DIR}/data | tail -n 1 | cut -f1`
SIZE_MB=`echo "scale=2; ${SIZE_BYTES}/(1024*1024)" | bc `
echo "`date` | post-load sizing (SizeMB) = ${SIZE_MB}" | tee -a ${LOG_NAME}
Find and report the total megabytes of the data directory
(dbPath). I usually only report on the specific collection and
it's indexes, this is simpler in that it includes the entire data
directory.
# put all the information into the summary log file
echo "`date` | tech = ${MINI_BENCH_ID} | ips = ${DOCS_PER_SEC} | sizeMB = ${SIZE_MB}" | tee -a ${SUMMARY_LOG_NAME}
done
Having all the results go to a single summary log file make it
easy to interpret and graph your results.
So there you have it. Download the script and configuration
files, make some changes, and run a few tests for yourself. Oh,
give me some feedback if you can think of areas I can improve the
above.
You are well on your way to your benchmarking black belt!
Links to everything you'll need to try this at
home.
- run.benchmark.bash (the script we picked apart in this blog)
- mmapv1.conf
- wiredtiger-uncompressed.conf
- wiredtiger-snappy.conf
- wiredtiger-zlib.conf
- tokumxse-uncompressed.conf
- tokumxse-quicklz.conf
- tokumxse-zlib.conf
- tokumxse-lzma.conf
- compress_test.js