Mutilate with scons patch
Go to file
Daniel Byrne 9e7fa3d1e9 shm
2022-05-25 18:43:34 -04:00
.gitignore Initial check-in. 2012-08-23 14:30:32 -07:00
.travis.yml improve travis script 2014-12-11 21:04:17 -08:00
AdaptiveSampler.h updated samples 2019-02-22 09:28:13 -05:00
AgentStats.h updates 2021-10-06 10:44:29 -04:00
barrier.cc Initial check-in. 2012-08-23 14:30:32 -07:00
barrier.h Initial check-in. 2012-08-23 14:30:32 -07:00
binary_protocol.h shm 2022-05-25 18:43:34 -04:00
bipbuffer.cc shm 2022-05-25 18:43:34 -04:00
bipbuffer.h shm 2022-05-25 18:43:34 -04:00
blockingconcurrentqueue.h added request q 2021-05-05 16:07:29 -04:00
cmdline.ggo shm 2022-05-25 18:43:34 -04:00
common.h updated string and sock handling 2021-07-06 11:24:35 -04:00
concurrentqueue.h added request q 2021-05-05 16:07:29 -04:00
Connection.cc updates 2022-03-11 14:50:21 -05:00
Connection.h shm 2022-05-25 18:43:34 -04:00
ConnectionMulti.backup updated and fixed 2022-03-11 19:58:25 -05:00
ConnectionMulti.cc batching mode 2022-04-11 20:27:03 -04:00
ConnectionMultiApprox.cc shm 2022-05-25 18:43:34 -04:00
ConnectionMultiApproxBatch.cc shm 2022-05-25 18:43:34 -04:00
ConnectionMultiApproxShm.cc shm 2022-05-25 18:43:34 -04:00
ConnectionOptions.h shm 2022-05-25 18:43:34 -04:00
ConnectionStats.h updates 2021-12-14 17:43:31 -05:00
COPYING Initial check-in. 2012-08-23 14:30:32 -07:00
distributions.cc updated 2021-02-03 15:26:32 -05:00
distributions.h Initial check-in. 2012-08-23 14:30:32 -07:00
Generator.cc Added location parameter to GPareto distribution. Validated value-size distribution against Atikoglu's paper by comparing CDF and PDF. 2012-08-24 14:02:57 -07:00
Generator.h updated 2021-02-03 15:26:32 -05:00
HistogramSampler.h Initial check-in. 2012-08-23 14:30:32 -07:00
libzstd.a compression support 2021-05-10 20:02:08 -04:00
lightweightsemaphore.h added request q 2021-05-05 16:07:29 -04:00
log.cc Initial check-in. 2012-08-23 14:30:32 -07:00
log.h Initial check-in. 2012-08-23 14:30:32 -07:00
LogHistogramSampler.h Initialize sum_sq before using it 2014-08-05 17:33:12 +08:00
mutilate.cc shm 2022-05-25 18:43:34 -04:00
mutilate.h Initial check-in. 2012-08-23 14:30:32 -07:00
Operation.h updates 2022-03-11 14:50:21 -05:00
Protocol.cc added in approx 2021-12-15 19:40:15 -05:00
Protocol.h updates but probs not working 2021-07-21 01:00:08 -04:00
README.md mut updates 2021-03-09 13:54:51 -05:00
SConstruct shm 2022-05-25 18:43:34 -04:00
TestGenerator.cc Added location parameter to GPareto distribution. Validated value-size distribution against Atikoglu's paper by comparing CDF and PDF. 2012-08-24 14:02:57 -07:00
update_readme.sh tested resp protocol 2016-09-25 22:03:13 -04:00
util.cc Initial check-in. 2012-08-23 14:30:32 -07:00
util.h Initial check-in. 2012-08-23 14:30:32 -07:00
zstd.h compression support 2021-05-10 20:02:08 -04:00

Mutilate Build Status

Mutilate is a memcached load generator designed for high request rates, good tail-latency measurements, and realistic request stream generation.

Requirements

  1. A C++0x compiler
  2. scons
  3. libevent
  4. gengetopt
  5. zeromq (optional)

Mutilate has only been thoroughly tested on Ubuntu 11.10. We'll flesh out compatibility over time.

Building

apt-get install scons libevent-dev gengetopt libzmq-dev
scons

Basic Usage

Type './mutilate -h' for a full list of command-line options. At minimum, a server must be specified.

$ ./mutilate -s localhost
#type       avg     min     1st     5th    10th    90th    95th    99th
read       52.4    41.0    43.1    45.2    48.1    55.8    56.6    71.5
update      0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0
op_q        1.5     1.0     1.0     1.1     1.1     1.9     2.0     2.0

Total QPS = 18416.6 (92083 / 5.0s)

Misses = 0 (0.0%)

RX   22744501 bytes :    4.3 MB/s
TX    3315024 bytes :    0.6 MB/s

Mutilate reports the latency (average, minimum, and various percentiles) for get and set commands, as well as achieved QPS and network goodput.

To achieve high request rate, you must configure mutilate to use multiple threads, multiple connections, connection pipelining, or remote agents.

$ ./mutilate -s zephyr2-10g -T 24 -c 8
#type       avg     min     1st     5th    10th    90th    95th    99th
read      598.8    86.0   437.2   466.6   482.6   977.0  1075.8  1170.6
update      0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0
op_q        1.5     1.0     1.0     1.1     1.1     1.9     1.9     2.0

Total QPS = 318710.8 (1593559 / 5.0s)

Misses = 0 (0.0%)

RX  393609073 bytes :   75.1 MB/s
TX   57374136 bytes :   10.9 MB/s

Suggested Usage

Real deployments of memcached often handle the requests of dozens, hundreds, or thousands of front-end clients simultaneously. However, by default, mutilate establishes one connection per server and meters requests one at a time (it waits for a reply before sending the next request). This artificially limits throughput (i.e. queries per second), as the round-trip network latency is almost certainly far longer than the time it takes for the memcached server to process one request.

In order to get reasonable benchmark results with mutilate, it needs to be configured to more accurately portray a realistic client workload. In general, this means ensuring that (1) there are a large number of client connections, (2) there is the potential for a large number of outstanding requests, and (3) the memcached server saturates and experiences queuing delay far before mutilate does. I suggest the following guidelines:

  1. Establish on the order of 100 connections per memcached server thread.
  2. Don't exceed more than about 16 connections per mutilate thread.
  3. Use multiple mutilate agents in order to achieve (1) and (2).
  4. Do not use more mutilate threads than hardware cores/threads.
  5. Use -Q to configure the "master" agent to take latency samples at slow, a constant rate.

Here's an example:

memcached_server$ memcached -t 4 -c 32768
agent1$ mutilate -T 16 -A
agent2$ mutilate -T 16 -A
agent3$ mutilate -T 16 -A
agent4$ mutilate -T 16 -A
agent5$ mutilate -T 16 -A
agent6$ mutilate -T 16 -A
agent7$ mutilate -T 16 -A
agent8$ mutilate -T 16 -A
master$ mutilate -s memcached_server --loadonly
master$ mutilate -s memcached_server --noload \
    -B -T 16 -Q 1000 -D 4 -C 4 \
    -a agent1 -a agent2 -a agent3 -a agent4 \
    -a agent5 -a agent6 -a agent7 -a agent8 \
    -c 4 -q 200000

This will create 8164 = 512 connections total, which is about 128 per memcached server thread. This ought to be enough outstanding requests to cause server-side queuing delay, and no possibility of client-side queuing delay adulterating the latency measurements.

Command-line Options

mutilate 0.1

Usage: mutilate -s server[:port] [options]

"High-performance" memcached benchmarking tool

  -h, --help                    Print help and exit
      --version                 Print version and exit
  -v, --verbose                 Verbosity. Repeat for more verbose.
      --quiet                   Disable log messages.

Basic options:
  -s, --server=STRING           Memcached server hostname[:port].  Repeat to
                                  specify multiple servers.
      --unix_socket             Use UNIX socket instead of TCP.
      --binary                  Use binary memcached protocol instead of ASCII.
      --redis                   Use Redis RESP protocol instead of memchached.
      --getset                  Use getset mode, in getset mode we first issue
                                  a GET and if the response is MISS, then issue
                                  a SET for on that
                                  key following distribution value.
      --getsetorset             Use getset mode and allow for direct writes
                                  (with optype == 2).
      --successful              Only record latency and throughput stats for
                                  successful queries
      --prefix=STRING           Prefix all keys with a string (helps with
                                  multi-tennant eval)
      --delete90                Delete 90 percent of keys after halfway through
                                  the workload, used to model Rumbel et. al.
                                  USENIX                      FAST '14
                                  workloads. MUST BE IN GETSET MODE and 
                                                       have a set number of
                                  queries
      --assoc=INT               We create hash tables by taking the truncating
                                  the                    key by b bytes. The
                                  n-b bytes are the key for redis, in the
                                  original                    (key,value). The
                                  value is a hash table and we acess field
                                  b to get the value. Essentially this makes
                                  redis n-way                    associative
                                  cache. Only works in redis mode. For small
                                  key                    sizes we just use
                                  normal method of (key,value) store. No hash
                                  table.  (default=`4')
  -q, --qps=INT                 Target aggregate QPS. 0 = peak QPS.
                                  (default=`0')
  -t, --time=INT                Maximum time to run (seconds).  (default=`5')
      --read_file=STRING        Read keys from file.  (default=`')
      --twitter_trace=INT       use twitter memcached trace format from file.
                                  (default=`0')
  -K, --keysize=STRING          Length of memcached keys (distribution).
                                  (default=`30')
  -V, --valuesize=STRING        Length of memcached values (distribution).
                                  (default=`200')
  -r, --records=INT             Number of memcached records to use.  If
                                  multiple memcached servers are given, this
                                  number is divided by the number of servers.
                                  (default=`10000')
  -m, --misswindow=INT          Window for recording misses, used to find
                                  steady state, no window by default, which
                                  gives us summary stats in total
                                  (default=`0')
  -N, --queries=INT             Number of queries to make. 0 is unlimited
                                  (default)  If multiple memcached servers are
                                  given, this number is divided by the number
                                  of servers.  (default=`0')
  -u, --update=FLOAT            Ratio of set:get commands.  (default=`0.0')

Advanced options:
  -U, --username=STRING         Username to use for SASL authentication.
  -P, --password=STRING         Password to use for SASL authentication.
  -T, --threads=INT             Number of threads to spawn.  (default=`1')
      --affinity                Set CPU affinity for threads, round-robin
  -c, --connections=INT         Connections to establish per server.
                                  (default=`1')
  -d, --depth=INT               Maximum depth to pipeline requests.
                                  (default=`1')
  -R, --roundrobin              Assign threads to servers in round-robin
                                  fashion.  By default, each thread connects to
                                  every server.
  -i, --iadist=STRING           Inter-arrival distribution (distribution).
                                  Note: The distribution will automatically be
                                  adjusted to match the QPS given by --qps.
                                  (default=`exponential')
  -S, --skip                    Skip transmissions if previous requests are
                                  late.  This harms the long-term QPS average,
                                  but reduces spikes in QPS after long latency
                                  requests.
      --moderate                Enforce a minimum delay of ~1/lambda between
                                  requests.
      --noload                  Skip database loading.
      --loadonly                Load database and then exit.
  -B, --blocking                Use blocking epoll().  May increase latency.
      --no_nodelay              Don't use TCP_NODELAY.
  -w, --warmup=INT              Warmup time before starting measurement.
  -W, --wait=INT                Time to wait after startup to start
                                  measurement.
      --save=STRING             Record latency samples to given file.
      --search=N:X              Search for the QPS where N-order statistic <
                                  Xus.  (i.e. --search 95:1000 means find the
                                  QPS where 95% of requests are faster than
                                  1000us).
      --scan=min:max:step       Scan latency across QPS rates from min to max.

Agent-mode options:
  -A, --agentmode               Run client in agent mode.
  -a, --agent=host              Enlist remote agent.
  -p, --agent_port=STRING       Agent port.  (default=`5556')
  -l, --lambda_mul=INT          Lambda multiplier.  Increases share of QPS for
                                  this client.  (default=`1')
  -C, --measure_connections=INT Master client connections per server, overrides
                                  --connections.
  -Q, --measure_qps=INT         Explicitly set master client QPS, spread across
                                  threads and connections.
  -D, --measure_depth=INT       Set master client connection depth.

The --measure_* options aid in taking latency measurements of the
memcached server without incurring significant client-side queuing
delay.  --measure_connections allows the master to override the
--connections option.  --measure_depth allows the master to operate as
an "open-loop" client while other agents continue as a regular
closed-loop clients.  --measure_qps lets you modulate the QPS the
master queries at independent of other clients.  This theoretically
normalizes the baseline queuing delay you expect to see across a wide
range of --qps values.

Some options take a 'distribution' as an argument.
Distributions are specified by <distribution>[:<param1>[,...]].
Parameters are not required.  The following distributions are supported:

   [fixed:]<value>              Always generates <value>.
   uniform:<max>                Uniform distribution between 0 and <max>.
   normal:<mean>,<sd>           Normal distribution.
   exponential:<lambda>         Exponential distribution.
   pareto:<loc>,<scale>,<shape> Generalized Pareto distribution.
   gev:<loc>,<scale>,<shape>    Generalized Extreme Value distribution.

   To recreate the Facebook "ETC" request stream from [1], the
   following hard-coded distributions are also provided:

   fb_value   = a hard-coded discrete and GPareto PDF of value sizes
   fb_key     = "gev:30.7984,8.20449,0.078688", key-size distribution
   fb_ia      = "pareto:0.0,16.0292,0.154971", inter-arrival time dist.

[1] Berk Atikoglu et al., Workload Analysis of a Large-Scale Key-Value Store,
    SIGMETRICS 2012