Netapp ONTAP deswizzle drizzle dizzee rascal singing snapmirror

We noticed some performance issues on one of our Netapp Filers and during my investigation I noticed a few things that happen under the covers and I wanted to share them with you.

Netapp SnapMirror is an asynchronous method of mirroring files from one location to another. At my company we use it to replicate property images from one datacentre to another. I've always wondered what happens to the replica as SnapMirror is a block copy though obviously the destination may be different therefore it needs to do something 'smart' to correct the differences between the two filers.

The 'smart'ness is what is causing performance problems - so lets go through them.

Here is some of the output of a 'statit' on the SnapMirror replica. The filer is idle - its not doing any traffic and its not snapmirroring.

nas02*> statit -e

Hostname: nas02 ID: xxxx Memory: 16376 MB
NetApp Release 7.2.6.1P4: Fri Apr 3 16:34:33 PDT 2009
....


Disk Statistics (per second)
ut% is the percent of time the disk was busy.
xfers is the number of data-transfer commands issued per second.
xfers = ureads + writes + cpreads + greads + gwrites
chain is the average number of 4K blocks per command.
usecs is the average disk round-trip time per 4K block.

disk ut% xfers ureads--chain-usecs writes--chain-usecs cpreads-chain-usecs greads--chain-usecs gwrites-chain-usecs
/aggr0/plex0/rg0:
0a.33 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .
0c.66 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .
0g.52 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .
0a.80 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .
0c.32 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .
0c.76 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .
0a.81 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .
0c.34 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .
0c.77 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .
0a.82 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .
0a.35 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .
0a.83 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .
0a.36 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .
0c.84 0 0.02 0.02 1.00 7000 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .
/aggr0/plex0/rg1:
0c.85 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .
0a.37 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .
0a.86 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .
0a.38 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .
0a.87 0 0.02 0.02 1.00 6000 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .
0c.39 0 0.02 0.02 1.00 11000 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .
0c.88 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .
0a.40 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .
0c.89 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .
0c.41 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .
0a.90 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .
0c.42 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .
0a.91 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .
0c.43 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... .
/aggr1/plex0/rg0:
0f.16 2 3.40 0.12 1.00 7833 3.03 30.12 247 0.24 17.75 263 0.00 .... . 0.00 .... .
0f.32 2 3.52 0.12 1.00 27500 3.15 29.03 207 0.24 17.75 263 0.00 .... . 0.00 .... .
0h.17 32 127.22 123.86 1.00 4305 3.05 29.71 293 0.30 16.40 520 0.00 .... . 0.00 .... .
0f.33 30 120.27 117.07 1.00 4036 2.93 30.91 273 0.26 16.62 407 0.00 .... . 0.00 .... .
0h.18 31 121.20 118.09 1.00 4468 2.89 31.35 297 0.22 19.45 369 0.00 .... . 0.00 .... .
0f.34 30 120.02 116.91 1.01 4160 2.89 31.35 315 0.22 18.55 358 0.00 .... . 0.00 .... .
0h.19 29 116.97 113.86 1.00 4124 2.89 31.35 281 0.22 21.27 286 0.00 .... . 0.00 .... .
0h.35 31 120.81 117.64 1.00 4294 2.89 31.35 291 0.28 15.29 397 0.00 .... . 0.00 .... .
0f.20 31 117.83 114.71 1.00 4352 2.89 31.35 284 0.22 19.64 319 0.00 .... . 0.00 .... .
0h.36 30 120.22 117.09 1.00 4075 2.89 31.35 293 0.24 19.67 309 0.00 .... . 0.00 .... .
0h.21 30 122.81 119.68 1.00 3960 2.89 31.35 283 0.24 17.67 392 0.00 .... . 0.00 .... .
0h.37 30 118.41 115.30 1.00 4123 2.89 31.35 295 0.22 18.73 388 0.00 .... . 0.00 .... .
0h.22 29 117.83 114.69 1.00 4102 2.89 31.35 284 0.24 19.67 284 0.00 .... . 0.00 .... .


WOOOO - what the? So what the above is telling me is that on 'aggr1' raid group 0, theres a LOT of read and write disk activity. On 'aggr0', on both raid groups the disks are doing nothing! Again the filer is not serving traffic, its not snapmirroring, it shouldnt be doing anything?

What is it doing?!

Well lets have a little look at the WAFL status:

nas02*> wafl scan status
.....

Volume objects08:
Scan id Type of scan progress
632337 volume deswizzling snap 30, inode 10082231 of 42105252. level 1 of normal files. Totals: Normal files: L1:1755627/3583238 L2:0/0 L3:0/0 L4:0/0 Inode file: L0:0/0 L1:0/0 L2:0/0 L3:0/0 L4:0/0
Volume objects07:
Scan id Type of scan progress
632343 volume deswizzling snap 137, inode 9325585 of 42105252. level 1 of normal files. Totals: Normal files: L1:1599575/3269944 L2:0/0 L3:0/0 L4:0/0 Inode file: L0:0/0 L1:0/0 L2:0/0 L3:0/0 L4:0/0
nas02*>



So the filer is deswizzling - what is deswizzling!?

Deswizzling is a post SnapMirror process that happens in the background that aligns virtual blocks with physical blocks. During this deswizzling period the filer must perform two reads when you request data on that volume being deswizzled - one for a virtual block then another for the physical block.


Here is what the Netapp KB says:

When reading data from a FlexVol, the physical location of that data must be known to perform the read. Under normal circumstances, reads use a fast-path to translate a logical address into a physical address.

When SnapMirror transfers a FlexVol to a different aggregate, however, the relationship between logical and physical addresses is changed. The destination aggregate lays out the volume into different physical blocks.

Immediately after the SnapMirror transfer, reads to the destination volume will use a different process to translate logical addresses into physical addresses. This slow-path is less efficient and may require an additional disk read to access a block of data.

As soon as a FlexVol SnapMirror update finishes, the destination storage controller will launch a scanner to recreate the fast-path metadata. Once the scan finishes, reads to the SnapMirror destination will use the normal fast-path. This process is known as deswizzling. (https://now.netapp.com/Knowledgebase/solutionarea.asp?id=kb16746)


So to return to the statit output - aggr1 is currently deswizzling away which is why the 'idle' filer is doing a lot of disk IO. During production load, this extra lookup for deswizzling and the reallocation itself is what caused the performance problems.