We noticed some performance issues on one of our Netapp Filers
and during my investigation I noticed a few things that happen
under the covers and I wanted to share them with you.
Netapp SnapMirror is an asynchronous method of mirroring files
from one location to another. At my company we use it to
replicate property images from one datacentre to another. I've
always wondered what happens to the replica as SnapMirror is a
block copy though obviously the destination may be different
therefore it needs to do something 'smart' to correct the
differences between the two filers.
The 'smart'ness is what is causing performance problems - so lets
go through them.
Here is some of the output of a 'statit' on the SnapMirror
replica. The filer is idle - its not doing any traffic and its
not snapmirroring.
nas02*> statit -e
Hostname: nas02 ID: xxxx Memory: 16376 MB
NetApp Release 7.2.6.1P4: Fri Apr 3 16:34:33 PDT 2009
....
Disk Statistics (per second)
ut% is the percent of time the disk was busy.
xfers is the number of data-transfer commands issued per
second.
xfers = ureads + writes + cpreads + greads + gwrites
chain is the average number of 4K blocks per command.
usecs is the average disk round-trip time per 4K block.
disk ut% xfers ureads--chain-usecs writes--chain-usecs
cpreads-chain-usecs greads--chain-usecs gwrites-chain-usecs
/aggr0/plex0/rg0:
0a.33 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00
.... .
0c.66 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00
.... .
0g.52 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00
.... .
0a.80 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00
.... .
0c.32 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00
.... .
0c.76 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00
.... .
0a.81 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00
.... .
0c.34 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00
.... .
0c.77 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00
.... .
0a.82 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00
.... .
0a.35 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00
.... .
0a.83 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00
.... .
0a.36 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00
.... .
0c.84 0 0.02 0.02 1.00 7000 0.00 .... . 0.00 .... . 0.00 .... .
0.00 .... .
/aggr0/plex0/rg1:
0c.85 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00
.... .
0a.37 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00
.... .
0a.86 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00
.... .
0a.38 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00
.... .
0a.87 0 0.02 0.02 1.00 6000 0.00 .... . 0.00 .... . 0.00 .... .
0.00 .... .
0c.39 0 0.02 0.02 1.00 11000 0.00 .... . 0.00 .... . 0.00 .... .
0.00 .... .
0c.88 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00
.... .
0a.40 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00
.... .
0c.89 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00
.... .
0c.41 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00
.... .
0a.90 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00
.... .
0c.42 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00
.... .
0a.91 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00
.... .
0c.43 0 0.00 0.00 .... . 0.00 .... . 0.00 .... . 0.00 .... . 0.00
.... .
/aggr1/plex0/rg0:
0f.16 2 3.40 0.12 1.00 7833 3.03 30.12 247 0.24 17.75 263 0.00
.... . 0.00 .... .
0f.32 2 3.52 0.12 1.00 27500 3.15 29.03 207 0.24 17.75 263 0.00
.... . 0.00 .... .
0h.17 32 127.22 123.86 1.00 4305 3.05 29.71 293 0.30 16.40 520
0.00 .... . 0.00 .... .
0f.33 30 120.27 117.07 1.00 4036 2.93 30.91 273 0.26 16.62 407
0.00 .... . 0.00 .... .
0h.18 31 121.20 118.09 1.00 4468 2.89 31.35 297 0.22 19.45 369
0.00 .... . 0.00 .... .
0f.34 30 120.02 116.91 1.01 4160 2.89 31.35 315 0.22 18.55 358
0.00 .... . 0.00 .... .
0h.19 29 116.97 113.86 1.00 4124 2.89 31.35 281 0.22 21.27 286
0.00 .... . 0.00 .... .
0h.35 31 120.81 117.64 1.00 4294 2.89 31.35 291 0.28 15.29 397
0.00 .... . 0.00 .... .
0f.20 31 117.83 114.71 1.00 4352 2.89 31.35 284 0.22 19.64 319
0.00 .... . 0.00 .... .
0h.36 30 120.22 117.09 1.00 4075 2.89 31.35 293 0.24 19.67 309
0.00 .... . 0.00 .... .
0h.21 30 122.81 119.68 1.00 3960 2.89 31.35 283 0.24 17.67 392
0.00 .... . 0.00 .... .
0h.37 30 118.41 115.30 1.00 4123 2.89 31.35 295 0.22 18.73 388
0.00 .... . 0.00 .... .
0h.22 29 117.83 114.69 1.00 4102 2.89 31.35 284 0.24 19.67 284
0.00 .... . 0.00 .... .
WOOOO - what the? So what the above is telling me is that on
'aggr1' raid group 0, theres a LOT of read and write disk
activity. On 'aggr0', on both raid groups the disks are doing
nothing! Again the filer is not serving traffic, its not
snapmirroring, it shouldnt be doing anything?
What is it doing?!
Well lets have a little look at the WAFL status:
nas02*> wafl scan status
.....
Volume objects08:
Scan id Type of scan progress
632337 volume deswizzling snap 30, inode 10082231 of 42105252.
level 1 of normal files. Totals: Normal files: L1:1755627/3583238
L2:0/0 L3:0/0 L4:0/0 Inode file: L0:0/0 L1:0/0 L2:0/0 L3:0/0
L4:0/0
Volume objects07:
Scan id Type of scan progress
632343 volume deswizzling snap 137, inode 9325585 of 42105252.
level 1 of normal files. Totals: Normal files: L1:1599575/3269944
L2:0/0 L3:0/0 L4:0/0 Inode file: L0:0/0 L1:0/0 L2:0/0 L3:0/0
L4:0/0
nas02*>
So the filer is deswizzling - what is deswizzling!?
Deswizzling is a post SnapMirror process that happens in the
background that aligns virtual blocks with physical blocks.
During this deswizzling period the filer must perform two reads
when you request data on that volume being deswizzled - one for a
virtual block then another for the physical block.
Here is what the Netapp KB says:
When reading data from a FlexVol, the physical location of that
data must be known to perform the read. Under normal
circumstances, reads use a fast-path to translate a logical
address into a physical address.
When SnapMirror transfers a FlexVol to a different aggregate,
however, the relationship between logical and physical addresses
is changed. The destination aggregate lays out the volume into
different physical blocks.
Immediately after the SnapMirror transfer, reads to the
destination volume will use a different process to translate
logical addresses into physical addresses. This slow-path is less
efficient and may require an additional disk read to access a
block of data.
As soon as a FlexVol SnapMirror update finishes, the destination
storage controller will launch a scanner to recreate the
fast-path metadata. Once the scan finishes, reads to the
SnapMirror destination will use the normal fast-path. This
process is known as deswizzling.
(https://now.netapp.com/Knowledgebase/solutionarea.asp?id=kb16746)
So to return to the statit output - aggr1 is currently
deswizzling away which is why the 'idle' filer is doing a lot of
disk IO. During production load, this extra lookup for
deswizzling and the reallocation itself is what caused the
performance problems.
Apr
05
2010