Bob Bawn
2013-07-22 14:36:26 UTC
Hello,
I'm testing high-density SATA storage with FreeBSD 9.1-STABLE. The
hardware is:
Drives: 45 * Seagate Altos ST3000NC002
Port Multipliers: 9 * SiI3826
SATA Controller: 3 * Marvell 88SX7042
After a few hours of a database-like workload over ZFS (NCQ enable, disk
write caches disabled), a disk becomes unresponsive (we think due to a
drive firmware problem):
Jun 14 21:39:54 adlax12st002 root: sysbench tests are now underway
Jun 15 12:12:07 adlax12st002 kernel: mvsch1: SNTF 15
Jun 15 12:12:37 adlax12st002 kernel: mvsch1: Timeout on slot 12
Jun 15 12:12:37 adlax12st002 kernel: mvsch1: iec 00000000 sstat 00000123 serr 00400000 edma_s 00000024 dma_c 10000708 dma_s 00000008 rs 08c81408 status 40
Jun 15 12:12:37 adlax12st002 kernel: mvsch1: ... waiting for slots 08c80408
Jun 15 12:12:37 adlax12st002 kernel: mvsch1: Timeout on slot 3
Jun 15 12:12:37 adlax12st002 kernel: mvsch1: iec 00000000 sstat 00000123 serr 00400000 edma_s 00000024 dma_c 10000708 dma_s 00000008 rs 08c81408 status 40
Jun 15 12:12:37 adlax12st002 kernel: mvsch1: ... waiting for slots 08c80400
After a few timeout/reset cycles, the afflicted device is removed:
Jun 15 12:13:41 adlax12st002 kernel: (aprobe1:mvsch1:0:1:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Jun 15 12:13:41 adlax12st002 kernel: (aprobe1:mvsch1:0:1:0): CAM status: Command timeout
Jun 15 12:13:41 adlax12st002 kernel: (aprobe1:mvsch1:0:1:0): Error 5, Retry was blocked
Jun 15 12:13:41 adlax12st002 kernel: (ada6:mvsch1:0:1:0): lost device
Jun 15 12:13:41 adlax12st002 kernel: (pass7:mvsch1:0:1:0): lost device
Jun 15 12:13:41 adlax12st002 kernel: (pass7:mvsch1:0:1:0): removing device entry
Jun 15 12:13:41 adlax12st002 kernel: mvsch1: MVS reset: device ready after 500ms
All of that seems like reasonable OS behavior when a drive is
unresponsive. In fact Linux/CentOS/ZoL behaves pretty much the same up
to this point.
The problem is that the other four drives behind the port multiplier
start timing out and get removed, one at a time, in target order, over
the next few minutes:
# grep "lost device" adlax12st002-messages.log
Jun 15 12:13:41 adlax12st002 kernel: (ada6:mvsch1:0:1:0): lost device
Jun 15 12:13:41 adlax12st002 kernel: (pass7:mvsch1:0:1:0): lost device
Jun 15 12:16:16 adlax12st002 kernel: (ada7:mvsch1:0:2:0): lost device
Jun 15 12:16:16 adlax12st002 kernel: (pass8:mvsch1:0:2:0): lost device
Jun 15 12:18:50 adlax12st002 kernel: (ada8:mvsch1:0:3:0): lost device
Jun 15 12:18:50 adlax12st002 kernel: (pass9:mvsch1:0:3:0): lost device
Jun 15 12:22:23 adlax12st002 kernel: (ada9:mvsch1:0:4:0): lost device
Jun 15 12:22:23 adlax12st002 kernel: (pass10:mvsch1:0:4:0): lost device
Jun 15 12:26:57 adlax12st002 kernel: (ada5:mvsch1:0:0:0): lost device
Jun 15 12:26:57 adlax12st002 kernel: (pass6:mvsch1:0:0:0): lost device
It looks like the timeout/reset/recovery sequence for the initial frozen
disk has somehow broken connectivity to all the drives behind the port
multiplier. This part does not happen on Linux. Sometimes the entire
machine is locked up after the "lost device" sequence. In all cases, a
full power cycle is required to make the devices available again. When I
soft reset the box over IPMI, the boot process gets stuck in a loop with
"mvsch2: MVS reset" and "mvsch2: Wait status d0".
Full /var/log/messages are at:
http://pastebin.com/xCJyfvSN
Unfortunately, I failed to grab the dmesg output and the box has since
been re-imaged. Here is a dmesg from a machine which I believe to be
identical to the test box:
http://pastebin.com/NYjezuMX
/var/log/messages for the CentOS/Linux case is at:
http://pastebin.com/qrWm0HJ0
Maybe this is a topic for a different post, but has anybody successfully
used high-density port-multiplied SATA platforms with FreeBSD? I've
heard lots of anecdotes about hardware and/or driver flakiness (like the
above), undocumented hardware, etc. (Actually, I've heard similar
complaints from Linux folks.) SAS machines seem to handle this workload
without any problems. We have tried 9.1-RELEASE and the behavior was
worse.
We're actually more interested in archive type workloads than this
database workload and we have not observed the problem with an archive
workload. However, we're worried that general single-drive failures
could turn into unavailability of five drives regardless of workload.
Any guidance would be appreciated.
Thanks!
Bob Bawn
I'm testing high-density SATA storage with FreeBSD 9.1-STABLE. The
hardware is:
Drives: 45 * Seagate Altos ST3000NC002
Port Multipliers: 9 * SiI3826
SATA Controller: 3 * Marvell 88SX7042
After a few hours of a database-like workload over ZFS (NCQ enable, disk
write caches disabled), a disk becomes unresponsive (we think due to a
drive firmware problem):
Jun 14 21:39:54 adlax12st002 root: sysbench tests are now underway
Jun 15 12:12:07 adlax12st002 kernel: mvsch1: SNTF 15
Jun 15 12:12:37 adlax12st002 kernel: mvsch1: Timeout on slot 12
Jun 15 12:12:37 adlax12st002 kernel: mvsch1: iec 00000000 sstat 00000123 serr 00400000 edma_s 00000024 dma_c 10000708 dma_s 00000008 rs 08c81408 status 40
Jun 15 12:12:37 adlax12st002 kernel: mvsch1: ... waiting for slots 08c80408
Jun 15 12:12:37 adlax12st002 kernel: mvsch1: Timeout on slot 3
Jun 15 12:12:37 adlax12st002 kernel: mvsch1: iec 00000000 sstat 00000123 serr 00400000 edma_s 00000024 dma_c 10000708 dma_s 00000008 rs 08c81408 status 40
Jun 15 12:12:37 adlax12st002 kernel: mvsch1: ... waiting for slots 08c80400
After a few timeout/reset cycles, the afflicted device is removed:
Jun 15 12:13:41 adlax12st002 kernel: (aprobe1:mvsch1:0:1:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Jun 15 12:13:41 adlax12st002 kernel: (aprobe1:mvsch1:0:1:0): CAM status: Command timeout
Jun 15 12:13:41 adlax12st002 kernel: (aprobe1:mvsch1:0:1:0): Error 5, Retry was blocked
Jun 15 12:13:41 adlax12st002 kernel: (ada6:mvsch1:0:1:0): lost device
Jun 15 12:13:41 adlax12st002 kernel: (pass7:mvsch1:0:1:0): lost device
Jun 15 12:13:41 adlax12st002 kernel: (pass7:mvsch1:0:1:0): removing device entry
Jun 15 12:13:41 adlax12st002 kernel: mvsch1: MVS reset: device ready after 500ms
All of that seems like reasonable OS behavior when a drive is
unresponsive. In fact Linux/CentOS/ZoL behaves pretty much the same up
to this point.
The problem is that the other four drives behind the port multiplier
start timing out and get removed, one at a time, in target order, over
the next few minutes:
# grep "lost device" adlax12st002-messages.log
Jun 15 12:13:41 adlax12st002 kernel: (ada6:mvsch1:0:1:0): lost device
Jun 15 12:13:41 adlax12st002 kernel: (pass7:mvsch1:0:1:0): lost device
Jun 15 12:16:16 adlax12st002 kernel: (ada7:mvsch1:0:2:0): lost device
Jun 15 12:16:16 adlax12st002 kernel: (pass8:mvsch1:0:2:0): lost device
Jun 15 12:18:50 adlax12st002 kernel: (ada8:mvsch1:0:3:0): lost device
Jun 15 12:18:50 adlax12st002 kernel: (pass9:mvsch1:0:3:0): lost device
Jun 15 12:22:23 adlax12st002 kernel: (ada9:mvsch1:0:4:0): lost device
Jun 15 12:22:23 adlax12st002 kernel: (pass10:mvsch1:0:4:0): lost device
Jun 15 12:26:57 adlax12st002 kernel: (ada5:mvsch1:0:0:0): lost device
Jun 15 12:26:57 adlax12st002 kernel: (pass6:mvsch1:0:0:0): lost device
It looks like the timeout/reset/recovery sequence for the initial frozen
disk has somehow broken connectivity to all the drives behind the port
multiplier. This part does not happen on Linux. Sometimes the entire
machine is locked up after the "lost device" sequence. In all cases, a
full power cycle is required to make the devices available again. When I
soft reset the box over IPMI, the boot process gets stuck in a loop with
"mvsch2: MVS reset" and "mvsch2: Wait status d0".
Full /var/log/messages are at:
http://pastebin.com/xCJyfvSN
Unfortunately, I failed to grab the dmesg output and the box has since
been re-imaged. Here is a dmesg from a machine which I believe to be
identical to the test box:
http://pastebin.com/NYjezuMX
/var/log/messages for the CentOS/Linux case is at:
http://pastebin.com/qrWm0HJ0
Maybe this is a topic for a different post, but has anybody successfully
used high-density port-multiplied SATA platforms with FreeBSD? I've
heard lots of anecdotes about hardware and/or driver flakiness (like the
above), undocumented hardware, etc. (Actually, I've heard similar
complaints from Linux folks.) SAS machines seem to handle this workload
without any problems. We have tried 9.1-RELEASE and the behavior was
worse.
We're actually more interested in archive type workloads than this
database workload and we have not observed the problem with an archive
workload. However, we're worried that general single-drive failures
could turn into unavailability of five drives regardless of workload.
Any guidance would be appreciated.
Thanks!
Bob Bawn