Discussion:
ahcich Timeouts SATA SSD
(too old to reply)
nate keegan
2012-10-14 23:04:06 UTC
Permalink
I originally posted this to the FreeBSD hardware forum and then on
freebsd-questions at the direction of a moderator in the forum.

Based on what I'm seeing for post types on freebsd-questions this
might be the best forum for this issue as it looks like some sort of a
strange issue or bug between FreeBSD 8.2/9.0 and SATA SSD drives.

My configuration is as follows:

FreeBSD 8.2-RELEASE
Supermicro X8DTi-LN4F (Intel Tylersburg 5520 chipset) motherboard
24 GB system memory
32 x Hitachi Deskstar 5K3000 disks connected to 4 x Intel SASUC8I (LSI
3081E-R) in IT mode
2 x Crucial M4 64 Gb SATA SSD for FreeBSD OS (zroot)
2 x Intel 320 MLC 80 Gb SATA SSD for L2ARC and swap
SSD are connected to on-board SATA port on motherboard

This system was commissioned in February of 2012 and ran without issue
as a ZFS backup system on our network until about 3 weeks ago.

At that time I started getting kernel panics due to timeouts to the
on-board SATA devices. The only change to the system since it was
built was to add an SSD for swap (32 Gb swap device) and this issue
did not happen until several months after this was added.

My initial thought was that I might have a bad SSD drive so I swapped
out one of the Crucial SSD drives and the problem happened again a few
days later.

I then moved to systematically replacing items such as SATA cables,
memory, motherboard, etc and the problem continued. For example, I
swapped out the 4 SATA cables with brand new SATA cables and waited to
see if the problem happened again. Once it did I moved on to replacing
the motherboard with an identical motherboard, waited, etc.

I could not find an obvious hardware related explanation for this
behavior so about a week and a half ago I did a fresh install of
FreeBSD 9.0-RELEASE to move from the ATA driver to the AHCI driver as
I found some evidence that this was helpful.

The problem continued with something like this:

ahcich0: Timeout on slot 29 port 0
ahcich0: is 000000000 cs 00000000 ss e0000000 rs e0000000 tfd 40 serr
00000000 cmd 0004df17

ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich0: Timeout on slot 31 port 0
ahcich0: is 00000000 cs 80000000 ss 00000000 rs 80000000 tfd 80 serr
00000000 cmd 0004df17
(ada0:ahcich0:0:0:0): lost device

ahcich0: AHCI reset: device not ready after 3100ms (tfd = 00000080)
ahcich0: Timeout on slot 31 port 0
ahcich0: is 00000000 cs 80000003 ss 800000003 rs 80000003 tfd 80 serr
0000000 cmd 0004df17
(ada0:ahcich0:0:0:0): removing device entry

ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich0: Poll timeout on slot 1 port 0
ahcich0: is 00000000 cs 00000002 ss 000000000 rs 0000002 tfd 80 serr
00000000 cmd 004c117

When this happens the only way to recover the system is to hard boot
via IPMI (yanking the power vs hitting reset). I cannot say that every
time this happens a hard reset is necessary but more often than not a
hard reset is necessary as the on-board AHCI portion of the BIOS does
not always see the disks after the event without a hard system power
reset.

I have done a bunch of Google work on this and have seen the issue
appear in FreeNAS and FreeBSD but no clear cut resolution in terms of
how to address it or what causes it. Some people had a bad SSD, others
had to disable NCQ or power management on their SSD, particular brands
of SSD (Samsung), etc.

Nothing conclusive so far.

At the present time the issue happens every 1-2 hours unless I have
the following in my /boot/loader.conf after the ahci_load statement:

ahci_load="YES"

# See ahci(4)
hint.ahcich.0.sata_rev=1
hint.ahcich.1.sata_rev=1
hint.ahcich.2.sata_rev=1
hint.ahcich.3.sata_rev=1

hint.ahcich.0.pm_level=1
hint.ahcich.1.pm_level=1
hint.ahcich.2.pm_level=1
hint.ahcich.3.pm_level=1

I have a script in /usr/local/etc/rc.d which disables NCQ on these drives:

#!/bin/sh

CAMCONTROL=/sbin/camcontrol

$CAMCONTROL tags ada0 -N 1 > /dev/null
$CAMCONTROL tags ada1 -N 1 > /dev/null
$CAMCONTROL tags ada2 -N 1 > /dev/null
$CAMCONTROL tags ada3 -N 1 > /dev/null

exit 0

I went ahead and pulled the Intel SSDs as they were showing ASR and
hardware resets which incremented. Removing both of these disks from
the system did not change the situation.

The combination of /boot/loader.conf and this script gets me 6 days or
so of operation before the issue pops up again. If I remove these two
items I get maybe 2 hours before the issue happens again.

Right now I'm down to one OS disk and one swap disk and that is it for
SSD disks on the system.

At the last reboot (yesterday) I disabled APM on the disks (ada0 and
ada1 at this point) to see if that makes a difference as I found a
reference to this being a potential problem.

I'm looking for insight/help on this as I'm about out of options. If
there is a way to gather more information when this happens, post up
information, etc I'm open to trying it.

What is driving me crazy is that I can't seem to come up with a
concrete explanation as to why now and not back when the system was
built. The issue only seems to happen when the system is idle and the
SSD drives do not see much action other than to host OS, scripts, etc
while the Intel/LSI based drives is where the actual I/O is at.

The system logs do not show anything prior to event happening and the
OS will respond to ping requests after the issue and if you have an
active SSH session you will remain connected to the system until you
attempt to do something like 'ls', 'ps', etc.

New SSH requests to the system get 'connection refused'.

As far as I can see I have three real options left:

* Hope that someone here knows something I don't
* Ditch SSD for straight SATA disks (plan on doing this next week
before next likely happening sometime Wed am) as perhaps there is some
odd SATA/SSD interaction with FreeBSD or with controller I'm not aware
of (haven't had this happen with plain SATA and FreeBSD before)
* Ditch FreeBSD for Solaris so I can keep ZFS lovin for the intended
purpose of this system

I'm open to suggestions, direction, etc to see if I can nail down what
is going on and put this issue to bed for not only myself but for
anyone else who might run into it in the future.
Peter Jeremy
2012-10-15 09:59:24 UTC
Permalink
Post by nate keegan
Based on what I'm seeing for post types on freebsd-questions this
might be the best forum for this issue as it looks like some sort of a
strange issue or bug between FreeBSD 8.2/9.0 and SATA SSD drives.
This system was commissioned in February of 2012 and ran without issue
as a ZFS backup system on our network until about 3 weeks ago.
At that time I started getting kernel panics due to timeouts to the
on-board SATA devices. The only change to the system since it was
built was to add an SSD for swap (32 Gb swap device) and this issue
did not happen until several months after this was added.
This _does_ sound more like hardware than software - it's difficult
to envisage a software bug that does nothing for 6 months and then
makes the system hang regularly.

Has there been any significant change to the system load, how much
data is being transferred, clients, how full the data zpool is, etc
that might correlate with the onset of hangs?
Post by nate keegan
I then moved to systematically replacing items such as SATA cables,
memory, motherboard, etc and the problem continued. For example, I
swapped out the 4 SATA cables with brand new SATA cables and waited to
see if the problem happened again. Once it did I moved on to replacing
the motherboard with an identical motherboard, waited, etc.
Have you tried replacing RAM & PSU?
Post by nate keegan
The system logs do not show anything prior to event happening and the
OS will respond to ping requests after the issue and if you have an
active SSH session you will remain connected to the system until you
attempt to do something like 'ls', 'ps', etc.
This implies that the kernel is still active but the filesystem is
deadlocked. Are you able to drop into DDB? Is anything displayed
on the kernel?
Post by nate keegan
New SSH requests to the system get 'connection refused'.
This implies that sshd has died - a filesystem deadlock should result
in connection attempts either timing out or just hanging.
Post by nate keegan
I'm open to suggestions, direction, etc to see if I can nail down what
is going on and put this issue to bed for not only myself but for
anyone else who might run into it in the future.
Are you running a GENERIC kernel? If not, what changes have you made?
Have you set any loader tunables or sysctls?
Have you scrubbed the pools?
If you run "gstat -a", do any devices have anomolous readings?

I can't offer any definite fixes but can suggest a few more things to
try:
1) Try FreeBSD-9.1RC2 and see if the problem persists.
2) Try a new kernel with
options WITNESS
options WITNESS_SKIPSPIN
this may make a software bug more obvious (but will somewhat increase
kernel overheads)
3) If you can afford it, detach the L2ARC - which removes one potential issue.
4) If you haven't already, build a kernel with
makeoptions DEBUG=-g
options KDB
options KDB_TRACE
options KDB_UNATTENDED
options DDB
this won't have any impact on normal operation but will simplify debugging.
--
Peter Jeremy
Patrick Proniewski
2012-10-15 10:35:28 UTC
Permalink
Post by Peter Jeremy
This _does_ sound more like hardware than software
I do agree with that.
Post by Peter Jeremy
Have you tried replacing RAM & PSU?
I, too, was about to suggest a test or replacement of the PSU.

Also, I've had a (quite) similar problem years ago (no raid, no zfs, older freebsdÂ…) where HDD would detach or be lost by the system on a random basis. I search a long time of the software side, but it was cured by a firmware update on HDDs.

good luck with this issue.
Patrick
nate keegan
2012-10-15 14:54:44 UTC
Permalink
The system is dual PSU behind a UPS so I don't think that this is an issue.

My notes show that we replaced one of the DIMMs on this system a few
months ago as it was detected as bad during a POST.

During the cycle of reboots that I have taken on with testing
resolutions to this issue I have seen a single time where the BIOS
detected a bad DIMM but only one time.

I do have a complete set of replacement memory (Crucial vs Kingston
that is in the system now) and will swap out the memory in case one of
the DIMMs is flaky but not poor enough for the BIOS to notice on a
consistent basis.

I am not able to drop into DDB when the issue happens as the system is
locked up completely. Could be a failure on my part to
understand/engage in how to do this, will try if the issue happens
again (should on Wednesday AM unless setting camcontrol apm to off for
the disks somehow fixes the issue).

I am running GENERIC kernel and have not set any loader tunables or
sysctls other than that related to addressing this issue (SATA power
management, AHCI, etc).

The problem first started around the time when we setup pool scrubbing
and at that time it was a single instance which seemed to be tied to
the bad DIMM. Have not run pool scrubbing since that time.

Will get the output of gstat -a and post it up here.

Will upgrade to FreeBSD 9.1RC2 today and compile kernel with the
options you suggested.

I already went ahead and removed the L2ARC and one of the OS SSD
drives to simplify things - now I have 1 x SSD with OS and 1 x SSD for
swap and that is it.

I ran the Crucial firmware update ISO and it did not see any firmware
updates as necessary on the SSD disks.

I appreciate the feedback as part of the difficulty here has been
making a determination of whether this is software/driver or hardware.
If software I agree that it would not make sense that this would
suddenly pop-up after months of operation with no issues.
Post by Peter Jeremy
Are you running a GENERIC kernel? If not, what changes have you made?
Have you set any loader tunables or sysctls?
Have you scrubbed the pools?
If you run "gstat -a", do any devices have anomolous readings?
I can't offer any definite fixes but can suggest a few more things to
1) Try FreeBSD-9.1RC2 and see if the problem persists.
2) Try a new kernel with
options WITNESS
options WITNESS_SKIPSPIN
this may make a software bug more obvious (but will somewhat increase
kernel overheads)
3) If you can afford it, detach the L2ARC - which removes one potential issue.
4) If you haven't already, build a kernel with
makeoptions DEBUG=-g
options KDB
options KDB_TRACE
options KDB_UNATTENDED
options DDB
this won't have any impact on normal operation but will simplify debugging.
nate keegan
2012-10-15 17:21:22 UTC
Permalink
I took a look at the DDB man page and I am not able to do this when
the issue happens as the system is completely blown up (meaning no
keyboard input on IPMI console, existing SSH sessions, etc.

No changes have been seen in the ZFS load on the system. The nature of
this system (backup) is such that the heaviest load would be created
in the first week or so of going online as we use rsync to copy files
down from our Windows servers and during this first week or so the
system has to 'seed' the initial copies which would be much heavier on
I/O than after that first week where things are relatively constant in
terms of I/O.

I have 48 Gb of Crucial memory that I will put in this system today to
replace the 24 Gb or so of Kingston memory I have in the system. If
the issue happens again with the memory change I plan on replacing
both SSD (Crucial M4) with two non-SSD SATA disks with the idea that
maybe the Crucial firmware on the disks (002 on both disks) is the
culprit somehow.

It neither item turn out to solve the issue will move on to 9.1RC2 or
9.1-RELEASE if it is out by then and adding kernel options requested.

The amount of monkeying that I have had to do via /boot/loader.conf
and the camcontrol script I run is telling me that the SSD, the
firmware on the SSD, etc is somehow causing the issue as we have
plenty of other FreeBSD 8.x and 9.x systems that use non-SSD SATA
drives without this issue popping up in their daily workloads.

My /boot/loader.conf looks like this currently:

# Set in the BIOS as well to activate
ahci_load="YES"

# Should be auto-negotiation in FreeBSD 9.x
# See ahci(4)
hint.ahcich.0.sata_rev=1
hint.ahcich.1.sata_rev=1

hint.ahcich.0.pm_level=1
hint.ahcich.1.pm_level=1


And /usr/local/etc/rc.d/camcontrol:

#!/bin/sh
CAMCONTROL=/sbin/camcontrol

# Disable NCQ
$CAMCONTROL tags ada0 -N 1 > /dev/null
$CAMCONTROL tags ada1 -N 1 > /dev/null

# Disable APM
$CAMCONTROL cmd ada0 -a "EF 85 00 00 00 00 00 00 00 00 00 00" > /dev/null
$CAMCONTROL cmd ada1 -a "EF 85 00 00 00 00 00 00 00 00 00 00" > /dev/null

Without both of these shims in place I get maybe 1.5 hours to two
hours or so before the system goes kablooie and that is without the
system doing any real I/O work just running FreeBSD during the
business day and a few scripts from cron to check for data and shuffle
it around.
Peter Jeremy
2012-10-15 21:55:20 UTC
Permalink
Post by nate keegan
The system is dual PSU behind a UPS so I don't think that this is an issue.
OK
Post by nate keegan
I do have a complete set of replacement memory (Crucial vs Kingston
that is in the system now) and will swap out the memory in case one of
the DIMMs is flaky but not poor enough for the BIOS to notice on a
consistent basis.
I presume this is registered ECC RAM - which makes it more robust.
Non-ECC RAM can develop pattern-sensitive faults - which are virtually
impossible to test for. And BIOS RAM 'tests' generally can't be
relied on to do much more than verify that something is responding.
Swapping RAM is the best way to rule out RAM issues.
Post by nate keegan
I am not able to drop into DDB when the issue happens as the system is
locked up completely.
That's surprising. I haven't seen a failure mode where the kernel
will respond to pings but not the console.
Post by nate keegan
Will get the output of gstat -a and post it up here.
"gstat -a" gives a dynamic picture of disk activity. I was hoping
you could watch it for a minute or so (on a tall window) whilst
the system was running and see if any disks look odd - significantly
higher or lower than expected I/O volume or long ms/r or ms/w.
Post by nate keegan
I took a look at the DDB man page and I am not able to do this when
the issue happens as the system is completely blown up (meaning no
keyboard input on IPMI console, existing SSH sessions, etc.
Note that I'm referring to ddb(4), not ddb(8). The former is
entered via a "magic" key sequence on the console and should work
even if the system won't react to normal commands. To enter ddb,
use Ctrl-Alt-ESC on a graphical console or the character sequence
CR ~ Ctrl-B on a serial console (in the latter case, the sysctl
debug.kdb.alt_break_to_debugger also needs to be set to 1).

If you do get into ddb, a useful set of initial commands is:
show all procs
show alllocks
show allpcpu
show lockedvnods
call doadump

Note that the first 4 commands will generate lots of output - ideally
you would have a serial console with logging. The last command
generates a crashdump and needs 'dumpdev="AUTO"' in /etc/rc.conf (run
"service dumpon start" after editing rc.conf to enable it without
rebooting).
Post by nate keegan
The amount of monkeying that I have had to do via /boot/loader.conf
and the camcontrol script I run is telling me that the SSD, the
firmware on the SSD, etc is somehow causing the issue as we have
plenty of other FreeBSD 8.x and 9.x systems that use non-SSD SATA
drives without this issue popping up in their daily workloads.
Are you able to move the SSD(s) to a different type of SATA port? One
(not especially likely) possibility is it's an interaction between the
SSD and the SATA controller.
--
Peter Jeremy
Dieter BSD
2012-10-15 22:17:09 UTC
Permalink
Post by nate keegan
SSD are connected to on-board SATA port on motherboard
Presumably to controllers provided by the Intel Tylersburg 5520 chipset.
Post by nate keegan
This system was commissioned in February of 2012 and ran without issue
as a ZFS backup system on our network until about 3 weeks ago.
The system is dual PSU behind a UPS so I don't think that this is an issue.
No changes? e.g. no added hardware to increase power load.
Overloading the power supply and/or the wiring (with too many splitters)
can result in flaky problems like this.
Post by nate keegan
OS will respond to ping requests after the issue and if you have an
active SSH session you will remain connected to the system until you
attempt to do something like 'ls', 'ps', etc.
I am not able to drop into DDB when the issue happens as the system is
locked up completely. Could be a failure on my part to
understand/engage in how to do this, will try if the issue happens
again (should on Wednesday AM unless setting camcontrol apm to off for
the disks somehow fixes the issue).
If the system is alive enough to respond to ping, I'd expect you
should be able to get into DDB? Can you get into DDB when the system
is working normally?
Post by nate keegan
2 x Crucial M4 64 Gb SATA SSD for FreeBSD OS (zroot)
2 x Intel 320 MLC 80 Gb SATA SSD for L2ARC and swap
I ran the Crucial firmware update ISO and it did not see any firmware
updates as necessary on the SSD disks.
Does the problem happen with both the Crucial and the Intel SSDs?
Post by nate keegan
If software I agree that it would not make sense that this would
suddenly pop-up after months of operation with no issues.
If something causes the software/firmware to take a different
path, new issues can appear. E.g. error handling or even timing.
Infrequently used code paths might not have been tested sufficiently.

Does the controller have firmware? Part of the BIOS I suppose.
Is there a BIOS update available? Have you considered connecting the
SSDs to a different controller?
Post by nate keegan
the on-board AHCI portion of the BIOS does
not always see the disks after the event without a hard system power
reset.
That's at least one bug somewhere, probably the hardware isn't getting reset
properly. Does Supermicro know about this bug?
Post by nate keegan
I have 48 Gb of Crucial memory that I will put in this system today to
replace the 24 Gb or so of Kingston memory I have in the system.
Which in addition to being different memory, should reduce swap activity.

Suggestion: move everything to conventional drives. Keep at least one
SSD connected to system, but normally unused. Now you can beat on the
SSD in a controlled manner to debug the problem. Does reading trigger
the problem? Writing? Try dd with different blocksizes, accessing
multiple SSDs at once, etc. I have to wonder if there is a timing problem,
or missing interrupt, or...
Post by nate keegan
* Ditch FreeBSD for Solaris so I can keep ZFS lovin for the intended
purpose of this system
If it fails with FreeBSD but works with Solaris on the same hardware,
then it is almost certainly a problem with the device driver. (Or
at least a problem that Solaris has a workaround for.)
nate keegan
2012-10-16 19:48:32 UTC
Permalink
I'm only seeing gstat output of a few percentage points for the OS disks.

I am using ECC memory (both the Kingston and the new Crucial memory)
and went ahead and swapped out the SSD for SATA disks this morning.

Since both SSD were the same firmware and type/manufacturer I figured
it was a good time to address this variable.

I also went ahead and put in a serial console server this morning so I
have proper console access instead of relying on the Supermicro iLO
utility.

Will keep an eye on the pure SATA setup to see if it barfs or not.
Will try to gather some ddb(4) information if it does barf again.
Post by Dieter BSD
Post by nate keegan
SSD are connected to on-board SATA port on motherboard
Presumably to controllers provided by the Intel Tylersburg 5520 chipset.
Post by nate keegan
This system was commissioned in February of 2012 and ran without issue
as a ZFS backup system on our network until about 3 weeks ago.
The system is dual PSU behind a UPS so I don't think that this is an issue.
No changes? e.g. no added hardware to increase power load.
Overloading the power supply and/or the wiring (with too many splitters)
can result in flaky problems like this.
Post by nate keegan
OS will respond to ping requests after the issue and if you have an
active SSH session you will remain connected to the system until you
attempt to do something like 'ls', 'ps', etc.
I am not able to drop into DDB when the issue happens as the system is
locked up completely. Could be a failure on my part to
understand/engage in how to do this, will try if the issue happens
again (should on Wednesday AM unless setting camcontrol apm to off for
the disks somehow fixes the issue).
If the system is alive enough to respond to ping, I'd expect you
should be able to get into DDB? Can you get into DDB when the system
is working normally?
Post by nate keegan
2 x Crucial M4 64 Gb SATA SSD for FreeBSD OS (zroot)
2 x Intel 320 MLC 80 Gb SATA SSD for L2ARC and swap
I ran the Crucial firmware update ISO and it did not see any firmware
updates as necessary on the SSD disks.
Does the problem happen with both the Crucial and the Intel SSDs?
Post by nate keegan
If software I agree that it would not make sense that this would
suddenly pop-up after months of operation with no issues.
If something causes the software/firmware to take a different
path, new issues can appear. E.g. error handling or even timing.
Infrequently used code paths might not have been tested sufficiently.
Does the controller have firmware? Part of the BIOS I suppose.
Is there a BIOS update available? Have you considered connecting the
SSDs to a different controller?
Post by nate keegan
the on-board AHCI portion of the BIOS does
not always see the disks after the event without a hard system power
reset.
That's at least one bug somewhere, probably the hardware isn't getting reset
properly. Does Supermicro know about this bug?
Post by nate keegan
I have 48 Gb of Crucial memory that I will put in this system today to
replace the 24 Gb or so of Kingston memory I have in the system.
Which in addition to being different memory, should reduce swap activity.
Suggestion: move everything to conventional drives. Keep at least one
SSD connected to system, but normally unused. Now you can beat on the
SSD in a controlled manner to debug the problem. Does reading trigger
the problem? Writing? Try dd with different blocksizes, accessing
multiple SSDs at once, etc. I have to wonder if there is a timing problem,
or missing interrupt, or...
Post by nate keegan
* Ditch FreeBSD for Solaris so I can keep ZFS lovin for the intended
purpose of this system
If it fails with FreeBSD but works with Solaris on the same hardware,
then it is almost certainly a problem with the device driver. (Or
at least a problem that Solaris has a workaround for.)
_______________________________________________
http://lists.freebsd.org/mailman/listinfo/freebsd-hardware
nate keegan
2012-10-23 19:45:42 UTC
Permalink
Since replacing the SSD disks with good old plain SATA in external
enclosures I have not experienced a single issue.

I can only surmise that something is wonky with the Crucial M4
firmware with FreeBSD 8.2/9.0 under certain circumstances.

Thanks to everyone who contributed on this as the information about
debugging kernels, etc was very helpful from a procedural point of
view.
Post by nate keegan
I'm only seeing gstat output of a few percentage points for the OS disks.
I am using ECC memory (both the Kingston and the new Crucial memory)
and went ahead and swapped out the SSD for SATA disks this morning.
Since both SSD were the same firmware and type/manufacturer I figured
it was a good time to address this variable.
I also went ahead and put in a serial console server this morning so I
have proper console access instead of relying on the Supermicro iLO
utility.
Will keep an eye on the pure SATA setup to see if it barfs or not.
Will try to gather some ddb(4) information if it does barf again.
Post by Dieter BSD
Post by nate keegan
SSD are connected to on-board SATA port on motherboard
Presumably to controllers provided by the Intel Tylersburg 5520 chipset.
Post by nate keegan
This system was commissioned in February of 2012 and ran without issue
as a ZFS backup system on our network until about 3 weeks ago.
The system is dual PSU behind a UPS so I don't think that this is an issue.
No changes? e.g. no added hardware to increase power load.
Overloading the power supply and/or the wiring (with too many splitters)
can result in flaky problems like this.
Post by nate keegan
OS will respond to ping requests after the issue and if you have an
active SSH session you will remain connected to the system until you
attempt to do something like 'ls', 'ps', etc.
I am not able to drop into DDB when the issue happens as the system is
locked up completely. Could be a failure on my part to
understand/engage in how to do this, will try if the issue happens
again (should on Wednesday AM unless setting camcontrol apm to off for
the disks somehow fixes the issue).
If the system is alive enough to respond to ping, I'd expect you
should be able to get into DDB? Can you get into DDB when the system
is working normally?
Post by nate keegan
2 x Crucial M4 64 Gb SATA SSD for FreeBSD OS (zroot)
2 x Intel 320 MLC 80 Gb SATA SSD for L2ARC and swap
I ran the Crucial firmware update ISO and it did not see any firmware
updates as necessary on the SSD disks.
Does the problem happen with both the Crucial and the Intel SSDs?
Post by nate keegan
If software I agree that it would not make sense that this would
suddenly pop-up after months of operation with no issues.
If something causes the software/firmware to take a different
path, new issues can appear. E.g. error handling or even timing.
Infrequently used code paths might not have been tested sufficiently.
Does the controller have firmware? Part of the BIOS I suppose.
Is there a BIOS update available? Have you considered connecting the
SSDs to a different controller?
Post by nate keegan
the on-board AHCI portion of the BIOS does
not always see the disks after the event without a hard system power
reset.
That's at least one bug somewhere, probably the hardware isn't getting reset
properly. Does Supermicro know about this bug?
Post by nate keegan
I have 48 Gb of Crucial memory that I will put in this system today to
replace the 24 Gb or so of Kingston memory I have in the system.
Which in addition to being different memory, should reduce swap activity.
Suggestion: move everything to conventional drives. Keep at least one
SSD connected to system, but normally unused. Now you can beat on the
SSD in a controlled manner to debug the problem. Does reading trigger
the problem? Writing? Try dd with different blocksizes, accessing
multiple SSDs at once, etc. I have to wonder if there is a timing problem,
or missing interrupt, or...
Post by nate keegan
* Ditch FreeBSD for Solaris so I can keep ZFS lovin for the intended
purpose of this system
If it fails with FreeBSD but works with Solaris on the same hardware,
then it is almost certainly a problem with the device driver. (Or
at least a problem that Solaris has a workaround for.)
_______________________________________________
http://lists.freebsd.org/mailman/listinfo/freebsd-hardware
Peter Jeremy
2012-10-23 19:46:28 UTC
Permalink
Post by nate keegan
Will keep an eye on the pure SATA setup to see if it barfs or not.
Will try to gather some ddb(4) information if it does barf again.
Any news on this?
--
Peter Jeremy
Lanny Baron
2012-11-09 10:30:33 UTC
Permalink
Hi,
I don't know how far apart you added memory from the time you
bought/built your server. I say that because the drams on the memory
might be slightly different. When we build servers, we use a particular
brand for certain reasons but one of those reason is the fact the dram
specs do not change on a given sku.

Here is what I recommend you try.

Take out all the memory. Add one dimm only. See if problem persists. If
problem stops, add second dimm. Still good, add 3rd dimm and keep adding
another dimm one by one. When the problem comes back. Remove all dimms
again and put the last dimm you added where the problem came back in
first slot. If the problem persists, you found the winner. If not, add
all dimms back except the one you just tested and use another dimm. If
problem persists, you found a bad memory slot.

It's a real PITA <tm> but that is the only way to find the issue if it
is indeed memory or a bad memory slot.

One more thing you should try. Did you enable IPMI? If so, #ipmitool -H
x.x.x.x sel list

Take a look at the output. If you did not enable IPMI
(ipadd/netmask/gateway), the bios should have a place to do so. Sorry,
we don't sell/build supermicro* so I am unfamiliar with those boards.

If you are using both kingston/crucial, just use one of those, do not
mix them.

Hope this can help you out.

Lanny
Servaris Corporation
http://www.servaris.com
Post by nate keegan
I'm only seeing gstat output of a few percentage points for the OS disks.
I am using ECC memory (both the Kingston and the new Crucial memory)
and went ahead and swapped out the SSD for SATA disks this morning.
Since both SSD were the same firmware and type/manufacturer I figured
it was a good time to address this variable.
I also went ahead and put in a serial console server this morning so I
have proper console access instead of relying on the Supermicro iLO
utility.
Will keep an eye on the pure SATA setup to see if it barfs or not.
Will try to gather some ddb(4) information if it does barf again.
Post by Dieter BSD
Post by nate keegan
SSD are connected to on-board SATA port on motherboard
Presumably to controllers provided by the Intel Tylersburg 5520 chipset.
Post by nate keegan
This system was commissioned in February of 2012 and ran without issue
as a ZFS backup system on our network until about 3 weeks ago.
The system is dual PSU behind a UPS so I don't think that this is an issue.
No changes? e.g. no added hardware to increase power load.
Overloading the power supply and/or the wiring (with too many splitters)
can result in flaky problems like this.
Post by nate keegan
OS will respond to ping requests after the issue and if you have an
active SSH session you will remain connected to the system until you
attempt to do something like 'ls', 'ps', etc.
I am not able to drop into DDB when the issue happens as the system is
locked up completely. Could be a failure on my part to
understand/engage in how to do this, will try if the issue happens
again (should on Wednesday AM unless setting camcontrol apm to off for
the disks somehow fixes the issue).
If the system is alive enough to respond to ping, I'd expect you
should be able to get into DDB? Can you get into DDB when the system
is working normally?
Post by nate keegan
2 x Crucial M4 64 Gb SATA SSD for FreeBSD OS (zroot)
2 x Intel 320 MLC 80 Gb SATA SSD for L2ARC and swap
I ran the Crucial firmware update ISO and it did not see any firmware
updates as necessary on the SSD disks.
Does the problem happen with both the Crucial and the Intel SSDs?
Post by nate keegan
If software I agree that it would not make sense that this would
suddenly pop-up after months of operation with no issues.
If something causes the software/firmware to take a different
path, new issues can appear. E.g. error handling or even timing.
Infrequently used code paths might not have been tested sufficiently.
Does the controller have firmware? Part of the BIOS I suppose.
Is there a BIOS update available? Have you considered connecting the
SSDs to a different controller?
Post by nate keegan
the on-board AHCI portion of the BIOS does
not always see the disks after the event without a hard system power
reset.
That's at least one bug somewhere, probably the hardware isn't getting reset
properly. Does Supermicro know about this bug?
Post by nate keegan
I have 48 Gb of Crucial memory that I will put in this system today to
replace the 24 Gb or so of Kingston memory I have in the system.
Which in addition to being different memory, should reduce swap activity.
Suggestion: move everything to conventional drives. Keep at least one
SSD connected to system, but normally unused. Now you can beat on the
SSD in a controlled manner to debug the problem. Does reading trigger
the problem? Writing? Try dd with different blocksizes, accessing
multiple SSDs at once, etc. I have to wonder if there is a timing problem,
or missing interrupt, or...
Post by nate keegan
* Ditch FreeBSD for Solaris so I can keep ZFS lovin for the intended
purpose of this system
If it fails with FreeBSD but works with Solaris on the same hardware,
then it is almost certainly a problem with the device driver. (Or
at least a problem that Solaris has a workaround for.)
_______________________________________________
http://lists.freebsd.org/mailman/listinfo/freebsd-hardware
_______________________________________________
http://lists.freebsd.org/mailman/listinfo/freebsd-hardware
Loading...