Strange problem with... ZFS? Disk? Controller?

Discussion:

(too old to reply)

Alex Povolotsky

2012-12-22 08:47:24 UTC

Hello,

I'm running FreeBSD 9.0/amd64, pure ZFS setup, one Seagate disk
ST2000NM0011 SN02 on LSI Logic (mpt) controller.

Yes, I know that running one disk on RAID controller is a bit weird, I
have to find yet if it is possible to connect disk to internal SATA
controller.

About two days ago, system became SLOW. Disk usage is constantly 100%,
and sometimes I'm getting swap_pager: indefinite wait buffer error. I
had to reset computer twice in two days.

mptutil does not show any errors, and smartctl shows

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED
WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 067 063 044 Pre-fail
Always - 6218970
3 Spin_Up_Time 0x0003 093 092 000 Pre-fail
Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age
Always - 14
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail
Always - 21
7 Seek_Error_Rate 0x000f 091 060 030 Pre-fail
Always - 1433294073
9 Power_On_Hours 0x0032 090 090 000 Old_age
Always - 8825
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail
Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age
Always - 16
184 End-to-End_Error 0x0032 100 100 099 Old_age
Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age
Always - 0
188 Command_Timeout 0x0032 100 099 000 Old_age
Always - 12885098499
189 High_Fly_Writes 0x003a 100 100 000 Old_age
Always - 0
190 Airflow_Temperature_Cel 0x0022 068 047 045 Old_age
Always - 32 (Min/Max 31/32)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age
Always - 859
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age
Always - 15
193 Load_Cycle_Count 0x0032 100 100 000 Old_age
Always - 26
194 Temperature_Celsius 0x0022 032 053 000 Old_age
Always - 32 (0 21 0 0 0)
195 Hardware_ECC_Recovered 0x001a 103 099 000 Old_age
Always - 6218970
197 Current_Pending_Sector 0x0012 100 100 000 Old_age
Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age
Always - 0

SMART Error Log Version: 1
No Errors Logged

I have removed most of snapshots, it does not help.

I have stopped all active processes, disk load did not decrease, same 100%.

What can I check and/or replace to get the problem fixed? Any ideas?

Alex

Derek Kulinski

2012-12-22 09:10:33 UTC

Permalink

Hello Alex,

SMART values are collected by the disk itself (smartmontools is only
reading it).

This would imply that the problem is between disk and controller.

Since you have tons of Hardware_ECC_Recovered and none of
UDMA_CRC_Error_Count I would think that the problem is with disk
itself.

I think the long waits are due to disk trying to re-read given sector
multiple times.

Your drive is 2TB, and according to this the bigger the drive the more
likely you'll run into problems like these:
http://forums.storagereview.com/index.php/topic/27994-smart-hardware-ecc-recovered-values/

I don't know how serious it is but if you keep anything important
there I would recommend a backup.

You should try SMART self tests.

Best regards,
Derek

Post by Alex Povolotsky
Hello,
I'm running FreeBSD 9.0/amd64, pure ZFS setup, one Seagate disk
ST2000NM0011 SN02 on LSI Logic (mpt) controller.
Yes, I know that running one disk on RAID controller is a bit weird, I
have to find yet if it is possible to connect disk to internal SATA
controller.
About two days ago, system became SLOW. Disk usage is constantly 100%,
and sometimes I'm getting swap_pager: indefinite wait buffer error. I
had to reset computer twice in two days.
mptutil does not show any errors, and smartctl shows
SMART Attributes Data Structure revision number: 10
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED
WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 067 063 044 Pre-fail
Always - 6218970
3 Spin_Up_Time 0x0003 093 092 000 Pre-fail
Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age
Always - 14
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail
Always - 21
7 Seek_Error_Rate 0x000f 091 060 030 Pre-fail
Always - 1433294073
9 Power_On_Hours 0x0032 090 090 000 Old_age
Always - 8825
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail
Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age
Always - 16
184 End-to-End_Error 0x0032 100 100 099 Old_age
Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age
Always - 0
188 Command_Timeout 0x0032 100 099 000 Old_age
Always - 12885098499
189 High_Fly_Writes 0x003a 100 100 000 Old_age
Always - 0
190 Airflow_Temperature_Cel 0x0022 068 047 045 Old_age
Always - 32 (Min/Max 31/32)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age
Always - 859
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age
Always - 15
193 Load_Cycle_Count 0x0032 100 100 000 Old_age
Always - 26
194 Temperature_Celsius 0x0022 032 053 000 Old_age
Always - 32 (0 21 0 0 0)
195 Hardware_ECC_Recovered 0x001a 103 099 000 Old_age
Always - 6218970
197 Current_Pending_Sector 0x0012 100 100 000 Old_age
Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age
Always - 0
SMART Error Log Version: 1
No Errors Logged
I have removed most of snapshots, it does not help.
I have stopped all active processes, disk load did not decrease, same 100%.
What can I check and/or replace to get the problem fixed? Any ideas?
Alex
_______________________________________________
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to

--
Best regards,
Derek mailto:***@cs.ucla.edu

If you choke a Smurf, what color does it turn?

Patrick Proniewski

2012-12-22 09:55:54 UTC

Permalink

Post by Derek Kulinski
Your drive is 2TB, and according to this the bigger the drive the more
http://forums.storagereview.com/index.php/topic/27994-smart-hardware-ecc-recovered-values/

Thanks Derek for this interesting pointer. It's the first time I read about problem like this... It's frightful. I had no idea big drives could have such problems.
Any other source that would confirm the issue?

patpro

Mark Felder

2012-12-22 09:26:30 UTC

Permalink

Try running diskinfo -t /dev/...

If it says your device is really slow it's probably dying. I'd suspect it's having trouble seeking.

Alex Povolotsky

2013-01-17 13:29:18 UTC

Permalink

Post by Mark Felder
Try running diskinfo -t /dev/...
If it says your device is really slow it's probably dying. I'd suspect it's having trouble seeking.

It was a break-in. Some dumb php script running with user privileges
managed FreeBSD to hang on disk io up to stopping responding to anything
besides reset.

Alex

Mark Felder

2013-01-17 13:57:12 UTC

Permalink

On Thu, 17 Jan 2013 07:22:26 -0600, Alex Povolotsky

Post by Alex Povolotsky

Post by Mark Felder
Try running diskinfo -t /dev/...
If it says your device is really slow it's probably dying. I'd suspect
it's having trouble seeking.

It was a break-in. Some dumb php script running with user privileges
managed FreeBSD to hang on disk io up to stopping responding to anything
besides reset.
Alex

Yikes! Make sure to run freebsd-update IDS to check the base OS's
checksums and if you're using pkgng you can use "pkg check-s" to look for
any tampered with files owned by packages.