Discussion:
MCA error, possible causes?
(too old to reply)
Ultima
2016-02-13 01:11:51 UTC
Permalink
Recently installed some cpus and received two MCA errors. Using mcelog, I
found that the version in ports is about 5 years out of dated and didn't
support my cpu. Decided to update it to the newest version (Will post on
bugzilla shortly) to pull some more info. Going to post orig and decoded
mcelog.


Raw:
MCA: Bank 20, Status 0xc800084000310e0f
MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x306f1, APIC ID 0
MCA: CPU 0 COR (33) OVER BUSLG ??? ERR Other
MCA: Misc 0x1df87b000d9eff
MCA: Bank 5, Status 0xc800008000310e0f
MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x306f1, APIC ID 42
MCA: CPU 34 COR (2) OVER BUSLG ??? ERR Other
MCA: Misc 0xdf87b008d9eff

mcelog v131:
Hardware event. This is not a software error.
CPU 0 BANK 20
MISC 1df87b000d9eff
MCG status:
QPI: Rx detected CRC error - successful LLR wihout Phy re-init
STATUS c800084000310e0f MCGSTATUS 0
MCGCAP 7000c16 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 63
Hardware event. This is not a software error.
CPU 34 BANK 5
MISC df87b008d9eff
MCG status:
QPI: Rx detected CRC error - successful LLR wihout Phy re-init
STATUS c800008000310e0f MCGSTATUS 0
MCGCAP 7000c16 APICID 2a SOCKETID 0
CPUID Vendor Intel Family 6 Model 63

After receiving this error, the system was in a frozen state. Any ideas
what may cause this?


Please cc me, I'm not on this list. Thanks =]

Ultima
John Baldwin
2016-02-24 20:17:23 UTC
Permalink
Post by Ultima
Recently installed some cpus and received two MCA errors. Using mcelog, I
found that the version in ports is about 5 years out of dated and didn't
support my cpu. Decided to update it to the newest version (Will post on
bugzilla shortly) to pull some more info. Going to post orig and decoded
mcelog.
MCA: Bank 20, Status 0xc800084000310e0f
MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x306f1, APIC ID 0
MCA: CPU 0 COR (33) OVER BUSLG ??? ERR Other
MCA: Misc 0x1df87b000d9eff
MCA: Bank 5, Status 0xc800008000310e0f
MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x306f1, APIC ID 42
MCA: CPU 34 COR (2) OVER BUSLG ??? ERR Other
MCA: Misc 0xdf87b008d9eff
Hardware event. This is not a software error.
CPU 0 BANK 20
MISC 1df87b000d9eff
QPI: Rx detected CRC error - successful LLR wihout Phy re-init
STATUS c800084000310e0f MCGSTATUS 0
MCGCAP 7000c16 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 63
Hardware event. This is not a software error.
CPU 34 BANK 5
MISC df87b008d9eff
QPI: Rx detected CRC error - successful LLR wihout Phy re-init
STATUS c800008000310e0f MCGSTATUS 0
MCGCAP 7000c16 APICID 2a SOCKETID 0
CPUID Vendor Intel Family 6 Model 63
After receiving this error, the system was in a frozen state. Any ideas
what may cause this?
Well, hardware causes it. QPI is the interconnect bus between your
CPUs and RAM. "Rx detected CRC error" implies that a CPU detected a
corrupted message on that bus, but when it requested a resend the
resent message was ok. Normally corrected errors shouldn't hang your
machine, but perhaps your machine had another hardware error after this
that broke it too badly to report and/or log the subsequent error.
--
John Baldwin
Ultima
2016-02-24 20:51:15 UTC
Permalink
Hi John,

Thanks for the explanation. I ran some tests and ended up being a power
savings mode (aka unstable mode?). Disabling this feature put an end to the
freezes. I came to this conclusion by stress testing the box for 3 days,
and there were no issues. Nothing, then I stopped the stress test and about
15-30 min later it froze. It seemed to only occur during periods of low
load. I have not received any of these errors after turning off this power
savings mode.
Post by John Baldwin
Post by Ultima
Recently installed some cpus and received two MCA errors. Using mcelog,
I
Post by Ultima
found that the version in ports is about 5 years out of dated and didn't
support my cpu. Decided to update it to the newest version (Will post on
bugzilla shortly) to pull some more info. Going to post orig and decoded
mcelog.
MCA: Bank 20, Status 0xc800084000310e0f
MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x306f1, APIC ID 0
MCA: CPU 0 COR (33) OVER BUSLG ??? ERR Other
MCA: Misc 0x1df87b000d9eff
MCA: Bank 5, Status 0xc800008000310e0f
MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x306f1, APIC ID 42
MCA: CPU 34 COR (2) OVER BUSLG ??? ERR Other
MCA: Misc 0xdf87b008d9eff
Hardware event. This is not a software error.
CPU 0 BANK 20
MISC 1df87b000d9eff
QPI: Rx detected CRC error - successful LLR wihout Phy re-init
STATUS c800084000310e0f MCGSTATUS 0
MCGCAP 7000c16 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 63
Hardware event. This is not a software error.
CPU 34 BANK 5
MISC df87b008d9eff
QPI: Rx detected CRC error - successful LLR wihout Phy re-init
STATUS c800008000310e0f MCGSTATUS 0
MCGCAP 7000c16 APICID 2a SOCKETID 0
CPUID Vendor Intel Family 6 Model 63
After receiving this error, the system was in a frozen state. Any ideas
what may cause this?
Well, hardware causes it. QPI is the interconnect bus between your
CPUs and RAM. "Rx detected CRC error" implies that a CPU detected a
corrupted message on that bus, but when it requested a resend the
resent message was ok. Normally corrected errors shouldn't hang your
machine, but perhaps your machine had another hardware error after this
that broke it too badly to report and/or log the subsequent error.
--
John Baldwin
Loading...