ECC support

Post by Dieter BSD
Assuming that a board does have the necessary connections but
the firmware does not have ECC support, is there some reason that
ECC support could not be added to the OS instead of the firmware?

Yes, there is. The memory controller is programmed by the code that runs from
ROM and uses no RAM (or the CPU cache is used as the RAM). Once the real RAM
gets used it's too late to reprogram the DRAM controller. This is true at least
for most or all of the modern day x86 hardware.

--
Andriy Gapon

Konstantin Belousov

2015-09-16 03:59:32 UTC

Post by Andriy Gapon

For modern Intel hardware, the IMC config is locked before BIOS passes
the control to the user code, i.e. OS loader. It does not help much that
the documentation for IMC is not provided even under NDA.

Bob Bishop

2015-09-16 07:52:19 UTC

Hi,

Arriving late to this thread, a few observations:

- Obviously the more RAM you have, the more errors you are going to see. In other words, ECC makes increasing sense as RAM sizes get larger. All server-class hardware should have it.

- DRAM has to be refreshed. In sensible designs, ECC scrub is integrated with refresh to minimise overhead. It doesn’t have to be very frequent, maybe every 24 hours.

- On server-class hardware, the platform management (BMC or whatever) should be picking up, logging, and possibly alarming on ECC errors regardless of the OS.

- You might think that as memory density increases (ie bit cell size shrinks), error rates would increase. Apparently this wasn’t so up to 2009 at least, see:

http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

which reports on a study of these issues across Google’s estate at the time. I don’t know of any more recent similar work.

--
Bob Bishop
***@gid.co.uk

Igor Mozolevsky

2015-09-16 10:49:34 UTC

On 16 September 2015 at 08:51, Bob Bishop <***@gid.co.uk> wrote:

<snip>

Post by Bob Bishop
- You might think that as memory density increases (ie bit cell size

shrinks), error rates would increase. Apparently this wasn’t so up to 2009

Post by Bob Bishop
http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

subsection 5.1:

"… Figure 6 indicates a trend towards worse error behavior
for increased capacities, although this trend is not consis-
tent. While in some cases the doubling of capacity has a
clear negative effect (factors larger than 1 in the graph),
in others it has hardly any effect (factor close to 1 in the
graph). For example, for Platform A -Mfg1 and Platform F -
Mfg1 doubling the capacity increases uncorrectable errors,
but not correctable errors. Conversely, for Platform D -
Mfg6 doubling the capacity affects correctable errors, but
not uncorrectable error."

There are also other environmental factors which would be more apparent in
"lone-server" configuration vs well maintained and insulated data centres
with very good power conditioning ;-)

--
Igor M.

Bob Bishop

2015-09-16 11:35:26 UTC

Hi,

Post by Igor Mozolevsky
<snip>

Post by Bob Bishop
- You might think that as memory density increases (ie bit cell size

shrinks), error rates would increase. Apparently this wasn’t so up to 2009

Post by Bob Bishop
http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

"… Figure 6 indicates a trend towards worse error behavior
for increased capacities, although this trend is not consis-
tent. [etc]

That’s talking about DIMM capacity, not the capacity (density) of individual chips on which they say (at the end of the same subsection):

"The best we can conclude therefore is that any chip size effect is unlikely to dominate error rates given that the trends are not consistent across various other confounders such as age and manufacturer.”

I’ll admit to talking that point up a bit but it is counterintuitive. Memory designers have always been scared of cosmic rays etc but the suspected effects simply have not been noticeable. Most likely as they shrink features ever smaller, other factors like material purity dominate.

Post by Igor Mozolevsky
There are also other environmental factors which would be more apparent in
"lone-server" configuration vs well maintained and insulated data centres
with very good power conditioning ;-)

Indeed, and that’s a whole other PITA. We went to colo and never looked back, but low-power options for small servers are getting better.

Post by Igor Mozolevsky
--
Igor M.
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-hardware

--
Bob Bishop
***@gid.co.uk

Igor Mozolevsky

2015-09-16 11:53:55 UTC

On 16 September 2015 at 12:34, Bob Bishop <***@gid.co.uk> wrote:

<snip>

Post by Bob Bishop
"The best we can conclude therefore is that any chip size effect is
unlikely to dominate error rates given that the trends are not consistent
across various other confounders such as age and manufacturer.”
I’ll admit to talking that point up a bit but it is counterintuitive.
Memory designers have always been scared of cosmic rays etc but the
suspected effects simply have not been noticeable. Most likely as they
shrink features ever smaller, other factors like material purity dominate.

I saw that after I posted, and had a long ponder as to why it would be so.
The only thing I could think of is that the fab process was(/is?) large
enough to not worry about "nonsense" like cosmic rays &c (but then I've not
had much exposure to semi-conductor electronics theory since late 90s).
Perhaps we're at a point where the fab process can't really shrink much
more with DRAM due to the underlying tech (effectively many tiny RC
circuits), which is the reason the manufacturers just stack ranks to get
more capacity per DIMM instead of packing more in a single chip?..

--
Igor M.

Bob Bishop

2015-09-16 12:04:42 UTC

Post by Igor Mozolevsky
<snip>

Dunno. I’ll ask my tame semiconductor expert when I see him tomorrow...

Post by Igor Mozolevsky
--
Igor M.
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-hardware

--
Bob Bishop
***@gid.co.uk

Bob Bishop

2015-09-17 23:06:09 UTC

Hi,

[…]The only thing I could think of is that the fab process was(/is?) large
enough to not worry about "nonsense" like cosmic rays &c (but then I've not
had much exposure to semi-conductor electronics theory since late 90s).
[…]

Dunno. I’ll ask my tame semiconductor expert when I see him tomorrow…

The answer is quite interesting. A few process shrinks ago, alpha particle effects were becoming worryingly intrusive and everybody was concerned how much smaller features on ICs could actually be pushed.

Then they did the next process shrink, and the effects disappeared completely! A couple more shrinks later and they still haven’t reappeared. Nobody understands why, but they don’t worry about it any more.

--
Igor M.
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-hardware

--
Bob Bishop

Jim Thompson

2015-09-15 21:52:49 UTC

ECC is implemented by a ‘hashing’ algorithm that works on eight (8) bytes (64 bits) at a time, and places the result into an 8-bit ECC ‘word’.

Errors are corrected "on-the-fly," corrected data is almost never placed back in memory. If the same corrupt data is read again, the correction process is repeated. Replacing the data in memory would require processing overhead that could accumulate and significantly diminish system performance. If the error occurred because of random events and isn't a defect in the memory, the memory address will be cleaned of the error when the data is overwritten with other data.

In terms of expense, at a minimum, where you had 8 bytes to make up a memory system, you will now have 9 (to hold the extra 8 bits). This means your memory, without the extra complexity of the controller, is 12.5% more expensive. This isn’t a huge impact at 8GB, (you’ll need another 1GB of RAM), but at 1024GB you’ll need another 128GB, and that much ram still costs enough that your wallet won’t be happy.

The memory controller has to be able to run the ECC algorithm on every read, *and* supply the corrected data as needed, within the cycle time of the read. If you involve software in this path, the performance your machine will be glacial.

Yes, the firmware has to program the memory controller. “Program a few registers” is all you need, only the MRC setup on Intel and AMD is both complex and proprietary. Good luck getting the
details for this. This is “Intel Red Book” territory, and you’ll need to be an employee with a need to know. The MRC setup code is a binary blob for otherwise open source boot firmware such as Coreboot.

Others have answered (in the positive) about the OS reporting ECC errors on FreeBSD.

Jim

Post by Dieter BSD
Many of AMD's CPU/APU parts support ECC memory. Not just the top of the
line parts, but also many of the less expensive, less power hungry parts.
However, many (most?) of the boards for these chips do not support ECC,
or at least do not admit to it. They specify "non-ECC memory".
Obviously there have to be connections between the memory controller and
the memory for the extra bits. Aside from a little extra time for the
board designer to add a few traces to the wire list, this would not
raise the cost of the board. Despite this I have read that some boards
lack the necessary traces.
Does the firmware have to do anything to support ECC? Program a few
registers in the memory controller perhaps? A few boards have FLOSS
firmware available, so this code could be added, but most boards do not
have firmware sources available.
Assuming that a board does have the necessary connections but
the firmware does not have ECC support, is there some reason that
ECC support could not be added to the OS instead of the firmware?
I grepped through FreeBSD 8.2 and 10.1 sources but couldn't find
anything that looked relevant. Also did not find any code that
reported ECC errors, other than one device. Perhaps I missed it?
I've been running machines with ECC for 15-20 years and have never seen
a report of an ECC error from either NetBSD or FreeBSD. I have seen
reports of ECC errors from Digital Unix. And remember getting panics
due to parity errors on machines before ECC. So I'm thinking that
the BSDs must ignore hardware reports of single bit ECC errors. :-(
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-hackers

Igor Mozolevsky

2015-09-15 22:20:24 UTC

On 15 September 2015 at 22:52, Jim Thompson <***@netgate.com> wrote:

<snip>

Errors are corrected "on-the-fly," corrected data is almost never placed

Post by Jim Thompson
back in memory. If the same corrupt data is read again, the correction
process is repeated. Replacing the data in memory would require processing
overhead that could accumulate and significantly diminish system
performance. If the error occurred because of random events and isn't a
defect in the memory, the memory address will be cleaned of the error when
the data is overwritten with other data.

<snip>

Just to correct a small oversight- most (if not all?) boards have an option
to scrub ECC memory in the background so as to prevent single bit
(recoverable) errors from turning into double bit (irrecoverable but
detectable) errors ;-)

--
Igor M.

Jim Thompson

2015-09-15 22:35:03 UTC

Post by Igor Mozolevsky
<snip>
Errors are corrected "on-the-fly," corrected data is almost never placed back in memory. If the same corrupt data is read again, the correction process is repeated. Replacing the data in memory would require processing overhead that could accumulate and significantly diminish system performance. If the error occurred because of random events and isn't a defect in the memory, the memory address will be cleaned of the error when the data is overwritten with other data.
<snip>
Just to correct a small oversight- most (if not all?) boards have an option to scrub ECC memory in the background so as to prevent single bit (recoverable) errors from turning into double bit (irrecoverable but detectable) errors ;-)

I think you’ll find that the default for ‘scrub’ is off on most (perhaps all) boards. There are reasons, and these relate directly to “significantly diminish system performance”, (above), as well as the greatly increased RAM sizes in use today.

’Scrub' was popular about a decade ago, when DDR2 RAM was around $100/GB. DDR3-1600 is about $6/GB today.

Jim

Igor Mozolevsky

2015-09-15 22:53:11 UTC

On 15 September 2015 at 23:34, Jim Thompson <***@netgate.com> wrote:

<snip>

Post by Jim Thompson
I think you’ll find that the default for ‘scrub’ is off on most (perhaps
all) boards. There are reasons, and these relate directly to
“significantly diminish system performance”, (above), as well as the
greatly increased RAM sizes in use today.

Perhaps I missed something- what point is it that you're trying to make? I
was saying that scrubbing aims to remove errors at the source (cf. "on
demand") and prevent multi-bit errors that become detectable but
irrecoverable, or worse, undetectable. Get hit by a few of the latter two
at "interesting" points and you'd wish that scrubbing were on!

And seriously, ECC scrubbing is slow but ZFS (or even hardware RAID)
scrubbing is lightning fast??! C'mon are we going for data integrity or
speed here?!

’Scrub' was popular about a decade ago, when DDR2 RAM was around $100/GB.

Post by Jim Thompson
DDR3-1600 is about $6/GB today.

Yup- with a much higher density of smaller memory bits! ;-)

--
Igor M.

alex.burlyga.ietf alex.burlyga.ietf

2015-09-15 23:02:04 UTC

Post by Igor Mozolevsky
<snip>

Perhaps I missed something- what point is it that you're trying to make? I
was saying that scrubbing aims to remove errors at the source (cf. "on
demand") and prevent multi-bit errors that become detectable but
irrecoverable, or worse, undetectable. Get hit by a few of the latter two
at "interesting" points and you'd wish that scrubbing were on!
And seriously, ECC scrubbing is slow but ZFS (or even hardware RAID)
scrubbing is lightning fast??! C'mon are we going for data integrity or
speed here?!

If I remember correctly enabling Patrol Scrub guaranties that each
address gets hit once per 24 hours. So on 128GB system you are
generating maybe 1-2MiB/s of reads. I'd say it's a good trade-off if
you bothered to put ECC memory in.

Post by Igor Mozolevsky
’Scrub' was popular about a decade ago, when DDR2 RAM was around $100/GB.

Post by Jim Thompson
DDR3-1600 is about $6/GB today.

Yup- with a much higher density of smaller memory bits! ;-)
--
Igor M.
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-hackers

Don Lewis

2015-09-16 01:24:13 UTC

Post by Igor Mozolevsky
<snip>
Errors are corrected "on-the-fly," corrected data is almost never
placed back in memory. If the same corrupt data is read again, the
correction process is repeated. Replacing the data in memory would
require processing overhead that could accumulate and significantly
diminish system performance. If the error occurred because of random
events and isn't a defect in the memory, the memory address will be
cleaned of the error when the data is overwritten with other data.
<snip>
Just to correct a small oversight- most (if not all?) boards have an
option to scrub ECC memory in the background so as to prevent single
bit (recoverable) errors from turning into double bit (irrecoverable
but detectable) errors ;-)

I think you$B!G(Bll find that the default for $B!F(Bscrub$B!G(B is off on most
(perhaps all) boards. There are reasons, and these relate directly to
$B!H(Bsignificantly diminish system performance$B!I(B, (above), as well as the
greatly increased RAM sizes in use today.

The Gigabyte AM3+ motherboards that I'm using have all sorts of knobs
for controlling the scrub rate, with different knobs for cache scrubbing
vs. main memory scrubbing. My somewhat more recent Asus AM3+ board with
different BIOS brand basically just has an ECC on/off knob.

Don Lewis

2015-09-15 22:10:45 UTC

I don't think the current APU parts support ECC. My guess is that the
current APU sockets don't have the connections to support it.

I'm typing on a FreeBSD with an AMD CPU with ECC RAM. I won't put
together a machine without ECC. My experience is that many ASUS
motherboard support ECC RAM and usually document that fact. Also many
Gigabyte mother boards also support ECC RAM, but don't document it. Even
if you look at the BIOS screenshots in the manual, you won't see the
knobs to configure ECC, I suspect because those knobs are not displayed
unless ECC RAM is installed.

Post by Dieter BSD
Does the firmware have to do anything to support ECC? Program a few
registers in the memory controller perhaps? A few boards have FLOSS
firmware available, so this code could be added, but most boards do not
have firmware sources available.
Assuming that a board does have the necessary connections but
the firmware does not have ECC support, is there some reason that
ECC support could not be added to the OS instead of the firmware?
I grepped through FreeBSD 8.2 and 10.1 sources but couldn't find
anything that looked relevant. Also did not find any code that
reported ECC errors, other than one device. Perhaps I missed it?

It's in there ...

Post by Dieter BSD
I've been running machines with ECC for 15-20 years and have never seen
a report of an ECC error from either NetBSD or FreeBSD. I have seen
reports of ECC errors from Digital Unix. And remember getting panics
due to parity errors on machines before ECC. So I'm thinking that
the BSDs must ignore hardware reports of single bit ECC errors. :-(

From daily mail to root about a month ago:

+MCA: Bank 4, Status 0x944a400096080a13
+MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000
+MCA: Vendor "AuthenticAMD", ID 0x100f53, APIC ID 0
+MCA: CPU 0 COR BUSLG Responder RD Memory
+MCA: Address 0x213e98b10
+MCA: Bank 4, Status 0xd44a400096080a13
+MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000
+MCA: Vendor "AuthenticAMD", ID 0x100f53, APIC ID 0
+MCA: CPU 0 COR OVER BUSLG Responder RD Memory
+MCA: Address 0x213e98b10

Jim Thompson

2015-09-15 22:40:16 UTC

Post by Don Lewis

I don't think the current APU parts support ECC. My guess is that the
current APU sockets don't have the connections to support it.

The G-Series (such as the T40E used on the APU) doesn’t support ECC.

“Kabini” (“G-Series 2.0” aka GX-210 / GX-415/420) supports a single channel of ECC ram.

Honestly, at the densities used by some of these boards, ECC doesn’t make much sense.
(Obviously, if you’re running storage appliance, this position is reversed.)

Don Lewis

2015-09-16 01:18:12 UTC

Post by Jim Thompson

Post by Don Lewis

I don't think the current APU parts support ECC. My guess is that the
current APU sockets don't have the connections to support it.

The G-Series (such as the T40E used on the APU) doesn’t support ECC.
“Kabini” (“G-Series 2.0” aka GX-210 / GX-415/420) supports a single channel of ECC ram.

Interesting ... it's been a while since I looked. I think the primary
sockets at the time were FM1, FM2, and FM2+, and the mobile sockets, and
they didn't support ECC.

AM1 motherboard ECC support seems to be pretty lacking, though.

Dieter BSD

2015-09-16 17:57:05 UTC

Post by Andriy Gapon

Yes, there is. The memory controller is programmed by the code that
runs from ROM and uses no RAM (or the CPU cache is used as the RAM).
Once the real RAM gets used it's too late to reprogram the DRAM controller.

Perhaps one of the several bootloader stages could get itelf into
CPU cache, program the memory controller, then load and execute the
next stage or the OS?

Post by Andriy Gapon
Replacing the data in memory would require processing overhead
that could accumulate and significantly diminish system performance.

If it only replaces data when there is a correctable error,
and the errors are occasional soft errors, the effect on
performance should be minimal. If there is a hard error,
you would want to replace the defective memory before you get
an additional error and it becomes uncorrectable.

Post by Andriy Gapon
If the error occurred because of random events and isn't a defect in
the memory, the memory address will be cleaned of the error when the
data is overwritten with other data.

If and when new data gets written to that location. If that location
contains info that never changes, such as kernel text, the bad bit will
never get fixed.

Post by Andriy Gapon
memory, without the extra complexity of the controller, is 12.5% more
expensive. This <80><99>t a huge impact at 8GB, (<80><99>ll need
another 1GB of RAM), but at 1024GB <80><99>ll need another 128GB,
and that much ram still costs enough that your wallet <80><99>t be happy.

It is 12.5% in both cases. How much does it cost to have undetected
errors in your data? How much does it cost when an Interstate
bridge collapses? How much does it cost when one of NASA's missions
fails? How much does it cost when your pharmacy receives a
prescription with an error in the dose?

Post by Andriy Gapon
the MRC setup on Intel and AMD is both complex and proprietary

One wonders why the secrecy. AMD has been much more open than many
(most?) chipmakers. They even forced the ATI people to document
how to program their chips. I don't see a lot of companies popping up
making competing chips. #include standard joke: "How do you make a small
fortune in chipmaking? Start with a very large fortune." I can't
see what secret would be revealed by saying "set bit 7 of register 4
to 1 to enable ECC".

Post by Andriy Gapon
Intel Red Book

So the secret books are red this week, yawn. I remember the nightmare
of the merced orange books and the brain damaged "features" the chips had.
Not recommended. I'm interested in chips that work correctly, hence the
interest in ECC and AMD. Looked for ARM boards with ECC but didn't find
any. Is the Sparc stuff any more reliable than it used to be? Other
arch choices?

Post by Andriy Gapon
The MRC setup code is a binary blob for otherwise open source boot
firmware such as Coreboot.

So the libreboot people are forced to work on reverse engineering
these blobs? :-(

Post by Andriy Gapon
I don't think the current APU parts support ECC.

According to wikipedia, socket FM2+ does not support ECC. :-(
Kabini has support for ECC. And Berlin, (and I assume Toronto) but
word is that Berlin and Toronto are basically dead. :-(
I think Carrizo and Turion are supposed to support ECC? There really
ought to be a list of which CPUs/APUs/sockets/boards do or do not
support ECC.

Post by Andriy Gapon
My experience is that many ASUS motherboard support ECC RAM and
usually document that fact. Also many Gigabyte mother boards also
support ECC RAM, but don't document it.

From what I've been reading, both Asus and Gigabyte make good boards.
I've seen reviews that complained about Gigabyte's firmware.
http://www.xbitlabs.com/articles/mainboards/display/gigabyte-ga-990fxa-ud5_8.html
I've also seen claims that the firmware bricked boards.
Reviewers like Asus' firmware. I've seen complaints about Asus's support,
and their website has significant problems.

The firmware on my Tyan board is crap, and they refused to tell me
how much power it needs. Which means I don't know how much other stuff
I can run from the same P/S. It should have *way* more power than needed,
but experience says "not enough", so I added a 2nd p/s for the disk farm
and suddenly had fewer problems. The 2 p/s setup does allow powercycling
the mainboard (because of the crappy firmware) without powercycling the disks.

Given my experience with the Tyan board, and the apparent lack of
FLOSS firmware for recent boards, I'm not real excited about the
Gigabyte boards. Asus has a couple of AMD3+ boards that I could
probably live with, if their website actually had things like
lists of exactly which CPUs and memory are approved, and firmware
updates, ... But there are also applications could use a lower wattage
solution.

Anyone have opinions on other mainboard companies? ECS? Asrock?
MSI? Zotac? Others?

Post by Andriy Gapon
+MCA: Bank 4, Status 0x944a400096080a13
+MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000
+MCA: Vendor "AuthenticAMD", ID 0x100f53, APIC ID 0
+MCA: CPU 0 COR BUSLG Responder RD Memory
+MCA: Address 0x213e98b10
+MCA: Bank 4, Status 0xd44a400096080a13
+MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000
+MCA: Vendor "AuthenticAMD", ID 0x100f53, APIC ID 0
+MCA: CPU 0 COR OVER BUSLG Responder RD Memory
+MCA: Address 0x213e98b10
MCA: Bank 1, Status 0x9400000000000151
MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000
MCA: Vendor "AuthenticAMD", ID 0x100f52, APIC ID 2
MCA: Address 0x81cc0e9f0
Kind of freaky. I've never had this error on this board before.
On others tho.
Try a search for MCA instead.

Is there a decoder ring for those messages? I don't recall seeing
messages like that, although I wasn't looking for them, and they
don't leap out at you screaming ERROR! ERROR! Digital Unix had its
problems, but at least the error messages were fairly clear.
Something like "single bit memory error at address 0x12345..."
A simple edit to sys/x86/x86/mca.c
s/printf("UNCOR ");/printf("Uncorrectable ");/
s/printf("COR ");/printf("Correctable ");/
would make the messages at least slightly more meaningful to a viewer
who isn't intimently(sp) familiar with the mca. Which most people aren't.
I used to maintain code that dealt with a memory controller, and
used a hardware circuit to inject errors into a memory board.
But looking at those messages doesn't tell me anything beyond
"Something happened, maybe I should grep through the source
code for clues about those messages." Looking at the source
doesn't add much, you'd need documentation for the mca.
Which most people aren't going to have. And you'd need a lot
of time to figure it out.

# find /var/log | xargs bzgrep -i mca
found no error messages.

I seem to be buried under a mountain of boards that would be useful,
if only they supported ECC. (and had firmware that actually works...)
And I'm hardly the only one. So how do we fix this?
Lobby AMD (and other chipmakers) to include ECC support in *all* memory
controllers and sockets? It isn't like they have to redesign the logic
for every chip, they only need one design per memory width. Lobby AMD
to publish documentation on how to program the memory controller?
Lobby the companies that make boards?

Don Lewis

2015-09-17 05:25:26 UTC

Post by Andriy Gapon

Yes, there is. The memory controller is programmed by the code that
runs from ROM and uses no RAM (or the CPU cache is used as the RAM).
Once the real RAM gets used it's too late to reprogram the DRAM controller.

Perhaps one of the several bootloader stages could get itelf into
CPU cache, program the memory controller, then load and execute the
next stage or the OS?

Post by Andriy Gapon
Replacing the data in memory would require processing overhead
that could accumulate and significantly diminish system performance.

Post by Andriy Gapon
If the error occurred because of random events and isn't a defect in
the memory, the memory address will be cleaned of the error when the
data is overwritten with other data.

If and when new data gets written to that location. If that location
contains info that never changes, such as kernel text, the bad bit will
never get fixed.

Post by Andriy Gapon
the MRC setup on Intel and AMD is both complex and proprietary

AMD documents a lot of this stuff in the BIOS and Kernel Developer's
Guide (BKDG) for each CPU family.

Post by Andriy Gapon
Intel Red Book

Supermicro has some Atom motherboards with ECC support.

Post by Andriy Gapon
The MRC setup code is a binary blob for otherwise open source boot
firmware such as Coreboot.

So the libreboot people are forced to work on reverse engineering
these blobs? :-(

Post by Andriy Gapon
I don't think the current APU parts support ECC.

Socket AM1 (Kabini) is supposed to support ECC, but motherboards with
this socket that support ECC is another story.

Post by Andriy Gapon
My experience is that many ASUS motherboard support ECC RAM and
usually document that fact. Also many Gigabyte mother boards also
support ECC RAM, but don't document it.

I've got one of the Gigabyte GA_990FXA-UD5 boards. I actually like the
BIOS. I'm not trying to overclock, but it does have lots of ECC-related
knobs. I think you can even tell it to gang the two memory controller
channels so that you can enable Chipkill. The latter isn't as good as
it sounds because it really only works properly with DIMMs that us x4
DRAM chips, and there don't seem to be any unbuffered versions of those.
The only unbuffered DDR3 DIMMS I've found use x8 DRAM chips. In that
case if a multiple bits coming out of the chip are incorrect, the ECC
checker has a just under 100% chance of detecting the error, but it is
still uncorrectable. With x4 DRAM chips, Chipkill can correct the error
even if all four bits from the DRAM are incorrect. Unfortunately, the
only DDR3 DIMMs that use x4 chips are registered. Also, ganging the
memory controllers does hurt performance.

The things that I don't like about this board are the SATA connector
placement (though it wasn't too bad in my specific application), and the
combined keyboard/mouse PS/2 connector. I'm still using a PS/2 KVM
switch here and I need motherboards with separate keyboard and mouse
connectors, and the Y-adaptors don't seem to work. I'd love to upgrade
to a newer KVM, but I'd want to also switch from VGA to DVI and KVMs
that handle more than two dual-link DVI inputs are serious $$$.

My newest motherboard is an Asus M5A97 R2.0. I bought it because it was
inexpensive, had sufficient expansion potential, and had separate
keyboard and mouse PS/2 connectors. I don't like the BIOS nearly as
much. It's got lots of whizzy graphics, but it's hard to find where the
various knobs are hidden. As I recall, ECC control is basically on/off.
I also wasn't able to get WOL to work. If I power off the machine with
shutdown -p, the LAN link light stays on, but sending a WOL packet
doesn't start the machine. It might wake from sleep mode, but I didn't
try that.

Post by Dieter BSD
The firmware on my Tyan board is crap, and they refused to tell me
how much power it needs. Which means I don't know how much other stuff
I can run from the same P/S. It should have *way* more power than needed,
but experience says "not enough", so I added a 2nd p/s for the disk farm
and suddenly had fewer problems. The 2 p/s setup does allow powercycling
the mainboard (because of the crappy firmware) without powercycling the disks.
Given my experience with the Tyan board, and the apparent lack of
FLOSS firmware for recent boards, I'm not real excited about the
Gigabyte boards. Asus has a couple of AMD3+ boards that I could
probably live with, if their website actually had things like
lists of exactly which CPUs and memory are approved, and firmware
updates, ... But there are also applications could use a lower wattage
solution.
Anyone have opinions on other mainboard companies? ECS? Asrock?
MSI? Zotac? Others?

If you are interested in something with low power consumption, take a
look at the Supermicro C2000 series Atom boards:
<http://www.supermicro.com/products/motherboard/ATOM/>

I'm seriously considering picking up an A1SRM-LN5F-2358. At first
glance it seems pricey, especially considering the amount of CPU grunt,
but I don't need much and I can use the extra LAN ports and possibly
IPMI, so I don't have to add the cost of a CPU, an decent aftermarket
cooler, extra NICs, or a video card.

I think jhb@ has some software that decodes this stuff. I'm not sure if
it is in ports.

John Baldwin

2015-10-22 18:14:12 UTC

Post by Andriy Gapon
MCA: Bank 1, Status 0x9400000000000151
MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000
MCA: Vendor "AuthenticAMD", ID 0x100f52, APIC ID 2
MCA: Address 0x81cc0e9f0
Kind of freaky. I've never had this error on this board before.
On others tho.
Try a search for MCA instead.

The problem is that there are other fields to decode and you can only fit so
much in one line. Also, there is not a CPU-independent way to know the
address of an ECC error. On Intel Core i3/5/7 (anything with QPI) you can
identify the individual DIMM at least, but the label that the motherboard
manufacturer uses varies by manufacturer. (You can maybe scrape that text
from the SMBIOS tables, but only if they aren't wrong which they sometimes
are, and good luck knowing if they are wrong or right.) Digital UNIX had the
luxury of running on hardware built by the same company, not on a random
assortment of boards built by various vendors. FreeBSD does not.

sysutils/mcelog does some more verbose decoding of MCA records, but I find
it to be equally gibberish for anyone not intimately familiar with a specific
CPU.

I wrote a tool for a previous employer that was able to do some simple parsing
of MCA errors for Supermicro X7-X10 boards (Intel CPUs) and give a short
summary that was used in a nagios check. However, it only handles a narrow
set of systems.

https://github.com/freebsd/freebsd/compare/master...bsdjhb:ecc

--
John Baldwin

Bob Bishop

2015-10-22 18:57:57 UTC

HI,

The problem is that there are other fields to decode and you can only fit so
much in one line. Also, there is not a CPU-independent way to know the
address of an ECC error. [etc]

On server-class hardware, the platform management (BMC or whatever) is probably decoding this stuff for event logs and can be interrogated via IPMI (or whatever).

--
Bob Bishop
***@gid.co.uk

John Baldwin

2015-10-22 21:17:33 UTC

Post by Bob Bishop
HI,

The problem is that there are other fields to decode and you can only fit so
much in one line. Also, there is not a CPU-independent way to know the
address of an ECC error. [etc]

On server-class hardware, the platform management (BMC or whatever) is probably decoding this stuff for event logs and can be interrogated via IPMI (or whatever).

Not always well and not always with side effects you want. On Core 2 and
Nehalem i7 class hardware I measured that it took on the order of 400
milliseconds (not micro) in SMM (system management mode, so your entire
OS is halted) to write out each log entry to NVRAM. At least one place I
worked at turned the BIOS ECC logging off because that delay was too costly.

Also, even though your BMC may log it, the format for doing so isn't
standard. The details such as the affected DIMM are in the OEM bits of
the log record, so not something you can easily extract from, say,
ipmitool sel elist. You'd have to log into the BIOS itself (or the BMC's
web UI) to see which DIMM is affected. Neither of those are really great
for automated reporting.

--
John Baldwin

Bob Bishop

2015-10-23 11:37:44 UTC

Hi,

Post by Bob Bishop
HI,

The problem is that there are other fields to decode and you can only fit so
much in one line. Also, there is not a CPU-independent way to know the
address of an ECC error. [etc]

On server-class hardware, the platform management (BMC or whatever) is probably decoding this stuff for event logs and can be interrogated via IPMI (or whatever).

Not always well and not always with side effects you want. On Core 2 and
Nehalem i7 class hardware I measured that it took on the order of 400
milliseconds (not micro) in SMM (system management mode, so your entire
OS is halted) to write out each log entry to NVRAM. At least one place I
worked at turned the BIOS ECC logging off because that delay was too costly.
Also, even though your BMC may log it, the format for doing so isn't
standard. The details such as the affected DIMM are in the OEM bits of
the log record, so not something you can easily extract from, say,
ipmitool sel elist. You'd have to log into the BIOS itself (or the BMC's
web UI) to see which DIMM is affected. Neither of those are really great
for automated reporting.

All agreed. I was just flagging up the existence of another possible channel to get at ECC logging.

Post by John Baldwin
--
John Baldwin

--
Bob Bishop
***@gid.co.uk

Dieter BSD

2015-09-18 02:49:39 UTC

It appears that they are no longer selling the MSI 880GMA-E45.

There used to be a web page with useful info about how well
various boards worked with FreeBSD. My notes say it was
http://www.freebsd.org/platforms/amd64/motherboards.html
but that URL now gives: "Page not found. Oh no. :("

Post by Don Lewis
Supermicro has some Atom motherboards with ECC support.

Thanks, but the company that designed the atom has a rather long
history of design problems. The whole point of ECC is to avoid
corrupting the right answer, not to avoid corrupting the wrong answer.
They also steal technology from other companies, admit to it,
and somehow usually get away with it.

Post by Don Lewis
Socket AM1 (Kabini) is supposed to support ECC, but motherboards with
this socket that support ECC is another story.

Word is that Asus updated the AM1M-A manual to say that it supports ECC.

http://www.planet3dnow.de/vbulletin/threads/421749-Geruecht-Zen-kommt-zuerst-als-Opteron?p=4988619&viewfull=1#post4988619
Google translation from pages 1 & 2:
Onkel_Dithmeyer: I have Athlon 5350, Asus AM1M-A and ECC Ram
drSeehas: Simply read the CPU registers D18F3xE8 and post the result here.
I bet there comes 1F74F00h out.
Onkel_Dithmeyer: The bet you've won!
https://en.wikipedia.org/wiki/List_of_AMD_accelerated_processing_unit_microprocessors
lists 5350 as "Kabini"
So it sounds like Kabini doesn't support ECC after all? :-(
Word is that old versions of memtest86 incorrectly assumed that ECC
was available and can therefore give incorrect results.

Post by Don Lewis
Gigabyte GA_990FXA-UD5
I actually like the BIOS.

Will the firmware talk to a RS-232 console?

The slot selection looks better than most: x16 x8 x8 x4 x4 x1 pci
The x1 slot looks crowded, some of the newegg user reviews complained
about that. Some of the x1 cards are small, if not, a riser should work.
At least it *has* a 7th slot. More than 7 slots would be great, but such
a board doesn't seem to exist.

The hardware looks ok. Does FreeBSD have *good* support for all the
hardware? (Other than the VIA VT6308P firewire which probably has the
same problems as the 6307. If so, that uses up one slot for a firewire
card. Looks like the firmware chips are soldered to the board, the
Asus sabertooth has a socket.

Newegg user review: "North and South will get hot."

Word is that UD5 and UD7 have "vdroop issues due to lack of an LLC unit"
Is that something to be concerned about if I'm not overclocking?

Gigabyte's website is obscenely slow: 186 B/s :-(
They know about it: "#1. Download speed may be varied in different
region. If you have experienced lower download speed, please try
other region download sites." At least I found the lists of approved
CPUs and memory, and the firmware and manuals, unlike Asus.

Post by Don Lewis
The things that I don't like about this board are the SATA connector
placement

Location looks okay, as long as they aren't too crowded or something.
Sata cables can be reasonably long. 2 meters works for me, even with
ports that don't claim to be e-sata. (e-sata is supposed to have
slightly higher Voltages) Bad placement is putting a pata connector
on the far left next to the i/o panel. Wimpy short cable barely
reaches the drive. HVD SCSI was nice, the drives could be on the other
side of the room, and often were. *grin*

Post by Don Lewis
I'd want to also switch from VGA to DVI

Nothing against DVI, but isn't it in the process of going away?
Displayport looks good, as long as you don't need analog. High
resolution, Freesync, inexpensive adapters to DVI and HDMI.

Post by Don Lewis
combined keyboard/mouse PS/2 connector

They get to save a bit of space on the i/o panel, and they get to sell
you a Y cable.

Post by Don Lewis
the Y-adaptors don't seem to work

I assume you've tried more than one adapter, and tried them without the KVM?
As long as both the firmware and BSD will listen to USB, I guess I can go
shopping for USB ones that I can stand. Presumably the USB ones are safe
to hotplug, unlike ps/2. I have proper Unix style keyboards with control
next to 'a' but both the firmware and FreeBSD think I have some brain-dead
keyboard with control and cap-lock switched. Xmodmap fixes it in X, but I
get a lot of typos in firmware and single user mode. :-(

Post by Don Lewis
Asus M5A97 R2.0
It's got lots of whizzy graphics

Firmware shouldn't have graphics at all. Firmware needs to be absolutely
reliable. Graphics adds a lot of unneeded complexity. Graphics over
RS-232 are rather slow. Word is that Asus firmware doesn't support an
RS-232 console which is a major negative.

YA Asus board with only 6 slots. Can't they count to at least 7?

The best Asus board I've found is the Sabertooth 990FX R2.0.
Again only 6 slots, but at least they have more lanes. And it has
4 extra sata ports. But again, Asus firmware is said to not talk RS-232.
Sometimes things fly by too fast to read. With RS-232 you can scroll
back, capture it in a disk file, etc.

Post by Don Lewis
If you are interested in something with low power consumption,

I need a new firewall/gateway/proxy/uucp/mail/... machine, which
shouldn't need massive cpu power, could be headless if RS-232 console
works, and doesn't need massive amounts of i/o. A low power consumption
machine should work great, if I can find one. Current machine is dying,
so need a replacement asap. Same or similar machine with a good
framebuffer (>= 4K, Freesync) and UVD could be X terminal / HTPC.
Minimal GPU, if any, needed. But can't find a video card with a good
framebuffer that doesn't also have some total overkill gpu that
is expensive, lots of power&heat, uses up 2 slots and has a fan.
Using up 2 slots is unacceptable when there are so few slots to start
with. I'm sure it will be easy to find a replacement for the oddball
board specific fan when it dies.

Also need a faster box with more i/o. Take the i/o on the UD5, twice
as many of everything would be about right.

I don't see what IPMI can do that an RS-232 console can't, other than
talk to the firmware with the machine mostly powered down, and
powering the machine up and down. I don't need to do that. At least
there is *some* feature I don't need! (besides a hyperthyroid gpu)

Post by Don Lewis
Then they did the next process shrink, and the effects disappeared
completely! [ ... ] Nobody understands why

Sounds very bizzare. Figuring out why would make a good project for
a phd student?

Tom Evans via freebsd-hardware

2015-09-18 09:14:24 UTC

Post by Dieter BSD
Current machine is dying,
so need a replacement asap. Same or similar machine with a good
framebuffer (>= 4K, Freesync) and UVD could be X terminal / HTPC.
Minimal GPU, if any, needed. But can't find a video card with a good
framebuffer that doesn't also have some total overkill gpu that
is expensive, lots of power&heat, uses up 2 slots and has a fan.
Using up 2 slots is unacceptable when there are so few slots to start
with. I'm sure it will be easy to find a replacement for the oddball
board specific fan when it dies.

nVidia GT 720 Silent. Single slot, no fan, 4k H264 decoding, enough GL
to do anything you might want to do on a desktop.

Cheers

Tom

Pokala, Ravi

2015-10-23 15:23:20 UTC

-----Original Message-----

Date: Thu, 22 Oct 2015 11:09:50 -0700
Subject: Re: ECC support
Content-Type: text/plain; charset="us-ascii"
The problem is that there are other fields to decode and you can only fit so much in one line.

At Panasas, we did in-kernel parsing and got it down to a one-liner like this:

Detected HW Err (CMC) - Correctable ECC error Channel:0; Dimm:0; Syndrome:2151686160

But that was only for main-memory corrected ECCs; for all other MCAs, it was a multi-line format (which I think we got from backporting MCA support from (8-STABLE?)):

MCA: Bank 8, Status 0xb20000000004008f
MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000004
MCA: Vendor "GenuineIntel", ID 0x106e4, APIC ID 0
MCA: CPU 0 UNCOR PCC GEN channel ?? memory error

Also, there is not a CPU-independent way to know the address of an ECC error. On Intel Core i3/5/7 (anything with QPI) you can identify the individual DIMM at least, but the label that the motherboard manufacturer uses varies by manufacturer. (You can maybe scrape that text from the SMBIOS tables,

That's exactly what we did when using off-the-shelf motherboards. We were able to extract the name of the DIMM slot, as defined in SMBIOS, as well as the part and serial numbers of the DIMM, and the physical address range of the DIMM. For example:

hw.mem.dimm.s: locator serial# part# bank size addr0 addrN
hw.mem.dimm.0: DIMM_A1 DC917AEF 36KDZS2G72PZ-1G4D1 [NODE 0 CHANNEL 0 DIMM 0] 16384MB 0x00000000000 0x003FFFFFFFF
hw.mem.dimm.1: DIMM_B1 DDA0C793 36KDZS2G72PZ-1G4D1 [NODE 0 CHANNEL 1 DIMM 0] 16384MB 0x00400000000 0x007FFFFFFFF
hw.mem.dimm.2: DIMM_C1 DDA0C7B6 36KDZS2G72PZ-1G4D1 [NODE 0 CHANNEL 2 DIMM 0] 16384MB 0x00800000000 0x00BFFFFFFFF
hw.mem.dimm.3: DIMM_D1 DDA0C7DE 36KDZS2G72PZ-1G4D1 [NODE 0 CHANNEL 3 DIMM 0] 16384MB 0x00C00000000 0x00FFFFFFFFF

Re-whacking that code for -CURRENT and getting it upstream has been on my to-do list for a depressingly long time; it keeps getting pre-empted. :-S

but only if they aren't wrong which they sometimes are, and good luck knowing if they are wrong or right.)

Making sure the SMBIOS identifier matches the label on the motherboard is part of the process of validating the motherboard as usable by us. :-)

Digital UNIX had the luxury of running on hardware built by the same company, not on a random assortment of boards built by various vendors. FreeBSD does not.

Yeah. Like I said, we scrapped SMBIOS *for off-the-shelf motherboards*. For our in-house designs, we hardcoded the Channel/DIMM mapping into an unambiguous form inside the driver itself.

sysutils/mcelog does some more verbose decoding of MCA records, but I find it to be equally gibberish for anyone not intimately familiar with a specific CPU.
I wrote a tool for a previous employer that was able to do some simple parsing of MCA errors for Supermicro X7-X10 boards (Intel CPUs) and give a short summary that was used in a nagios check. However, it only handles a narrow set of systems.
https://github.com/freebsd/freebsd/compare/master...bsdjhb:ecc

Oooo, that looks nice! Is this something that can be committed to the main tree? If nothing else, I'll need to make a note of the way you're getting the MCA records into userland.

Thanks,

Ravi

--
John Baldwin

John Baldwin

2015-11-11 23:30:35 UTC