How To Diagnose Memory Errors on AMD x86_64 using EDAC

This is a writeup I put together to help identify the defective DIMM from EDAC errors on linux x86_64.

Over several years of managing a linux cluster I have occaisionally had systems with a bad memory DIMM. An early manifestation of these errors is EDAC errors (Error Detection and Correction kernel module) reported in the kernel ring buffer. One frustrating problem is identifying the bad DIMM. In the past I have used a brute force approach to diagnose this by running the system with a single DIMM at a time until I found the offending DIMM. However, as systems have become larger with more CPUs and more DIMMS this has become very impractical. It seems that the information in the EDAC error messages should be sufficient to identify the offending DIMM. Unfortunately, it is not obvious how to do this, and I have found no single source that explained the process. As a result, using information from a number of sources I have figured it out for some of our current motherboards (e.g. SuperMicro AMD64). The following is a summary of the steps that I used which I believe can be generalized to other motherboards.

How To Diagnose Memory Errors on AMD x86_64 using EDAC

Author: Martin Stumpf
Last Update: November 2nd, 2012

Contents

  1. Which EDAC modules are in use?
  2. Get the memory error information from the kernel log
  3. Get the memory controller(MCx) device information
  4. Analysis of the information given
  5. Conclusion
  6. Appendix

*****************************************************************************
1. Which EDAC modules are in use? This HowTo is for the amd64_edac module.

# lsmod | grep -i amd
amd64_edac_mod 55921 0
edac_mc 61217 1 amd64_edac_mod

*****************************************************************************
2. Get the memory error information from the kernel log.

# dmesg | grep -E -i edac\|northbridge
Northbridge Error (node 3): DRAM ECC error detected on the NB.
EDAC amd64 MC3: CE ERROR_ADDRESS= 0x6281d4710
EDAC MC3: CE page 0x6281d4, offset 0x710, grain 0, syndrome 0x2845, row 3,
channel 1, label "": amd64_edac

The salient parts are: MC3, row 3, and channel 1.

*****************************************************************************
3. Get the memory controller (MCx) device information.

If you have cleared the kernel log then you will have to reboot. With a new log, you will have the EDAC driver messages which help identify the DIMMS.
(blank lines have been added to the output for clarity)

# dmesg | grep -E -i edac\|northbridge
EDAC MC: Ver: 2.0.1 Oct 20 2011
EDAC amd64_edac: Ver: 3.4.0

EDAC amd64: ECC is enabled by BIOS.
EDAC amd64: F10h detected (node 0).
EDAC MC: DCT0 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2: 2048MB 3: 2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC MC: DCT1 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2: 2048MB 3: 2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC amd64: using x8 syndromes.
EDAC amd64: MCT channel count: 2
EDAC amd64: CS2: Registered DDR3 RAM
EDAC amd64: CS3: Registered DDR3 RAM
EDAC MC0: Giving out device to amd64_edac F10h: DEV 0000:00:18.2

EDAC amd64: ECC is enabled by BIOS.
EDAC amd64: F10h detected (node 1).
EDAC MC: DCT0 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2: 2048MB 3: 2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC MC: DCT1 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2: 2048MB 3: 2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC amd64: using x8 syndromes.
EDAC amd64: MCT channel count: 2
EDAC amd64: CS2: Registered DDR3 RAM
EDAC amd64: CS3: Registered DDR3 RAM
EDAC MC1: Giving out device to amd64_edac F10h: DEV 0000:00:19.2

EDAC amd64: ECC is enabled by BIOS.
EDAC amd64: F10h detected (node 2).
EDAC MC: DCT0 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2: 2048MB 3: 2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC MC: DCT1 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2: 2048MB 3: 2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC amd64: using x8 syndromes.
EDAC amd64: MCT channel count: 2
EDAC amd64: CS2: Registered DDR3 RAM
EDAC amd64: CS3: Registered DDR3 RAM
EDAC MC2: Giving out device to amd64_edac F10h: DEV 0000:00:1a.2

EDAC amd64: ECC is enabled by BIOS.
EDAC amd64: F10h detected (node 3).
EDAC MC: DCT0 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2: 2048MB 3: 2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC MC: DCT1 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2: 2048MB 3: 2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC amd64: using x8 syndromes.
EDAC amd64: MCT channel count: 2
EDAC amd64: CS2: Registered DDR3 RAM
EDAC amd64: CS3: Registered DDR3 RAM
EDAC MC3: Giving out device to amd64_edac F10h: DEV 0000:00:1b.2

EDAC amd64: ECC is enabled by BIOS.
EDAC amd64: F10h detected (node 4).
EDAC MC: DCT0 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2: 2048MB 3: 2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC MC: DCT1 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2: 2048MB 3: 2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC amd64: using x8 syndromes.
EDAC amd64: MCT channel count: 2
EDAC amd64: CS2: Registered DDR3 RAM
EDAC amd64: CS3: Registered DDR3 RAM
EDAC MC4: Giving out device to amd64_edac F10h: DEV 0000:00:1c.2

EDAC amd64: ECC is enabled by BIOS.
EDAC amd64: F10h detected (node 5).
EDAC MC: DCT0 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2: 2048MB 3: 2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC MC: DCT1 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2: 2048MB 3: 2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC amd64: using x8 syndromes.
EDAC amd64: MCT channel count: 2
EDAC amd64: CS2: Registered DDR3 RAM
EDAC amd64: CS3: Registered DDR3 RAM
EDAC MC5: Giving out device to amd64_edac F10h: DEV 0000:00:1d.2

EDAC amd64: ECC is enabled by BIOS.
EDAC amd64: F10h detected (node 6).
EDAC MC: DCT0 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2: 2048MB 3: 2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC MC: DCT1 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2: 2048MB 3: 2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC amd64: using x8 syndromes.
EDAC amd64: MCT channel count: 2
EDAC amd64: CS2: Registered DDR3 RAM
EDAC amd64: CS3: Registered DDR3 RAM
EDAC MC6: Giving out device to amd64_edac F10h: DEV 0000:00:1e.2

EDAC amd64: ECC is enabled by BIOS.
EDAC amd64: F10h detected (node 7).
EDAC MC: DCT0 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2: 2048MB 3: 2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC MC: DCT1 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2: 2048MB 3: 2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC amd64: using x8 syndromes.
EDAC amd64: MCT channel count: 2
EDAC amd64: CS2: Registered DDR3 RAM
EDAC amd64: CS3: Registered DDR3 RAM
EDAC MC7: Giving out device to amd64_edac F10h: DEV 0000:00:1f.2

*****************************************************************************
4. Analysis of the information given.

This board, a Supermicro H8QG6, has 4 processors each having 8 DIMM slots. As seen above, the EDAC driver has enumerated them such that there are 8 memory controller instances (MC0-MC7). There are two MC’s for each processor. Each MC serves 4 DIMM slots. This board is physically labeled like this: P1-DIMM1A, P1-DIMM1B, P1-DIMM2A, P1-DIMM2B … P1-DIMM4B, and on up to P4-DIMM4B in the same manner.

In order to make sure that you are interpreting the EDAC information correctly, you have to know the current actual DIMM setup. I have four 4GB DIMMS in the ‘A’ slots of each processor. That is a total of 16GB per processor and 64GB on the board. There are 16 DIMMS installed total.

The actual memory controller/EDAC device control files can be examined by looking into the directory: /sys/devices/system/edac/mc. There you will find the log files for both correctable and non correctable errors, and a directory for each memory controller instance.

# ls -F1 /sys/devices/system/edac/mc
log_ce
log_ue
mc0/
mc1/
mc2/
mc3/
mc4/
mc5/
mc6/
mc7/
panic_on_ue
poll_msec

Here is the error again:

Northbridge Error (node 3): DRAM ECC error detected on the NB.
EDAC amd64 MC3: CE ERROR_ADDRESS= 0x6281d4710
EDAC MC3: CE page 0x6281d4, offset 0x710, grain 0, syndrome 0x2845, row 3, channel 1, label "": amd64_edac

You can see the last line states EDAC MC3 so we can look into the mc3 directory:

# cd /sys/devices/system/edac/mc
# ls -F1 mc3
ce_count
ce_noinfo_count
csrow2/
csrow3/
device@
mc_name
reset_counters
seconds_since_reset
size_mb
ue_count
ue_noinfo_count

All of these files except for the device link are text files so they can be easily examined. Look at the file size_mb for the entire controller instance:

# cd mc3
# cat size_mb
8192

This is half of the 16GB that are present for processor number 2. Again, I am using 4GB DDR3 DIMMS. Remember that each memory controller instance is managing half of the slots adjacent to each processor. This board has 8 slots per processor and currently has 4 DIMMS installed into the A slots for each processor. There is a total of 64GB or RAM on the board, 16GB per proc, 8GB per MC, and 4GB per DIMM. Processor 2 is served by MC2 and MC3.

Each of the DIMMS is ‘dual ranked’ which means that there are 2GB per ‘chip select row’ (csrow). A ‘rank’ corresponds to a populated csrow. Thus, these 4GB DIMMS show up in two csrows.

The csrow2/ and csrow3/ directories contain the following files:

# ls -1 csrow2
ce_count
ch0_ce_count
ch0_dimm_label
dev_type
edac_mode
mem_type
size_mb
ue_count

The size_mb file contains the amount of RAM that this chip select row is
managing.

# cat csrow2/size_mb
4096

# cat csrow3/size_mb
4096

Why 4096 and not 2048 (one half of the DIMM) in both rows? Because the csrows are interleaved across two channels! This means that memory of one 4GB DIMM in slot 1A and one 4GB DIMM in slot 2A show up in two rows and two channels. For MC3, the csrow2 and csrow3 files contain the total size of the memory managed by this memory controller instance. (The other 8GB is managed by MC2)

This can be confusing. Here is the correspondence between memory controllers and processors:

MC0, MC1 -> processor 1
MC2, MC3 -> processor 2
MC4, MC5 -> processor 3
MC6, MC7 -> processor 4

The memory controller, MC2 is managing slots 1-4 for processor 2. MC3 is
managing slots 5-8 for processor 2. The first 4 slots are P2-DIMM1A,
P2-DIMM1B, P2-DIMM2A, P2-DIMM2B, and the second 4 slots are P2-DIMM3A,
P2-DIMM3B, P2-DIMM4A, P2-DIMM4B.

Take a look at the EDAC messages for MC3 again:

EDAC amd64: ECC is enabled by BIOS.
EDAC amd64: F10h detected (node 3).
EDAC MC: DCT0 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2: 2048MB 3: 2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC MC: DCT1 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2: 2048MB 3: 2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC amd64: using x8 syndromes.
EDAC amd64: MCT channel count: 2
EDAC amd64: CS2: Registered DDR3 RAM
EDAC amd64: CS3: Registered DDR3 RAM
EDAC MC3: Giving out device to amd64_edac F10h: DEV 0000:00:1b.2

This memory controller uses 8 chip select rows (MC 0-7) and with the current DIMM installation is showing 2 channels (DCT0 and DCT1). That is a confusing print out because the two characters, MC, are used in multiple places and seem to mean different things.

If we remove the DIMM in P2-DIMM4A the EDAC driver would look like this:

EDAC amd64: ECC is enabled by BIOS.
EDAC amd64: F10h detected (node 3).
EDAC MC: DCT0 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2: 2048MB 3: 2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC MC: DCT1 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2: 0MB 3: 0MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC amd64: using x8 syndromes.
EDAC amd64: MCT channel count: 1
EDAC amd64: CS2: Registered DDR3 RAM
EDAC amd64: CS3: Registered DDR3 RAM
EDAC MC3: Giving out device to amd64_edac F10h: DEV 0000:00:1b.2

Note that the MCT channel count is now 1. There are still two csrows involved for the single DIMM in slot P2-DIMM3A (it is dual ranked), but the total size for each csrow is now only 2048. There is nothing in DCT1 which is channel 1.

The total for the entire memory controller mc3 with one DIMM is 4096 as expected:
# cd /sys/devices/system/edac/mc/mc3
# cat size_mb
4096

The size_mb file for mc3/csrow2 and mc3/csrow3 now contain:

# cat csrow2/size_mb
2048

# cat csrow3/size_mb
2048

That is 2048MB or one half the DIMM allocated to both csrows (ranks).

It should be obvious now that the EDAC log messages and error messages do not by default show the actual physical DIMM slot on the motherboard. That has to be deduced from the triplet of mc/row/channel as explained in the conclusion.

*****************************************************************************
5. Conclusion

Take a look at the EDAC error one more time:

# dmesg | grep -E -i edac\|northbridge
Northbridge Error (node 3): DRAM ECC error detected on the NB.
EDAC amd64 MC3: CE ERROR_ADDRESS= 0x6281d4710
EDAC MC3: CE page 0x6281d4, offset 0x710, grain 0, syndrome 0x2845, row 3,
channel 1, label "": amd64_edac

As we said before, the error is on MC3, row 3, channel 1. We now know that MC3 is managing the second 4 slots of processor 2’s eight slots, and that row 3 is the 2nd rank of a dual ranked DIMM. There have also been EDAC errors for row 2, channel 1 which makes perfect sense. Row 2 is the first rank on the same DIMM.

But what physical DIMM slot contains the defective DIMM?

The reported channel number, in this case 1, corresponds to DCT1 (the 2nd channel) which is DIMM4A or DIMM4B. We now know that it must be DIMM4A because rows 2&3 correspond to the A slots and rows 0&1 correspond to the B slots. But we also know that we don’t have any DIMMS in the B slots! That helps.

So the defective DIMM is P2-DIMM4A.

Here is a diagram for processor 2 showing the correspondence between rows, channels, and DIMMS. Recall that the MCx tells us which processor as explained above.

MC2
Channel 0 (DCT0)
row0 row1 P2-DIMM1B
row2 row3 P2-DIMM1A
row4 row5 unused
row6 row7 unused
Channel 1 (DCT1)
row0 row1 P2-DIMM2B
row2 row3 P2-DIMM2A
row4 row5 unused
row6 row7 unused
MC3
Channel 0 (DCT0)
row0 row1 P2-DIMM3B
row2 row3 P2-DIMM3A
row4 row5 unused
row6 row7 unused
Channel 1 (DCT1)
row0 row1 P2-DIMM4B
row2 row3 P2-DIMM4A
row4 row5 unused
row6 row7 unused

*****************************************************************************
Appendix

On this SuperMicro H8QG6 with AMD processors and the amd64 EDAC driver code, there is a strange occurrence. If you populate the B DIMM slots their memory will show up in csrows 0 and 1. I did experiments to demonstrate this and it seems to be linked to the fact that the DMI enumeration recognizes the B slots before the A slots. I thought that the A slots would come first but that may be misdirected.

Here is the output of dmidecode for the memory devices. As you can see, the info for P1_DIMM1B shows up before P1_DIMM1A:

# dmidecode -t 17
SMBIOS 2.6 present.

Handle 0x001E, DMI type 17, 28 bytes
Memory Device
Array Handle: 0x001C
Error Information Handle: Not Provided
Total Width: Unknown
Data Width: Unknown
Size: No Module Installed
Form Factor:
Set: None
Locator: P1_DIMM1B
Bank Locator: BANK0
Type:
Type Detail: None
Speed: Unknown
Manufacturer:
Serial Number:
Asset Tag:
Part Number:
Rank: Unknown

Handle 0x0020, DMI type 17, 28 bytes
Memory Device
Array Handle: 0x001C
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 4096 MB
Form Factor: DIMM
Set: None
Locator: P1_DIMM1A
Bank Locator: BANK1
Type: DDR3
Type Detail: Synchronous
Speed: 1333 MHz
Manufacturer: Samsung
Serial Number: 34363238
Asset Tag:
Part Number: M393B5170FH0-CH9
Rank: 2

There is a package named edac-util that has a helpful script for examining the contents of the /sys/devices/system/edac/mc directories. It is available via yum as an rpm on CentOS.

dmidecode is also very helpful with the -t 16 or -t 17 switches.

That’s it. I hope this can be of help to you as it took me a couple of days to get this far.

12 thoughts on “How To Diagnose Memory Errors on AMD x86_64 using EDAC”

  1. Great article. But for reasons unknown, with the identical motherboard and SuSE Enterprise (SLES11SP3, kernel 3.0.101-0.31) the EDAC sysfs /sys/devices/system/edac/mc directory is empty. Worse yet, edac-util and mcelog no longer work (as in older SuSE on the same board.. perhaps related to a switch to edac_core / edac_mce_amd instead of amd64_edac_mod ?) Furthermore, edac documentation is very out of date, and the [Hardware Error] that appear in dmesg give you nothing more than the mem. controller and a mem. address (see in drivers/edac/mce_amd.c)

    Any ideas? Is the absent sysfs a possible bug (maybe, or not, related to “GHES: HEST is not enabled!” ?) or SuSE weirdness? Memory errors appear within 4mb boundaries, is this a likely DIMM interleaving step? It’s rather frustrating to have too little information from the kernel to simply identify a bad RAM chip…

  2. Nice idea! I wrote a shell script for this based on /sys/devices/system/edac/mc/ and dmidecode. It uses the following parameters: .

    I count /sys/devices/system/edac/mc/mc* directories for the number of MCs.
    Dmidecode knows how many DIMM slots there are and with /sys/devices/system/edac/mc/mc$MC_id/csrow$row_id/ch* I count the channels per MC.
    The DIMM slot ID is calculated like this (in shell):
    MC_id * slots / mcs + channel_id * slots / channels + row_id / 2

    With the DIMM slot ID I just read its locator from dmidecode output.

    Your example with the SuperMicro H8QG6:
    Input: 3 3 1
    Calculation: 3 * 32 / 8 + 1 * 32 / (2 * 8) + 3 / 2 = 15
    Output: P2_DIMM4A

    1. I’ve already seen that an mc directory or a channel is missing in /sys/devices/system/edac/mc/. In this case the integrated MC of the CPU is defective and the CPU has to be replaced.

    2. On the H8QG6 the DIMM locator names changed after a BIOS update (“P2_DIMM4A” -> “P2_4A”) and the order in dmidecode changed. The ones ending with ‘A’ are listed first and belong to row 2/3 now. So using “3 2 1” as parameters points to P2_4B but this slot is empty and it should be P2_4A instead.
      Having to invert the logic for the rows is really annoying. So better check twice the logic used on your server.

    3. I have a bug report that this whole method does not work correctly. 🙁 Use the “CE ERROR_ADDRESS” instead. E.g. we have an error at 0x24bcfff3d0. In dmidecode there is a section “type 20” below each “type 17” DIMM. This section shows to which address range the DIMM above is mapped.

      Example:
      Handle 0x0020, DMI type 20, 19 bytes
      Memory Device Mapped Address
      Starting Address: 0x00400000000
      Ending Address: 0x007FFFFFFFF
      Range Size: 16 GB

      My script points to DIMM ID 8 but it is DIMM ID 9 instead. You get the same if dividing the CE error address by the size of a DIMM.

      0x24bcfff3d0 = 157,789,713,360 bytes = 146.95 GB
      146.95 GB / 16 GB = 9.18

  3. Implicitly, it is assumed that the failure of each bit in a word of memory is independent, resulting in improbability of two simultaneous errors.

Leave a Reply to ravinder Cancel reply

Your email address will not be published. Required fields are marked *