How To Diagnose Memory Errors on AMD x86_64 using EDAC

This is a writeup I put together to help identify the defective DIMM from EDAC errors on linux x86_64.

Over several years of managing a linux cluster I have occaisionally had systems with a bad memory DIMM. An early manifestation of these errors is EDAC errors (Error Detection and Correction kernel module) reported in the kernel ring buffer. One frustrating problem is identifying the bad DIMM. In the past I have used a brute force approach to diagnose this by running the system with a single DIMM at a time until I found the offending DIMM. However, as systems have become larger with more CPUs and more DIMMS this has become very impractical. It seems that the information in the EDAC error messages should be sufficient to identify the offending DIMM. Unfortunately, it is not obvious how to do this, and I have found no single source that explained the process. As a result, using information from a number of sources I have figured it out for some of our current motherboards (e.g. SuperMicro AMD64). The following is a summary of the steps that I used which I believe can be generalized to other motherboards.
Continue reading How To Diagnose Memory Errors on AMD x86_64 using EDAC