When running an executable compiled with intel mpiicc
, I get, after 30 minutes of running, the
following errors :
kernel:[29585.573874] [Hardware Error]: Corrected error, no action required.
Message from syslogd@pablo at Nov 8 09:53:25 ...
kernel:[29585.573881] [Hardware Error]: CPU:2 (17:31:0) MC18_STATUS[Over|CE|MiscV|-|AddrV|-|-|SyndV|-|CECC]: 0xdc2041000000011b
Message from syslogd@pablo at Nov 8 09:53:25 ...
kernel:[29585.573887] [Hardware Error]: Error Addr: 0x0000000a6c12d280
Message from syslogd@pablo at Nov 8 09:53:25 ...
kernel:[29585.573888] [Hardware Error]: IPID: 0x0000009600550f00, Syndrome: 0xc54c00040a800611
Message from syslogd@pablo at Nov 8 09:53:25 ...
kernel:[29585.573891] [Hardware Error]: Unified Memory Controller Extended Error Code: 0
Message from syslogd@pablo at Nov 8 09:53:25 ...
kernel:[29585.573893] [Hardware Error]: Unified Memory Controller Error: DRAM ECC error.
Message from syslogd@pablo at Nov 8 09:53:25 ...
kernel:[29585.573895] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
I am working on a AMD EPYC 7702P 64-Core Processor
with 1TB of RAM and a Debian OS :
Linux pablo 4.19.0-9-amd64 #1 SMP Debian 4.19.118-2+deb10u1 (2020-06-07) x86_64 GNU/Linux
From what I have seen, I did the command : dmidecode -t memory
that gives :
# dmidecode 3.2
Getting SMBIOS data from sysfs.
SMBIOS 3.2.0 present.
Handle 0x0023, DMI type 16, 23 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: Multi-bit ECC
Maximum Capacity: 2 TB
Error Information Handle: 0x0022
Number Of Devices: 8
Handle 0x002B, DMI type 17, 84 bytes
Memory Device
Array Handle: 0x0023
Error Information Handle: 0x002A
Total Width: 72 bits
Data Width: 64 bits
Size: 128 GB
Form Factor: DIMM
Set: None
Locator: DIMM 0
Bank Locator: P0 CHANNEL A
Type: DDR4
Type Detail: Synchronous Registered (Buffered) LRDIMM
Speed: 2933 MT/s
Manufacturer: Samsung
Serial Number: 03C6F701
Asset Tag: Not Specified
Part Number: M386AAG40MMB-CVF
Rank: 4
Configured Memory Speed: 2933 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V
Memory Technology: DRAM
Memory Operating Mode Capability: Volatile memory
Firmware Version: Unknown
Module Manufacturer ID: Bank 1, Hex 0xCE
Module Product ID: Unknown
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Non-Volatile Size: None
Volatile Size: 128 kB
Cache Size: None
Logical Size: None
Handle 0x002E, DMI type 17, 84 bytes
Memory Device
Array Handle: 0x0023
Error Information Handle: 0x002D
Total Width: 72 bits
Data Width: 64 bits
Size: 128 GB
Form Factor: DIMM
Set: None
Locator: DIMM 0
Bank Locator: P0 CHANNEL B
Type: DDR4
Type Detail: Synchronous Registered (Buffered) LRDIMM
Speed: 2933 MT/s
Manufacturer: Samsung
Serial Number: 03C6F3ED
Asset Tag: Not Specified
Part Number: M386AAG40MMB-CVF
Rank: 4
Configured Memory Speed: 2933 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V
Memory Technology: DRAM
Memory Operating Mode Capability: Volatile memory
Firmware Version: Unknown
Module Manufacturer ID: Bank 1, Hex 0xCE
Module Product ID: Unknown
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Non-Volatile Size: None
Volatile Size: 128 kB
Cache Size: None
Logical Size: None
Handle 0x0031, DMI type 17, 84 bytes
Memory Device
Array Handle: 0x0023
Error Information Handle: 0x0030
Total Width: 72 bits
Data Width: 64 bits
Size: 128 GB
Form Factor: DIMM
Set: None
Locator: DIMM 0
Bank Locator: P0 CHANNEL C
Type: DDR4
Type Detail: Synchronous Registered (Buffered) LRDIMM
Speed: 2933 MT/s
Manufacturer: Samsung
Serial Number: 03C6F4BA
Asset Tag: Not Specified
Part Number: M386AAG40MMB-CVF
Rank: 4
Configured Memory Speed: 2933 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V
Memory Technology: DRAM
Memory Operating Mode Capability: Volatile memory
Firmware Version: Unknown
Module Manufacturer ID: Bank 1, Hex 0xCE
Module Product ID: Unknown
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Non-Volatile Size: None
Volatile Size: 128 kB
Cache Size: None
Logical Size: None
Handle 0x0034, DMI type 17, 84 bytes
Memory Device
Array Handle: 0x0023
Error Information Handle: 0x0033
Total Width: 72 bits
Data Width: 64 bits
Size: 128 GB
Form Factor: DIMM
Set: None
Locator: DIMM 0
Bank Locator: P0 CHANNEL D
Type: DDR4
Type Detail: Synchronous Registered (Buffered) LRDIMM
Speed: 2933 MT/s
Manufacturer: Samsung
Serial Number: 03C6F396
Asset Tag: Not Specified
Part Number: M386AAG40MMB-CVF
Rank: 4
Configured Memory Speed: 2933 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V
Memory Technology: DRAM
Memory Operating Mode Capability: Volatile memory
Firmware Version: Unknown
Module Manufacturer ID: Bank 1, Hex 0xCE
Module Product ID: Unknown
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Non-Volatile Size: None
Volatile Size: 128 kB
Cache Size: None
Logical Size: None
Handle 0x0037, DMI type 17, 84 bytes
Memory Device
Array Handle: 0x0023
Error Information Handle: 0x0036
Total Width: 72 bits
Data Width: 64 bits
Size: 128 GB
Form Factor: DIMM
Set: None
Locator: DIMM 0
Bank Locator: P0 CHANNEL E
Type: DDR4
Type Detail: Synchronous Registered (Buffered) LRDIMM
Speed: 2933 MT/s
Manufacturer: Samsung
Serial Number: 03C6F67D
Asset Tag: Not Specified
Part Number: M386AAG40MMB-CVF
Rank: 4
Configured Memory Speed: 2933 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V
Memory Technology: DRAM
Memory Operating Mode Capability: Volatile memory
Firmware Version: Unknown
Module Manufacturer ID: Bank 1, Hex 0xCE
Module Product ID: Unknown
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Non-Volatile Size: None
Volatile Size: 128 kB
Cache Size: None
Logical Size: None
Handle 0x003A, DMI type 17, 84 bytes
Memory Device
Array Handle: 0x0023
Error Information Handle: 0x0039
Total Width: 72 bits
Data Width: 64 bits
Size: 128 GB
Form Factor: DIMM
Set: None
Locator: DIMM 0
Bank Locator: P0 CHANNEL F
Type: DDR4
Type Detail: Synchronous Registered (Buffered) LRDIMM
Speed: 2933 MT/s
Manufacturer: Samsung
Serial Number: 03C6F394
Asset Tag: Not Specified
Part Number: M386AAG40MMB-CVF
Rank: 4
Configured Memory Speed: 2933 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V
Memory Technology: DRAM
Memory Operating Mode Capability: Volatile memory
Firmware Version: Unknown
Module Manufacturer ID: Bank 1, Hex 0xCE
Module Product ID: Unknown
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Non-Volatile Size: None
Volatile Size: 128 kB
Cache Size: None
Logical Size: None
Handle 0x003D, DMI type 17, 84 bytes
Memory Device
Array Handle: 0x0023
Error Information Handle: 0x003C
Total Width: 72 bits
Data Width: 64 bits
Size: 128 GB
Form Factor: DIMM
Set: None
Locator: DIMM 0
Bank Locator: P0 CHANNEL G
Type: DDR4
Type Detail: Synchronous Registered (Buffered) LRDIMM
Speed: 2933 MT/s
Manufacturer: Samsung
Serial Number: 03C6F48A
Asset Tag: Not Specified
Part Number: M386AAG40MMB-CVF
Rank: 4
Configured Memory Speed: 2933 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V
Memory Technology: DRAM
Memory Operating Mode Capability: Volatile memory
Firmware Version: Unknown
Module Manufacturer ID: Bank 1, Hex 0xCE
Module Product ID: Unknown
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Non-Volatile Size: None
Volatile Size: 128 kB
Cache Size: None
Logical Size: None
Handle 0x0040, DMI type 17, 84 bytes
Memory Device
Array Handle: 0x0023
Error Information Handle: 0x003F
Total Width: 72 bits
Data Width: 64 bits
Size: 128 GB
Form Factor: DIMM
Set: None
Locator: DIMM 0
Bank Locator: P0 CHANNEL H
Type: DDR4
Type Detail: Synchronous Registered (Buffered) LRDIMM
Speed: 2933 MT/s
Manufacturer: Samsung
Serial Number: 03C6F3FB
Asset Tag: Not Specified
Part Number: M386AAG40MMB-CVF
Rank: 4
Configured Memory Speed: 2933 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V
Memory Technology: DRAM
Memory Operating Mode Capability: Volatile memory
Firmware Version: Unknown
Module Manufacturer ID: Bank 1, Hex 0xCE
Module Product ID: Unknown
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Non-Volatile Size: None
Volatile Size: 128 kB
Cache Size: None
Logical Size: None
I don't know where these DRAM ECC error
come from, Maybe there are incompatibilies between my motherboard, CPU model or bad version of Intel compiler SDK
?
These errors appears roughly every 5 minutes during the execution.
I am using the intel compilers version compilers_and_libraries_2020.1.217
.
I have also the same error messages when I compile with MPI from official Open-MPI Debian 10 repository version.
I should modify maybe an option in the BIOS but I am not sure.
If someone had an idea to solve this issue, this would be fine to tell it.