0

When running an executable compiled with intel mpiicc, I get, after 30 minutes of running, the following errors :

 kernel:[29585.573874] [Hardware Error]: Corrected error, no action required.

Message from syslogd@pablo at Nov  8 09:53:25 ...
 kernel:[29585.573881] [Hardware Error]: CPU:2 (17:31:0) MC18_STATUS[Over|CE|MiscV|-|AddrV|-|-|SyndV|-|CECC]: 0xdc2041000000011b

Message from syslogd@pablo at Nov  8 09:53:25 ...
 kernel:[29585.573887] [Hardware Error]: Error Addr: 0x0000000a6c12d280

Message from syslogd@pablo at Nov  8 09:53:25 ...
 kernel:[29585.573888] [Hardware Error]: IPID: 0x0000009600550f00, Syndrome: 0xc54c00040a800611

Message from syslogd@pablo at Nov  8 09:53:25 ...
 kernel:[29585.573891] [Hardware Error]: Unified Memory Controller Extended Error Code: 0

Message from syslogd@pablo at Nov  8 09:53:25 ...
 kernel:[29585.573893] [Hardware Error]: Unified Memory Controller Error: DRAM ECC error.

Message from syslogd@pablo at Nov  8 09:53:25 ...
 kernel:[29585.573895] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

I am working on a AMD EPYC 7702P 64-Core Processor with 1TB of RAM and a Debian OS :

Linux pablo 4.19.0-9-amd64 #1 SMP Debian 4.19.118-2+deb10u1 (2020-06-07) x86_64 GNU/Linux

From what I have seen, I did the command : dmidecode -t memory that gives :

# dmidecode 3.2
Getting SMBIOS data from sysfs.
SMBIOS 3.2.0 present.

Handle 0x0023, DMI type 16, 23 bytes
Physical Memory Array
    Location: System Board Or Motherboard
    Use: System Memory
    Error Correction Type: Multi-bit ECC
    Maximum Capacity: 2 TB
    Error Information Handle: 0x0022
    Number Of Devices: 8

Handle 0x002B, DMI type 17, 84 bytes
Memory Device
    Array Handle: 0x0023
    Error Information Handle: 0x002A
    Total Width: 72 bits
    Data Width: 64 bits
    Size: 128 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM 0
    Bank Locator: P0 CHANNEL A
    Type: DDR4
    Type Detail: Synchronous Registered (Buffered) LRDIMM
    Speed: 2933 MT/s
    Manufacturer: Samsung
    Serial Number: 03C6F701
    Asset Tag: Not Specified
    Part Number: M386AAG40MMB-CVF
    Rank: 4
    Configured Memory Speed: 2933 MT/s
    Minimum Voltage: 1.2 V
    Maximum Voltage: 1.2 V
    Configured Voltage: 1.2 V
    Memory Technology: DRAM
    Memory Operating Mode Capability: Volatile memory
    Firmware Version: Unknown
    Module Manufacturer ID: Bank 1, Hex 0xCE
    Module Product ID: Unknown
    Memory Subsystem Controller Manufacturer ID: Unknown
    Memory Subsystem Controller Product ID: Unknown
    Non-Volatile Size: None
    Volatile Size: 128 kB
    Cache Size: None
    Logical Size: None

Handle 0x002E, DMI type 17, 84 bytes
Memory Device
    Array Handle: 0x0023
    Error Information Handle: 0x002D
    Total Width: 72 bits
    Data Width: 64 bits
    Size: 128 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM 0
    Bank Locator: P0 CHANNEL B
    Type: DDR4
    Type Detail: Synchronous Registered (Buffered) LRDIMM
    Speed: 2933 MT/s
    Manufacturer: Samsung
    Serial Number: 03C6F3ED
    Asset Tag: Not Specified
    Part Number: M386AAG40MMB-CVF
    Rank: 4
    Configured Memory Speed: 2933 MT/s
    Minimum Voltage: 1.2 V
    Maximum Voltage: 1.2 V
    Configured Voltage: 1.2 V
    Memory Technology: DRAM
    Memory Operating Mode Capability: Volatile memory
    Firmware Version: Unknown
    Module Manufacturer ID: Bank 1, Hex 0xCE
    Module Product ID: Unknown
    Memory Subsystem Controller Manufacturer ID: Unknown
    Memory Subsystem Controller Product ID: Unknown
    Non-Volatile Size: None
    Volatile Size: 128 kB
    Cache Size: None
    Logical Size: None

Handle 0x0031, DMI type 17, 84 bytes
Memory Device
    Array Handle: 0x0023
    Error Information Handle: 0x0030
    Total Width: 72 bits
    Data Width: 64 bits
    Size: 128 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM 0
    Bank Locator: P0 CHANNEL C
    Type: DDR4
    Type Detail: Synchronous Registered (Buffered) LRDIMM
    Speed: 2933 MT/s
    Manufacturer: Samsung
    Serial Number: 03C6F4BA
    Asset Tag: Not Specified
    Part Number: M386AAG40MMB-CVF
    Rank: 4
    Configured Memory Speed: 2933 MT/s
    Minimum Voltage: 1.2 V
    Maximum Voltage: 1.2 V
    Configured Voltage: 1.2 V
    Memory Technology: DRAM
    Memory Operating Mode Capability: Volatile memory
    Firmware Version: Unknown
    Module Manufacturer ID: Bank 1, Hex 0xCE
    Module Product ID: Unknown
    Memory Subsystem Controller Manufacturer ID: Unknown
    Memory Subsystem Controller Product ID: Unknown
    Non-Volatile Size: None
    Volatile Size: 128 kB
    Cache Size: None
    Logical Size: None

Handle 0x0034, DMI type 17, 84 bytes
Memory Device
    Array Handle: 0x0023
    Error Information Handle: 0x0033
    Total Width: 72 bits
    Data Width: 64 bits
    Size: 128 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM 0
    Bank Locator: P0 CHANNEL D
    Type: DDR4
    Type Detail: Synchronous Registered (Buffered) LRDIMM
    Speed: 2933 MT/s
    Manufacturer: Samsung
    Serial Number: 03C6F396
    Asset Tag: Not Specified
    Part Number: M386AAG40MMB-CVF
    Rank: 4
    Configured Memory Speed: 2933 MT/s
    Minimum Voltage: 1.2 V
    Maximum Voltage: 1.2 V
    Configured Voltage: 1.2 V
    Memory Technology: DRAM
    Memory Operating Mode Capability: Volatile memory
    Firmware Version: Unknown
    Module Manufacturer ID: Bank 1, Hex 0xCE
    Module Product ID: Unknown
    Memory Subsystem Controller Manufacturer ID: Unknown
    Memory Subsystem Controller Product ID: Unknown
    Non-Volatile Size: None
    Volatile Size: 128 kB
    Cache Size: None
    Logical Size: None

Handle 0x0037, DMI type 17, 84 bytes
Memory Device
    Array Handle: 0x0023
    Error Information Handle: 0x0036
    Total Width: 72 bits
    Data Width: 64 bits
    Size: 128 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM 0
    Bank Locator: P0 CHANNEL E
    Type: DDR4
    Type Detail: Synchronous Registered (Buffered) LRDIMM
    Speed: 2933 MT/s
    Manufacturer: Samsung
    Serial Number: 03C6F67D
    Asset Tag: Not Specified
    Part Number: M386AAG40MMB-CVF
    Rank: 4
    Configured Memory Speed: 2933 MT/s
    Minimum Voltage: 1.2 V
    Maximum Voltage: 1.2 V
    Configured Voltage: 1.2 V
    Memory Technology: DRAM
    Memory Operating Mode Capability: Volatile memory
    Firmware Version: Unknown
    Module Manufacturer ID: Bank 1, Hex 0xCE
    Module Product ID: Unknown
    Memory Subsystem Controller Manufacturer ID: Unknown
    Memory Subsystem Controller Product ID: Unknown
    Non-Volatile Size: None
    Volatile Size: 128 kB
    Cache Size: None
    Logical Size: None

Handle 0x003A, DMI type 17, 84 bytes
Memory Device
    Array Handle: 0x0023
    Error Information Handle: 0x0039
    Total Width: 72 bits
    Data Width: 64 bits
    Size: 128 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM 0
    Bank Locator: P0 CHANNEL F
    Type: DDR4
    Type Detail: Synchronous Registered (Buffered) LRDIMM
    Speed: 2933 MT/s
    Manufacturer: Samsung
    Serial Number: 03C6F394
    Asset Tag: Not Specified
    Part Number: M386AAG40MMB-CVF
    Rank: 4
    Configured Memory Speed: 2933 MT/s
    Minimum Voltage: 1.2 V
    Maximum Voltage: 1.2 V
    Configured Voltage: 1.2 V
    Memory Technology: DRAM
    Memory Operating Mode Capability: Volatile memory
    Firmware Version: Unknown
    Module Manufacturer ID: Bank 1, Hex 0xCE
    Module Product ID: Unknown
    Memory Subsystem Controller Manufacturer ID: Unknown
    Memory Subsystem Controller Product ID: Unknown
    Non-Volatile Size: None
    Volatile Size: 128 kB
    Cache Size: None
    Logical Size: None

Handle 0x003D, DMI type 17, 84 bytes
Memory Device
    Array Handle: 0x0023
    Error Information Handle: 0x003C
    Total Width: 72 bits
    Data Width: 64 bits
    Size: 128 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM 0
    Bank Locator: P0 CHANNEL G
    Type: DDR4
    Type Detail: Synchronous Registered (Buffered) LRDIMM
    Speed: 2933 MT/s
    Manufacturer: Samsung
    Serial Number: 03C6F48A
    Asset Tag: Not Specified
    Part Number: M386AAG40MMB-CVF
    Rank: 4
    Configured Memory Speed: 2933 MT/s
    Minimum Voltage: 1.2 V
    Maximum Voltage: 1.2 V
    Configured Voltage: 1.2 V
    Memory Technology: DRAM
    Memory Operating Mode Capability: Volatile memory
    Firmware Version: Unknown
    Module Manufacturer ID: Bank 1, Hex 0xCE
    Module Product ID: Unknown
    Memory Subsystem Controller Manufacturer ID: Unknown
    Memory Subsystem Controller Product ID: Unknown
    Non-Volatile Size: None
    Volatile Size: 128 kB
    Cache Size: None
    Logical Size: None

Handle 0x0040, DMI type 17, 84 bytes
Memory Device
    Array Handle: 0x0023
    Error Information Handle: 0x003F
    Total Width: 72 bits
    Data Width: 64 bits
    Size: 128 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM 0
    Bank Locator: P0 CHANNEL H
    Type: DDR4
    Type Detail: Synchronous Registered (Buffered) LRDIMM
    Speed: 2933 MT/s
    Manufacturer: Samsung
    Serial Number: 03C6F3FB
    Asset Tag: Not Specified
    Part Number: M386AAG40MMB-CVF
    Rank: 4
    Configured Memory Speed: 2933 MT/s
    Minimum Voltage: 1.2 V
    Maximum Voltage: 1.2 V
    Configured Voltage: 1.2 V
    Memory Technology: DRAM
    Memory Operating Mode Capability: Volatile memory
    Firmware Version: Unknown
    Module Manufacturer ID: Bank 1, Hex 0xCE
    Module Product ID: Unknown
    Memory Subsystem Controller Manufacturer ID: Unknown
    Memory Subsystem Controller Product ID: Unknown
    Non-Volatile Size: None
    Volatile Size: 128 kB
    Cache Size: None
    Logical Size: None

I don't know where these DRAM ECC error come from, Maybe there are incompatibilies between my motherboard, CPU model or bad version of Intel compiler SDK ?

These errors appears roughly every 5 minutes during the execution.

I am using the intel compilers version compilers_and_libraries_2020.1.217.

I have also the same error messages when I compile with MPI from official Open-MPI Debian 10 repository version.

I should modify maybe an option in the BIOS but I am not sure.

If someone had an idea to solve this issue, this would be fine to tell it.

1 Answer 1

1

Seems your ram is faulty, it is a hardware problrm. I suggest you to either run memtest for a long time or change the sticks and try your application again. Probably the application allocates too much ram accessing the faulty sectors.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.