3

Just starting to look into mcelog for the first time (I've enabled it and seen syslog output before, but this is the first time I'm trying to do something non-default). I'm looking for information on how to write triggers for it. Specifically, I'm looking for what kinds of events mcelog can react to, how it decides which scripts to execute, and so on. Best I can make from the example trigger is that it sets a bunch of environmental variables before invoking the script. So does it just try to execute everything in the trigger directory (which is /etc/mcelog on RHEL) and let the script decide what it wants to act on?

I've seen other trigger scripts with names that look like MCE events, is that convention or does that have a special function? I created a trigger called /etc/mcelog/joel.sh which just sends a basic email to my gmail account. A few days ago apparently the trigger went off because I got an email from the script without manually running the script. I didn't think to pipe env output to the mailx command in joel.sh so I don't know what hardware event triggered the script execution or why mcelog picked joel.sh as the script to execute for it.

Basically, I'm looking for an answer that will give me a basic orientation with mcelog, it's triggering system, and how I can use it to monitor my hardware health. I'm pretty sure I can figure out the more advanced stuff once I get my bearings.

1 Answer 1

1

Looking at the sample mcelog.conf config file it looks to contain most if not all of the types of triggers it can deal with.

DIMMs

[dimm]
#
# execute these triggers when the rate of corrected or uncorrected
# errors per DIMM exceeds the threshold
# Note when the hardware does not report DIMMs this might also
# be per channel
# The default of 10/24h is reasonable for server quality·
# DDR3 DIMMs as of 2009/10
#uc-error-trigger = dimm-error-trigger
uc-error-threshold = 1 / 24h
#ce-error-trigger = dimm-error-trigger
ce-error-threshold = 10 / 24h

Sockets

[socket]
# Threshold and trigger for uncorrected memory errors on a socket
# mem-uc-error-trigger = socket-memory-error-trigger
mem-uc-error-threshold = 100 / 24h
# Threshold and trigger for corrected memory errors on a socket
mem-ce-error-trigger = socket-memory-error-trigger
mem-ce-error-threshold = 100 / 24h

Cache

[cache]
# Processing of cache error thresholds reported by Intel CPUs
cache-threshold-trigger = cache-error-trigger

Page

[page]
# Memory error accouting per 4K memory page
# Threshold for the correct memory errors trigger script
memory-ce-threshold = 10 / 24h
# Trigger script for corrected errors
# memory-ce-trigger = page-error-trigger

Triggers

Triggers can be controlled in this section.

[trigger]
# Maximum number of running triggers
children-max = 2
# execute triggers in this directory
directory = /etc/mcelog

Sample triggers

There are some sample triggers here on the mcelog github page.

Sample trigger script, dimm-error-triggers:

#!/bin/sh
#  This shell script can be executed by mcelog in daemon mode when a DIMM
#  exceeds a pre-configured error threshold
# 
# environment:
# THRESHOLD     human readable threshold status
# MESSAGE   Human readable consolidated error message
# TOTALCOUNT    total count of errors for current DIMM of CE/UC depending on
#       what triggered the event
# LOCATION  Consolidated location as a single string
# DMI_LOCATION  DIMM location from DMI/SMBIOS if available
# DMI_NAME  DIMM identifier from DMI/SMBIOS if available
# DIMM      DIMM number reported by hardware
# CHANNEL   Channel number reported by hardware
# SOCKETID  Socket ID of CPU that includes the memory controller with the DIMM
# CECOUNT   Total corrected error count for DIMM
# UCCOUNT   Total uncorrected error count for DIMM
# LASTEVENT Time stamp of event that triggered threshold (in time_t format, seconds)
# THRESHOLD_COUNT Total umber of events in current threshold time period of specific type
#
# note: will run as mcelog configured user
# this can be changed in mcelog.conf

logger -s -p daemon.err -t mcelog "$MESSAGE"
logger -s -p daemon.err -t mcelog "Location: $LOCATION"

[ -x ./dimm-error-trigger.local ] && . ./dimm-error-trigger.local

exit 0

References

5
  • Thanks, even though the answer is basically common sense I guess it somehow eluded me when I was reading the configuration file. Always good to get someone else to look at the problem for issues like that.
    – Bratchley
    Commented May 18, 2013 at 21:15
  • Any ideas on why my joel.sh script just suddenly fired off in the middle of no where?
    – Bratchley
    Commented May 18, 2013 at 21:17
  • No this is good info to have on U&L so I was glad you asked it. I never heard of mcelog so it was worth it! I was looking into that joel.sh issue b/c it was bugging me too. I couldn't put my finger on it either. I saw a mention somewhere that there is a default trigger scenario, lost where I saw it. Was there other scripts in that dir. or just joel.sh?
    – slm
    Commented May 18, 2013 at 21:20
  • joel.sh is the only executable file in that directory, the only other file is the mcelog.conf file.
    – Bratchley
    Commented May 18, 2013 at 21:26
  • That's probably why it was run then. I saw it in one of the links I referenced above, if I get a chance I'll dig it up but it seemed to be saying that if there was only a single script in that directory it was going to run it by default. Seemed a little aggressive to me but it kind of makes sense.
    – slm
    Commented May 18, 2013 at 21:42

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.