Writing triggers for mcelog

Question

Just starting to look into mcelog for the first time (I've enabled it and seen syslog output before, but this is the first time I'm trying to do something non-default). I'm looking for information on how to write triggers for it. Specifically, I'm looking for what kinds of events mcelog can react to, how it decides which scripts to execute, and so on. Best I can make from the example trigger is that it sets a bunch of environmental variables before invoking the script. So does it just try to execute everything in the trigger directory (which is /etc/mcelog on RHEL) and let the script decide what it wants to act on?

I've seen other trigger scripts with names that look like MCE events, is that convention or does that have a special function? I created a trigger called /etc/mcelog/joel.sh which just sends a basic email to my gmail account. A few days ago apparently the trigger went off because I got an email from the script without manually running the script. I didn't think to pipe env output to the mailx command in joel.sh so I don't know what hardware event triggered the script execution or why mcelog picked joel.sh as the script to execute for it.

Basically, I'm looking for an answer that will give me a basic orientation with mcelog, it's triggering system, and how I can use it to monitor my hardware health. I'm pretty sure I can figure out the more advanced stuff once I get my bearings.

slm · Accepted Answer · 2013-05-18 20:51:46Z

Looking at the sample mcelog.conf config file it looks to contain most if not all of the types of triggers it can deal with.

DIMMs

[dimm]
#
# execute these triggers when the rate of corrected or uncorrected
# errors per DIMM exceeds the threshold
# Note when the hardware does not report DIMMs this might also
# be per channel
# The default of 10/24h is reasonable for server quality·
# DDR3 DIMMs as of 2009/10
#uc-error-trigger = dimm-error-trigger
uc-error-threshold = 1 / 24h
#ce-error-trigger = dimm-error-trigger
ce-error-threshold = 10 / 24h

Sockets

[socket]
# Threshold and trigger for uncorrected memory errors on a socket
# mem-uc-error-trigger = socket-memory-error-trigger
mem-uc-error-threshold = 100 / 24h
# Threshold and trigger for corrected memory errors on a socket
mem-ce-error-trigger = socket-memory-error-trigger
mem-ce-error-threshold = 100 / 24h

Cache

[cache]
# Processing of cache error thresholds reported by Intel CPUs
cache-threshold-trigger = cache-error-trigger

Page

[page]
# Memory error accouting per 4K memory page
# Threshold for the correct memory errors trigger script
memory-ce-threshold = 10 / 24h
# Trigger script for corrected errors
# memory-ce-trigger = page-error-trigger

Triggers

Triggers can be controlled in this section.

[trigger]
# Maximum number of running triggers
children-max = 2
# execute triggers in this directory
directory = /etc/mcelog

Sample triggers

There are some sample triggers here on the mcelog github page.

Sample trigger script, `dimm-error-triggers`:

#!/bin/sh
#  This shell script can be executed by mcelog in daemon mode when a DIMM
#  exceeds a pre-configured error threshold
# 
# environment:
# THRESHOLD     human readable threshold status
# MESSAGE   Human readable consolidated error message
# TOTALCOUNT    total count of errors for current DIMM of CE/UC depending on
#       what triggered the event
# LOCATION  Consolidated location as a single string
# DMI_LOCATION  DIMM location from DMI/SMBIOS if available
# DMI_NAME  DIMM identifier from DMI/SMBIOS if available
# DIMM      DIMM number reported by hardware
# CHANNEL   Channel number reported by hardware
# SOCKETID  Socket ID of CPU that includes the memory controller with the DIMM
# CECOUNT   Total corrected error count for DIMM
# UCCOUNT   Total uncorrected error count for DIMM
# LASTEVENT Time stamp of event that triggered threshold (in time_t format, seconds)
# THRESHOLD_COUNT Total umber of events in current threshold time period of specific type
#
# note: will run as mcelog configured user
# this can be changed in mcelog.conf

logger -s -p daemon.err -t mcelog "$MESSAGE"
logger -s -p daemon.err -t mcelog "Location: $LOCATION"

[ -x ./dimm-error-trigger.local ] && . ./dimm-error-trigger.local

exit 0

References

Thanks, even though the answer is basically common sense I guess it somehow eluded me when I was reading the configuration file. Always good to get someone else to look at the problem for issues like that. — Bratchley, Commented May 18, 2013 at 21:15
Any ideas on why my joel.sh script just suddenly fired off in the middle of no where? — Bratchley, Commented May 18, 2013 at 21:17
No this is good info to have on U&L so I was glad you asked it. I never heard of mcelog so it was worth it! I was looking into that joel.sh issue b/c it was bugging me too. I couldn't put my finger on it either. I saw a mention somewhere that there is a default trigger scenario, lost where I saw it. Was there other scripts in that dir. or just joel.sh? — slm, Commented May 18, 2013 at 21:20
joel.sh is the only executable file in that directory, the only other file is the mcelog.conf file. — Bratchley, Commented May 18, 2013 at 21:26
That's probably why it was run then. I saw it in one of the links I referenced above, if I get a chance I'll dig it up but it seemed to be saying that if there was only a single script in that directory it was going to run it by default. Seemed a little aggressive to me but it kind of makes sense. — slm, Commented May 18, 2013 at 21:42

Stack Exchange Network

Writing triggers for mcelog

1 Answer 1

DIMMs

Sockets

Cache

Page

Triggers

Sample triggers

Sample trigger script, `dimm-error-triggers`:

References

You must log in to answer this question.

Linked

Hot Network Questions

Writing triggers for mcelog

1 Answer 1

DIMMs

Sockets

Cache

Page

Triggers

Sample triggers

Sample trigger script, dimm-error-triggers:

References

You must log in to answer this question.

Linked

Related

Hot Network Questions

Sample trigger script, `dimm-error-triggers`: