The ipmi watchdog pretimeout action can trigger unintentionally in
certain rare, complicated situations. What we have seen at Netflix
is that the BMC can sometimes be sent a continuous stream of
writes to port 0x80, and due to what is a bug or misconfiguration
in the BMC software, this results in the BMC running out of memory,
becoming very slow to respond to KCS requests, and eventually being
rebooted by its own internal watchdog. While that is going on in
the BMC, back in the host OS, a number of requests are pending in
the ipmi request queue, and the kcs_loop thread is working on
processing these requests. All of the KCS accesses to process
those requests are timing out and eventually failing because the
BMC is responding very slowly or not at all, and the kcs_loop thread
is holding the IPMI_IO_LOCK the whole time that is going on.
Meanwhile the watchdogd process in the host is trying to pat the
BMC watchdog, and this process is sleeping waiting to get the
IPMI_IO_LOCK. It's not entirely clear why the watchdogd process
is sleeping for this lock, because the intention is that a thread
holding the IPMI_IO_LOCK should not sleep and thus any thread
that wants the lock should just spin to wait for it. My best guess
is that the kcs_loop thread is spinning waiting for the BMC to
respond for so long that it is eventually preempted, and during
the brief interval when the kcs_loop thread is not running,
the watchdogd thread notices that the lock holder is not running
and sleeps. When the kcs_loop thread eventually finishes processing
one request, it drops the IPMI_IO_LOCK and then immediately takes the
lock again so it can process the next request in the queue.
Because the watchdogd thread is sleeping at this point, the kcs_loop
always wins the race to acquire the IPMI_IO_LOCK, thus starving
the watchdogd thread. The callout for the watchdog pretimeout
would be reset by the watchdogd thread after its request to the BMC
watchdog completes, but since that request never processed, the
pretimeout callout eventually fires, even though there is nothing
actually wrong with the host.
To prevent this saga from unfolding:
- when kcs_driver_request() is called in a context where it can sleep,
queue the request and let the worker thread process it rather than
trying to process in the original thread.
- add a new high-priority queue for driver requests, so that the
watchdog patting requests will be processed as quickly as possible
even if lots of application requests have already been queued.
With these two changes, the watchdog pretimeout action does not trigger
even if the BMC is completely out to lunch for long periods of time
(as long as the watchdogd check command does not also get stuck).
Sponsored by: Netflix
Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D36555
18db96dbfd introduced a use-after-free bug
in the error handling of the IPMICTL_RECEIVE_MSG ioctl.
Reported by: Coverity (CID 1490456) (via vangyzen)
Differential Revision: https://reviews.freebsd.org/D35605
Handle IPMB requests using SEND_MSG (sent as driver request as we do not
need to return anything back to userland for this) and GET_MSG (sent as
usual request so we can return the data for RECEIVE_MSG ioctl) pair.
This fixes fetching complete sensor data from boards (e.g. HP ProLiant
DL380 Gen10).
Reviewed by: philip
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D35605
Previous code by default setting pre-timeout interval to 120 seconds
made impossible to set timeout interval below that, resulting in error
0xcc (Invalid data field in Request) at least on Supermicro boards.
To fix that limit maximum pre-timeout interval to ~1/4 of the timeout
interval, that sounds like a reasonable default: not too short to fire
too late, but also not too long to give many false reports.
MFC after: 2 weeks
In case we want to use other WD than IPMI-provided, add
sysctl to disable initialization.
Obtained from: Semihalf
Sponsored by: Stormshield
Differential revision: https://reviews.freebsd.org/D31548
Add request submission status checks before checking req->ir_compcode,
otherwise it may be zero just because of initialization.
Add checks for req->ir_compcode errors in ipmi_reset_watchdog() and
ipmi_set_watchdog(). In first case explicitly check for 0x80, which
means timer was not previously set, that I found happening after BMC
cold reset. This change makes watchdog timer to recover instead of
permanently ignoring reset errors after BMC reset or upgraded.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Copy the CP, PTRIN, etc macros from freebsd32.h into a sys/abi_compat.h
and replace existing definitation with includes where required. This
eliminates duplicate code and allows Linux and FreeBSD compatability
headers to be included in the same files.
Input from: cem, jhb
Obtained from: CheriBSD
MFC after: 2 weeks
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D24275
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.
This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.
Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT
Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718
This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h"
in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header
pollution substantially.
EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c
files into appropriate headers (e.g., sys/proc.h, powernv/opal.h).
As a side effect of reduced header pollution, many .c files and headers no
longer contain needed definitions. The remainder of the patch addresses
adding appropriate includes to fix those files.
LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by
sys/mutex.h since r326106 (but silently protected by header pollution prior
to this change).
No functional change (intended). Of course, any out of tree modules that
relied on header pollution for sys/eventhandler.h, sys/lock.h, or
sys/mutex.h inclusion need to be fixed. __FreeBSD_version has been bumped.
PR: maybe related to 233998 (inconclusive at this time)
Submitted by: byuu <byuu AT tutanota.com> (previous version)
Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D18506
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Otherwise an orderly shutdown will initiate a watchdog that will cause
a 7 minute delayed reboot *by default*, In the freebsd.org cluster's case
this often worked out be a surprise reboot a minute or two after the
machine came back up.
hw.ipmi.cycle_time is the time to wait for the power down phase of the
ipmi power cycle before falling back to either reboot or halt.
Sponsored by: Netflix
o Make hw.ipmi.on a tuneable
o Changes to keep shutdown from hanging indefinitately after the wd
would normally have been disabled.
o Add support for setting pretimeout (which fires an interrupt
some time before the actual watchdog expires)
o Allow refinement of the actions to take when the watchdog expires
o Allow special startup timeout to keep us from hanging in boot
before watchdogd is started, but after we've loaded the kernel.
Obtained From: Netflix OCA Firmware
Some BMCs support power cycling the chassis via the chassis control
command 2 subcommand 2 (ipmitool called it 'chassis power cycle'). If
the BMC supports the chassis device, register a shutdown_final handler
that sends the power cycle command if request and waits up to 10s for
it to take effect. To minimize stack strain, we preallocate a ipmi
request in the softc. At the moment, we're verbose about what we're
doing.
Sponsored by: Netflix
Set watchdog timer parameters only when they really need to be changed.
In other cases just restart the timer with single Reset command instead
of two (Set and Reset).
From one side this visually reduces amount of CPU time burned in tight
loop waiting while some slow BMC configures its watchdog hardware, that
seems to be much more complicated task then just resetting the timer.
From another side on some BMCs those slow Set commands sometimes tend to
timeout, that leads to noisy log messages and even more CPU time burned,
so avoiding them can provide even bigger bonuses.
MFC after: 2 weeks
are not permitted to sleep. Only use the IPMI watchdog with backends
which poll driver-initiated requests to meet this requirement.
In practice this means that watchdogs will no longer be used on systems
that use the SSIF backend.
Differential Revision: https://reviews.freebsd.org/D2062
MFC after: 2 weeks
particular, updates to the watchdog should no longer sleep.
- Add a new IPMI_IO_LOCK for low-level I/O access. Use this for
kcs_polled_request() and smic_polled_request().
- Add a new backend callback "ipmi_driver_request" to handle a driver
request. The new callback performs the request sychronously for KCS
and SMIC. SSIF still defers the work to the worker thread since the
worker thread sleeps during request processing anyway.
- Allocate driver requests on the stack rather than using malloc().
Differential Revision: https://reviews.freebsd.org/D1723
Tested by: scottl
MFC after: 2 weeks
Starting or stopping the IPMI watchdog is rather expensive with the
current implementation as all IPMI requests are bounced via thread.
This is not viable during shutdown or dumps, and this avoids headache
in the common case that the watchdog is not enabled. The IPMI watchdog
should probably be reworked to not use a separate thread to fix this
in the case when the watchdog timer is enabled.
MFC after: 2 weeks
The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.
value is obtained by dividing it by 256, not by 2550; also,
one second is 10^9 nanoseconds, not 1800000000 nanoseconds.
- Due to rounding error, setting watchdog to a really small
timeout (<1 sec) was turning the watchdog off. It should
set the watchdog to a small timeout instead.
- Implemented error checking in ipmi_wd_event(), as required
by watchdog(9).
PR: kern/130512
Submitted by: Dmitrij Tejblum
- Additionally, check that the timeout value is within the
supported range, and if it's too large, act as required by
watchdog(9).
MFC after: 3 days
Use the much simpler cdevpriv for per-fd state and enable it. This allows
multiple opens of /dev/ipmi0 (e.g. using ipmitool while ipmievd is running
in the background).
MFC after: 1 week
watchdog might hide the succesful arming of an earlier one. Accept that on
failing to arm any watchdog (because of non-supported timeouts) EOPNOTSUPP is
returned instead of the more appropriate EINVAL.
MFC after: 3 days
behave as expected.
Also:
- Return an error if WD_PASSIVE is passed in to the ioctl as only
WD_ACTIVE is implemented at the moment. See sys/watchdog.h for an
explanation of the difference between WD_ACTIVE and WD_PASSIVE.
- Remove the I_HAVE_TOTALLY_LOST_MY_SENSE_OF_HUMOR define. If you've
lost your sense of humor, than don't add a define.
Specific changes:
i80321_wdog.c
Don't roll your own passive watchdog tickle as this would defeat the
purpose of an active (userland) watchdog tickle.
ichwd.c / ipmi.c:
WD_ACTIVE means active patting of the watchdog by a userland process,
not whether the watchdog is active. See sys/watchdog.h.
kern_clock.c:
(software watchdog) Remove a check for WD_ACTIVE as this does not make
sense here. This reverts r1.181.
- Split out the communication protocols into their own files and use
a couple of function pointers in the softc that the commuication
protocols setup in their own attach routine.
- Add support for the SSIF interface (talking to IPMI over SMBus).
- Add an ACPI attachment.
- Add a PCI attachment that attaches to devices with the IPMI interface
subclass.
- Split the ISA attachment out into its own file: ipmi_isa.c.
- Change the code to probe the SMBIOS table for an IPMI entry to just use
pmap_mapbios() to map the table in rather than trying to setup a fake
resource on an isa device and then activating the resource to map in the
table.
- Make bus attachments leaner by adding attach functions for each
communication interface (ipmi_kcs_attach(), ipmi_smic_attach(), etc.)
that setup per-interface data.
- Formalize the model used by the driver to handle requests by adding an
explicit struct ipmi_request object that holds the state of a given
request and reply for the entire lifetime of the request. By bundling
the request into an object, it is easier to add retry logic to the various
communication backends (as well as eventually support BT mode which uses
a slightly different message format than KCS, SMIC, and SSIF).
- Add a per-softc lock and remove D_NEEDGIANT as the driver is now MPSAFE.
- Add 32-bit compatibility ioctl shims so you can use a 32-bit ipmitool
on FreeBSD/amd64.
- Add ipmi(4) to i386 and amd64 NOTES.
Submitted by: ambrisko (large portions of 2 and 3)
Sponsored by: IronPort Systems, Yahoo!
MFC after: 6 days
to work with ipmitools. It works with other tools that have an OpenIPMI
driver interface. The port will need to get updated to used this.
I have not implemented the IPMB mode yet so ioctl's for that don't
really do much otherwise it should work like the OpenIPMI version.
The ipmi.h definitions was derived from the ipmitool header file.
The bus attachments are done for smbios and pci/smbios. Differences
in bus probe order for modules/static are delt with. ACPI attachment
should be done.
This drivers registers with the watchdod(4) interface
Work to do:
- BT interface
- IPMB mode
This has been tested on Dell PE2850, PE2650 & PE850 with i386 & amd64
kernel.
I will link this into the build on next week.
Tom Rhodes, helped me with the man page.
Sponsored by: IronPort Systems Inc.
Inspired from: ipmitool & Linux