header. This will help to correlate console server logs with dump files,
no matter how precise is clock on a console server appliance, and how
buggy the appliance is.
This is needed for kernel dumps to work, as the panicking thread will call
into code that makes use of kernel locks.
Reported and tested by: Eugene Grosbein
MFC after: 1 week
Changes include modifications in kernel crash dump routines, dumpon(8) and
savecore(8). A new tool called decryptcore(8) was added.
A new DIOCSKERNELDUMP I/O control was added to send a kernel crash dump
configuration in the diocskerneldump_arg structure to the kernel.
The old DIOCSKERNELDUMP I/O control was renamed to DIOCSKERNELDUMP_FREEBSD11 for
backward ABI compatibility.
dumpon(8) generates an one-time random symmetric key and encrypts it using
an RSA public key in capability mode. Currently only AES-256-CBC is supported
but EKCD was designed to implement support for other algorithms in the future.
The public key is chosen using the -k flag. The dumpon rc(8) script can do this
automatically during startup using the dumppubkey rc.conf(5) variable. Once the
keys are calculated dumpon sends them to the kernel via DIOCSKERNELDUMP I/O
control.
When the kernel receives the DIOCSKERNELDUMP I/O control it generates a random
IV and sets up the key schedule for the specified algorithm. Each time the
kernel tries to write a crash dump to the dump device, the IV is replaced by
a SHA-256 hash of the previous value. This is intended to make a possible
differential cryptanalysis harder since it is possible to write multiple crash
dumps without reboot by repeating the following commands:
# sysctl debug.kdb.enter=1
db> call doadump(0)
db> continue
# savecore
A kernel dump key consists of an algorithm identifier, an IV and an encrypted
symmetric key. The kernel dump key size is included in a kernel dump header.
The size is an unsigned 32-bit integer and it is aligned to a block size.
The header structure has 512 bytes to match the block size so it was required to
make a panic string 4 bytes shorter to add a new field to the header structure.
If the kernel dump key size in the header is nonzero it is assumed that the
kernel dump key is placed after the first header on the dump device and the core
dump is encrypted.
Separate functions were implemented to write the kernel dump header and the
kernel dump key as they need to be unencrypted. The dump_write function encrypts
data if the kernel was compiled with the EKCD option. Encrypted kernel textdumps
are not supported due to the way they are constructed which makes it impossible
to use the CBC mode for encryption. It should be also noted that textdumps don't
contain sensitive data by design as a user decides what information should be
dumped.
savecore(8) writes the kernel dump key to a key.# file if its size in the header
is nonzero. # is the number of the current core dump.
decryptcore(8) decrypts the core dump using a private RSA key and the kernel
dump key. This is performed by a child process in capability mode.
If the decryption was not successful the parent process removes a partially
decrypted core dump.
Description on how to encrypt crash dumps was added to the decryptcore(8),
dumpon(8), rc.conf(5) and savecore(8) manual pages.
EKCD was tested on amd64 using bhyve and i386, mipsel and sparc64 using QEMU.
The feature still has to be tested on arm and arm64 as it wasn't possible to run
FreeBSD due to the problems with QEMU emulation and lack of hardware.
Designed by: def, pjd
Reviewed by: cem, oshogbo, pjd
Partial review: delphij, emaste, jhb, kib
Approved by: pjd (mentor)
Differential Revision: https://reviews.freebsd.org/D4712
option, not INVARIANTS. The function is required if we want
to load in a module that is compiled with INVARIANTS.
Reviewed by: jhb
Approved by: re (gjb)
panic string again if set, in case it scrolled out of the active
window. This avoids having to remember the symbol name.
Also add a show callout <addr> command to DDB in order to inspect
some struct callout fields in case of panics in the callout code.
This may help to see if there was memory corruption or to further
ease debugging problems.
Obtained from: projects/vnet
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Reviewed by: jhb (comment only on the show panic initally)
Differential Revision: https://reviews.freebsd.org/D4527
(And 4Kn minidump support, but only for amd64.)
Make sure all I/O to the dump device is of the native sector size. To
that end, we keep a native sector sized buffer associated with dump
devices (di->blockbuf) and use it to pad smaller objects as needed (e.g.
kerneldumpheader).
Add dump_write_pad() as a convenience API to dump smaller objects with
zero padding. (Rather than pull in NPM leftpad, we wrote our own.)
Savecore(1) has been updated to deal with these dumps. The format for
512-byte sector dumps should remain backwards compatible.
Minidumps for other architectures are left as an exercise for the
reader.
PR: 194279
Submitted by: ambrisko@
Reviewed by: cem (earlier version), rpokala
Tested by: rpokala (4Kn/512 except 512 fulldump), cem (512 fulldump)
Relnotes: yes
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D5848
to shut down; close laptop lid" scenario which otherwise tended to end
with a laptop overheating or the battery dying.
The implementation uses a new sysctl, kern.suspend_blocked; init(8) sets
this while rc.suspend runs, and the ACPI sleep code ignores requests while
the sysctl is set.
Discussed on: freebsd-acpi (35 emails)
MFC after: 1 week
Note that the mountlist manipulations are somewhat fragile, and not very
pretty. The reason for this is to avoid changing vfs_mountroot(), which
is (obviously) rather mission-critical, but not very well documented,
and thus hard to test properly. It might be possible to rework it to use
its own simple root mount mechanism instead of vfs_mountroot().
Reviewed by: kib@
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D2698
'buf' is inconvenient and has lead me to some irritating to discover
bugs over the years. It also makes it more challenging to refactor
the buf allocation system.
- Move swbuf and declare it as an extern in vfs_bio.c. This is still
not perfect but better than it was before.
- Eliminate the unused ffs function that relied on knowledge of the buf
array.
- Move the shutdown code that iterates over the buf array into vfs_bio.c.
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
longer than 192 bytes will cause the version field of a dump header to
overflow. strncpy doesn't null terminate it, so savecore will print a
corrupted info file. Using strlcpy fixes the bug.
Differential Revision: https://reviews.freebsd.org/D2560
Reviewed by: markj
MFC after: 3 weeks
Sponsored by: Spectra Logic
code in sys/kern/kern_dump.c. Most dumpsys() implementations are nearly
identical and simply redefine a number of constants and helper subroutines;
a generic implementation will make it easier to implement features around
kernel core dumps. This change does not alter any minidump code and should
have no functional impact.
PR: 193873
Differential Revision: https://reviews.freebsd.org/D904
Submitted by: Conrad Meyer <conrad.meyer@isilon.com>
Reviewed by: jhibbits (earlier version)
Sponsored by: EMC / Isilon Storage Division
was possible for a regular user to setup the dump device if he had write access
to the given device. In theory it is a security issue as user might get access
to kernel's memory after provoking kernel crash, but in practise it is not
recommended to give regular users direct access to storage devices.
Rework the code so that we do privileges check within the set_dumper() function
to avoid similar problems in the future.
Discussed with: secteam
This commit does not add error returns to minidumpsys() or
textdump_dumpsys(); those can also be added later.
Submitted by: Conrad Meyer (EMC / Isilon storage division)
These changes prevent sysctl(8) from returning proper output,
such as:
1) no output from sysctl(8)
2) erroneously returning ENOMEM with tools like truss(1)
or uname(1)
truss: can not get etype: Cannot allocate memory
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.
Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.
MFC after: 2 weeks
Sponsored by: Mellanox Technologies
Right now, init(8) cannot distinguish between an ACPI power button press
or a Ctrl+Alt+Del sequence on the keyboard. This is because
shutdown_nice() sends SIGINT to init(8) unconditionally, but later
modifies the arguments to reboot(2) to force a certain behaviour.
Instead of doing this, patch up the code to just forward the appropriate
signal to userspace. SIGUSR1 and SIGUSR2 can already be used to halt the
system.
While there, move waittime to the function where it's used; kern_reboot().
defaults to PANIC_REBOOT_WAIT_TIME (a long-existing kernel config
setting). Use this now-variable value in place of the defined constant
to control how long the system waits after a panic before rebooting.
* VM_OBJECT_LOCK and VM_OBJECT_UNLOCK are mapped to write operations
* VM_OBJECT_SLEEP() is introduced as a general purpose primitve to
get a sleep operation using a VM_OBJECT_LOCK() as protection
* The approach must bear with vm_pager.h namespace pollution so many
files require including directly rwlock.h
This is an ongoing effort to provide runtime debug information
useful in the field that does not panic existing installations.
This gives us the flexibility needed when shipping images to a
potentially large audience with WITNESS enabled without worrying
about formerly non-fatal LORs hurting a release.
Sponsored by: iXsystems
This is to allow debug images to be used without taking down the
system when non-fatal asserts are hit.
The following sysctls are added:
debug.kassert.warn_only: 1 = log, 0 = panic
debug.kassert.do_ktr: set to a ktr mask for logging via KTR
debug.kassert.do_log: 1 = log, 0 = quiet
debug.kassert.warnings: stats, number of kasserts hit
debug.kassert.log_panic_at:
number of kasserts before we actually panic, 0 = never
debug.kassert.log_pps_limit: pps limit for log messages
debug.kassert.log_mute_at: stop warning after N kasserts, 0 = never stop
debug.kassert.kassert: set this sysctl to trigger a kassert
Discussed with: scottl, gnn, marcel
Sponsored by: iXsystems
There has not been any complaints about the default behavior, so there
is no need to keep a knob that enables the worse alternative.
Now that the hard-stopping of other CPUs is the only behavior, the panic_cpu
spinlock-like logic can be dropped, because only a single CPU is
supposed to win stop_cpus_hard(other_cpus) race and proceed past that
call.
MFC after: 1 month
this is a VNET-kernel or not. gcc used to put the static symbol into
the symbol table, clang does not. This fixes the 'netstat: no namelist'
error seen on clang+VNET systems.
primitives by breaking stop_scheduler into a per-thread variable.
Also, store the new td_stopsched very close to td_*locks members as
they will be accessed mostly in the same codepaths as td_stopsched and
this results in avoiding a further cache-line pollution, possibly.
STOP_SCHEDULER() was pondered to use a new 'thread' argument, in order to
take advantage of already cached curthread, but in the end there should
not really be a performance benefit, while introducing a KPI breakage.
In collabouration with: flo
Reviewed by: avg
MFC after: 3 months (or never)
X-MFC: r228424
Historical behavior of letting other CPUs merily go on is a default for
time being. The new behavior can be switched on via
kern.stop_scheduler_on_panic tunable and sysctl.
Stopping of the CPUs has (at least) the following benefits:
- more of the system state at panic time is preserved intact
- threads and interrupts do not interfere with dumping of the system
state
Only one thread runs uninterrupted after panic if stop_scheduler_on_panic
is set. That thread might call code that is also used in normal context
and that code might use locks to prevent concurrent execution of certain
parts. Those locks might be held by the stopped threads and would never
be released. To work around this issue, it was decided that instead of
explicit checks for panic context, we would rather put those checks
inside the locking primitives.
This change has substantial portions written and re-written by attilio
and kib at various times. Other changes are heavily based on the ideas
and patches submitted by jhb and mdf. bde has provided many insights
into the details and history of the current code.
The new behavior may cause problems for systems that use a USB keyboard
for interfacing with system console. This is because of some unusual
locking patterns in the ukbd code which have to be used because on one
hand ukbd is below syscons, but on the other hand it has to interface
with other usb code that uses regular mutexes/Giant for its concurrency
protection. Dumping to USB-connected disks may also be affected.
PR: amd64/139614 (at least)
In cooperation with: attilio, jhb, kib, mdf
Discussed with: arch@, bde
Tested by: Eugene Grosbein <eugen@grosbein.net>,
gnn,
Steven Hartland <killing@multiplay.co.uk>,
glebius,
Andrew Boyer <aboyer@averesystems.com>
(various versions of the patch)
MFC after: 3 months (or never)
The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.
Reviewed by: rwatson
Approved by: re (bz)
of the device boundry.
While this is generally ok, the problem is that all the consumers
handle similar cases (and expect to catch) ENOSPC for this (for a
reference look at minidumpsys() and dumpsys() constructions). That
ends up in consumers not recognizing the issue and amd64 failing to
retry if the number of pages grows up during minidump.
Fix this by returning ENOSPC in dump_write() and while here add some
more diagnostic on involved values.
Sponsored by: Sandvine Incorporated
In collabouration with: emaste
Approved by: re (kib)
MFC after: 10 days
- Axe out the SHOW_BUSYBUFS option and uses a tunable for selectively
enable/disable it, which is defaulted for not printing anything (0
value) but can be changed for printing (1 value) and be verbose (2
value)
- Improves the informations outputed: right now, there is no track of
the actual struct buf object or vnode which are referenced by the
shutdown process, but it is printed the related struct bufobj object
which is not really helpful
- Add more verbosity about the state of the struct buf lock and the
vnode informations, with the latter to be activated separately by the
sysctl
Sponsored by: Sandvine Incorporated
Reviewed by: emaste, kib
Approved by: re (ksmith)
MFC after: 10 days
This is done per request/suggestion from John Baldwin
who introduced the option. Trying to resume normal
system operation after a panic is very unpredictable
and dangerous. It will become even more dangerous
when we allow a thread in panic(9) to penetrate all
lock contexts.
I understand that the only purpose of this option was
for testing scenarios potentially resulting in panic.
Suggested by: jhb
Reviewed by: attilio, jhb
X-MFC-After: never
Approved by: re (kib)
interleaving.
Signal dumping to happen only for the first panic which should be the
most important.
Sponsored by: Sandvine Incorporated
Submitted by: Nima Misaghian (nmisaghian AT sandvine DOT com)
MFC after: 2 weeks