This is a stopgap measure to unbreak installer/VM/embedded boot issues
introduced (or at least exposed by) in r346250.
Add the new tunable, "security.stack_protect.permit_nonrandom_cookies," in
order to continue boot with insecure non-random stack cookies if the random
device is unavailable.
For now, enable it by default. This is NOT safe. It will be disabled by
default in a future revision.
There is follow-on work planned to use fast random sources (e.g., RDRAND on
x86 and DARN on Power) to seed when the early entropy file cannot be
provided, for whatever reason. Please see D19928.
Some better hacks may be used to make the non-random __stack_chk_guard
slightly less predictable (from delphij@ and mjg@); those suggestions are
left for a future revision. I think it may also be plausible to move stack
guard initialization far later in the boot process; potentially it could be
moved all the way to just before userspace is started.
Reported by: many
Reviewed by: delphij, emaste, imp (all w/ caveat: this is a stopgap fix)
Security: yes
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D19927
r176215 corrected readlink(2)'s return type and the type of the last
argument. readlink(2) was introduced in r177788 after being developed
as part of Google Summer of Code 2007; it appears to have inherited the
wrong return type.
Man pages and header files were already ssize_t; update syscalls.master
to match.
PR: 197915
Submitted by: Henning Petersen <henning.petersen@t-online.de>
MFC after: 2 weeks
read_random() is/was used, mostly without error checking, in a lot of
very sensitive places in the kernel -- including seeding the widely used
arc4random(9).
Most uses, especially arc4random(9), should block until the device is seeded
rather than proceeding with a bogus or empty seed. I did not spy any
obvious kernel consumers where blocking would be inappropriate (in the
sense that lack of entropy would be ok -- I did not investigate locking
angle thoroughly). In many instances, arc4random_buf(9) or that family
of APIs would be more appropriate anyway; that work was done in r345865.
A minor cleanup was made to the implementation of the READ_RANDOM function:
instead of using a variable-length array on the stack to temporarily store
all full random blocks sufficient to satisfy the requested 'len', only store
a single block on the stack. This has some benefit in terms of reducing
stack usage, reducing memcpy overhead and reducing devrandom output leakage
via the stack. Additionally, the stack block is now safely zeroed if it was
used.
One caveat of this change is that the kern.arandom sysctl no longer returns
zero bytes immediately if the random device is not seeded. This means that
FreeBSD-specific userspace applications which attempted to handle an
unseeded random device may be broken by this change. If such behavior is
needed, it can be replaced by the more portable getrandom(2) GRND_NONBLOCK
option.
On any typical FreeBSD system, entropy is persisted on read/write media and
used to seed the random device very early in boot, and blocking is never a
problem.
This change primarily impacts the behavior of /dev/random on embedded
systems with read-only media that do not configure "nodevice random". We
toggle the default from 'charge on blindly with no entropy' to 'block
indefinitely.' This default is safer, but may cause frustration. Embedded
system designers using FreeBSD have several options. The most obvious is to
plan to have a small writable NVRAM or NAND to persist entropy, like larger
systems. Early entropy can be fed from any loader, or by writing directly
to /dev/random during boot. Some embedded SoCs now provide a fast hardware
entropy source; this would also work for quickly seeding Fortuna. A 3rd
option would be creating an embedded-specific, more simplistic random
module, like that designed by DJB in [1] (this design still requires a small
rewritable media for forward secrecy). Finally, the least preferred option
might be "nodevice random", although I plan to remove this in a subsequent
revision.
To help developers emulate the behavior of these embedded systems on
ordinary workstations, the tunable kern.random.block_seeded_status was
added. When set to 1, it blocks the random device.
I attempted to document this change in random.4 and random.9 and ran into a
bunch of out-of-date or irrelevant or inaccurate content and ended up
rototilling those documents more than I intended to. Sorry. I think
they're in a better state now.
PR: 230875
Reviewed by: delphij, markm (earlier version)
Approved by: secteam(delphij), devrandom(markm)
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D19744
r340744 broke the NFSv4 client, because it replaced pfind_locked() with a
call to pfind(), since pfind() acquires the sx lock for the pid hash and
the NFSv4 already holds a mutex when it does the call.
The patch fixes the problem by recreating a pfind_any_locked() and adding the
functions pidhash_slockall() and pidhash_sunlockall to acquire/release
all of the pid hash locks.
These functions are then used by the NFSv4 client instead of acquiring
the allproc_lock and calling pfind().
Reviewed by: kib, mjg
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D19887
cache_lookup's documentation got dislocated by r324378. Relocate and expand
it.
Reviewed by: jhb, kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Provide a convenience function to avoid the hack with filling fake
struct vop_fsync_args and then calling vop_stdfsync().
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
descriptor, not the file descriptor. The file descriptor is used only for
verification so do not expect any additional capabilities on it.
Reported by: antoine
Tested by: antoine
Discussed with: kib, emaste, bapt
Sponsored by: Fudo Security
Such processes will be reparented to the reaper when the current
parent is done with them (i.e., ptrace detached), so p_oppid must be
updated accordingly.
Add a regression test to exercise this code path. Previously it
would not be possible to reap an orphan with a stale oppid.
Reviewed by: kib, mjg
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19825
'tos' is an index into an array and never holds a negative value. Correct
its signedness to match PCTRIE_LIMIT, which it is compared to in assertions.
No functional change (kills a warning).
the file associated with the given file descriptor.
Reviewed by: kib, asomers
Reviewed by: cem, jilles, brooks (they reviewed previous version)
Discussed with: pjd, and many others
Differential Revision: https://reviews.freebsd.org/D14567
It performs BUS_RESET_CHILD() on the parental bus and the specified
device.
Reviewed by: imp (previous version), jhb (previous version)
Sponsored by: Mellanox Technologies
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D19646
The methods BUS_RESET_PREPARE(), BUS_RESET(), and BUS_RESET_POST()
should be implemented by bus which can provide reset to a device. The
methods are described in inline doxygen comments.
Code only provides BUS_RESET_PREPARE() and BUS_RESET_POST() helpers
instead of default implementation, because actual bus needs to handle
device state around reset, while helpers provide the other half of
typical prepare/post code.
Reviewed by: imp (previous version), jhb (previous version)
Sponsored by: Mellanox Technologies
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D19646
Otherwise we might miss the last iteration where EOF appears below
unaligned noff.
Reported and reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D19811
In particular, elf32 FreeBSD binaries were not executed on LP64 hosts.
The interp_name_len value should account for the nul terminator. This
is needed for strncmp()s in brand checking code to work.
Reported by: andreast
Sponsored by: The FreeBSD Foundation
MFC after: 12 days (together with r345661)
In most cases kernel.bootfile is populated from the information
provided by loader(8). There are certain scenarios when loader
is not available, for instance when kernel is loaded by u-boot
or some other BootROM directly. In this case the default value
"/kernel" points to invalid location and breaks some functinality,
like using installkernel on self-hosted system or dtrace's CTF
lookup. This can be fixed by setting the value manually but the
default that reflects correct location is better than default that
points to invalid one.
Current default was set around FreeBSD 1, when "/kernel" was the
actual path. Transition to /boot/kernel/kernel happened circa FreeBSD 3.
PR: 221550
Reviewed by: ian, imp
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D18902
It makes the code slightly easier to follow, and might make
it easier to fix the resouce accounting to also account for
the interpreter.
The PROC_UNLOCK() is moved earlier - I don't see anything
it should protect; the lim_max() is a wrapper around lim_rlimit(),
and that, differently from lim_rlimit_proc(), doesn't require
the proc lock to be held.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D19689
A sysid of 0 denotes the local system, and some handlers for remote
locking commands do not attempt to deal with local locks. Note that
F_SETLK_REMOTE is only available to privileged users as it is intended
to be used as a testing interface.
Reviewed by: kib
Reported by: syzbot+9c457a6ae014a3281eb8@syzkaller.appspotmail.com
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19702
There are only 19 bytes available for the name of an interrupt plus the
name(s) of handlers/drivers using it. There is a mechanism from the days of
shared interrupts that replaces some of the handler names with '+' when they
don't all fit into 19 bytes.
In modern times there is typically only one device on an interrupt, but long
device names are the norm, especially with embedded systems. Also, in systems
with multiple interrupt controllers, the names of the interrupts themselves
can be long. For example, 'gic0,s54: imx6_anatop0' doesn't fit, and
replacing the device driver name with a '+' provides no useful info at all.
When there is only one handler but its name was too long to fit, this
change truncates enough leading chars of the handler name (replacing them
with a '-' char to indicate that some chars are missing) to use all 19
bytes, preserving the unit number typically on the end of the name. Using
the prior example, this results in: 'gic0,s54:-6_anatop0' which provides
plenty of info to figure out which device is involved.
PR: 211946
Reviewed by: gonzo@ (prior version without the '-' char)
Differential Revision: https://reviews.freebsd.org/D19675
r343532 noted the difference between "hw.realmem" and "hw.physmem", which I
was previously unaware of. I discovered that neither sysctl had a
description visible via `sysctl -d', so I found where they were defined and
added suitable descriptions. While in the file, I went ahead and added
descriptions for all the others which lacked them. I also updated sysctl.3
accordingly
Reviewed by: kib, bcr
MFC after: 1 weeks
Sponsored by: Panasas
Differential Revision: https://reviews.freebsd.org/D19007
The line was misedited to change tt to st instead of
changing ut to st.
The use of st as the denominator in mul64_by_fraction() will lead
to an integer divide fault in the intr proc (the process holding
ithreads) where st will be 0. This divide by 0 happens after
the total runtime for all ithreads exceeds 76 hours.
Submitted by: bde
Add the infrastructure to allow MD procctl(2) commands, and use it to
introduce amd64 PTI control and reporting. PTI mode cannot be
modified for existing pmap, the knob controls PTI of the new vmspace
created on exec.
Requested by: jhb
Reviewed by: jhb, markj (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D19514
PTI mode for the process pmap on exec is activated iff P_MD_PTI is set.
On exec, the existing vmspace can be reused only if pti mode of the
pmap matches the P_MD_PTI flag of the process. Add MD
cpu_exec_vmspace_reuse() callback for exec_new_vmspace() which can
vetoed reuse of the existing vmspace.
MFC note: md_flags change struct proc KBI.
Reviewed by: jhb, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D19514
extended attributes, the kernel can panic with either "ffs_truncate3"
or with "softdep_deallocate_dependencies: dangling deps".
The problem arises because the flushbuflist() function which is
called to clear out buffers is passed either the V_NORMAL flag to
indicate that it should flush buffer associated with the contents
of the file or the V_ALT flag to indicate that it should flush the
buffers associated with the extended attribute data. The buffers
containing the extended attribute data are identified by having
their BX_ALTDATA flag set in the buffer's b_xflags field. The
BX_ALTDATA flag is set on the buffer when the extended attribute
block is first allocated or when its contents are read in from the
disk.
On a busy system, a buffer may be reused for another purpose, but
the contents of the block that it contained continues to be held
in the main page cache. Each physical page is identified as holding
the contents of a logical block within a specified file (identified
by a vnode). When a request is made to read a file, the kernel first
looks for the block in the existing buffers. If it is not found
there, it checks the page cache to see if it is still there. If
it is found in the page cache, then it is remapped into a new
buffer thus avoiding the need to read it in from the disk.
The bug is that when a buffer request made for an extended attribute
is fulfilled by reconstituting a buffer from the page cache rather
than reading it in from disk, the BX_ALTDATA flag was not being
set. Thus the flushbuflist() function would never clear it out and
the "ffs_truncate3" panic would occur because the vnode being cleared
still had buffers on its clean-buffer list. If the extended attribute
was being updated, it is first read, then updated, and finally
written. If the read is fulfilled by reconstituting the buffer
from the page cache the BX_ALTDATA flag was not set and thus the
dirty buffer would never be flushed by flushbuflist(). Eventually
the buffer would be recycled. Since it was never written it would
have an unfinished dependency which would trigger the
"softdep_deallocate_dependencies: dangling deps" panic.
The fix is to ensure that the BX_ALTDATA flag is set when a buffer
has been reconstituted from the page cache.
PR: 230962
Reported by: 2t8mr7kx9f@protonmail.com
Reviewed by: kib
Tested by: Peter Holm
MFC after: 1 week
Sponsored by: Netflix
When merging from Netflix's tree, resetting the carsize was dropped
accidentally. This fix fixes that revision by properly resetting how
many are in the car.
Noticed by: mav@
Research Unix, 7th Edition introduced TIMEZONE and DSTFLAG
compile-time constants in sys/param.h to communicate these values for
the machine. 4.2BSD moved from the compile-time to run-time and
introduced these variables and used for localtime() to return the
right offset from UTC (sometimes referred to as GMT, for this purpose
is the same). 4.4BSD migrated to using the tzdata code/database and
these variables were basically unused.
FreeBSD removed the real need for these with adjkerntz in
1995. However, some RTC clocks continued to use these variables,
though they were largely unused otherwise. Later, phk centeralized
most of the uses in utc_offset, but left it using both tz_minuteswest
and adjkerntz.
POSIX (IEEE Std 1003.1-2017) states in the gettimeofday specification
"If tzp is not a null pointer, the behavior is unspecified" so there's
no standards reason to retain it anymore. In fact, gettimeofday has
been marked as obsolecent, meaning it could be removed from a future
release of the standard. It is the only interface defined in POSIX
that references these two values. All other references come from the
tzdata database via tzset().
These were used to more faithfully implement early unix ABIs which
have been removed from FreeBSD. NetBSD has completely eliminated
these variables years ago. Linux has migrated to tzdata as well,
though these variables technically still exist for compatibility
with unspecified older programs.
So, there's no real reason to have them these days. They are a
historical vestige that's no longer used in any meaningful way.
Reviewed By: jhb@, brooks@
Differential Revision: https://reviews.freebsd.org/D19550
vnode pointer (b_vp). The value of b_vp can be used by "show vnode"
to print the vnode and "show vnodebufs" to print all the clean and
dirty buffers associated with the vnode (which should include this
buffer).
Sponsored by: Netflix
Note that only entries wired by userspace are shown as such. In
particular, entries transiently wired by sysctl_wire_old_buffer() are
not flagged as wired in procstat -v output.
Reviewed by: kib (previous version)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D19461
to call soabort() on a newborn socket created by sonewconn() in case
if further setup of PCB failed. Code in sofree() handles such socket
correctly.
Submitted by: jtl, rrs
MFC after: 3 weeks
In all of the architectures we have today, we always use PAGE_SIZE.
While in theory one could define different things, none of the
current architectures do, even the ones that have transitioned from
32-bit to 64-bit like i386 and arm. Some ancient mips binaries on
other systems used 8k instead of 4k, but we don't support running
those and likely never will due to their age and obscurity.
Reviewed by: imp (who also contributed the commit message)
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D19280
When the device is in attaching state, detach should return
EBUSY instead of success. In other case, there could be race
between attach and detach during the driver unloading.
If driver goes sleep and releases GIANT lock during attaching,
unloading module could start. In such case when attach continues
after module unload, page fault "supervisor read instruction,
page not present" occurred.
This patch works around the real issue, which is a locking
deficiency of the busses.
Submitted by: Rafal Kozik <rk@semihalf.com>
Reviewed by: imp
Obtained from: Semihalf
MFC after: 2 weeks
Sponsored by: Amazon, Inc.
Differential Revision: https://reviews.freebsd.org/D19375
They have no effect, as with filesystem file descriptors.
This improves compatibility with some existing userspace code.
Submitted by: Greg V <greg@unrelenting.technology>
Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D19330
Linux generates the content of procfs files using a mechanism prefixed with
seq_*. This in particular came up with recent gcov import.
Sponsored by: The FreeBSD Foundation
On platforms without a direct map (i.e., platforms without
UMA_MD_SMALL_ALLOC defined), the boundary tag allocator reserves a
number of tags for use when allocating a new slab of boundary tags,
as such platforms require free boundary tags in order to allocate
boundary tags. r327899 increased the number of boundary tags required
for a KVA allocation in the worst case, and the aforementioned
reservation was not updated accordingly. In some cases, this could
lead to a system hang. Fix the problem by increasing this reservation.
Also reduce KVA_QUANTUM on systems lacking superpage support.
The previous import quantum (4MB with a 4KB page size) was quite large
for systems with limited KVA, and fragmentation in kernel_arena could
cause kernel memory allocation failures even with a substantial amount
of free KVA.
Reported and tested by: jhibbits
Reviewed by: alc, kib
No objections: jeff
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19337
If an interrupt fires while writing the cmp entry we may have a partial
entry. Work around this by using atomic_cmpset to set the new index. If it
fails we need to set the previous index value and try again as the entry
may be in an inconsistent state.
This fixes messages similar to the following from syzkaller:
bad comp 224 type 2163727253
Reviewed by: tuexen
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D19287