Commit Graph

18633 Commits

Author SHA1 Message Date
Edward Tomasz Napierala
460b4b550d Implement unprivileged chroot
This builds on recently introduced NO_NEW_PRIVS flag to implement
unprivileged chroot, enabled by `security.bsd.unprivileged_chroot`.
It allows non-root processes to chroot(2), provided they have the
NO_NEW_PRIVS flag set.

The chroot(8) utility gets a new flag, -n, which sets NO_NEW_PRIVS
before chrooting.

Reviewed By:	kib
Sponsored By:	EPSRC
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D30130

(cherry picked from commit a40cf4175c)
2022-02-14 18:42:21 +00:00
Mark Johnston
2a454b54bf Fix the build after commit 5fa005e915
Fixes:	5fa005e915 ("exec: Reimplement stack address randomization")
2022-02-16 13:32:18 -05:00
John Baldwin
1a9f14cfa5 Use vmspace->vm_stacktop in place of sv_usrstack in more places.
Reviewed by:	markj
Obtained from:	CheriBSD

(cherry picked from commit becaf6433b)
2022-02-16 11:55:37 -05:00
Mark Johnston
5fa005e915 exec: Reimplement stack address randomization
The approach taken by the stack gap implementation was to insert a
random gap between the top of the fixed stack mapping and the true top
of the main process stack.  This approach was chosen so as to avoid
randomizing the previously fixed address of certain process metadata
stored at the top of the stack, but had some shortcomings.  In
particular, mlockall(2) calls would wire the gap, bloating the process'
memory usage, and RLIMIT_STACK included the size of the gap so small
(< several MB) limits could not be used.

There is little value in storing each process' ps_strings at a fixed
location, as only very old programs hard-code this address; consumers
were converted decades ago to use a sysctl-based interface for this
purpose.  Thus, this change re-implements stack address randomization by
simply breaking the convention of storing ps_strings at a fixed
location, and randomizing the location of the entire stack mapping.
This implementation is simpler and avoids the problems mentioned above,
while being unlikely to break compatibility anywhere the default ASLR
settings are used.

The kern.elfN.aslr.stack_gap sysctl is renamed to kern.elfN.aslr.stack,
and is re-enabled by default.

PR:		260303
Reviewed by:	kib
Discussed with:	emaste, mw
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 1811c1e957)
2022-02-16 11:55:03 -05:00
Mark Johnston
e3b852f99b ktls: Disallow transmitting empty frames outside of TLS 1.0/CBC mode
There was nothing preventing one from sending an empty fragment on an
arbitrary KTLS TX-enabled socket, but ktls_frame() asserts that this
could not happen.  Though the transmit path handles this case for TLS
1.0 with AES-CBC, we should be strict and allow empty fragments only in
modes where it is explicitly allowed.

Modify sosend_generic() to reject writes to a KTLS-enabled socket if the
number of data bytes is zero, so that userspace cannot trigger the
aforementioned assertion.

Add regression tests to exercise this case.

Reported by:	syzkaller
Reviewed by:	gallatin, jhb
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 5de79eeddb)
2022-02-16 11:52:31 -05:00
Mark Johnston
7ac2a6354f file: Make fget*() and getvnode*() consistent about initializing *fpp
Most fget*() functions initialize the output parameter to NULL.  Make
the externally visible interface behave consistently, and make
fget_unlocked_seq() private to kern_descrip.c.

This fixes at least one bug in a consumer, _filemon_wrapper_openat(),
which assumes that getvnode() sets the output file pointer to NULL upon
an error.

Reported by:	syzbot+01c0459408f896a5933a@syzkaller.appspotmail.com
Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 300cfb96fc)
2022-02-16 11:52:31 -05:00
Justin Hibbits
2053dee56a Fix gzip compressed core dumps on big endian architectures
The gzip trailer words (size and CRC) are both little-endian per the spec.

MFC after:	3 days
Sponsored by:	Juniper Networks, Inc.

(cherry picked from commit 6db44b0158)
2022-02-14 13:30:52 -06:00
Dimitry Andric
ae76550171 tty_info: Avoid warning by using logical instead of bitwise operators
Since TD_IS_RUNNING() and TS_ON_RUNQ() are defined as logical
expressions involving '==', clang 14 warns about them being checked with
a bitwise operator instead of a logical one:

```
sys/kern/tty_info.c:124:9: error: use of bitwise '|' with boolean operands [-Werror,-Wbitwise-instead-of-logical]
        runa = TD_IS_RUNNING(td) | TD_ON_RUNQ(td);
               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                 ||
sys/sys/proc.h:562:27: note: expanded from macro 'TD_IS_RUNNING'
                                ^
sys/kern/tty_info.c:124:9: note: cast one or both operands to int to silence this warning
sys/sys/proc.h:562:27: note: expanded from macro 'TD_IS_RUNNING'
                                ^
sys/kern/tty_info.c:129:9: error: use of bitwise '|' with boolean operands [-Werror,-Wbitwise-instead-of-logical]
        runb = TD_IS_RUNNING(td2) | TD_ON_RUNQ(td2);
               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                  ||
sys/sys/proc.h:562:27: note: expanded from macro 'TD_IS_RUNNING'
                                ^
sys/kern/tty_info.c:129:9: note: cast one or both operands to int to silence this warning
sys/sys/proc.h:562:27: note: expanded from macro 'TD_IS_RUNNING'
                                ^
```

Fix this by using logical operators instead. No functional change
intended.

Reviewed by:	cem, emaste, kevans, markj
MFC after:	3 days
Differential Revision: https://reviews.freebsd.org/D34186

(cherry picked from commit 7d8a4eb943)
2022-02-11 17:43:03 +01:00
Colin Percival
baee6cc181 x86: Speed up clock calibration
Prior to this commit, the TSC and local APIC frequencies were calibrated
at boot time by measuring the clocks before and after a one-second sleep.
This was simple and effective, but had the disadvantage of *requiring a
one-second sleep*.

Rather than making two clock measurements (before and after sleeping) we
now perform many measurements; and rather than simply subtracting the
starting count from the ending count, we calculate a best-fit regression
between the target clock and the reference clock (for which the current
best available timecounter is used). While we do this, we keep track
of an estimate of the uncertainty in the regression slope (aka. the ratio
of clock speeds), and stop measuring when we believe the uncertainty is
less than 1 PPM.

In order to avoid the risk of aliasing resulting from the data-gathering
loop synchronizing with (a multiple of) the frequency of the reference
clock, we add some additional spinning depending upon the iteration number.

For numerical stability and simplicity of implementation, we make use of
floating-point arithmetic for the statistical calculations.

On the author's Dell laptop, this reduces the time spent in calibration
from 2000 ms to 29 ms; on an EC2 c5.xlarge instance, it is reduced from
2000 ms to 2.5 ms.

Reviewed by:	bde (previous version), kib
Sponsored by:	https://www.patreon.com/cperciva
Differential Revision:	https://reviews.freebsd.org/D33802

(cherry picked from commit c2705ceaeb)
2022-02-10 22:52:00 -08:00
Kyle Evans
00bc7bbde5 sched: separate out schedinit_ap()
schedinit_ap() sets up an AP for a later call to sched_throw(NULL).

Currently, ULE sets up some pcpu bits and fixes the idlethread lock with
a call to sched_throw(NULL); this results in a window where curthread is
setup in platforms' init_secondary(), but it has the wrong td_lock.
Typical platform AP startup procedure looks something like:

- Setup curthread
- ... other stuff, including cpu_initclocks_ap()
- Signal smp_started
- sched_throw(NULL) to enter the scheduler

cpu_initclocks_ap() may have callouts to process (e.g., nvme) and
attempt to sched_add() for this AP, but this attempt fails because
of the noted violated assumption leading to locking heartburn in
sched_setpreempt().

Interrupts are still disabled until cpu_throw() so we're not really at
risk of being preempted -- just let the scheduler in on it a little
earlier as part of setting up curthread.

(cherry picked from commit 589aed00e3)
2022-02-10 14:55:29 -06:00
Kyle Evans
7393eedb03 execve: disallow argc == 0
The manpage has contained the following verbiage on the matter for just
under 31 years:

"At least one argument must be present in the array"

Previous to this version, it had been prefaced with the weakening phrase
"By convention."

Carry through and document it the rest of the way.  Allowing argc == 0
has been a source of security issues in the past, and it's hard to
imagine a valid use-case for allowing it.  Toss back EINVAL if we ended
up not copying in any args for *execve().

The manpage change can be considered "Obtained from: OpenBSD"

(cherry picked from commit 773fa8cd13)
(cherry picked from commit c9afc7680f)
2022-02-10 14:21:59 -06:00
Hans Petter Selasky
22ba297076 mbuf(9): Assert receive mbufs don't carry a send tag.
Else we would start leaking reference counts.

Discussed with:	jhb@
Sponsored by:	NVIDIA Networking

(cherry picked from commit 17cbcf33c3)
2022-02-10 16:11:22 +01:00
Gordon Bergling
6a3607622e kern_racct: Fix a typo in a source code comment
- s/maxumum/maximum/

(cherry picked from commit a9bee9c77a)
2022-02-09 07:19:50 +01:00
Gordon Bergling
b9c307bc77 kern_fflock: Fix a typo in a source code comment
- s/foward/forward/

(cherry picked from commit 5a78ec9e7c)
2022-02-09 07:18:00 +01:00
Ed Maste
94e6d14488 Remove "All Rights Reserved" from FreeBSD Foundation sys/ copyrights
These ones were unambiguous cases where the Foundation was the only
listed copyright holder (in the associated license block).

Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 9feff969a0)
2022-02-08 15:00:55 -05:00
Konstantin Belousov
15def34bd8 Add GB_NOWITNESS flag
(cherry picked from commit c02780b78c)
2022-02-07 11:38:50 +02:00
Konstantin Belousov
7782d71671 syncer VOP_FSYNC(): unlock syncer vnode around call to VFS_SYNC()
(cherry picked from commit 3d68c4e175)
2022-02-07 11:38:50 +02:00
Konstantin Belousov
4116ae3ece buf_alloc(): lock the buffer with LK_NOWAIT
(cherry picked from commit 5875b94c74)
2022-02-07 11:38:49 +02:00
Konstantin Belousov
78d27f25c7 Use dedicated lock name for pbufs
(cherry picked from commit 531f8cfea0)
2022-02-07 11:38:49 +02:00
Alexander Motin
c27237d62f Reduce bufdaemon/bufspacedaemon shutdown time.
Before this change bufdaemon and bufspacedaemon threads used
kthread_shutdown() to stop activity on system shutdown.  The problem is
that kthread_shutdown() has no idea about the wait channel and lock used
by specific thread to wake them up reliably.  As result, up to 9 threads
could consume up to 9 seconds to shutdown for no good reason.

This change introduces specific shutdown functions, knowing how to
properly wake up specific threads, reducing wait for those threads on
shutdown/reboot from average 4 seconds to effectively zero.

MFC after:	2 weeks
Reviewed by:	kib, markj
Differential Revision:  https://reviews.freebsd.org/D33936

(cherry picked from commit b7ff445ffa)
2022-02-01 19:53:10 -05:00
Mark Johnston
40d6b2a362 exec: Remove the stack gap implementation
ASLR stack randomization will reappear in a forthcoming commit.  Rather
than inserting a random gap into the stack mapping, the entire stack
mapping itself will be randomized in the same way that other mappings
are when ASLR is enabled.

No functional change intended, as the stack gap implementation is
currently disabled by default.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 758d98debe)
2022-01-31 09:48:57 -05:00
Mark Johnston
947e849150 sysent: Add a sv_psstringssz field to struct sysentvec
The size of the ps_strings structure varies between ABIs, so this is
useful for computing the address of the ps_strings structure relative to
the top of the stack when stack address randomization is enabled.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 3fc21fdd5f)
2022-01-31 09:48:11 -05:00
Mark Johnston
d247611467 exec: Introduce the PROC_PS_STRINGS() macro
Rather than fetching the ps_strings address directly from a process'
sysentvec, use this macro.  With stack address randomization the
ps_strings address is no longer fixed.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 706f4a81a8)
2022-01-31 09:46:57 -05:00
Konstantin Belousov
fbdc76539b Add security.bsd.allow_ptrace sysctl
(cherry picked from commit fe6db72708)
2022-01-29 03:10:44 +02:00
Konstantin Belousov
704d2103c6 p_candebug(), p_cansee(): always allow for curproc
(cherry picked from commit 55a0aa2162)
2022-01-29 03:10:44 +02:00
Jessica Clarke
f63a2e288c intrng: Use less confusing return value for intr_pic_add_handler
Currently intr_pic_add_handler either returns the PIC you gave it (which
is useless and risks causing confusion about whether it's creating
another PIC) or, on error, NULL. Instead, convert it to return an int
error code as one would expect.

Note that the only consumer of this API, arm64's gicv3_its, does not use
the return value, so no uses need updating to work with the revised API.

Reviewed by:	markj, mmel
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D33341

(cherry picked from commit a3e828c91d)
2022-01-24 23:59:55 +00:00
Jessica Clarke
3c7f332f71 Fix buffer overread in preloaded hostuuid parsing
Commit b6be9566d2 stopped prison0_init writing outside of the
preloaded hostuuid's bounds. However, the preloaded data will not
(normally) have a NUL in it, and so validate_uuid will walk off the end
of the buffer in its call to sscanf. Previously if there was any
whitespace in the string we'd at least know there's a NUL one past the
end due to the off-by-one error, but now no such byte is guaranteed.

Fix this by copying to a temporary buffer and explicitly adding a NUL.

Whilst here, change the strlcpy call to use a far less suspicious
argument for dstsize; in practice it's fine, but it's an unusual pattern
and not necessary.

Found by:	CHERI
Reviewed by:	emaste, kevans, jhb
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D33616

(cherry picked from commit d2ef377430)
2022-01-24 23:59:49 +00:00
Mark Johnston
c595625abe Revert "kern_exec: Add kern.stacktop sysctl."
The current ASLR stack gap feature will be removed, and with that the
need for the kern.stacktop sysctl is gone.  All consumers have been
removed.

This reverts commit a97d697122.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 1544f5add8)
2022-01-24 09:20:21 -05:00
Mark Johnston
1a97674b46 setrlimit: Remove special handling for RLIMIT_STACK with a stack gap
This will not be required with a forthcoming reimplementation of ASLR
stack randomization.  Moreover, this change was not sufficient to enable
the use of a stack size limit smaller than the stack gap itself.

PR:		260303
Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 5a8413e779)
2022-01-24 09:16:52 -05:00
Mark Johnston
9795d85d2e posixshm: Report output buffer truncation from kern.ipc.posix_shm_list
PR:		240573
Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit dc7526170d)
2022-01-24 09:16:37 -05:00
Alexander Motin
70201cc45d Reduce minimum idle hardclock rate from 2Hz to 1Hz.
On idle 80-thread system it allows to improve package-level idle state
residency and so power consumption by several percent.

MFC after:	2 weeks

(cherry picked from commit cb1f5d1136)
2022-01-23 21:35:58 -05:00
Alexander Motin
15e1d8f69b Fix inverse sleep logic in buf_daemon().
Before commit 3cec5c77d6 buf_daemon() went to longer 1s sleep if
numdirtybuffers <= lodirtybuffers.  After that commit new condition
!BIT_EMPTY(BUF_DOMAINS, &bdlodirty) got opposite -- true when one
or more more domains is above lodirtybuffers.  As result, on freshly
booted system with no dirty buffers buf_daemon() wakes up 10 times
per second and probably only 1 time per second when there is actual
work to do.

MFC after:	1 week
Reviewed by:	kib, markj
Tested by:	pho
Differential revision:	https://reviews.freebsd.org/D33890

(cherry picked from commit e76c010899)
2022-01-23 14:57:35 -05:00
Michal Meloun
2ace1585b0 intrng: remove now redundant shadow variable.
Should not be a functional change.

Submitted by: 	ehem_freebsd@m5p.com
Discussed in:	https://reviews.freebsd.org/D29310
MFC after:	4 weeks

(cherry picked from commit e88c3b1b02)
2022-01-20 11:08:45 +01:00
Michal Meloun
a3ccd06dd9 intrng: Releasing interrupt source should clear interrupt table full state.
The first release of an interrupt in a situation where the interrupt table
is full should schedule a full table check the next time an interrupt is
allocated. A full check is necessary to ensure maximum separation between
the order of allocation and the order of release.

Submitted by:	ehem_freebsd@m5p.com (initial version)
Discussed in:	https://reviews.freebsd.org/D29310
MFC after:	4 weeks

(cherry picked from commit a49f208d94)
2022-01-20 11:07:44 +01:00
Mark Johnston
af30714ff4 fd: Avoid truncating output buffers for KERN_PROC_{CWD,FILEDESC}
These sysctls failed to return an error if the caller had provided too
short an output buffer.  Change them to return ENOMEM instead, to ensure
that callers can detect truncation in the face of a concurrently
changing fd table.

PR:		228432
Discussed with:	cem, jhb

(cherry picked from commit 36bd49ac4d)
2022-01-16 10:40:25 -05:00
Konstantin Belousov
a5f6985995 Ignore debugger-injected signals left after detaching
PR:	261010

(cherry picked from commit a24afbb4e6)
2022-01-15 14:55:32 +02:00
Konstantin Belousov
b4889992d7 Add vfs_remount_ro()
(cherry picked from commit 4a4b059a97)
2022-01-14 20:11:02 +02:00
Stefan Eßer
54e1dc50ec sys/kern/sched_4bsd.c: fix typo introduced in previous commit
(cherry picked from commit ec3af9d0ca)
2022-01-14 18:17:31 +02:00
Stefan Eßer
a94baf23cf Restore variable aliasing in the context of cpu set operations
(cherry picked from commit a19bd8e30e)
2022-01-14 18:17:31 +02:00
Stefan Eßer
dc4114875e Make CPU_SET macros compliant with other implementations
(cherry picked from commit e2650af157)
2022-01-14 18:17:30 +02:00
Konstantin Belousov
6e676b5550 Regen 2022-01-14 18:17:30 +02:00
Konstantin Belousov
a48d9f1900 Add sched_getcpu()
(cherry picked from commit 77b2c2f814)
2022-01-14 18:17:29 +02:00
Mark Johnston
1562fe492a exec: Simplify sv_copyout_strings implementations a bit
Simplify control flow around handling of the execpath length and signal
trampoline.  Cache the sysentvec pointer in a local variable.

No functional change intended.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit f04a096049)
2022-01-14 08:50:06 -05:00
Colin Percival
972796d007 vfs_mountroot: Check for root dev before waiting
If GEOM is idle but the root device is not yet present when we enter
vfs_mountroot_wait_if_necessary, we call vfs_mountroot_wait to wait
for root holds (e.g. CAM or USB initialization).  Upon returning from
vfs_mountroot_wait, we wait 100 ms at a time until the root device
shows up.

Since the root device most likely appeared during vfs_mountroot_wait
-- waiting for subsystems which may be responsible for the root
device is the whole purpose of that function -- it makes sense to
check if the device is now present rather than printing a warning
and pausing for 100 ms before checking.

Reviewed by:	trasz
Fixes: a3ba3d09c2 Make root mount wait mechanism smarter
Sponsored by:	https://www.patreon.com/cperciva
Differential Revision:	https://reviews.freebsd.org/D33593

(cherry picked from commit 33812d60b9)
2022-01-12 11:29:51 -08:00
Colin Percival
72f61780a4 vfs_mountroot: Wait for GEOM idle post root holds
In the case of a root hold related to the initialization of a disk
device, a flurry of GEOM tasting is likely to take place as soon as
the device is initialized and the root hold is released.  If we
don't wait for GEOM idle it's easy for vfs_mountroot to "win" the
race and proceed before the root filesystem GEOM is ready.

Reviewed by:	imp
Sponsored by:	https://www.patreon.com/cperciva
Differential Revision:	https://reviews.freebsd.org/D33592

(cherry picked from commit 19a172158c)
2022-01-12 11:29:48 -08:00
Colin Percival
d4724934f2 vfs_mountroot: Skip 'Root mount waiting' < 1 s
While the message is technically correct, it's not particularly
helpful in the case where we're only waiting a few ms; this case
occurs frequently on EC2 arm64 instances with CAM initialization
racing to release its root hold before vfs_mountroot reaches this
point.  Only print the message if we end up waiting for more than
one second.

Sponsored by:	https://www.patreon.com/cperciva
Differential Revision:	https://reviews.freebsd.org/D33591

(cherry picked from commit e6db5eb9ec)
2022-01-12 11:29:37 -08:00
Alexander Motin
034c2781d5 kern: Remove CTLFLAG_NEEDGIANT from some more sysctls.
MFC after:	2 weeks

(cherry picked from commit c6c52d8e39)
2022-01-09 19:30:09 -05:00
Alexander Motin
5ec6907c0a kern: Remove CTLFLAG_NEEDGIANT from some sysctls.
MFC after:	2 weeks

(cherry picked from commit fe27f1db5f)
2022-01-08 20:24:10 -05:00
Hans Petter Selasky
a889d262a7 Remove dead code.
The variable orig_resid is always set to zero right after the while loop
where it is cleared.

Reviewed by:	gallatin@ and glebius@
Differential Revision:	https://reviews.freebsd.org/D33589
Sponsored by:	NVIDIA Networking

(cherry picked from commit f9978339d1)
2022-01-07 14:08:59 +01:00
Alexander Motin
b7668d009e Make CPU children explicitly share parent unit numbers.
Before this device unit number match was coincidental and broke if I
disabled some CPU device(s).  Aside of cosmetics, for some drivers
(may be considered broken) it caused talking to wrong CPUs.

(cherry picked from commit d3a8f98acb)
2022-01-04 12:21:42 -05:00