1876 Commits

Author SHA1 Message Date
kevans
0b09e6adb7 Add SPDX tags to recently added files
Reported by:	Pawel Biernacki
2019-09-25 22:53:30 +00:00
kevans
ca07d63db5 posix_spawn(3): handle potential signal issues with vfork
Described in [1], signal handlers running in a vfork child have
opportunities to corrupt the parent's state. Address this by adding a new
rfork(2) flag, RFSPAWN, that has vfork(2) semantics but also resets signal
handlers in the child during creation.

x86 uses rfork_thread(3) instead of a direct rfork(2) because rfork with
RFMEM/RFSPAWN cannot work when the return address is stored on the stack --
further information about this problem is described under RFMEM in the
rfork(2) man page.

Addressing this has been identified as a prerequisite to using posix_spawn
in subprocess on FreeBSD [2].

[1] https://ewontfix.com/7/
[2] https://bugs.python.org/issue35823

Reviewed by:	jilles, kib
Differential Revision:	https://reviews.freebsd.org/D19058
2019-09-25 19:22:03 +00:00
kevans
245d8426fc rfork(2): add RFSPAWN flag
When RFSPAWN is passed, rfork exhibits vfork(2) semantics but also resets
signal handlers in the child during creation to avoid a point of corruption
of parent state from the child.

This flag will be used by posix_spawn(3) to handle potential signal issues.

Reviewed by:	jilles, kib
Differential Revision:	https://reviews.freebsd.org/D19058
2019-09-25 19:20:41 +00:00
kevans
1d9983b221 sysent: regenerate after r352705
This also implements it, fixes kdump, and removes no longer needed bits from
lib/libc/sys/shm_open.c for the interim.
2019-09-25 18:09:19 +00:00
kevans
575e351fdd Add linux-compatible memfd_create
memfd_create is effectively a SHM_ANON shm_open(2) mapping with optional
CLOEXEC and file sealing support. This is used by some mesa parts, some
linux libs, and qemu can also take advantage of it and uses the sealing to
prevent resizing the region.

This reimplements shm_open in terms of shm_open2(2) at the same time.

shm_open(2) will be moved to COMPAT12 shortly.

Reviewed by:	markj, kib
Differential Revision:	https://reviews.freebsd.org/D21393
2019-09-25 18:03:18 +00:00
kevans
7013010007 Update fcntl(2) after r352695 2019-09-25 17:33:12 +00:00
sef
f0e5ce5f10 Add two options to allow mount to avoid covering up existing mount points.
The two options are

* nocover/cover:  Prevent/allow mounting over an existing root mountpoint.
E.g., "mount -t ufs -o nocover /dev/sd1a /usr/local" will fail if /usr/local
is already a mountpoint.
* emptydir/noemptydir:  Prevent/allow mounting on a non-empty directory.
E.g., "mount -t ufs -o emptydir /dev/sd1a /usr" will fail.

Neither of these options is intended to be a default, for historical and
compatibility reasons.

Reviewed by:	allanjude, kib
Differential Revision:	https://reviews.freebsd.org/D21458
2019-09-23 04:28:07 +00:00
kib
c4ddd57b85 Return EISDIR when directory is opened with O_CREAT without O_DIRECTORY.
Reviewed by:	bcr (man page), emaste (previous version)
PR:	240452
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
DIfferential revision:	https://reviews.freebsd.org/D21634
2019-09-17 18:32:18 +00:00
asomers
af1c569e96 getsockopt.2: clarify that SO_TIMESTAMP is not 100% reliable
When SO_TIMESTAMP is set, the kernel will attempt to attach a timestamp as
ancillary data to each IP datagram that is received on the socket. However,
it may fail, for example due to insufficient memory. In that case the
packet will still be received but not timestamp will be attached.

Reviewed by:	kib
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D21607
2019-09-11 19:48:32 +00:00
mhorne
cac0e53d56 Fix cpuwhich_t column width
Not bumping .Dd since this is purely a format change.

Approved by:	markj (mentor)
2019-09-08 21:37:52 +00:00
kib
a788c25f19 Add procctl(PROC_STACKGAP_CTL)
It allows a process to request that stack gap was not applied to its
stacks, retroactively.  Also it is possible to control the gaps in the
process after exec.

PR:	239894
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D21352
2019-09-03 18:56:25 +00:00
mjg
286ae5bd6b Add sysctlbyname system call
Previously userspace would issue one syscall to resolve the sysctl and then
another one to actually use it. Do it all in one trip.

Fallback is provided in case newer libc happens to be running on an older
kernel.

Submitted by:	Pawel Biernacki
Reported by:	kib, brooks
Differential Revision:	https://reviews.freebsd.org/D17282
2019-09-03 04:16:30 +00:00
emaste
ac28fa5277 Add @generated tag to libc syscall asm wrappers
Although libc syscall wrappers do not get checked in this can aid in
finding the source of generated files when spelunking in the objdir.

Multiple tools use @generated to identify generated files (for example,
in a review Phabricator will by default hide diffs in generated files).
For consistency use the @generated tag in makesyscalls.sh as we've done
for other generated files, even though these wrappers aren't checked in
to the tree.
2019-08-16 14:14:57 +00:00
kib
75eca26976 wait(2): clarify reparenting of children of the exiting process.
Point to the existence of reapers and mention that init is the default
reaper.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2019-08-11 15:47:48 +00:00
kib
334dd0a5ca wait(2): split long line by using .Fo/.Fa instead of .Ft.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2019-08-11 15:44:36 +00:00
bjk
c1b91fd7f9 Fix grammar nit in copy_file_range docs
Bytes are countable, so we have fewer of them, not less of them.
2019-07-25 15:43:15 +00:00
rmacklem
cd932f85db Add libc support for the copy_file_range(2) syscall added by r350315.
copy_file_range.2 is a new man page (content change).

Reviewed by:	kib, asomers
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D20584
2019-07-25 06:05:49 +00:00
jhb
bd67f2ec6b Add ptrace op PT_GET_SC_RET.
This ptrace operation returns a structure containing the error and
return values from the current system call.  It is only valid when a
thread is stopped during a system call exit (PL_FLAG_SCX is set).

The sr_error member holds the error value from the system call.  Note
that this error value is the native FreeBSD error value that has _not_
been translated to an ABI-specific error value similar to the values
logged to ktrace.

If sr_error is zero, then the return values of the system call will be
set in sr_retval[0] and sr_retval[1].

Reviewed by:	kib
MFC after:	1 month
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D20901
2019-07-15 21:48:02 +00:00
kib
dd4da0248c Document atomicity for read(2) and write(2).
Take part of the text from POSIX 2018 edition and describe the
atomicity requirements for read and write syscalls.  See p1003.1-2018,
Vol.2, 2.9.7 Threads interaction with Regular File Operations.

Reviewed by:	asomers
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Differential revision:	https://reviews.freebsd.org/D20867
2019-07-06 20:31:37 +00:00
kib
5144f6086b Control implicit PROT_MAX() using procctl(2) and the FreeBSD note
feature bit.

In particular, allocate the bit to opt-out the image from implicit
PROTMAX enablement.  Provide procctl(2) verbs to set and query
implicit PROTMAX handling.  The knobs mimic the same per-image flag
and per-process controls for ASLR.

Reviewed by:	emaste, markj (previous version)
Discussed with:	brooks
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D20795
2019-07-02 19:07:17 +00:00
kib
e9829937e1 Typo.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2019-06-28 16:42:44 +00:00
brooks
57acd8427d Add PROT_MAX to the HISTORY section.
In the case of mmap(), add a HISTORY section.  Mention that mmap() and
mprotect()'s documentation predates an implementation.  The
implementation first saw wide use in 4.3-Reno, but there seems to be no
easy way to express that in mdoc so stick with 4.4BSD.

Reviewed by:	emaste
Requested by:	cem
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D20713
2019-06-20 21:52:30 +00:00
brooks
0165ccc850 Extend mmap/mprotect API to specify the max page protections.
A new macro PROT_MAX() alters a protection value so it can be OR'd with
a regular protection value to specify the maximum permissions.  If
present, these flags specify the maximum permissions.

While these flags are non-portable, they can be used in portable code
with simple ifdefs to expand PROT_MAX() to 0.

This change allows (e.g.) a region that must be writable during run-time
linking or JIT code generation to be made permanently read+execute after
writes are complete.  This complements W^X protections allowing more
precise control by the programmer.

This change alters mprotect argument checking and returns an error when
unhandled protection flags are set.  This differs from POSIX (in that
POSIX only specifies an error), but is the documented behavior on Linux
and more closely matches historical mmap behavior.

In addition to explicit setting of the maximum permissions, an
experimental sysctl vm.imply_prot_max causes mmap to assume that the
initial permissions requested should be the maximum when the sysctl is
set to 1.  PROT_NONE mappings are excluded from this for compatibility
with rtld and other consumers that use such mappings to reserve
address space before mapping contents into part of the reservation.  A
final version this is expected to provide per-binary and per-process
opt-in/out options and this sysctl will go away in its current form.
As such it is undocumented.

Reviewed by:	emaste, kib (prior version), markj
Additional suggestions from:	alc
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D18880
2019-06-20 18:24:16 +00:00
asomers
ba20317d83 open(2): fix the description of O_FSYNC
The man page claims that with O_FSYNC (aka O_SYNC) the kernel will not cache
written data. However, that's not true. Nor does POSIX require it.
Perhaps it was true when that section of the man page was written in r69336
(I haven't checked). But it's not true now.  Now the effect is simply that
writes are sent to disk immediately and synchronously, but they're still
cached.

See also: https://pubs.opengroup.org/onlinepubs/9699919799/
See also: ffs_write in sys/ufs/ffs/ffs_vnops.c

Reviewed by:	cem
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20641
2019-06-14 20:35:37 +00:00
oshogbo
86dc1571ec unlink: add missing function to unlink.2 man page 2019-06-05 22:36:19 +00:00
asomers
ae332e926a Link fhlinkat(2) man page
Reviewed by:	kib
MFC after:	3 days
MFC-With:	r341689
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20339
2019-05-22 01:11:21 +00:00
markj
094736f08f Provide separate accounting for user-wired pages.
Historically we have not distinguished between kernel wirings and user
wirings for accounting purposes.  User wirings (via mlock(2)) were
subject to a global limit on the number of wired pages, so if large
swaths of physical memory were wired by the kernel, as happens with
the ZFS ARC among other things, the limit could be exceeded, causing
user wirings to fail.

The change adds a new counter, v_user_wire_count, which counts the
number of virtual pages wired by user processes via mlock(2) and
mlockall(2).  Only user-wired pages are subject to the system-wide
limit which helps provide some safety against deadlocks.  In
particular, while sources of kernel wirings typically support some
backpressure mechanism, there is no way to reclaim user-wired pages
shorting of killing the wiring process.  The limit is exported as
vm.max_user_wired, renamed from vm.max_wired, and changed from u_int
to u_long.

The choice to count virtual user-wired pages rather than physical
pages was done for simplicity.  There are mechanisms that can cause
user-wired mappings to be destroyed while maintaining a wiring of
the backing physical page; these make it difficult to accurately
track user wirings at the physical page layer.

The change also closes some holes which allowed user wirings to succeed
even when they would cause the system limit to be exceeded.  For
instance, mmap() may now fail with ENOMEM in a process that has called
mlockall(MCL_FUTURE) if the new mapping would cause the user wiring
limit to be exceeded.

Note that bhyve -S is subject to the user wiring limit, which defaults
to 1/3 of physical RAM.  Users that wish to exceed the limit must tune
vm.max_user_wired.

Reviewed by:	kib, ngie (mlock() test changes)
Tested by:	pho (earlier version)
MFC after:	45 days
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D19908
2019-05-13 16:38:48 +00:00
trasz
2dfe603ea5 .Xr protect(1) and proccontrol(1) from procctl(2).
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
2019-04-09 10:09:59 +00:00
oshogbo
20d273b44b Introduce funlinkat syscall that always us to check if we are removing
the file associated with the given file descriptor.

Reviewed by:	kib, asomers
Reviewed by:	cem, jilles, brooks (they reviewed previous version)
Discussed with:	pjd, and many others
Differential Revision:	https://reviews.freebsd.org/D14567
2019-04-06 09:34:26 +00:00
kib
9638d3e2e6 Fix initial exec TLS mode for dynamically loaded shared objects.
If dso uses initial exec TLS mode, rtld tries to allocate TLS in
static space. If there is no space left, the dlopen(3) fails. If space
if allocated, initial content from PT_TLS segment is distributed to
all threads' pcbs, which was missed and caused un-initialized TLS
segment for such dso after dlopen(3).

The mode is auto-detected either due to the relocation used, or if the
DF_STATIC_TLS dynamic flag is set.  In the later case, the TLS segment
is tried to allocate earlier, which increases chance of the dlopen(3)
to succeed.  LLD was recently fixed to properly emit the flag, ld.bdf
did it always.

Initial test by:	dumbbell
Tested by:	emaste (amd64), ian (arm)
Tested by:	Gerald Aryeetey <aryeeteygerald_rogers.com> (arm64)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D19072
2019-03-29 17:52:57 +00:00
emaste
c2bece9ede Use consistent struct stat arg name in stat man page
stat, lstat, and fstat use `*sb` as the stat struct pointer arg name,
while fstatat previously used `*buf`.

MFC after:	1 week
2019-03-13 15:18:14 +00:00
emaste
47733e8a6a poll.2: POLLNVAL is returned also for insufficient rights
Reported by:	"Bora Özarslan" <borako.ozarslan@gmail.com>
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2019-02-27 17:52:22 +00:00
kib
069606f4d2 procctl(2): document ASLR knobs.
Reviewed by:	0mp
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D19308
2019-02-26 17:41:41 +00:00
kib
7b81db9851 procctl(2): fix -width parameter to .Bl.
According to 0mp, macros are not expanded in the argument provided to
-width.  Use plain identifiers for width specification.

Noted and reviewed by:	0mp
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Differential revision:	https://reviews.freebsd.org/D19308
2019-02-26 17:35:06 +00:00
glebius
8a60540f9e Imaginary cat jumped my keyboard! 2019-02-15 23:46:34 +00:00
glebius
d889424078 For 32-bit machines rollback the default number of vnode pager pbufs
back to the lever before r343030.  For 64-bit machines reduce it slightly,
too.  Together with r343030 I bumped the limit up to the value we use at
Netflix to serve 100 Gbit/s of sendfile traffic, and it probably isn't a
good default.

Provide a loader tunable to change vnode pager pbufs count. Document it.
2019-02-15 23:36:22 +00:00
pluknet
16482aa8e6 Document the ENOBUFS errno in setsockopt(2).
In particular, it is the case if SO_SNDBUF/SO_RCVBUF would exceed sb_max_adj.

PR:		200649
MFC after:	1 week
2019-02-09 21:33:32 +00:00
ngie
d1a640f492 Document that sendfile will return an invalid value for sbytes if provided an invalid address
This is meant to clarify the fact that the system call will not fail
with -1/EFAULT, as one might expect, when reading the sendfile(2)
manpage today.

While here, pet the mandoc linter, when dealing with the section that
describes valid values for `flags`.

PR:	232210
MFC after:	2 weeks
Approved by:	emaste (mentor)
Reviewed by:	glebius, 0mp
Differential Revision: https://reviews.freebsd.org/D18949
2019-01-25 19:56:02 +00:00
mckusick
72a21ba0f6 Create new EINTEGRITY error with message "Integrity check failed".
An integrity check such as a check-hash or a cross-correlation failed.
The integrity error falls between EINVAL that identifies errors in
parameters to a system call and EIO that identifies errors with the
underlying storage media. EINTEGRITY is typically raised by intermediate
kernel layers such as a filesystem or an in-kernel GEOM subsystem when
they detect inconsistencies. Uses include allowing the mount(8) command
to return a different exit value to automate the running of fsck(8)
during a system boot.

These changes make no use of the new error, they just add it. Later
commits will be made for the use of the new error number and it will
be added to additional manual pages as appropriate.

Reviewed by:    gnn, dim, brueffer, imp
Discussed with: kib, cem, emaste, ed, jilles
Differential Revision: https://reviews.freebsd.org/D18765
2019-01-17 06:35:45 +00:00
kib
f34dbcf7d0 Implement shmat(2) flag SHM_REMAP.
Based on the description in Linux man page.

Reviewed by:	markj, ngie (previous version)
Sponsored by:	Mellanox Technologies
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D18837
2019-01-16 05:15:57 +00:00
kib
54e59fff7b Add a tunable which changes mincore(2) algorithm to only report data
from the local mapping.

Enable the setting by default.
The article behind the change: https://arxiv.org/abs/1901.01161

Reviewed by:	markj
Discussed with:	emaste
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D18764
2019-01-07 22:10:48 +00:00
jilles
bbe7a81f56 thr_wake(2): Minor mdoc fixes
MFC after:	1 week
2019-01-06 21:34:05 +00:00
markj
85037a09f3 Support MSG_DONTWAIT in send*(2).
As it does for recv*(2), MSG_DONTWAIT indicates that the call should
not block, returning EAGAIN instead.  Linux and OpenBSD both implement
this, so the change makes porting easier, especially since we do not
return EINVAL or so when unrecognized flags are specified.

Submitted by:	Greg V <greg@unrelenting.technology>
Reviewed by:	tuexen
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D18728
2019-01-04 17:31:50 +00:00
kib
4045451f9e Remove special case handling for getfhat(fd, NULL, handle).
There is no reason for it to behave differently from openat(fd, NULL).
Also the handling did not worked because the substituted path was from
the system address space, causing EFAULT.

Submitted by:	Jack Halford <jack@gandi.net>
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D18501
2018-12-11 02:48:49 +00:00
kib
48d91fd889 Add new file handle system calls.
Namely, getfhat(2), fhlink(2), fhlinkat(2), fhreadlink(2).  The
syscalls are provided for a NFS userspace server (nfs-ganesha).

Submitted by:	Jack Halford <jack@gandi.net>
Sponsored by:	Gandi.net
Tested by:	pho
Feedback from:	brooks, markj
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D18359
2018-12-07 15:17:29 +00:00
asomers
1d5dbbadec stat(2): clarify which syscalls modify file timestamps
The list of syscalls that modify st_atim, st_mtim, and st_ctim was quite out
of date and probably not accurate to begin with.  Update it, and make it
clear that the list is open-ended.

Reviewed by:	kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D18410
2018-12-05 17:28:40 +00:00
asomers
0b42fa2478 fcntl.2: document an additional error condition
MFC after:	2 weeks
2018-11-15 16:13:25 +00:00
kib
7d33ec3750 Add d_off support for multiple filesystems.
The d_off field has been added to the dirent structure recently.
Currently filesystems don't support this feature.  Support has been
added and tested for zfs, ufs, ext2fs, fdescfs, msdosfs and unionfs.
A stub implementation is available for cd9660, nandfs, udf and
pseudofs but hasn't been tested.

Motivation for this feature: our usecase is for a userspace nfs server
(nfs-ganesha) with zfs.  At the moment we cache direntry offsets by
calling lseek once per entry, with this patch we can get the offset
directly from getdirentries(2) calls which provides a significant
speedup.

Submitted by:	Jack Halford <jack@gandi.net>
Reviewed by:	mckusick, pfg, rmacklem (previous versions)
Sponsored by:	Gandi.net
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D17917
2018-11-14 14:18:35 +00:00
kib
b2b4bc844b First draft of documentation for AT/O_BENEATH handling of the absolute
paths.

It was decided that committing the code and drafting of the man page
update is better than allowing the code to rot until wordsmithing
happens.

Reviewed by:	jilles (previous version)
Discussed with:	brooks, jilles, emaste
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D17714
2018-11-11 01:46:48 +00:00
cem
a559f5f5a0 kern_poll: Restore explanatory comment removed in r177374
The comment isn't stale.  The check is bogus in the sense that poll(2)
does not require pollfd entries to be unique in fd space, so there is no
reason there cannot be more pollfd entries than open or even allowed
fds.  The check is mostly a seatbelt against accidental misuse or
abuse.  FD_SETSIZE, while usually unrelated to poll, is used as an
arbitrary floor for systems with very low kern.maxfilesperproc.

Additionally, document this possible EINVAL condition in the poll.2
manual.

No functional change.

Reviewed by:	markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D17671
2018-11-01 23:46:23 +00:00