Unlike the other syscalls these two symbols were missing from the
version script. I noticed this while looking into the compiler-rt
runtime libraries for CHERI.
Reviewed by: brooks
Obtained from: https://github.com/CTSRD-CHERI/cheribsd/pull/1063
MFC after: 3 days
Add missing wrapper code to librt for these new functions so that
SIGEV_THREAD works. Without machinery to convert it to SIGEV_THREAD_ID,
you got EINVAL.
Reviewed by: asomers
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D31618
fspacectl(2) is a system call to provide space management support to
userspace applications. VOP_DEALLOCATE(9) is a VOP call to perform the
deallocation. vn_deallocate(9) is a public KPI for kmods' use.
The purpose of proposing a new system call, a KPI and a VOP call is to
allow bhyve or other hypervisor monitors to emulate the behavior of SCSI
UNMAP/NVMe DEALLOCATE on a plain file.
fspacectl(2) comprises of cmd and flags parameters to specify the
space management operation to be performed. Currently cmd has to be
SPACECTL_DEALLOC, and flags has to be 0.
fo_fspacectl is added to fileops.
VOP_DEALLOCATE(9) is added as a new VOP call. A trivial implementation
of VOP_DEALLOCATE(9) is provided.
Sponsored by: The FreeBSD Foundation
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D28347
Current POSIX standard requires fork() to be async-signal safe. Neither
our implementation, nor implementations in other operating systems are,
and practically it is impossible to make fork() async-signal safe without
too much efforts. Also, that would put undue requirement that all atfork
handlers should be async-signal safe as well, which contradicts its main
use.
As result, Austin Group dropped the requirement, and added a new function
_Fork() that should be async-signal safe, but it does not call atfork
handlers. Basically, _Fork() can be implemented as a raw syscall.
Release of glibc 2.34 added _Fork(), do the same for FreeBSD.
Clarify threading behavior for fork() in the manpage.
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D31378
Without wrapping, rtld services and malloc(3) are not guaranteed
to operate correctly in the forked child.
Reviewed by: markj
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28088
POSIX AIO is great, but it lacks vectored I/O functions. This commit
fixes that shortcoming by adding aio_writev and aio_readv. They aren't
part of the standard, but they're an obvious extension. They work just
like their synchronous equivalents pwritev and preadv.
It isn't yet possible to use vectored aiocbs with lio_listio, but that
could be added in the future.
Reviewed by: jhb, kib, bcr
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D27743
largepage shm objects.
And since we can, add memfd_create(MFD_HUGETLB) support, hopefully
close enough to the Linux feature.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D24652
Include a temporarily compatibility shim as well for kernels predating
close_range, since closefrom is used in some critical areas.
Reviewed by: markj (previous version), kib
Differential Revision: https://reviews.freebsd.org/D24399
close_range(min, max, flags) allows for a range of descriptors to be
closed. The Python folk have indicated that they would much prefer this
interface to closefrom(2), as the case may be that they/someone have special
fds dup'd to higher in the range and they can't necessarily closefrom(min)
because they don't want to hit the upper range, but relocating them to lower
isn't necessarily feasible.
sys_closefrom has been rewritten to use kern_close_range() using ~0U to
indicate closing to the end of the range. This was chosen rather than
requiring callers of kern_close_range() to hold FILEDESC_SLOCK across the
call to kern_close_range for simplicity.
The flags argument of close_range(2) is currently unused, so any flags set
is currently EINVAL. It was added to the interface in Linux so that future
flags could be added for, e.g., "halt on first error" and things of this
nature.
This patch is based on a syscall of the same design that is expected to be
merged into Linux.
Reviewed by: kib, markj, vangyzen (all slightly earlier revisions)
Differential Revision: https://reviews.freebsd.org/D21627
A new syscall sigfastblock(2) is added which registers a uint32_t
variable as containing the count of blocks for signal delivery. Its
content is read by kernel on each syscall entry and on AST processing,
non-zero count of blocks is interpreted same as the signal mask
blocking all signals.
The biggest downside of the feature that I see is that memory
corruption that affects the registered fast sigblock location, would
cause quite strange application misbehavior. For instance, the process
would be immune to ^C (but killable by SIGKILL).
With consumers (rtld and libthr added), benchmarks do not show a
slow-down of the syscalls in micro-measurements, and macro benchmarks
like buildworld do not demonstrate a difference. Part of the reason is
that buildworld time is dominated by compiler, and clang already links
to libthr. On the other hand, small utilities typically used by shell
scripts have the total number of syscalls cut by half.
The syscall is not exported from the stable libc version namespace on
purpose. It is intended to be used only by our C runtime
implementation internals.
Tested by: pho
Disscussed with: cem, emaste, jilles
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D12773
Add an atomic shm rename operation, similar in spirit to a file
rename. Atomically unlink an shm from a source path and link it to a
destination path. If an existing shm is linked at the destination
path, unlink it as part of the same atomic operation. The caller needs
the same permissions as shm_unlink to the shm being renamed, and the
same permissions for the shm at the destination which is being
unlinked, if it exists. If those fail, EACCES is returned, as with the
other shm_* syscalls.
truss support is included; audit support will come later.
This commit includes only the implementation; the sysent-generated
bits will come in a follow-on commit.
Submitted by: Matthew Bryan <matthew.bryan@isilon.com>
Reviewed by: jilles (earlier revision)
Reviewed by: brueffer (manpages, earlier revision)
Relnotes: yes
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D21423
memfd_create is effectively a SHM_ANON shm_open(2) mapping with optional
CLOEXEC and file sealing support. This is used by some mesa parts, some
linux libs, and qemu can also take advantage of it and uses the sealing to
prevent resizing the region.
This reimplements shm_open in terms of shm_open2(2) at the same time.
shm_open(2) will be moved to COMPAT12 shortly.
Reviewed by: markj, kib
Differential Revision: https://reviews.freebsd.org/D21393
Previously userspace would issue one syscall to resolve the sysctl and then
another one to actually use it. Do it all in one trip.
Fallback is provided in case newer libc happens to be running on an older
kernel.
Submitted by: Pawel Biernacki
Reported by: kib, brooks
Differential Revision: https://reviews.freebsd.org/D17282
copy_file_range.2 is a new man page (content change).
Reviewed by: kib, asomers
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D20584
the file associated with the given file descriptor.
Reviewed by: kib, asomers
Reviewed by: cem, jilles, brooks (they reviewed previous version)
Discussed with: pjd, and many others
Differential Revision: https://reviews.freebsd.org/D14567
No valid FreeBSD binary very called them (they would call lchown and
msync directly) and we haven't supported NetBSD binaries in ages.
This is a respin of r335983 with a workaround for the ancient BFD linker
in the libc stubs.
Reviewed by: kib
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D16193
No valid FreeBSD binary ever called them (they would call lchown and
msync directly) and we haven't supported NetBSD binaries in ages.
Reviewed by: kib
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15814
The vadvise syscall (aka ovadvise) is undocumented and has always been
implmented as returning EINVAL. Put the syscall under COMPAT11 and
provide a userspace implementation.
Reviewed by: kib
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15557
On older kernels, when userspace program disables SIGSYS, catch ENOSYS and
emulate getrandom(2) syscall with the kern.arandom sysctl (via existing
arc4_sysctl wrapper).
Special care is taken to faithfully emulate EFAULT on NULL pointers, because
sysctl(3) as used by kern.arandom ignores NULL oldp. (This was caught by
getentropy(3) ATF tests.)
Reported by: kib
Reviewed by: kib
Discussed with: delphij
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14785
The general idea here is to provide userspace programs with well-defined
sources of entropy, in a fashion that doesn't require opening a new file
descriptor (ulimits) or accessing paths (/dev/urandom may be restricted
by chroot or capsicum).
getrandom(2) is the more general API, and comes from the Linux world.
Since our urandom and random devices are identical, the GRND_RANDOM flag
is ignored.
getentropy(3) is added as a compatibility shim for the OpenBSD API.
truss(1) support is included.
Tests for both system calls are provided. Coverage is believed to be at
least as comprehensive as LTP getrandom(2) test coverage. Additionally,
instructions for running the LTP tests directly against FreeBSD are provided
in the "Test Plan" section of the Differential revision linked below. (They
pass, of course.)
PR: 194204
Reported by: David CARLIER <david.carlier AT hardenedbsd.org>
Discussed with: cperciva, delphij, jhb, markj
Relnotes: maybe
Differential Revision: https://reviews.freebsd.org/D14500
userspace to control NUMA policy administratively and programmatically.
Implement domainset based iterators in the page layer.
Remove the now legacy numa_* syscalls.
Cleanup some header polution created by having seq.h in proc.h.
Reviewed by: markj, kib
Discussed with: alc
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13403
Private functions like __aio_read and _aio_read were exposed in
FBSDprivate_1.0 by r169090, even though they've never been used outside of
librt. Also, remove some weak references from r156136 that have never
resolved.
Reviewed by: kib
MFC after: 3 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D11649
This change implements NOTE_ABSTIME flag for EVFILT_TIMER, which
specifies that the data field contains absolute time to fire the
event.
To make this useful, data member of the struct kevent must be extended
to 64bit. Using the opportunity, I also added ext members. This
changes struct kevent almost to Apple struct kevent64, except I did
not changed type of ident and udata, the later would cause serious API
incompatibilities.
The type of ident was kept uintptr_t since EVFILT_AIO returns a
pointer in this field, and e.g. CHERI is sensitive to the type
(discussed with brooks, jhb).
Unlike Apple kevent64, symbol versioning allows us to claim ABI
compatibility and still name the new syscall kevent(2). Compat shims
are provided for both host native and compat32.
Requested by: bapt
Reviewed by: bapt, brooks, ngie (previous version)
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D11025
Extend the ino_t, dev_t, nlink_t types to 64-bit ints. Modify
struct dirent layout to add d_off, increase the size of d_fileno
to 64-bits, increase the size of d_namlen to 16-bits, and change
the required alignment. Increase struct statfs f_mntfromname[] and
f_mntonname[] array length MNAMELEN to 1024.
ABI breakage is mitigated by providing compatibility using versioned
symbols, ingenious use of the existing padding in structures, and
by employing other tricks. Unfortunately, not everything can be
fixed, especially outside the base system. For instance, third-party
APIs which pass struct stat around are broken in backward and
forward incompatible ways.
Kinfo sysctl MIBs ABI is changed in backward-compatible way, but
there is no general mechanism to handle other sysctl MIBS which
return structures where the layout has changed. It was considered
that the breakage is either in the management interfaces, where we
usually allow ABI slip, or is not important.
Struct xvnode changed layout, no compat shims are provided.
For struct xtty, dev_t tty device member was reduced to uint32_t.
It was decided that keeping ABI compat in this case is more useful
than reporting 64-bit dev_t, for the sake of pstat.
Update note: strictly follow the instructions in UPDATING. Build
and install the new kernel with COMPAT_FREEBSD11 option enabled,
then reboot, and only then install new world.
Credits: The 64-bit inode project, also known as ino64, started life
many years ago as a project by Gleb Kurtsou (gleb). Kirk McKusick
(mckusick) then picked up and updated the patch, and acted as a
flag-waver. Feedback, suggestions, and discussions were carried
by Ed Maste (emaste), John Baldwin (jhb), Jilles Tjoelker (jilles),
and Rick Macklem (rmacklem). Kris Moore (kris) performed an initial
ports investigation followed by an exp-run by Antoine Brodin (antoine).
Essential and all-embracing testing was done by Peter Holm (pho).
The heavy lifting of coordinating all these efforts and bringing the
project to completion were done by Konstantin Belousov (kib).
Sponsored by: The FreeBSD Foundation (emaste, kib)
Differential revision: https://reviews.freebsd.org/D10439
Add a clock_nanosleep() syscall, as specified by POSIX.
Make nanosleep() a wrapper around it.
Attach the clock_nanosleep test from NetBSD. Adjust it for the
FreeBSD behavior of updating rmtp only when interrupted by a signal.
I believe this to be POSIX-compliant, since POSIX mentions the rmtp
parameter only in the paragraph about EINTR. This is also what
Linux does. (NetBSD updates rmtp unconditionally.)
Copy the whole nanosleep.2 man page from NetBSD because it is complete
and closely resembles the POSIX description. Edit, polish, and reword it
a bit, being sure to keep any relevant text from the FreeBSD page.
Reviewed by: kib, ngie, jilles
MFC after: 3 weeks
Relnotes: yes
Sponsored by: Dell EMC
Differential Revision: https://reviews.freebsd.org/D10020
The syscall is a trivial wrapper around new VOP_FDATASYNC(), sharing
code with fsync(2). For all filesystems, this commit provides the
implementation which delegates the work of VOP_FDATASYNC() to
VOP_FSYNC(). This is functionally correct but not efficient.
This is not yet POSIX-compliant implementation, because it does not
ensure that queued AIO requests are completed before returning.
Reviewed by: mckusick
Discussed with: avg (ZFS), jhb (AIO part)
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D7471
These entries should have never been present since they only exist for
compat with FreeBSD 6.x (and older) binaries. This was missed in r296572.
Technically this breaks the ABI by removing versioned symbols. However,
no binaries should be linked against these symbols. No release has
shipped with a header that contained a prototype for these functions.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D5615
wraps sendmsg(2) and recvmsg(2) into batch send and receive operation.
The goal of this implementation is only to provide API compatibility
with Linux.
The cancellation behaviour of the functions is not quite right, but
due to relative rare use of cancellation it is considered acceptable
comparing with the complexity of the correct implementation. If
functions are reimplemented as syscalls, the fix would come almost
trivial. The direct use of the syscall trampolines instead of libc
wrappers for sendmsg(2) and recvmsg(2) is to avoid data loss on
cancellation.
Submitted by: Boris Astardzhiev <boris.astardzhiev@gmail.com>
Discussed with: jilles (cancellation behaviour)
MFC after: 1 month
The comment in the libc/sys symbol map referenced the generated symbols
for the syscall trampolines. Such comment was out of place in the secure
symbol map so remove the stale comment and attempt to clarify the old one
to avoid risks of confusion.
Pointed out by: kib
As part of the code refactoring to support FORTIFY_SOURCE we want
a new subdirectory "secure" to keep the files related to security.
Move the stack protector functions to this new directory.
No functional change.
Differential Review: https://reviews.freebsd.org/D3333
This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.
* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
set in a variety of methods.
This is only relevant for very specific workloads.
This doesn't pretend to be a final NUMA solution.
The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.
This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.
Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.
Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.
Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.
Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!
Tested:
* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)
* Peter Holm ran a stress test suite on this work and found one
issue, but has not been able to verify it (it doesn't look NUMA
related, and he only saw it once over many testing runs.)
* I've tested bhyve instances running in fixed NUMA domains and cpusets;
all seems to work correctly.
Verified:
* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
NUMA policies for processes under test.
Review:
This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@. The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).
This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus. My hope is that with further
exposure and testing more functionality can be implemented and evaluated.
Notes:
* The VM doesn't handle unbalanced domains very well, and if you have an overly
unbalanced memory setup whilst under high memory pressure, VM page allocation
may fail leading to a kernel panic. This was a problem in the past, but it's
much more easily triggered now with these tools.
* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory
isn't really guaranteed in any way. That's next on my plate.
Sponsored by: Norse Corp, Inc.; Dell
The core kernel part is patch file utimes.2008.4.diff from
pluknet@FreeBSD.org. I updated the code for API changes, added the manual
page and added compatibility code for old kernels. There is also audit and
Capsicum support.
A new UTIME_* constant might allow setting birthtimes in future.
Differential Revision: https://reviews.freebsd.org/D1426
Submitted by: pluknet (partially)
Reviewed by: delphij, pluknet, rwatson
Relnotes: yes
though GOT, by staticizing and hiding. Add setter for
__error_selector to hide it as well.
Suggested and reviewed by: jilles
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
(or loading a dso linked to libthr.so into process which was not
linked against threading library).
- Remove libthr interposers of the libc functions, including
__error(). Instead, functions calls are indirected through the
interposing table, similar to how pthread stubs in libc are already
done. Libc by default points either to syscall trampolines or to
existing libc implementations. On libthr load, libthr rewrites the
pointers to the cancellable implementations already in libthr. The
interposition table is separate from pthreads stubs indirection
table to not pull pthreads stubs into static binaries.
- Postpone the malloc(3) internal mutexes initialization until libthr
is loaded. This avoids recursion between calloc(3) and static
pthread_mutex_t initialization.
- Reinstall signal handlers with wrapper on libthr load. The
_rtld_is_dlopened(3) is used to avoid useless calls to sigaction(2)
when libthr is statically referenced from the main binary.
In the process, fix openat(2), swapcontext(2) and setcontext(2)
interposing. The libc symbols were exported at different versions
than libthr interposers. Export both libc and libthr versions from
libc now, with default set to the higher version from libthr.
Remove unused and disconnected swapcontext(3) userspace implementation
from libc/gen.
No objections from: deischen
Tested by: pho, antoine (exp-run) (previous versions)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
exhausted.
- Add a new protect(1) command that can be used to set or revoke protection
from arbitrary processes. Similar to ktrace it can apply a change to all
existing descendants of a process as well as future descendants.
- Add a new procctl(2) system call that provides a generic interface for
control operations on processes (as opposed to the debugger-specific
operations provided by ptrace(2)). procctl(2) uses a combination of
idtype_t and an id to identify the set of processes on which to operate
similar to wait6().
- Add a PROC_SPROTECT control operation to manage the protection status
of a set of processes. MADV_PROTECT still works for backwards
compatability.
- Add a p_flag2 to struct proc (and a corresponding ki_flag2 to kinfo_proc)
the first bit of which is used to track if P_PROTECT should be inherited
by new child processes.
Reviewed by: kib, jilles (earlier version)
Approved by: re (delphij)
MFC after: 1 month
in the future in a backward compatible (API and ABI) way.
The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.
The structure definition looks like this:
struct cap_rights {
uint64_t cr_rights[CAP_RIGHTS_VERSION + 2];
};
The initial CAP_RIGHTS_VERSION is 0.
The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.
The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.
To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.
#define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL)
We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:
#define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL)
#define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL)
#define CAP_FCHMODAT (CAP_FCHMOD | CAP_LOOKUP)
There is new API to manage the new cap_rights_t structure:
cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
void cap_rights_set(cap_rights_t *rights, ...);
void cap_rights_clear(cap_rights_t *rights, ...);
bool cap_rights_is_set(const cap_rights_t *rights, ...);
bool cap_rights_is_valid(const cap_rights_t *rights);
void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);
Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:
cap_rights_t rights;
cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);
There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:
#define cap_rights_set(rights, ...) \
__cap_rights_set((rights), __VA_ARGS__, 0ULL)
void __cap_rights_set(cap_rights_t *rights, ...);
Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:
cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);
Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.
This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.
Sponsored by: The FreeBSD Foundation
The pipe2() function is similar to pipe() but allows setting FD_CLOEXEC and
O_NONBLOCK (on both sides) as part of the function.
If p points to two writable ints, pipe2(p, 0) is equivalent to pipe(p).
If the pointer is not valid, behaviour differs: pipe2() writes into the
array from the kernel like socketpair() does, while pipe() writes into the
array from an architecture-specific assembler wrapper.
Reviewed by: kan, kib
The accept4() function, compared to accept(), allows setting the new file
descriptor atomically close-on-exec and explicitly controlling the
non-blocking status on the new socket. (Note that the latter point means
that accept() is not equivalent to any form of accept4().)
The linuxulator's accept4 implementation leaves a race window where the new
file descriptor is not close-on-exec because it calls sys_accept(). This
implementation leaves no such race window (by using falloc() flags). The
linuxulator could be fixed and simplified by using the new code.
Like accept(), accept4() is async-signal-safe, a cancellation point and
permitted in capability mode.
int bindat(int fd, int s, const struct sockaddr *addr, socklen_t addrlen);
int connectat(int fd, int s, const struct sockaddr *name, socklen_t namelen);
which allow to bind and connect respectively to a UNIX domain socket with a
path relative to the directory associated with the given file descriptor 'fd'.
- Add manual pages for the new syscalls.
- Make the new syscalls available for processes in capability mode sandbox.
- Add capability rights CAP_BINDAT and CAP_CONNECTAT that has to be present on
the directory descriptor for the syscalls to work.
- Update audit(4) to support those two new syscalls and to handle path
in sockaddr_un structure relative to the given directory descriptor.
- Update procstat(1) to recognize the new capability rights.
- Document the new capability rights in cap_rights_limit(2).
Sponsored by: The FreeBSD Foundation
Discussed with: rwatson, jilles, kib, des
- Capability is no longer separate descriptor type. Now every descriptor
has set of its own capability rights.
- The cap_new(2) system call is left, but it is no longer documented and
should not be used in new code.
- The new syscall cap_rights_limit(2) should be used instead of
cap_new(2), which limits capability rights of the given descriptor
without creating a new one.
- The cap_getrights(2) syscall is renamed to cap_rights_get(2).
- If CAP_IOCTL capability right is present we can further reduce allowed
ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
ioctls can be retrived with cap_ioctls_get(2) syscall.
- If CAP_FCNTL capability right is present we can further reduce fcntls
that can be used with the new cap_fcntls_limit(2) syscall and retrive
them with cap_fcntls_get(2).
- To support ioctl and fcntl white-listing the filedesc structure was
heavly modified.
- The audit subsystem, kdump and procstat tools were updated to
recognize new syscalls.
- Capability rights were revised and eventhough I tried hard to provide
backward API and ABI compatibility there are some incompatible changes
that are described in detail below:
CAP_CREATE old behaviour:
- Allow for openat(2)+O_CREAT.
- Allow for linkat(2).
- Allow for symlinkat(2).
CAP_CREATE new behaviour:
- Allow for openat(2)+O_CREAT.
Added CAP_LINKAT:
- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
- Allow to be target for renameat(2).
Added CAP_SYMLINKAT:
- Allow for symlinkat(2).
Removed CAP_DELETE. Old behaviour:
- Allow for unlinkat(2) when removing non-directory object.
- Allow to be source for renameat(2).
Removed CAP_RMDIR. Old behaviour:
- Allow for unlinkat(2) when removing directory.
Added CAP_RENAMEAT:
- Required for source directory for the renameat(2) syscall.
Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
- Allow for unlinkat(2) on any object.
- Required if target of renameat(2) exists and will be removed by this
call.
Removed CAP_MAPEXEC.
CAP_MMAP old behaviour:
- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
PROT_WRITE.
CAP_MMAP new behaviour:
- Allow for mmap(2)+PROT_NONE.
Added CAP_MMAP_R:
- Allow for mmap(PROT_READ).
Added CAP_MMAP_W:
- Allow for mmap(PROT_WRITE).
Added CAP_MMAP_X:
- Allow for mmap(PROT_EXEC).
Added CAP_MMAP_RW:
- Allow for mmap(PROT_READ | PROT_WRITE).
Added CAP_MMAP_RX:
- Allow for mmap(PROT_READ | PROT_EXEC).
Added CAP_MMAP_WX:
- Allow for mmap(PROT_WRITE | PROT_EXEC).
Added CAP_MMAP_RWX:
- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).
Renamed CAP_MKDIR to CAP_MKDIRAT.
Renamed CAP_MKFIFO to CAP_MKFIFOAT.
Renamed CAP_MKNODE to CAP_MKNODEAT.
CAP_READ old behaviour:
- Allow pread(2).
- Disallow read(2), readv(2) (if there is no CAP_SEEK).
CAP_READ new behaviour:
- Allow read(2), readv(2).
- Disallow pread(2) (CAP_SEEK was also required).
CAP_WRITE old behaviour:
- Allow pwrite(2).
- Disallow write(2), writev(2) (if there is no CAP_SEEK).
CAP_WRITE new behaviour:
- Allow write(2), writev(2).
- Disallow pwrite(2) (CAP_SEEK was also required).
Added convinient defines:
#define CAP_PREAD (CAP_SEEK | CAP_READ)
#define CAP_PWRITE (CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ)
#define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
#define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W)
#define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X)
#define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X)
#define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
#define CAP_RECV CAP_READ
#define CAP_SEND CAP_WRITE
#define CAP_SOCK_CLIENT \
(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
#define CAP_SOCK_SERVER \
(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
CAP_SETSOCKOPT | CAP_SHUTDOWN)
Added defines for backward API compatibility:
#define CAP_MAPEXEC CAP_MMAP_X
#define CAP_DELETE CAP_UNLINKAT
#define CAP_MKDIR CAP_MKDIRAT
#define CAP_RMDIR CAP_UNLINKAT
#define CAP_MKFIFO CAP_MKFIFOAT
#define CAP_MKNOD CAP_MKNODAT
#define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER)
Sponsored by: The FreeBSD Foundation
Reviewed by: Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with: rwatson, benl, jonathan
ABI compatibility discussed with: kib
system call, which has a nice property - it never fails, so it is a bit
easier to use. If there is no support for capability mode in the kernel
the function will return false (not in a sandbox). If the kernel is compiled
with the support for capability mode, the function will return true or false
depending if the calling process is in the capability mode sandbox or not
respectively.
Sponsored by: The FreeBSD Foundation