- Add macros to allow preinitialization of cap_rights_t.
- Convert most commonly used code paths to use preinitialized cap_rights_t.
A 3.6% speedup in fstat was measured with this change.
Reported by: mjg
Reviewed by: oshogbo
Approved by: sbruno
MFC after: 1 month
x dup_before
+ dup_after
+------------------------------------------------------------+
| x + |
|x x x x ++ ++|
| |____AM___| |AM||
+------------------------------------------------------------+
N Min Max Median Avg Stddev
x 5 1.514954e+08 1.5230351e+08 1.5206157e+08 1.5199371e+08 341205.71
+ 5 1.5494336e+08 1.5519569e+08 1.5511982e+08 1.5508323e+08 96232.829
Difference at 95.0% confidence
3.08952e+06 +/- 365604
2.03266% +/- 0.245071%
(Student's t, pooled s = 250681)
Reported by: mjg@
MFC after: 1 week
1. check if P_ADVLOCK is already set and if so, don't lock to set it
(stolen from DragonFly)
2. when trying for fast path unlock, check that we are doing unlock
first instead of taking the interlock for no reason (e.g. if we want
to *lock*). whilere make it more likely that falling fast path will
not take the interlock either by checking for state
Note the code is severely pessimized both single- and multithreaded.
fget_cap() tries to do a cheaper snapshot of a file descriptor without
holding the file descriptor lock. This snapshot does not do a deep
copy of the ioctls capability array, but instead uses a different
return value to inform the caller to retry the copy with the lock
held. However, filecaps_copy() was returning 1 to indicate that a
retry was required, and fget_cap() was checking for 0 (actually
'!filecaps_copy()'). As a result, fget_cap() did not do a deep copy
of the ioctls array and just reused the original pointer. This cause
multiple file descriptor entries to think they owned the same pointer
and eventually resulted in duplicate frees.
The only code path that I'm aware of that triggers this is to create a
listen socket that has a restricted list of ioctls and then call
accept() which calls fget_cap() with a valid filecaps structure from
getsock_cap().
To fix, change the return value of filecaps_copy() to return true if
it succeeds in copying the caps and false if it fails because the lock
is required. I find this more intuitive than fixing the caller in
this case. While here, change the return type from 'int' to 'bool'.
Finally, make filecaps_copy() more robust in the failure case by not
copying any of the source filecaps structure over. This avoids the
possibility of leaking a pointer into a structure if a similar future
caller doesn't properly handle the return value from filecaps_copy()
at the expense of one more branch.
I also added a test case that panics before this change and now passes.
Reviewed by: kib
Discussed with: mjg (not a fan of the extra branch)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D15047
opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.
Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.
Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.
Reviewed by: kib, cem, jhb, jtl
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14941
pathconf(2) and fpathconf(2) both return a long. The kern_[f]pathconf()
functions now accept a pointer to a long value rather than modifying
td_retval directly. Instead, the system calls explicitly store the
returned long value in td_retval[0].
Requested by: bde
Reviewed by: kib
Sponsored by: Chelsio Communications
The method handles NAME_MAX and LINK_MAX explicitly. For all other
pathconf variables, the method passes the request down to the underlying
file descriptor. This requires splitting a kern_fpathconf() syscallsubr
routine out of sys_fpathconf(). Also, to avoid lock order reversals with
vnode locks, the fdescfs vnode is unlocked around the call to
kern_fpathconf(), but with the usecount of the vnode bumped.
MFC after: 1 month
Sponsored by: Chelsio Communications
Mainly focus on files that use BSD 3-Clause license.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
This makes ddb show files more descriptive and also adjusts the
whitespace to align the columns for non-32-bit architectures.
Reviewed by: cem (previous version), jhb
Approved by: markj (mentor)
Differential Revision: https://reviews.freebsd.org/D11061
inode number or link count for the ABI compat binaries.
Right now, and by default after the change, too large 64bit values are
silently truncated to 32 bits. Enabling the knob causes the system to
return EOVERFLOW for stat(2) family of compat syscalls when some
values cannot be completely represented by the old structures. For
getdirentries(2), knob skips the dirents which would cause non-trivial
truncation of d_ino.
EOVERFLOW error is specified by the X/Open 1996 LFS document
('Adding Support for Arbitrary File Sizes to the Single UNIX
Specification').
Based on the discussion with: bde
Sponsored by: The FreeBSD Foundation
Extend the ino_t, dev_t, nlink_t types to 64-bit ints. Modify
struct dirent layout to add d_off, increase the size of d_fileno
to 64-bits, increase the size of d_namlen to 16-bits, and change
the required alignment. Increase struct statfs f_mntfromname[] and
f_mntonname[] array length MNAMELEN to 1024.
ABI breakage is mitigated by providing compatibility using versioned
symbols, ingenious use of the existing padding in structures, and
by employing other tricks. Unfortunately, not everything can be
fixed, especially outside the base system. For instance, third-party
APIs which pass struct stat around are broken in backward and
forward incompatible ways.
Kinfo sysctl MIBs ABI is changed in backward-compatible way, but
there is no general mechanism to handle other sysctl MIBS which
return structures where the layout has changed. It was considered
that the breakage is either in the management interfaces, where we
usually allow ABI slip, or is not important.
Struct xvnode changed layout, no compat shims are provided.
For struct xtty, dev_t tty device member was reduced to uint32_t.
It was decided that keeping ABI compat in this case is more useful
than reporting 64-bit dev_t, for the sake of pstat.
Update note: strictly follow the instructions in UPDATING. Build
and install the new kernel with COMPAT_FREEBSD11 option enabled,
then reboot, and only then install new world.
Credits: The 64-bit inode project, also known as ino64, started life
many years ago as a project by Gleb Kurtsou (gleb). Kirk McKusick
(mckusick) then picked up and updated the patch, and acted as a
flag-waver. Feedback, suggestions, and discussions were carried
by Ed Maste (emaste), John Baldwin (jhb), Jilles Tjoelker (jilles),
and Rick Macklem (rmacklem). Kris Moore (kris) performed an initial
ports investigation followed by an exp-run by Antoine Brodin (antoine).
Essential and all-embracing testing was done by Peter Holm (pho).
The heavy lifting of coordinating all these efforts and bringing the
project to completion were done by Konstantin Belousov (kib).
Sponsored by: The FreeBSD Foundation (emaste, kib)
Differential revision: https://reviews.freebsd.org/D10439
This allows blind increment of relevant counters which under contention
is cheaper than inc-not-zero loops at least on amd64.
Use it in some of the places which are guaranteed to see already active
vnodes.
Reviewed by: kib (previous version)
always audit the file-descriptor number and vnode information for all
fnctl(2) commands, not just locking-related ones. This was likely an
oversight in the original adaptation of this code from XNU.
MFC after: 3 days
Sponsored by: DARPA, AFRL
If the kernel is not compiled with the CAPABILITIES kernel options
fget_unlocked doesn't return the sequence number so fd_modify will
always report modification, in that case we got infinity loop.
Reported by: br
Reviewed by: mjg
Tested by: br, def
fget_cap_locked returns a referenced file, but the fgetvp_rights does
not need it. Instead, due to the filedesc lock being held, it can
ref the vnode after the file was looked up.
Fix up fget_cap_locked to be consistent with other _locked helpers and not
ref the file.
This plugs a leak introduced in r306184.
Pointy hat to: mjg, oshogbo
read(2), write(2), dup(2), and mmap(2). This auditing is not
required by the Common Criteria (and hence was not being
performed), but is valuable in both contemporary live analysis
and forensic use cases.
MFC after: 3 days
Sponsored by: DARPA, AFRL
data (headers). Historically the size of the headers was not checked
against the socket buffer space. Application could easily overcommit the
socket buffer space.
With the new sendfile (r293439) the problem remained, but a KASSERT was
inserted that checked that amount of data written to the socket matches
its space. In case when size of headers is bigger that socket space,
KASSERT fires. Without INVARIANTS the new sendfile won't panic, but
would report incorrect amount of bytes sent.
o With this change, the headers copyin is moved down into the cycle, after
the sbspace() check. The uio size is trimmed by socket space there,
which fixes the overcommit problem and its consequences.
o The compatibility handling for FreeBSD 4 sendfile headers API is pushed
up the stack to syscall wrappers. This required a copy and paste of the
code, but in turn this allowed to remove extra stack carried parameter
from fo_sendfile_t, and embrace entire compat code into #ifdef. If in
future we got more fo_sendfile_t function, the copy and paste level would
even reduce.
Reviewed by: emax, gallatin, Maxim Dounin <mdounin mdounin.ru>
Tested by: Vitalij Satanivskij <satan ukr.net>
Sponsored by: Netflix
- Mark AIO system calls as STD and remove the helpers to dynamically
register them.
- Use COMPAT6 for the old system calls with the older sigevent instead of
an 'o' prefix.
- Simplify the POSIX configuration to note that AIO is always available.
- Handle AIO in the default VOP_PATHCONF instead of special casing it in
the pathconf() system call. fpathconf() is still hackish.
- Remove freebsd32_aio_cancel() as it just called the native one directly.
Reviewed by: kib
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D5589
The filedesc lock is only needed if ioctls caps are present, which is a
rare situation. This is a step towards reducing the scope of the filedesc
lock.
Coredump notes depend on being able to invoke dump routines twice; once
in a dry-run mode to get the size of the note, and another to actually
emit the note to the corefile.
When a note helper emits a different length section the second time
around than the length it requested the first time, the kernel produces
a corrupt coredump.
NT_PROCSTAT_FILES output length, when packing kinfo structs, is tied to
the length of filenames corresponding to vnodes in the process' fd table
via vn_fullpath. As vnodes may move around during dump, this is racy.
So:
- Detect badly behaved notes in putnote() and pad underfilled notes.
- Add a fail point, debug.fail_point.fill_kinfo_vnode__random_path to
exercise the NT_PROCSTAT_FILES corruption. It simply picks random
lengths to expand or truncate paths to in fo_fill_kinfo_vnode().
- Add a sysctl, kern.coredump_pack_fileinfo, to allow users to
disable kinfo packing for PROCSTAT_FILES notes. This should avoid
both FILES note corruption and truncation, even if filenames change,
at the cost of about 1 kiB in padding bloat per open fd. Document
the new sysctl in core.5.
- Fix note_procstat_files to self-limit in the 2nd pass. Since
sometimes this will result in a short write, pad up to our advertised
size. This addresses note corruption, at the risk of sometimes
truncating the last several fd info entries.
- Fix NT_PROCSTAT_FILES consumers libutil and libprocstat to grok the
zero padding.
With suggestions from: bjk, jhb, kib, wblock
Approved by: markj (mentor)
Relnotes: yes
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D3548
Originally it was added in order to prevent trashing of objects with
INVARIANTS enabled. The same effect is now provided with mere UMA_ZONE_NOFREE.
This reverts r286921.
Discussed with: kib
zero. The file_zone if no-free, but r284861 added trashing of the
freed memory. Most visible manifestation of the issue were 'memory
modified after free' panics for the file zone, triggered from
falloc_noinstall().
Add UMA_ZONE_ZINIT flag to turn off trashing. Mjg noted that it makes
sense to not trash freed memory for any non-free zone, which will be
done later.
Reported and tested by: pho
Discussed with: mjg
Sponsored by: The FreeBSD Foundation
falloc_noinstall() followed by finstall() allows you to create and
install file descriptors with custom capabilities. Add falloc_caps()
that can do both of these actions in one go.
This will be used by CloudABI to create pipes with custom capabilities.
Reviewed by: mjg
Summary:
In a runtime that is purely based on capability-based security, there is
a strong emphasis on how programs start their execution. We need to make
sure that we execute an new program with an exact set of file
descriptors, ensuring that credentials are not leaked into the process
accidentally.
Providing the right file descriptors is just half the problem. There
also needs to be a framework in place that gives meaning to these file
descriptors. How does a CloudABI mail server know which of the file
descriptors corresponds to the socket that receives incoming emails?
Furthermore, how will this mail server acquire its configuration
parameters, as it cannot open a configuration file from a global path on
disk?
CloudABI solves this problem by replacing traditional string command
line arguments by tree-like data structure consisting of scalars,
sequences and mappings (similar to YAML/JSON). In this structure, file
descriptors are treated as a first-class citizen. When calling exec(),
file descriptors are passed on to the new executable if and only if they
are referenced from this tree structure. See the cloudabi-run(1) man
page for more details and examples (sysutils/cloudabi-utils).
Fortunately, the kernel does not need to care about this tree structure
at all. The C library is responsible for serializing and deserializing,
but also for extracting the list of referenced file descriptors. The
system call only receives a copy of the serialized data and a layout of
what the new file descriptor table should look like:
int proc_exec(int execfd, const void *data, size_t datalen, const int *fds,
size_t fdslen);
This change introduces a set of fd*_remapped() functions:
- fdcopy_remapped() pulls a copy of a file descriptor table, remapping
all of the file descriptors according to the provided mapping table.
- fdinstall_remapped() replaces the file descriptor table of the process
by the copy created by fdcopy_remapped().
- fdescfree_remapped() frees the table in case we aborted before
fdinstall_remapped().
We then add a function exec_copyin_data_fds() that builds on top these
functions. It copies in the data and constructs a new remapped file
descriptor. This is used by cloudabi_sys_proc_exec().
Test Plan:
cloudabi-run(1) is capable of spawning processes successfully, providing
it data and file descriptors. procstat -f seems to confirm all is good.
Regular FreeBSD processes also work properly.
Reviewers: kib, mjg
Reviewed By: mjg
Subscribers: imp
Differential Revision: https://reviews.freebsd.org/D3079
Previously several places were doing it on its own, partially
incorrectly (e.g. without the filedesc locked) or even actively harmful
by populating jdir or assigning rootvnode without vrefing it.
Reviewed by: kib
- make mode enum start from 0 so that the assertion covers all cases [1]
- rename prefix _CLOEXEC flag with _FLAG
- postpone fhold on the old file descriptor, which eliminates the need to fdrop
in error cases.
- fixup FDDUP_FCNTL check missed in the previous commit
This removes 'fp == oldfde->fde_file' assertion which had little value. kern_dup
only calls fd-related functions which cannot drop the lock or a whole lot of
races would be introduced.
Noted by: kib [1]