Refcounting was added to combat a race between selfdfree and doselwakup,
but it adds avoidable overhead.
selfdfree detects it can free the object by ->sf_si == NULL, thus we can
ensure that the condition only holds after all accesses are completed.
On a sample box vmstat -z shows:
ITEM SIZE LIMIT USED FREE REQ
64: 64, 0, 1043784, 4367538,3698187229
selfd: 64, 0, 1520, 13726,182729008
But at the same time:
vm.uma.selfd.keg.domain.1.pages: 121
vm.uma.selfd.keg.domain.0.pages: 121
Thus 242 pages got pulled even though the malloc zone would likely accomodate
the load without using extra memory.
It keeps recalculated way more often than it is needed.
Provide a routine (fdlastfile) to get it if necessary.
Consumers may be better off with a bitmap iterator instead.
Hide debug print showing use of sign extended ioctl command argument
under INVARIANTS. The print is available to all and can easily fill
up the logs.
No functional change intended.
MFC after: 1 week
Sponsored by: Mellanox Technologies
This opens the door for other descriptor types to implement
posix_fallocate(2) as needed.
Reviewed by: kib, bcr (manpages)
Differential Revision: https://reviews.freebsd.org/D23042
vn_write already checks for vnode type to see if bwillwrite should be called.
This effectively reverts r244643.
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21905
fget_unlocked() and fhold().
On sufficiently large machine, f_count can be legitimately very large,
e.g. malicious code can dup same fd up to the per-process
filedescriptors limit, and then fork as much as it can.
On some smaller machine, I see
kern.maxfilesperproc: 939132
kern.maxprocperuid: 34203
which already overflows u_int. More, the malicious code can create
transient references by sending fds over unix sockets.
I realized that this check is missed after reading
https://secfault-security.com/blog/FreeBSD-SA-1902.fd.html
Reviewed by: markj (previous version), mjg
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D20947
The comment isn't stale. The check is bogus in the sense that poll(2)
does not require pollfd entries to be unique in fd space, so there is no
reason there cannot be more pollfd entries than open or even allowed
fds. The check is mostly a seatbelt against accidental misuse or
abuse. FD_SETSIZE, while usually unrelated to poll, is used as an
arbitrary floor for systems with very low kern.maxfilesperproc.
Additionally, document this possible EINVAL condition in the poll.2
manual.
No functional change.
Reviewed by: markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D17671
Some of the poll code used 'fds' and some used 'ufds' to refer to the
uap->fds userspace pointer that was passed around to subroutines. Some of
the poll code used 'fds' to refer to the kernel memory pollfd arrays, which
seemed unnecessarily confusing.
Unify on 'ufds' to refer to the userspace pollfd array.
Additionally, 'bits' is not an accurate description of the kernel pollfd
array in kern_poll, so rename that to 'kfds'. Finally, clean up some logic
with mallocarray() and nitems().
No functional change.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D17670
- Add macros to allow preinitialization of cap_rights_t.
- Convert most commonly used code paths to use preinitialized cap_rights_t.
A 3.6% speedup in fstat was measured with this change.
Reported by: mjg
Reviewed by: oshogbo
Approved by: sbruno
MFC after: 1 month
opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.
Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.
Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.
Reviewed by: kib, cem, jhb, jtl
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14941
A correct destruction is important for WITNESS(4) and LOCK_PROFILING(9).
Submitted by: Sebastian Huber <sebastian.huber@embedded-brains.de>
MFC after: 1 week
Mainly focus on files that use BSD 3-Clause license.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
of their sys_*() counterparts. The svr4 is left unchanged.
Reviewed by: kib@
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D9379
read(2), write(2), dup(2), and mmap(2). This auditing is not
required by the Common Criteria (and hence was not being
performed), but is valuable in both contemporary live analysis
and forensic use cases.
MFC after: 3 days
Sponsored by: DARPA, AFRL
Previously, freebsd32 binaries could submit read/write requests with lengths
greater than INT_MAX that a native kernel would have rejected.
Reviewed by: kib
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D5788
- Set td_errno so that ktrace and dtrace can obtain the syscall error
number in the usual way.
- Pass negative error numbers directly to the syscall layer, as they're
not intended to be returned to userland.
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D5425
loop finds the selfd entry and clears its sf_si pointer, which is
handled by selfdfree() in parallel, NULL sf_si makes selfdfree() free
the memory. The result is the race and accesses to the freed memory.
Refcount the selfd ownership. One reference is for the sf_link
linkage, which is unconditionally dereferenced by selfdfree().
Another reference is for sf_threads, both selfdfree() and
doselwakeup() race to deref it, the winner unlinks and than frees the
selfd entry.
Reported by: Larry Rosenman <ler@lerctr.org>
Tested by: Larry Rosenman <ler@lerctr.org>, pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
pwrite(2) syscalls are wrapped to provide compatibility with pre-7.x
kernels which required padding before the off_t parameter. The
fcntl(2) contains compatibility code to handle kernels before the
struct flock was changed during the 8.x CURRENT development. The
shims were reasonable to allow easier revert to the older kernel at
that time.
Now, two or three major releases later, shims do not serve any
purpose. Such old kernels cannot handle current libc, so revert the
compatibility code.
Make padded syscalls support conditional under the COMPAT6 config
option. For COMPAT32, the syscalls were under COMPAT6 already.
Remove WITHOUT_SYSCALL_COMPAT build option, which only purpose was to
(partially) disable the removed shims.
Reviewed by: jhb, imp (previous versions)
Discussed with: peter
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Introduce fget_fcntl which performs appropriate checks when needed.
This removes a branch from fget_unlocked.
Introduce fget_mmap dealing with cap_rights_to_vmprot conversion.
This removes a branch from _fget.
Modify fget_unlocked to pass sequence counter to interested callers so
that they can perform their own checks and make sure the result was
otained from stable & current state.
Reviewed by: silence on -hackers
especially for platforms where unaligned access is not allowed. Make
it possible to override the small buffer size.
A simple continuous read string test using libusb showed a reduction
in CPU usage from roughly 10% to less than 1% using a dual-core GHz
CPU, when the malloc() operation was skipped for small buffers.
MFC after: 2 weeks
fd_lastfile is guaranteed to be the biggest open fd, so when the intent
is to iterate over active fds or lookup one, there is no point in looking
beyond that limit.
Few places are left unpatched for now.
MFC after: 1 week
SBT_MAX, to make it more robust in case internal type representation will
change in the future. All the consumers were migrated to SBT_MAX and
every new consumer (if any) should from now use this interface.
Requested by: bapt, jmg, Ryan Lortie (implictly)
Reviewed by: mav, bde
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.
MFC after: 3 weeks
a very hard time to fully understand) with much more intuitive rights:
CAP_EVENT - when set on descriptor, the descriptor can be monitored
with syscalls like select(2), poll(2), kevent(2).
CAP_KQUEUE_EVENT - When set on a kqueue descriptor, the kevent(2)
syscall can be called on this kqueue to with the eventlist
argument set to non-NULL value; in other words the given
kqueue descriptor can be used to monitor other descriptors.
CAP_KQUEUE_CHANGE - When set on a kqueue descriptor, the kevent(2)
syscall can be called on this kqueue to with the changelist
argument set to non-NULL value; in other words it allows to
modify events monitored with the given kqueue descriptor.
Add alias CAP_KQUEUE, which allows for both CAP_KQUEUE_EVENT and
CAP_KQUEUE_CHANGE.
Add backward compatibility define CAP_POLL_EVENT which is equal to CAP_EVENT.
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
devfs_iosize_max_clamp sysctl, which allows/disables SSIZE_MAX-sized
i/o requests on the devfs files.
Sponsored by: The FreeBSD Foundation
Reminded by: Dmitry Sivachenko <trtrmitya@gmail.com>
MFC after: 1 week
in the future in a backward compatible (API and ABI) way.
The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.
The structure definition looks like this:
struct cap_rights {
uint64_t cr_rights[CAP_RIGHTS_VERSION + 2];
};
The initial CAP_RIGHTS_VERSION is 0.
The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.
The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.
To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.
#define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL)
We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:
#define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL)
#define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL)
#define CAP_FCHMODAT (CAP_FCHMOD | CAP_LOOKUP)
There is new API to manage the new cap_rights_t structure:
cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
void cap_rights_set(cap_rights_t *rights, ...);
void cap_rights_clear(cap_rights_t *rights, ...);
bool cap_rights_is_set(const cap_rights_t *rights, ...);
bool cap_rights_is_valid(const cap_rights_t *rights);
void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);
Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:
cap_rights_t rights;
cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);
There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:
#define cap_rights_set(rights, ...) \
__cap_rights_set((rights), __VA_ARGS__, 0ULL)
void __cap_rights_set(cap_rights_t *rights, ...);
Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:
cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);
Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.
This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.
Sponsored by: The FreeBSD Foundation
- Reconnect with some minor modifications, in particular now selsocket()
internals are adapted to use sbintime units after recent'ish calloutng
switch.
Fix kern_select() and sys_poll() so that they can handle sub-tick
precision for timeouts (in the same fashion it was done for nanosleep()
in r247797).
Sponsored by: Google Summer of Code 2012, iXsystems inc.
Tested by: flo, marius, ian, markj, Fabian Keil