exec_map_first_page() would unconditionally free an unbacked, invalid
page from the executable image. However, it is possible that the page
is wired, in which case it is incorrect to free the page, so check for
additional wirings first.
Reported by: syzkaller
Tested by: pho
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21767
Add an atomic shm rename operation, similar in spirit to a file
rename. Atomically unlink an shm from a source path and link it to a
destination path. If an existing shm is linked at the destination
path, unlink it as part of the same atomic operation. The caller needs
the same permissions as shm_unlink to the shm being renamed, and the
same permissions for the shm at the destination which is being
unlinked, if it exists. If those fail, EACCES is returned, as with the
other shm_* syscalls.
truss support is included; audit support will come later.
This commit includes only the implementation; the sysent-generated
bits will come in a follow-on commit.
Submitted by: Matthew Bryan <matthew.bryan@isilon.com>
Reviewed by: jilles (earlier revision)
Reviewed by: brueffer (manpages, earlier revision)
Relnotes: yes
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D21423
We have two ways to check if kenv variable exists - either we check return
value from TUNABLE_INT_FETCH, or we pre-initialize the variable and check
if this value did change. In terminal_init() it is more convinient to
use pre-initialized variables.
Problem was revealed by older loader.efi, which did not set teken.* variables.
Reported by: tuexen
Use of CPU_FFS() to implement CPUSET_FOREACH() allows to save up to ~0.5%
of CPU time on 72-thread SMT system doing 80K IOPS to NVMe from one thread.
MFC after: 1 month
Sponsored by: iXsystems, Inc.
I've noticed that I missed intr check at one more SCHED_AFFINITY(),
so instead of adding one more branching I prefer to remove few.
Profiler shows the function CPU time reduction from 0.24% to 0.16%.
MFC after: 1 month
Sponsored by: iXsystems, Inc.
When RFSPAWN is passed, rfork exhibits vfork(2) semantics but also resets
signal handlers in the child during creation to avoid a point of corruption
of parent state from the child.
This flag will be used by posix_spawn(3) to handle potential signal issues.
Reviewed by: jilles, kib
Differential Revision: https://reviews.freebsd.org/D19058
properly nested and warns about recursive entrances. Unlike with locks,
there is nothing fundamentally wrong with such use, the intent of tracer
is to help to review complex epoch-protected code paths, and we mean the
network stack here.
Reviewed by: hselasky
Sponsored by: Netflix
Pull Request: https://reviews.freebsd.org/D21610
shm_open2 allows a little more flexibility than the original shm_open.
shm_open2 doesn't enforce CLOEXEC on its callers, and it has a separate
shmflag argument that can be expanded later. Currently the only shmflag is
to allow file sealing on the returned fd.
shm_open and memfd_create will both be implemented in libc to use this new
syscall.
__FreeBSD_version is bumped to indicate the presence.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D21393
Now that flags may be set on posixshm, add an argument to kern_shm_open()
for the initial seals. To maintain past behavior where callers of
shm_open(2) are guaranteed to not have any seals applied to the fd they're
given, apply F_SEAL_SEAL for existing callers of kern_shm_open. A special
flag could be opened later for shm_open(2) to indicate that sealing should
be allowed.
We currently restrict initial seals to F_SEAL_SEAL. We cannot error out if
F_SEAL_SEAL is re-applied, as this would easily break shm_open() twice to a
shmfd that already existed. A note's been added about the assumptions we've
made here as a hint towards anyone wanting to allow other seals to be
applied at creation.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D21392
File sealing applies protections against certain actions
(currently: write, growth, shrink) at the inode level. New fileops are added
to accommodate seals - EINVAL is returned by fcntl(2) if they are not
implemented.
Reviewed by: markj, kib
Differential Revision: https://reviews.freebsd.org/D21391
Check for teken.fg_color and teken.bg_color and prepare the color
attributes accordingly.
When white background is used, make it light to improve visibility.
When black background is used, make kernel messages light.
Doing some tests with very high interrupt rates I've noticed that one of
conditions I added in r232207 to make interrupt threads in most cases
run on local CPU never worked as expected (worked only if previous time
it was executed on some other CPU, that is quite opposite). It caused
additional CPU usage to run full CPU search and could schedule interrupt
threads to some other CPU.
This patch removes that code and instead reuses existing non-interrupt
code path with some tweaks for interrupt case:
- On SMT systems, if current thread is idle, don't look on other threads.
Even if they are busy, it may take more time to do fill search and bounce
the interrupt thread to other core then execute it locally, even sharing
CPU resources. It is other threads should migrate, not bound interrupts.
- Try hard to keep interrupt threads within LLC of their original CPU.
This improves scheduling cost and supposedly cache and memory locality.
On a test system with 72 threads doing 2.2M IOPS to NVMe this saves few
percents of CPU time while adding few percents to IOPS.
MFC after: 1 month
Sponsored by: iXsystems, Inc.
is a completely separate TCP stack (tcp_bbr.ko) that will be built only if
you add the make options WITH_EXTRA_TCP_STACKS=1 and also include the option
TCPHPTS. You can also include the RATELIMIT option if you have a NIC interface that
supports hardware pacing, BBR understands how to use such a feature.
Note that this commit also adds in a general purpose time-filter which
allows you to have a min-filter or max-filter. A filter allows you to
have a low (or high) value for some period of time and degrade slowly
to another value has time passes. You can find out the details of
BBR by looking at the original paper at:
https://queue.acm.org/detail.cfm?id=3022184
or consult many other web resources you can find on the web
referenced by "BBR congestion control". It should be noted that
BBRv1 (which this is) does tend to unfairness in cases of small
buffered paths, and it will usually get less bandwidth in the case
of large BDP paths(when competing with new-reno or cubic flows). BBR
is still an active research area and we do plan on implementing V2
of BBR to see if it is an improvement over V1.
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D21582
- track the total count of hot entries
- pre-read the lock when shrinking since it is typically already taken
- place the lock in its own cacheline
- shorten the hold time of hot lock list when zapping
Sponsored by: The FreeBSD Foundation
This is required for DPCPU and VNET data variable definitions to work when
KLDs are linked as DSOs. R_X86_64_RELATIVE relocations should not appear
in object files, so assert this in elf_relocaddr().
Reviewed by: kib
MFC after: 1 month
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21755
The two options are
* nocover/cover: Prevent/allow mounting over an existing root mountpoint.
E.g., "mount -t ufs -o nocover /dev/sd1a /usr/local" will fail if /usr/local
is already a mountpoint.
* emptydir/noemptydir: Prevent/allow mounting on a non-empty directory.
E.g., "mount -t ufs -o emptydir /dev/sd1a /usr" will fail.
Neither of these options is intended to be a default, for historical and
compatibility reasons.
Reviewed by: allanjude, kib
Differential Revision: https://reviews.freebsd.org/D21458
by issuing delayed-writes (bdwrite()) until a non-sequential block
is written or the maximum cluster size is reached. At that point
it collects the delayed buffers together (using bread()) to write
them in a single operation. The assumption was that since we just
looked at them they will still be in memory so there is no need to
check for a read error from bread(). Very occationally (apparently
every 10-hours or so when being pounded by Peter Holm's tests)
this assumption is wrong.
The fix is to check for errors from bread() and fail the cluster
write thus falling back to the default individual flushing of any
still dirty buffers.
Reported by: Peter Holm and Chuck Silvers
Reviewed by: kib
MFC after: 3 days
This one is very hard to run into. If the filesystem is being unmounted or
the mount point is freed the vfs_op_thread_enter will fail. For it to
succeed the mount point itself would have to be reallocated in the time
window between the initial read and the attempt to enter.
Sponsored by: The FreeBSD Foundation
The breakage was added after all the testing and the testing which followed
was not sufficient to find it.
Reported by: pho
Sponsored by: The FreeBSD Foundation
There are 3 counters modified all the time in this structure - one for
keeping the structure alive, one for preventing unmount and one for
tracking active writers. Exact values of these counters are very rarely
needed, which makes them a prime candidate for conversion to a per-cpu
scheme, resulting in much better performance.
Sample benchmark performing fstatfs (modifying 2 out of 3 counters) on
a 104-way 2 socket Skylake system:
before: 852393 ops/s
after: 76682077 ops/s
Reviewed by: kib, jeff
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21637
New primitive is introduced to denote sections can operate locklessly
on aspects of struct mount, but which can also be disabled if necessary.
This provides an opportunity to start scaling common case modifications
while providing stable state of the struct when facing unmount, write
suspendion or other events.
mnt_ref is the first counter to start being managed in this manner with
the intent to make it per-cpu.
Reviewed by: kib, jeff
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21425
A future change to posixshm to add file sealing (in DIFF_21391[0] and child)
will move locking out of shm_dotruncate as kern_shm_open() will require the
lock to be held across the dotruncate until the seal is actually applied.
For this, the cookie is passed into shm_dotruncate_locked which asserts
RCA_WLOCKED.
[0] Name changed to protect the innocent, hopefully, from getting autoclosed
due to this reference...
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D21628
1. If we release the last usecount we take ownership of the hold count, which
means the vnode will remain allocated until we vdrop it.
2. If someone else vrefs they will find no usecount and will proceed to add
their own hold count.
3. No code has a problem with v_usecount transitioning to 0 without the
interlock
These facts combined mean we can fetchadd instead of having a cmpset loop.
Reviewed by: kib (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21528
cpuset_getroot() is guaranteed to return a non-NULL pointer.
Reported by: Mark Millard <marklmi@yahoo.com>
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Refcount waiting will set some flag bits in the refcount value.
Make sure these bits get cleared by using the REFCOUNT_COUNT()
macro to obtain the actual refcount.
Differential Revision: https://reviews.freebsd.org/D21620
Reviewed by: kib@, markj@
MFC after: 1 week
Sponsored by: Mellanox Technologies
As I like to forget: static kenv var formatting is actually such that an
empty environment would be double null bytes. We should make sure that a
non-zero buffer has at least enough for this, though most of the current
usage is with a 4k buffer.
Garbage in the passed-in buffer can cause problems if any attempts to read
the kenv are inadvertently made between init_static_kenv and the first
kern_setenv -- assuming there is one.
This is cheap and easy, so do it. This also helps rule out some class of
bugs as one tries to debug; tunables fetch from the static environment up
until SI_SUB_KMEM + 1, and many of these buffers are global ~4k buffers that
rely on BSS clearing while others just grab a page of free memory and use it
(e.g. xen).
Setting the B_INVALONERR flag before a synchronous write causes the buf
cache to forcibly invalidate contents if the write fails (BIO_ERROR).
This is intended to be used to allow layers above the buffer cache to make
more informed decisions about when discarding dirty buffers without
successful write is acceptable.
As a proof of concept, use in msdosfs to handle failures to mark the on-disk
'dirty' bit during rw mount or ro->rw update.
Extending this to other filesystems is left as future work.
PR: 210316
Reviewed by: kib (with objections)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D21539
Due to lock ordering issues (bucket lock held, vnode locks wanted) the code
starts with trylocking which in face of contention often fails. Prior to
the change it would loop back with a possible yield.
Instead note we know what locks are needed and can take them in the right
order, avoiding retries. Then we can safely re-lookup and see if the entry
we are looking for is still there.
On a 104-way box poudriere would result in constant retries during an 11h
run as seen in the vfs.cache.zap_and_exit_bucket_fail counter.
before: 408866592
after : 0
However, a new stat reports:
vfs.cache.zap_and_exit_bucket_relock_success: 32638
Note this is only a bandaid over current design issues.
Tested by: pho
Sponsored by: The FreeBSD Foundation
It used to be mp_ncpus * 64, but this gives unnecessarily big values for small
machines and at the same time constraints bigger ones. In particular this helps
on a 104-way box for which the count is now doubled.
While here make cache_purgevfs less likely. Currently it is not efficient in
face of contention due to lock ordering issues. These are fixable but not worth
it at the moment.
Sponsored by: The FreeBSD Foundation