Commit Graph

17441 Commits

Author SHA1 Message Date
Mark Johnston
569a186d02 Remove UNP_NASCENT, reverting r303855.
unp_connectat() no longer holds the link lock across calls to
sonewconn(), so the recursion described in r303855 can no longer occur.
No functional change intended.

Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2020-03-20 16:17:54 +00:00
Mark Johnston
429537caeb kern_dup(): Call filecaps_free_prep() in a write section.
filecaps_free_prep() bzeros the capabilities structure and we need to be
careful to synchronize with unlocked readers, which expect a consistent
rights structure.

Reviewed by:	kib, mjg
Reported by:	syzbot+5f30b507f91ddedded21@syzkaller.appspotmail.com
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D24120
2020-03-19 15:40:05 +00:00
Mark Johnston
2d896b816b Enter a write sequence when updating rights.
The Capsicum system calls modify file descriptor table entries.  To
ensure that readers observe a consistent snapshot of descriptor writes,
the system calls need to signal to unlocked readers that an update is
pending.

Note that ioctl rights are always checked with the descriptor table lock
held, so it is not strictly necessary to signal unlocked readers.
However, we probably want to enable lockless ioctl checks eventually, so
use seqc_write_begin() in kern_cap_ioctls_limit() too.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D24119
2020-03-19 15:39:45 +00:00
Brandon Bergren
3069380898 [PowerPC][Book-E] Fix missing load base in elf_cpu_parse_dynamic().
When I implemented MD DYNAMIC parsing, I was originally passing a
linker_file_t so that the MD code could relocate pointers.

However, it turns out this isn't even filled in until later, so it was
always 0.

Just pass the load base (ef->address) directly, as that's really the only
thing we were interested in in the first place.

This fixes a crash on RB800 where it was trying to write to an unmapped
address when updating the GOT.

Reviewed by:	jhibbits
Sponsored by:	Tag1 Consulting, Inc.
Differential Revision:	https://reviews.freebsd.org/D24105
2020-03-18 02:58:18 +00:00
Conrad Meyer
34086d5bda Implement sysctl kern.boot_id
Boot IDs are random, opaque 128-bit identifiers that distinguish distinct
system boots.  A new ID is generated each time the system boots.  Unlike
kern.boottime, the value is not modified by NTP adjustments.  It remains fixed
until the machine is restarted.

PR:		244867
Reported by:	Ricardo Fraile <rfraile AT rfraile.eu>
MFC after:	I do not intend to, but feel free
2020-03-17 22:27:16 +00:00
Conrad Meyer
a99c321802 Remove misleading / redundant bzero in callout_callwheel_init
The intent seems to be zeroing all of the cc_cpu array, or its singleton on
such platforms.  The assumption made is that the BSP is always zero.  The
code smell was introduced in r326218, which changed the prior explicit zero
to 'curcpu'.  The change is only valid if curcpu continues to be zero,
contrary to the aim expressed in that commit message.

So, more succinctly, the expression could be: memset(cc_cpu,0,sizeof(cc_cpu)).

However, there's no point.  cc_cpu lives in the data section and has a zero
initial value already.  So this revision just removes the problematic
statement.

No functional change.  Appeases a (false positive, ish) Coverity CID.

CID:		1383567
Reported by:	Puneeth Jothaiah <puneethkumar.jothaia AT dell.com>
Reviewed by:	kib
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D24089
2020-03-16 22:25:25 +00:00
Bjoern A. Zeeb
1b786d0191 kern_jail: missing \0 termination check on osrelease parameter
If a user spplies a non-\0 terminated osrelease parameter reading it back
may disclose kernel memory.
This is a problem in case of nested jails (children.max > 0, which is not
the default).  Otherwise root outside the jail has access to kernel memory
by other means and root inside a jail cannot create a child jail.

Add the proper \0 check at the end of a supplied osrelease parameter and
make sure any copies of the field will be \0-terminated.

Submitted by:	Hans Christian Woithe (chwoithe yahoo.com)
MFC after:	3 days
2020-03-14 14:04:55 +00:00
Michael Tuexen
db4493f7b6 sendfile() does currently not support SCTP sockets.
Therefore, fail the call.

Reviewed by:		markj@
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D24059
2020-03-13 18:38:28 +00:00
Conrad Meyer
7a119578a4 kern_shutdown: Add missing EKCD ifdef
Submitted by:	Puneeth Jothaiah <puneethkumar.jothaia AT dell.com>
Reviewed by:	bdrewery
Sponsored by:	Dell EMC Isilon
2020-03-12 21:26:36 +00:00
Konstantin Belousov
00ebd80972 Fix signal delivery might be on sigfastblock clearing.
When clearing sigfastblock, either by sigfastblock(UNSETPTR) call or
implicitly on execve(2), kernel must check for pending signals and
reschedule them if needed.

E.g. on execve, all other threads are terminated, and current thread
fast block pointer is cleaned.  If any signal was left pending, it can
now be delivered to the current thread, and we should prepare for
ast() on return to userspace to notice the signals.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
2020-03-10 20:25:03 +00:00
Konstantin Belousov
0bc52b0bdb Return reschedule_signals() to being static again.
It was used after sigfastblock_setpend() call in in ast() when current
thread fast-blocks signals.  Add a flag to sigfastblock_setpend() to
request reschedule, and remove the direct use of the function from
subr_trap.c

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2020-03-10 20:04:38 +00:00
Konstantin Belousov
2d3c083fd7 pipe: explain why not deallocating inode number is fine.
Suggested and reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24009
2020-03-09 23:40:25 +00:00
Konstantin Belousov
c6d3d601c9 Preallocate pipe buffers on pipe creation.
Return ENOMEM if one of the buffer cannot be created even with the
minimal size.  This should avoid subsequent spurious ENOMEM errors
from write(2) when buffer cannot be allocated on the fly, after we
reported that the pipe was create succesfully.

Reported by:	Keno Fischer <keno@juliacomputing.com>
Reviewed by:	markj (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D23993
2020-03-09 21:55:26 +00:00
Konstantin Belousov
1213de28f8 Style.
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D23993
2020-03-09 19:46:28 +00:00
Andrew Gallatin
98085bae8c make lacp's use_numa hashing aware of send tags
When I did the use_numa support, I missed the fact that there is
a separate hash function for send tag nic selection. So when
use_numa is enabled, ktls offload does not work properly, as it
does not reliably allocate a send tag on the proper egress nic
since different egress nics are selected for send-tag allocation
and packet transmit. To fix this, this change:

- refectors lacp_select_tx_port_by_hash() and
     lacp_select_tx_port() to make lacp_select_tx_port_by_hash()
     always called by lacp_select_tx_port()

-   pre-shifts flowids to convert them to hashes when calling lacp_select_tx_port_by_hash()

-   adds a numa_domain field to if_snd_tag_alloc_params

-   plumbs the numa domain into places where we allocate send tags

In testing with NIC TLS setup on a NUMA machine, I see thousands
of output errors before the change when enabling
kern.ipc.tls.ifnet.permitted=1. After the change, I see no
errors, and I see the NIC sysctl counters showing active TLS
offload sessions.

Reviewed by:	rrs, hselasky, jhb
Sponsored by:	Netflix
2020-03-09 13:44:51 +00:00
Mateusz Guzik
d2222aa0e9 fd: use smr for managing struct pwd
This has a side effect of eliminating filedesc slock/sunlock during path
lookup, which in turn removes contention vs concurrent modifications to the fd
table.

Reviewed by:	markj, kib
Differential Revision:	https://reviews.freebsd.org/D23889
2020-03-08 00:23:36 +00:00
Mark Johnston
d869a17e62 Use COUNTER_U64_DEFINE_EARLY() in places where it simplifies things.
Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D23978
2020-03-06 19:10:00 +00:00
Mark Johnston
fffcb56f7a Add COUNTER_U64_SYSINIT() and COUNTER_U64_DEFINE_EARLY().
The aim is to reduce the boilerplate needed today to define and
initialize global counters.  Also add SI_SUB_COUNTER to the sysinit
ordering.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D23977
2020-03-06 19:09:01 +00:00
Chuck Silvers
f15ccf8836 Add a new "mntfs" pseudo file system which provides private device vnodes for
file systems to safely access their disk devices, and adapt FFS to use it.
Also add a new BO_NOBUFS flag to allow enforcing that file systems using
mntfs vnodes do not accidentally use the original devfs vnode to create buffers.

Reviewed by:	kib, mckusick
Approved by:	imp (mentor)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D23787
2020-03-06 18:41:37 +00:00
Konstantin Belousov
695e0701a0 buffer pager: deref ucred immediately after read.
Ucred is passed to bread(9) so that non-local filesystems use proper
credentials.  But, since clean buffer might be cached unless
buf_pager_relbuf is not enabled, it makes credentials to have extra
reference until buffer is reclaimed.  Ucred reference would prevent
jail from destroying if creds are jailed.

Dereferencing the read credentials on the valid buffer avoid that, and
should be fine because the buffer is valid and does not need re-read.

PR:	238032
Reported by:	bz
Reproduced and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D23775
2020-03-05 15:52:34 +00:00
Mateusz Guzik
8d4d271e92 execve: use LOCKSHARED when looking up the interpreter
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23956
2020-03-04 19:52:34 +00:00
Chuck Silvers
37bf88e790 if vm_pager_get_pages_async() returns an error, release the sfio->nios
refcount that we took earlier that represents the I/O that ended up
not being started.

Reviewed by:	glebius
Approved by:	imp (mentor)
Sponsored by:	Netflix
2020-03-04 00:22:50 +00:00
Bjoern A. Zeeb
a2fba2a700 upic_ktrls: make RSS compile again here
The results of ktls_get_cpu() are stored in u_int and NETISR_CPUID_NONE
requires u_int.  Adjust uint16_t to uint_t in order to make RSS kernels
compile some more again.

HPTS still has to be fixed, which is a bit more complicated.

Reviewed by:	jhb, gallatin, rrs
Differential Revision:	https://reviews.freebsd.org/D23726
2020-03-03 14:07:44 +00:00
Mark Johnston
4cf919edb9 Fix the malloc type used in sys_shm_unlink() after r354808.
PR:		244563
Reported by:	swills
2020-03-03 00:28:37 +00:00
Pawel Biernacki
b05ca4290c sys/: Document few more sysctls.
Submitted by:	Antranig Vartanian <antranigv@freebsd.am>
Reviewed by:	kaktus
Commented by:	jhb
Approved by:	kib (mentor)
Sponsored by:	illuria security
Differential Revision:	https://reviews.freebsd.org/D23759
2020-03-02 15:30:52 +00:00
Mateusz Guzik
2f423bce54 vfs: stop taking additional refs on root vnode during lookup
They are spurious since introduction of struct pwd, which provides them
implicitly.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23885
2020-03-01 21:54:28 +00:00
Mateusz Guzik
8d03b99b9d fd: move vnodes out of filedesc into a dedicated structure
The new structure is copy-on-write. With the assumption that path lookups are
significantly more frequent than chdirs and chrooting this is a win.

This provides stable root and jail root vnodes without the need to reference
them on lookup, which in turn means less work on globally shared structures.
Note this also happens to fix a bug where jail vnode was never referenced,
meaning subsequent access on lookup could run into use-after-free.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23884
2020-03-01 21:53:46 +00:00
Mateusz Guzik
8243063f9b fd: make fgetvp_rights work without the filedesc lock
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23883
2020-03-01 21:50:13 +00:00
Mark Johnston
5aa5420ff2 Ensure that arm64 thread structures are allocated from the direct map.
Otherwise we can fail to handle translation faults on curthread, leading
to a panic.

Reviewed by:	alc, rlibby
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D23895
2020-02-29 18:41:48 +00:00
Jeff Roberson
6be21eb778 Provide a lock free alternative to resolve bogus pages. This is not likely
to be much of a perf win, just a nice code simplification.

Reviewed by:	markj, kib
Differential Revision:	https://reviews.freebsd.org/D23866
2020-02-28 21:42:48 +00:00
Jeff Roberson
7aaf252c96 Convert a few triviail consumers to the new unlocked grab API.
Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D23847
2020-02-28 20:34:30 +00:00
Jeff Roberson
f72eaaeb03 Use unlocked grab for uipc_shm/tmpfs.
Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D23865
2020-02-28 20:33:28 +00:00
Mark Johnston
46994ec2b1 Fix standalone builds of systrace.ko after r357912.
Sponsored by:	The FreeBSD Foundation
2020-02-28 17:05:04 +00:00
Mark Johnston
c99d0c5801 Add a blocking counter KPI.
refcount(9) was recently extended to support waiting on a refcount to
drop to zero, as this was needed for a lockless VM object
paging-in-progress counter.  However, this adds overhead to all uses of
refcount(9) and doesn't really match traditional refcounting semantics:
once a counter has dropped to zero, the protected object may be freed at
any point and it is not safe to dereference the counter.

This change removes that extension and instead adds a new set of KPIs,
blockcount_*, for use by VM object PIP and busy.

Reviewed by:	jeff, kib, mjg
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D23723
2020-02-28 16:05:18 +00:00
Jeff Roberson
561af25fa7 Simplify lazy advance with a 64bit atomic cmpset.
This provides the potential to force a lazy (tick based) SMR to advance
when there are blocking waiters by decoupling the wr_seq value from the
ticks value.

Add some missing compiler barriers.

Reviewed by:	rlibby
Differential Revision:	https://reviews.freebsd.org/D23825
2020-02-27 19:05:26 +00:00
Warner Losh
729ea680be Remove trailing white space. 2020-02-26 16:22:28 +00:00
Pawel Biernacki
7029da5c36 Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE.  All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by:	kib (mentor, blanket)
Commented by:	kib, gallatin, melifaro
Differential Revision:	https://reviews.freebsd.org/D23718
2020-02-26 14:26:36 +00:00
Gleb Smirnoff
6bc27f086a Generalize resources freeing in sendfile with different scenarios.
Now we execute sendfile_iodone() in all possible cases, which
guarantees that vm_object_pip_wakeup() is called and sfio structure
is freed.

At the beginning of sendfile initialize sfio->m to NULL, that would
indicate that the mbuf chain either doesn't exist, or belongs to the
syscall (not to I/O completion).  Fill sfio->m only at a point when
we are positive that there are I/Os ongoing and before releasing
syscall's reference on sfio.

In sendfile_iodone() perform vm_object_pip_wakeup() once last
reference is released, then check for sfio->m.  NULL pointer
indicates that we need only to free the memory.

Reviewed by:	jtl, gallatin
2020-02-25 19:29:05 +00:00
Gleb Smirnoff
f85e1a806b Make ktls_frame() never fail. Caller must supply correct mbufs.
This makes sendfile code a bit simplier.
2020-02-25 19:26:40 +00:00
Gleb Smirnoff
69302907d6 When sendfile_swapin() sweeps through pages in search for a bogus page
skip first and last pages.  This is a micro optimisation.
2020-02-25 19:11:20 +00:00
Ryan Libby
fe20aaec0a sys/kern: quiet -Wwrite-strings
Quiet a variety of Wwrite-strings warnings in sys/kern at low-impact
sites.  This patch avoids addressing certain others which would need to
plumb const through structure definitions.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D23798
2020-02-23 03:32:16 +00:00
Ryan Libby
2782c00c04 vfs: quiet -Wwrite-strings
Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D23797
2020-02-23 03:32:11 +00:00
Ryan Libby
eaa17d4291 sys/vm: quiet -Wwrite-strings
Discussed with:	kib
Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D23796
2020-02-23 03:32:04 +00:00
Konstantin Belousov
04869b812b Add td_pflags2, yet another thread-private flags word.
There is no more free bits in td_pflags.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2020-02-22 20:43:04 +00:00
Jeff Roberson
226dd6db47 Add an atomic-free tick moderated lazy update variant of SMR.
This enables very cheap read sections with free-to-use latencies and memory
overhead similar to epoch.  On a recent AMD platform a read section cost
1ns vs 5ns for the default SMR.  On Xeon the numbers should be more like 1
ns vs 11.  The memory consumption should be proportional to the product
of the free rate and 2*1/hz while normal SMR consumption is proportional
to the product of free rate and maximum read section time.

While here refactor the code to make future additions more
straightforward.

Name the overall technique Global Unbound Sequences (GUS) and adjust some
comments accordingly.  This helps distinguish discussions of the general
technique (SMR) vs this specific implementation (GUS).

Discussed with:	rlibby, markj
2020-02-22 03:44:10 +00:00
Mateusz Guzik
721a81c369 vfs: stop duplicating vnode work in audit during path lookup
Duplicating the work was putting an avoidable requirement that the filedesc
lock is held across the entire operation (otherwise by the time audit reads
vnode pointers another thread in the same process can chdir somewhere else,
making audit log things using different vnode than the one which will be
used for actual lookup).

Do the obvious thing and pass down vnodes which will be used.
2020-02-21 01:44:31 +00:00
Eric van Gyzen
3cd1f28e4a clamp kernel dump compression level when using gzip
If the configured compression level for kernel dumps
it outside the supported range, clamp it to the closest
supported level.  Previously, dumpon would fail.

zstd already does this internally, so the compressor
needs no change.

Reviewed by:	cem markj
MFC after:	2 weeks
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D23765
2020-02-20 23:53:48 +00:00
Konstantin Belousov
74cb9a5333 Fix a bug in r358168, do not call sigfastblock_setpend() under a mutex.
PR:	244250
Reported and tested by:	lwhsu
Sponsored by:	The FreeBSD Foundation
2020-02-20 21:25:12 +00:00
Mateusz Guzik
65cdfb4caa make sysent for r358172 ("vfs: add realpathat syscall") 2020-02-20 16:58:57 +00:00
Mateusz Guzik
0573d0a9b8 vfs: add realpathat syscall
realpath(3) is used a lot e.g., by clang and is a major source of getcwd
and fstatat calls. This can be done more efficiently in the kernel.

This works by performing a regular lookup while saving the name and found
parent directory. If the terminal vnode is a directory we can resolve it using
usual means. Otherwise we can use the name saved by lookup and resolve the
parent.

See the review for sample syscall counts.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23574
2020-02-20 16:58:19 +00:00
Konstantin Belousov
a113b17f10 Do not read sigfastblock word on syscall entry.
On machines with SMAP, fueword executes two serializing instructions
which can be seen in microbenchmarks.

As a measure to restore microbenchmark numbers, only read the word on
the attempt to deliver signal in ast().  If the word is set, signal is
not delivered and word is kept, preventing interruption of
interruptible sleeps by signals until userspace calls
sigfastblock(UNBLOCK) which clears the word.

This way, the spurious EINTR that userspace can see while in critical
section is on first interruptible sleep, if a signal is pending, and
on signal posting.  It is believed that it is not important for rtld
and lbithr critical sections.  It might be visible for the application
code e.g. for the callback of dl_iterate_phdr(3), but again the belief
is that the non-compliance is acceptable.  Most important is that the
retry of the sleeping syscall does not interrupt unless additional
signal is posted.

For now I added the knob kern.sigfastblock_fetch_always to enable the
word read on syscall entry to be able to diagnose possible issues due
to spurious EINTR.

While there, do some code restructuting to have all sigfastblock()
handling located in kern_sig.c.

Reviewed by:	jeff
Discussed with:	mjg
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D23622
2020-02-20 15:34:02 +00:00
Jeff Roberson
6c5f36ff30 Eliminate some unnecessary uses of UMA_ZONE_VM. Only zones involved in
virtual address or physical page allocation need to be marked with this
flag.

Reviewed by:	markj
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D23712
2020-02-19 08:17:27 +00:00
Mateusz Guzik
d8a84f08e8 refcount: update comments about fencing when releasing counts after r357989
Requested by:	kib
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23719
2020-02-16 18:20:09 +00:00
Mateusz Guzik
3403d5245e vfs: fix vlrureclaim ->v_object access
The routine was checking for ->v_type == VBAD. Since vgone drops the interlock
early sets this type at the end of the process of dooming a vnode, this opens
a time window where it can clear the pointer while the inerlock-holders is
accessing it.

Another note is that the code was:
	   (vp->v_object != NULL &&
	   vp->v_object->resident_page_count > trigger)

With the compiler being fully allowed to emit another read to get the pointer,
and in fact it did on the kernel used by pho.

Use atomic_load_ptr and remember the result.

Note that this depends on type-safety of vm_object.

Reported by:	pho
2020-02-16 03:33:34 +00:00
Mateusz Guzik
c615009461 vfs: check early for VCHR in vput_final to short-circuit in the common case
Otherwise the compiler inlines v_decr_devcount which keps getting jumped over
in the common case of not dealing with a device.
2020-02-16 03:16:28 +00:00
Matt Macy
45035becfe Add zfree to zero allocation before free
Key and cookie management typically wants to
avoid information leaks by explicitly zeroing
before free. This routine simplifies that by
permitting consumers to do so without carrying
the size around.

Reviewed by:	jeff@, jhb@
MFC after:	1 week
Sponsored by:	Rubicon Communications, LLC (Netgate)
Differential Revision:	https://reviews.freebsd.org/D22790
2020-02-16 00:12:53 +00:00
Konstantin Belousov
a7b61c0af1 sem_remove(): fix the loop that compacts sem array on semaphores removal.
As written now, it copies random kernel memory from beyond the bounds
of the array.

Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation (kib)
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D23694
2020-02-15 23:19:23 +00:00
Konstantin Belousov
4cb6ea7e8e sem_remove(): add some asserts.
Assert that sema[idx] allocation from sem[] is sane.
Also assert that sem_mtx is owned, it protects the SEM_ALLOC flag.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation (kib)
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D23694
2020-02-15 23:18:02 +00:00
Konstantin Belousov
8095050846 Use designated initializers for seminfo.
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation (kib)
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D23694
2020-02-15 23:15:42 +00:00
Pawel Biernacki
e0d69c5a88 Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (1 of many)
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked). Use it in
preparation for a general review of all nodes.
This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Reviewed by:	kib, trasz
Approved by:	kib (mentor)
Differential Revision:	https://reviews.freebsd.org/D23640
2020-02-15 18:48:38 +00:00
Mateusz Guzik
074ad60a4c vfs: make write suspension mandatory
At the time opt-in was introduced adding yourself as a writer was esrializing
across the mount point. Nowadays it is fully per-cpu, the only impact being
a small single-threaded hit on top of what's there right now.

Vast majority of the overhead stems from the call to VOP_GETWRITEMOUNT which
has is done regardless.

Should someone want to microoptimize this single-threaded they can coalesce
looking the mount up with adding a write to it.
2020-02-15 13:00:39 +00:00
Mateusz Guzik
eb40664d83 capsicum: use new helpers 2020-02-15 01:30:27 +00:00
Mateusz Guzik
445faddf7f kqueue: use new capsicum helpers 2020-02-15 01:30:13 +00:00
Mateusz Guzik
32a86c44ee fd: use new capsicum helpers 2020-02-15 01:28:55 +00:00
Mateusz Guzik
e126c5a3e8 vfs: use new capsicum helpers 2020-02-15 01:28:42 +00:00
Konstantin Belousov
6cf2362e2c Consolidate read code for timecounters and fix possible overflow in
bintime()/binuptime().

The algorithm to read the consistent snapshot of current timehand is
repeated in each accessor, including the details proper rollup
detection and synchronization with the writer.  In fact there are only
two different kind of readers: one for bintime()/binuptime() which has
to do the in-place calculation, and another kind which fetches some
member from struct timehand.

Extract the logic into type-checked macros, GETTHBINTIME() for bintime
calculation, and GETTHMEMBER() for safe read of a structure' member.
This way, the synchronization is only written in bintime_off() and
getthmember().

In bintime_off(), use overflow-safe calculation of th_scale *
delta(timecounter).  In tc_windup, pre-calculate the min delta value
which overflows and require slow algorithm, into the new timehands
th_large_delta member.

This part with overflow fix was written by Bruce Evans.

Reported by:	Mark Millard <marklmi@yahoo.com> (the overflow issue)
Tested by:	pho
Discussed with:	emaste
Sponsored by:	The FreeBSD Foundation (kib)
MFC after:	3 weeks
2020-02-14 23:27:45 +00:00
Mateusz Guzik
df0d5a2a85 vfs: remove no longer needed atomic_load_ptr casts 2020-02-14 23:18:32 +00:00
Mateusz Guzik
8f86349f8b fd: remove no longer needed atomic_load_ptr casts 2020-02-14 23:18:22 +00:00
Mateusz Guzik
5bc6a91f54 kcov: remove no longer needed atomic_load_ptr casts 2020-02-14 23:18:03 +00:00
Mateusz Guzik
2f7292437d Merge audit and systrace checks
This further shortens the syscall routine by not having to re-check after
the system call.
2020-02-14 13:09:41 +00:00
Mateusz Guzik
0e84a878c0 Annotate branches in the syscall path
This in particular significantly shortens amd64_syscall, which otherwise
keeps jumping forward over 2KB of code in total.

Note some of these branches should be either eliminated altogether or
coalesced.
2020-02-14 13:08:46 +00:00
Mateusz Guzik
ba8dd40bb1 lockmgr: add a change missed in r357907 2020-02-14 11:56:50 +00:00
Mateusz Guzik
6ed30ea4c0 fd: annotate finstall with prediction branches 2020-02-14 11:22:12 +00:00
Mateusz Guzik
c1b57fa7d3 lockmgr: rename lock_fast_path to lock_flags
The routine is not much of a fast path and the flags name better describes
its purpose.
2020-02-14 11:21:28 +00:00
Mateusz Guzik
943c4932f3 lockmgr: retire the unused lockmgr_unlock_fast_path routine 2020-02-14 11:20:25 +00:00
Kyle Evans
0f5f49eff7 u_char -> vm_prot_t in a couple of places, NFC
The latter is a typedef of the former; the typedef exists and these bits are
representing vmprot values, so use the correct type.

Submitted by:	sigsys@gmail.com
MFC after:	3 days
2020-02-14 02:22:08 +00:00
Mateusz Guzik
6ebab6bad2 vfs: use mac fastpath for lookup, open, read, write, mmap 2020-02-13 22:22:55 +00:00
Mateusz Guzik
7b2ff0dcb2 Partially decompose priv_check by adding priv_check_cred_vfs_generation
During buildkernel there are very frequent calls to priv_check and they
all are for PRIV_VFS_GENERATION (coming from stat/fstat).

This results in branching on several potential privileges checking if
perhaps that's the one which has to be evaluated.

Instead of the kitchen-sink approach provide a way to have commonly used
privs directly evaluated.
2020-02-13 22:22:15 +00:00
Mateusz Guzik
e6081fe899 Inline jailed().
It is constantly called from priv_check.
2020-02-13 22:16:30 +00:00
Mateusz Guzik
8bdcfb10d3 Annotate suser_enabled as __read_mostly
It is read a lot in priv code.
2020-02-13 22:16:02 +00:00
Jeff Roberson
1f2a6b8501 Since r357804 pcpu zones are required to use zalloc_pcpu(). Prior to this
it was only required if you were zeroing.  Switch to these interfaces.

Reviewed by:	mjg
2020-02-13 21:10:17 +00:00
Jeff Roberson
a4d50e49da Add more precise SMR entry asserts. 2020-02-13 20:50:21 +00:00
Kyle Evans
b30ab6d8fe sys/kern sysent: re-add dependency on capabilities.conf
r356868 inadvertently removed this, so changes to capabilities.conf were no
longer considered for being outdated.
2020-02-12 19:06:34 +00:00
Ed Maste
fe16bad415 regen sysent after r357831, r357838
Capability mode changes allowing fdatasync and getloginclass.

Sponsored by:	The FreeBSD Foundation
2020-02-12 19:05:10 +00:00
Ed Maste
e953765f15 Allow getloginclass in capability mode
As with e.g. getgroups and getlogin it allows querying current process
credential state.

Reported by:	sigsys@gmail.com via kevans
Sponsored by:	The FreeBSD Foundation
2020-02-12 18:59:00 +00:00
Ed Maste
9cdfb2d69a Allow fdatasync in capability mode
fdatasync is essentially a subset of fsync (and may be exactly fsync,
depending on filesystem and development effort) and operates only on
a provided fd.

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2020-02-12 17:12:26 +00:00
Mateusz Guzik
4602214772 vfs: refactor vputx and add more comment
Reviewed by:	jeff (previous version)
Tested by:	pho (previous version)
Differential Revision:	https://reviews.freebsd.org/D23530
2020-02-12 11:19:07 +00:00
Mateusz Guzik
ed67a63c39 vfs: drop remaining zpcpu casts 2020-02-12 11:18:12 +00:00
Mateusz Guzik
123c519731 vfs: switch to smp_rendezvous_cpus_retry for vfs_op_thread_enter/exit
In particular on amd64 this eliminates an atomic op in the common case,
trading it for IPIs in the uncommon case of catching CPUs executing the
code while the filesystem is getting suspended or unmounted.
2020-02-12 11:17:45 +00:00
Mateusz Guzik
00ac9d2632 rms: use smp_rendezvous_cpus_retry instead of a hand-rolled variant 2020-02-12 11:17:18 +00:00
Mateusz Guzik
e4f584971b Add smp_rendezvous_cpus_retry
This is a wrapper around smp_rendezvous_cpus which enables use of IPI
handlers which can fail and require retrying.

wait_func argument is added to to provide a routine which can be used to
poll CPU of interest for when the IPI can be retried.

Handlers which succeed must call smp_rendezvous_cpus_done to denote that
fact.

Discussed with:	 jeff
Differential Revision:	https://reviews.freebsd.org/D23582
2020-02-12 11:16:55 +00:00
Mateusz Guzik
3acb6572fc Store offset into zpcpu allocations in the per-cpu area.
This shorten zpcpu_get and allows more optimizations.

Reviewed by:	jeff
Differential Revision:	https://reviews.freebsd.org/D23570
2020-02-12 11:11:22 +00:00
Mateusz Guzik
48baf00f54 epoch: convert zpcpu_get_cpua(.., curcpu) to zpcpu_get 2020-02-12 11:10:10 +00:00
Gleb Smirnoff
4426b2e64b Add flag to struct task to mark the task as requiring network epoch.
When processing a taskqueue and a task has associated epoch, then
enter for duration of the task.  If consecutive tasks belong to the
same epoch, batch them.  Now we are talking about the network epoch
only.

Shrink the ta_priority size to 8-bits.  No current consumers use
a priority that won't fit into 8 bits.  Also complexity of
taskqueue_enqueue() is a square of maximum value of priority, so
we unlikely ever want to go over UCHAR_MAX here.

Reviewed by:	hselasky
Differential Revision:	https://reviews.freebsd.org/D23518
2020-02-11 18:48:07 +00:00
Mateusz Guzik
57349a4f41 vfs: fix vhold race in mnt_vnode_next_lazy_relock
vdrop can set the hold count to 0 and wait for the ->mnt_listmtx held by
mnt_vnode_next_lazy_relock caller. The routine incorrectly asserted the
count has to be > 0.

Reported by:	pho
Tested by:	pho
2020-02-11 18:19:56 +00:00
Mateusz Guzik
1b853b62f3 capsicum: restore the cap_rights_contains symbol
It is expected to be provided by libc.

PR:		244033
Reported by:	 Jan Kokemueller
2020-02-11 18:13:53 +00:00
Mateusz Guzik
2e57c8fde7 vfs: fix device count leak on vrele racing with vgone
The race is:

CPU1                                CPU2
                                    devfs_reclaim_vchr
make v_usecount 0
                                      VI_LOCK
                                      sees v_usecount == 0, no updates
                                      vp->v_rdev = NULL;
                                      ...
                                      VI_UNLOCK
VI_LOCK
v_decr_devcount
  sees v_rdev == NULL, no updates

In this scenario si_devcount decrement is not performed.

Note this can only happen if the vnode lock is not held.

Reviewed by:	kib
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D23529
2020-02-10 22:28:54 +00:00
Li-Wen Hsu
37d4ece7c5 Restore the behavior of allowing empty string in a string sysctl
Added as a special case to avoid unnecessary memory operations.

Reviewed by:	delphij
Sponsored by:	The FreeBSD Foundation
2020-02-10 20:53:59 +00:00
Hans Petter Selasky
f912e8f2ff Fix for unbalanced EPOCH(9) usage in the generic kernel interrupt
handler.

Interrupt handlers are removed via intr_event_execute_handlers() when
IH_DEAD is set. The thread removing the interrupt is woken up, and
calls intr_event_update(). When this happens, the ie_hflags are
cleared and re-built from all the remaining handlers sharing the
event. When the last IH_NET handler is removed, the IH_NET flag will
be cleared from ih_hflags (or ie_hflags may still be being rebuilt in
a different context), and the ithread_execute_handlers() may return
with ie_hflags missing IH_NET. This can lead to a scenario where
IH_NET was present before calling ithread_execute_handlers, and is not
present at its return, meaning the need for epoch must be cached
locally.

This can happen when loading and unloading network drivers. Also make
sure the ie_hflags is not cleared before being updated.

This is a regression issue after r357004.

Backtrace:
panic()
# trying to access epoch tracker on stack of dead thread
_epoch_enter_preempt()
ifunit_ref()
ifioctl()
fo_ioctl()
kern_ioctl()
sys_ioctl()
syscallenter()
amd64_syscall()

Differential Revision:	https://reviews.freebsd.org/D23483
Reviewed by:	glebius@, gallatin@, mav@, jeff@ and kib@
Sponsored by:	Mellanox Technologies
2020-02-10 20:23:08 +00:00
Mateusz Guzik
cd951a0d8e vfs: fix lock recursion in vrele
vrele is supposed to be called with an unlocked vnode, but this was never
asserted for if v_usecount was > 0. For such counts the lock is never touched
by the routine. As a result the kernel has several consumers which expect
vunref semantics and get away with calling vrele since they happen to never do
it when this is the last reference (and for some of them this may happen to be
a guarantee).

Work around the problem by changing vrele semantics to tolerate being called
with a lock. This eliminates a possible bug where the lock is already held and
vputx takes it anyway.

Reviewed by:	kib
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D23528
2020-02-10 13:54:34 +00:00
Konstantin Belousov
48fcb46311 Add sysctl kern.proc.sigfastblk for reporting sigfastblock word address.
Tested by:	pho
Disscussed with:	cem, emaste, jilles
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D12773
2020-02-09 12:29:51 +00:00
Konstantin Belousov
944cf37bb5 Add AT_BSDFLAGS auxv entry.
The intent is to provide bsd-specific flags relevant to interpreter
and C runtime.  I did not want to reuse AT_FLAGS which is common ELF
auxv entry.

Use bsdflags to report kernel support for sigfastblock(2).  This
allows rtld and libthr to safely infer the syscall presence without
SIGSYS.  The tunable kern.elf{32,64}.sigfastblock blocks reporting.

Tested by:	pho
Disscussed with:	cem, emaste, jilles
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D12773
2020-02-09 12:10:37 +00:00
Konstantin Belousov
f88c67a625 Regen. 2020-02-09 11:53:37 +00:00
Konstantin Belousov
146fc63fce Add a way to manage thread signal mask using shared word, instead of syscall.
A new syscall sigfastblock(2) is added which registers a uint32_t
variable as containing the count of blocks for signal delivery.  Its
content is read by kernel on each syscall entry and on AST processing,
non-zero count of blocks is interpreted same as the signal mask
blocking all signals.

The biggest downside of the feature that I see is that memory
corruption that affects the registered fast sigblock location, would
cause quite strange application misbehavior. For instance, the process
would be immune to ^C (but killable by SIGKILL).

With consumers (rtld and libthr added), benchmarks do not show a
slow-down of the syscalls in micro-measurements, and macro benchmarks
like buildworld do not demonstrate a difference. Part of the reason is
that buildworld time is dominated by compiler, and clang already links
to libthr. On the other hand, small utilities typically used by shell
scripts have the total number of syscalls cut by half.

The syscall is not exported from the stable libc version namespace on
purpose.  It is intended to be used only by our C runtime
implementation internals.

Tested by:	pho
Disscussed with:	cem, emaste, jilles
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D12773
2020-02-09 11:53:12 +00:00
Mateusz Guzik
2f7f11b7de vfs: tidy up vget_finish and vn_lock
- remove assertion which duplicates vn_lock
- use VNPASS instead of retyping the failure
- report what flags were passed if panicking on them
2020-02-08 15:52:20 +00:00
Mateusz Guzik
3eb6b656c2 vfs: remove now useless ENODEV handling from vn_fullpath consumers
Noted by:	ngie
2020-02-08 15:51:08 +00:00
Konstantin Belousov
300b525d29 Correct the function name in the comment.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2020-02-08 15:06:06 +00:00
Mateusz Guzik
ea77ce6ef9 rms: use newly added zpcpu routines instead of direct access where appropriate 2020-02-07 22:44:41 +00:00
Jeff Roberson
a40068e524 Fix a race in smr_advance() that could result in unnecessary poll calls.
This was relatively harmless but surprising to see in counters.  The
race occurred when rd_seq was read after the goal was updated and we
incorrectly calculated the delta between them.

Reviewed by:	rlibby
Differential Revision:	https://reviews.freebsd.org/D23464
2020-02-06 20:51:46 +00:00
Jeff Roberson
8d7f16a5db Add some global counters for SMR. These may eventually become per-smr
counters.  In my stress test there is only one poll for every 15,000
frees.  This means we are effectively amortizing the cache coherency
overhead even with very high write rates (3M/s/core).

Reviewed by:	markj, rlibby
Differential Revision:	https://reviews.freebsd.org/D23463
2020-02-06 20:10:21 +00:00
Pawel Biernacki
210176ad76 sysctl(9): add CTLFLAG_NEEDGIANT flag
Add CTLFLAG_NEEDGIANT flag (modelled after D_NEEDGIANT) that will be used to
mark sysctls that still require locking Giant.

Rewrite sysctl_handle_string() to use internal locking instead of locking
Giant.

Mark SYSCTL_STRING, SYSCTL_OPAQUE and their variants as MPSAFE.

Add infrastructure support for enforcing proper use of CTLFLAG_NEEDGIANT
and CTLFLAG_MPSAFE flags with SYSCTL_PROC and SYSCTL_NODE, not enabled yet.

Reviewed by:	kib (mentor)
Approved by:	kib (mentor)
Differential Revision:	https://reviews.freebsd.org/D23378
2020-02-06 12:45:58 +00:00
Mark Johnston
d3631aa582 Avoid releasing object PIP in vn_sendfile() if no pages were grabbed.
sendfile(2) optionally takes a set of headers that get prepended to the
file data.  If the request length is less than that of the headers,
sendfile may not allocate an sfio structure, in which case its pointer
is null and we should be careful not to dereference.  This was
introduced in r356902.

Reported by:	syzkaller
Sponsored by:	The FreeBSD Foundation
2020-02-05 16:09:21 +00:00
Leandro Lupori
eb5a41cf2f Add SYSCTL to get KERNBASE and relocated KERNBASE
This change adds 2 new SYSCTLs, to retrieve the original and relocated KERNBASE
values. This provides an easy, architecture independent way to calculate the
running kernel displacement (current/load address minus original base address).

The initial goal for this change is to add a new libkvm function that returns
the kernel displacement, both for live kernels and crashdumps. This would in
turn be used by kgdb to find out how to relocate kernel symbols (if needed).

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D23284
2020-02-05 11:34:10 +00:00
Mateusz Guzik
1a9fe4528b fd: always nullify *fdp in fget* routines
Some consumers depend on the pointer being NULL if an error is returned.

The guarantee got broken in r357469.

Reported by:	https://syzkaller.appspot.com/bug?extid=0c9b05e2b727aae21eef
Noted by:	markj
2020-02-05 00:20:26 +00:00
Ryan Libby
10c8fb47d9 uma: convert mbuf_jumbo_alloc to UMA_ZONE_CONTIG & tag others
Remove mbuf_jumbo_alloc and let large mbuf zones use the new uma default
contig allocator (a copy of mbuf_jumbo_alloc).  Tag other zones which
require contiguous objects, even if they don't use the new default
contig allocator, so that uma knows about their constraints.

Reviewed by:	jeff, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D23238
2020-02-04 22:40:23 +00:00
Konstantin Belousov
0783b70974 Remove unneeded assert for curproc. Simplify.
Reported by:	syzkaller by markj
Sponsored by:	The FreeBSD Foundation
2020-02-04 21:02:08 +00:00
Mark Johnston
60185d649b Correct the malloc tag used when freeing the temporary semop(2) buffer.
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2020-02-04 20:00:45 +00:00
Dmitry Chagin
cbc1089190 For code reuse in Linuxulator rename get_proccess_cputime()
and get_thread_cputime() and add prototypes for it to <sys/syscallsubr.h>.

As both functions become a public interface add process lock assert
to ensure that the process is not exiting under it.

Fix whitespace nit while here.

Reviewed by:		kib
Differential Revision:	https://reviews.freebsd.org/D23340
MFC after		2 weeks
2020-02-04 05:25:51 +00:00
Jeff Roberson
bc6509845d Implement a deferred write advancement feature that can be used to further
amortize shared cacheline writes.

Discussed with: rlibby
Differential Revision:	https://reviews.freebsd.org/D23462
2020-02-04 02:44:52 +00:00
Jeff Roberson
c8ea36e881 Fix a recursion on the thread lock by acquiring it after call rtp_to_pri().
Reported by:	swills
Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D23495
2020-02-04 02:42:54 +00:00
Mark Johnston
e489450589 Fix the !SMP case in sched_add() after r355779.
If the thread's lock is already that of the runqueue, don't recurse on
the queue lock.

Reviewed by:	jeff, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D23492
2020-02-03 22:49:05 +00:00
Mateusz Guzik
8151b6e92a fd: partially unengrish the previous commit 2020-02-03 22:34:50 +00:00
Mateusz Guzik
e10f063b30 fd: streamline fget_unlocked
clang has the unfortunate property of paying little attention to prediction
hints when faced with a loop spanning the majority of the rotuine.

In particular fget_unlocked has an unlikely corner case where it starts almost
from scratch. Faced with this clang generates a maze of taken jumps, whereas
gcc produces jump-free code (in the expected case).

Work around the problem by providing a variant which only tries once and
resorts to calling the original code if anything goes wrong.

While here note that the 'seq' parameter is almost never passed, thus the
seldom users are redirected to call it directly.
2020-02-03 22:32:49 +00:00
Mateusz Guzik
52604ed792 fd: remove the seq argument from fget_unlocked
It is almost always NULL.
2020-02-03 22:27:55 +00:00
Mateusz Guzik
7f1566f884 fd: remove the seq argument from fget routines
It is almost always NULL.
2020-02-03 22:27:03 +00:00
Mateusz Guzik
0a1427c5ab ktrace: provide ktrstat_error
This eliminates a branch from its consumers trading it for an extra call
if ktrace is enabled for curthread. Given that this is almost never true,
the tradeoff is worth it.
2020-02-03 22:26:00 +00:00
Gleb Smirnoff
0017b2adac Couple protocol drain routines (frag6_drain and sctp_drain) may send
packets.  An unexpected behaviour for memory reclamation routine.
Anyway, we need enter the network epoch for doing that.
2020-02-03 20:48:57 +00:00
Kyle Evans
3d62f685d5 namei: preserve errors from fget_cap_locked
Most notably, we want to make sure we don't clobber any capabilities-related
errors. This is a regression from r357412 (O_SEARCH) that was picked up by
the capsicum tests.

PR:		243839
Reviewed by:	kib (committed form recommended by)
Tested by:	lwhsu
Differential Revision:	https://reviews.freebsd.org/D23479
2020-02-03 18:59:07 +00:00
Warner Losh
58aa35d429 Remove sparc64 kernel support
Remove all sparc64 specific files
Remove all sparc64 ifdefs
Removee indireeect sparc64 ifdefs
2020-02-03 17:35:11 +00:00
Mateusz Guzik
bcd1cf4f03 capsicum: faster cap_rights_contains
Instead of doing a 2 iteration loop (determined at runeimt), take advantage
of the fact that the size is already known.

While here provdie cap_check_inline so that fget_unlocked does not have to
do a function call.

Verified with the capsicum suite /usr/tests.
2020-02-03 17:08:11 +00:00
Mateusz Guzik
fee204544e fd: fix f_count acquire in fget_unlocked
The code was using a hand-rolled fcmpset loop, while in other places the same
count is manipulated with the refcount API.

This transferred from a stylistic issue into a bug after the API got extended
to support flags. As a result the hand-rolled loop could bump the count high
enough to set the bit flag. Another bump + refcount_release would then free
the file prematurely.

The bug is only present in -CURRENT.
2020-02-03 14:28:31 +00:00
Mateusz Guzik
f1fa1ba3d0 Fix up various vnode-related asserts which did not dump the used vnode 2020-02-03 14:25:32 +00:00
Kyle Evans
6a5abb1ee5 Provide O_SEARCH
O_SEARCH is defined by POSIX [0] to open a directory for searching, skipping
permissions checks on the directory itself after the initial open(). This is
close to the semantics we've historically applied for O_EXEC on a directory,
which is UB according to POSIX. Conveniently, O_SEARCH on a file is also
explicitly undefined behavior according to POSIX, so O_EXEC would be a fine
choice. The spec goes on to state that O_SEARCH and O_EXEC need not be
distinct values, but they're not defined to be the same value.

This was pointed out as an incompatibility with other systems that had made
its way into libarchive, which had assumed that O_EXEC was an alias for
O_SEARCH.

This defines compatibility O_SEARCH/FSEARCH (equivalent to O_EXEC and FEXEC
respectively) and expands our UB for O_EXEC on a directory. O_EXEC on a
directory is checked in vn_open_vnode already, so for completeness we add a
NOEXECCHECK when O_SEARCH has been specified on the top-level fd and do not
re-check that when descending in namei.

[0] https://pubs.opengroup.org/onlinepubs/9699919799/

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23247
2020-02-02 16:34:57 +00:00
Mateusz Guzik
2568d5bb79 fd: sprinkle some predits around fget
clang inlines fget -> _fget into kern_fstat and eliminates several checkes,
but prior to this change it would assume fget_unlocked was likely to fail
and consequently avoidable jumps got generated.
2020-02-02 09:38:40 +00:00
Mateusz Guzik
da4f45ea5c fd: use atomic_load_ptr instead of hand-rolled cast through volatile
No change in assembly.
2020-02-02 09:37:16 +00:00
Mateusz Guzik
6698e11f4b vfs: remove the now empty vop_unlock_post 2020-02-02 09:36:32 +00:00
Mateusz Guzik
7739d92766 cache: replace kern___getcwd with vn_getcwd
The previous routine was resulting in extra data copies most notably in
linux_getcwd.
2020-02-01 20:38:38 +00:00
Mateusz Guzik
921e7210f8 cache: return the total length from vn_fullpath1
This removes strlen from getcwd.
2020-02-01 20:37:11 +00:00
Mateusz Guzik
4511dd9d41 cache: remove vnode -> path lookup disablement
It seems to be of little to no use even when debugging.

Interested parties can resurrect it and gate compilation with a macro.
2020-02-01 20:36:35 +00:00
Mateusz Guzik
45757984f8 vfs: consistently use size_t for buflen around VOP_VPTOCNP 2020-02-01 20:34:43 +00:00
Mateusz Guzik
643656cfaf vfs: replace VOP_MARKATIME with VOP_MMAPPED
The routine is only provided by ufs and is only used on mmap and exec.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23422
2020-02-01 06:46:55 +00:00
Mateusz Guzik
90f4ec3328 vfs: save on atomics on the root vnode for absolute lookups
There are 2 back-to-back atomics on the vnode, but we can check upfront if one
is sufficient. Similarly we can handle relative lookups where current working
directory == root directory.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23427
2020-02-01 06:40:35 +00:00
Mateusz Guzik
21c4f1041e vfs: add vrefactn
Differential Revision:	https://reviews.freebsd.org/D23427
2020-02-01 06:39:49 +00:00
Jeff Roberson
915c367e8e Add two missing fences with comments describing them. These were found by
inspection and after a lengthy discussion with jhb and kib.  They have not
produced test failures.

Don't pointer chase through cpu0's smr.  Use cpu correct smr even when not
in a critical section to reduce the likelihood of false sharing.
2020-01-31 22:21:15 +00:00
Mark Johnston
1c29da0279 Reimplement stack capture of running threads on i386 and amd64.
After r355784 the td_oncpu field is no longer synchronized by the thread
lock, so the stack capture interrupt cannot be delievered precisely.
Fix this using a loop which drops the thread lock and restarts if the
wrong thread was sampled from the stack capture interrupt handler.

Change the implementation to use a regular interrupt instead of an NMI.
Now that we drop the thread lock, there is no advantage to the latter.

Simplify the KPIs.  Remove stack_save_td_running() and add a return
value to stack_save_td().  On platforms that do not support stack
capture of running threads, stack_save_td() returns EOPNOTSUPP.  If the
target thread is running in user mode, stack_save_td() returns EBUSY.

Reviewed by:	kib
Reported by:	mjg, pho
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D23355
2020-01-31 15:43:33 +00:00
Mateusz Guzik
0f4d8b77c0 vfs: revert the overzealous assert added in r357285 to vgone
The intent was to make it more likely to catch filesystems with custom
need_inactive routines which fail to call vn_need_pageq_flush (or do an
equivalent).

One immediate case which is missed is vgone from called by inactive itself.

A better assertion may land later. The routine is not added to vputx because
it is of no use to tmpfs et al.

Reported by:	syzbot+5f697ec11f89b60941db@syzkaller.appspotmail.com
2020-01-31 11:31:14 +00:00
Mateusz Guzik
1a78ac2416 Add rms_try_rlock and rms_wowned. 2020-01-31 08:36:49 +00:00
Mateusz Guzik
cedad2916e Remove an overzealous assert from rms_runlock. 2020-01-31 08:36:23 +00:00
Jeff Roberson
da6e9935e4 Don't use "All rights reserved" in new copyrights.
Requested by:	rgrimes
2020-01-31 02:08:09 +00:00
Jeff Roberson
d4665eaa66 Implement a safe memory reclamation feature that is tightly coupled with UMA.
This is in the same family of algorithms as Epoch/QSBR/RCU/PARSEC but is
a unique algorithm.  This has 3x the performance of epoch in a write heavy
workload with less than half of the read side cost.  The memory overhead
is significantly lessened by limiting the free-to-use latency.  A synthetic
test uses 1/20th of the memory vs Epoch.  There is significant further
discussion in the comments and code review.

This code should be considered experimental.  I will write a man page after
it has settled.  After further validation the VM will begin using this
feature to permit lockless page lookups.

Both markj and cperciva tested on arm64 at large core counts to verify
fences on weaker ordering architectures.  I will commit a stress testing
tool in a follow-up.

Reviewed by:	mmacy, markj, rlibby, hselasky
Discussed with:	sbahara
Differential Revision:	https://reviews.freebsd.org/D22586
2020-01-31 00:49:51 +00:00
Mateusz Guzik
3ff65f71cb Remove duplicated empty lines from kern/*.c
No functional changes.
2020-01-30 20:05:05 +00:00
Mateusz Guzik
2823710f05 Tidy up 2 comments in smp_rendezvous_cpus. 2020-01-30 20:02:14 +00:00
Mateusz Guzik
7ab99925fd Assert that smp_rendezvous_cpus is called with interrupts enabled. 2020-01-30 19:38:51 +00:00
Mateusz Guzik
d53d924f60 vfs: keep the mount point referenced across sys_quotactl
Otherwise we risk running into use-after-free.

In particular this codepath ends up dropping all protection before
suspending writes:

ufs_quotactl -> quotaoff_inchange -> vfs_write_suspend_umnt

Reported by:	pho
2020-01-30 19:38:12 +00:00
John Baldwin
fbb9879c0c Fix use of an uninitialized variable.
ctx (and thus ctx.flags) is stack garbage at the start of this
function, so initialize ctx.flags to an explicit value instead of
using binary operations on the garbage.

Reported by:	gcc9
Reviewed by:	imp
Differential Revision:	https://reviews.freebsd.org/D23368
2020-01-30 18:28:02 +00:00
Mateusz Guzik
c2ef6aa3d5 vfs: assert that doomed vnodes don't need to call vm_object_page_clean
... after the optional inactive processing.
2020-01-30 04:59:08 +00:00
Mateusz Guzik
07c6e2f4ab vfs: unlazy before dooming the vnode
With this change having the listmtx lock held postpones dooming the vnode.
Use this fact to simplify iteration over the lazy list. It also allows
filters to safely access ->v_data.

Reviewed by:	kib (early version)
Differential Revision:	https://reviews.freebsd.org/D23397
2020-01-30 02:12:52 +00:00
Gleb Smirnoff
79674264df Fix text format definition for kern.maxvnodes, vfs.wantfreevnodes. This
is a regression from r356642, r356645.
2020-01-30 00:18:00 +00:00
Conrad Meyer
07a65f9d38 hwpstate_intel(4): Silence/fix Coverity reports
These were all introduced in the initial import of hwpstate_intel(4).

Reported by:	Coverity
CIDs:		1413161, 1413164, 1413165, 1413167
X-MFC-With:	r357002
2020-01-29 03:15:34 +00:00
Warner Losh
42ec4f05a3 Make mqueue objects work across a fork again.
In r110908 (2003) alfred added DFLAG_PASSABLE to tag those types of FD
that can be passed via unix pipes, but mqueuefs didn't exist
yet. Later, in r152825 (2005) davidxu neglected to include
DFLAG_PASSABLE since people don't normally pass these things via unix
sockets (it's a FreeBSD implementation detail that it's a file
descriptor, nobody noticed). Then r223866 (2011) by jonathan used the
new flag in fdcopy, which fork uses. Due to that, mqueuefs actually
broke mqueue objects being propagated by fork. No mention of mqueuefs
was made in r223866, so I think it was an unintended consequence.

Fix this by tagging mqueuefs as passable as well. They were prior to
alfred's change (and it's clear there's no intent in his change to
change this behavior), and POSIX requires this to be the case as well.

PR: 243103
Reviewed by: kib@, jiles@
Differential Revision: https://reviews.freebsd.org/D23038
2020-01-27 22:36:54 +00:00
John Baldwin
425e5f9dcf Revert accidental change from r357146. 2020-01-26 14:23:27 +00:00
John Baldwin
c73222d0e6 Fix some misleading indentation warnings reported by recent clang.
These should not be any functional change.  While the change in
emul10kx-pcm.c looks like a real bug fix (as opposed to inconsistent
whitespace), the extra statements were not harmful.

Reviewed by:	kib
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D23363
2020-01-26 14:20:57 +00:00
Mateusz Guzik
1513f80391 vfs: do an unlocked check before iterating the lazy list
For most filesystems it is expected to be empty most of the time.
2020-01-26 07:06:18 +00:00
Mateusz Guzik
cd0e46c66b vfs: remove vop loop from vop_sigdefer
All ops are guaranteed to be present since r357131.
2020-01-26 07:05:06 +00:00
Mateusz Guzik
6d69e665dd vfs: fix freevnodes count update race against preemption
vdbatch_process leaves the critical section too early, openign a time
window where another thread can get scheduled and modify vd->freevnodes.
Once it the preempted thread gets back it overrides the value with 0.

Just move critical_exit to the end of the function.
2020-01-26 00:40:27 +00:00
Mateusz Guzik
dc9a1cb60b vfs: predict vn_lock failure as unlikely in vget 2020-01-26 00:34:57 +00:00
Jason A. Harmening
a9aa06f7b1 Implement cycle-detecting garbage collector for AF_UNIX sockets
The existing AF_UNIX socket garbage collector destroys any socket
which may potentially be in a cycle, as indicated by its file reference
count being equal to its enqueue count. However, this can produce false
positives for in-flight sockets which aren't part of a cycle but are
part of one or more SCM_RIGHTS mssages and which have been closed
on the sending side. If the garbage collector happens to run at
exactly the wrong time, destruction of these sockets will render them
unusable on the receiving side, such that no previously-written data
may be read.

This change rewrites the garbage collector to precisely detect cycles:

1. The existing check of msgcount==f_count is still used to determine
   whether the socket is potentially in a cycle.
2. The socket is now placed on a local "dead list", which is used to
   reduce iteration time (and therefore contention on the global
   unp_link_rwlock).
3. The first pass through the dead list removes each potentially-dead
   socket's outgoing references from the graph of potentially-dead
   sockets, using a gc-specific copy of the original reference count.
4. The second series of passes through the dead list removes from the
   list any socket whose remaining gc refcount is non-zero, as this
   indicates the socket is actually accessible outside of any possible
   cycle.  Iteration is repeated until no further sockets are removed
   from the dead list.
5. Sockets remaining in the dead list are destroyed as before.

PR:		227285
Submitted by:	jan.kokemueller@gmail.com (prior version)
Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D23142
2020-01-25 08:57:26 +00:00
Mark Johnston
a89c2c8c34 Revert r357050.
It seems to have introduced a couple of regressions.

Reported by:	cy, pho
2020-01-24 14:58:02 +00:00
Edward Tomasz Napierala
b3fb13eb55 Add kern_unmount() and use in Linuxulator. No functional changes.
Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22646
2020-01-24 11:57:55 +00:00
Mateusz Guzik
28eb39a5ab vfs: allow v_usecount to transition 0->1 without the interlock
There is nothing to do but to bump the count even during said transition.
There are 2 places which can do it:
- vget only does this after locking the vnode, meaning there is no change in
  contract versus inactive or reclamantion
- vref only ever did it with the interlock held which did not protect against
  either (that is, it would always succeed)

VCHR vnodes retain special casing due to the need to maintain dev use count.

Reviewed by:	jeff, kib
Tested by:	pho (previous version)
Differential Revision:	https://reviews.freebsd.org/D23185
2020-01-24 07:47:44 +00:00
Mateusz Guzik
d93762b94d vfs: stop handling VI_OWEINACT in vget
vget is almost always called with LK_SHARED, meaning the flag (if present) is
almost guaranteed to get cleared. Stop handling it in the first place and
instead let the thread which wanted to do inactive handle the bumepd usecount.

Reviewed by:	jeff
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D23184
2020-01-24 07:45:59 +00:00
Mateusz Guzik
74c4b7cc60 vfs: stop unlocking the vnode upfront in vput
Doing so runs into races with filesystems which make half-constructed vnodes
visible to other users, while depending on the chain vput -> vinactive ->
vrecycle to be executed without dropping the vnode lock.

Impediments for making this work got cleared up (notably vop_unlock_post now
does not do anything and lockmgr stops touching the lock after the final
write). Stacked filesystems keep vhold/vdrop across unlock, which arguably can
now be eliminated.

Reviewed by:	jeff
Differential Revision:	https://reviews.freebsd.org/D23344
2020-01-24 07:44:25 +00:00
Mateusz Guzik
c00115f108 lockmgr: don't touch the lock past unlock
This evens it up with other locking primitives.

Note lock profiling still touches the lock, which again is in line with the
rest.

Reviewed by:	jeff
Differential Revision:	https://reviews.freebsd.org/D23343
2020-01-24 07:42:57 +00:00
Mark Johnston
1bfca40c57 Set td_oncpu before dropping the thread lock during a switch.
After r355784 we no longer hold a thread's thread lock when switching it
out.  Preserve the previous synchronization protocol for td_oncpu by
setting it together with td_state, before dropping the thread lock
during a switch.

Reported and tested by:	pho
Reviewed by:	kib
Discussed with:	jeff
Differential Revision:	https://reviews.freebsd.org/D23270
2020-01-23 16:24:51 +00:00
Jeff Roberson
91e31c3c08 Consistently use busy and vm_page_valid() rather than touching page bits
directly.  This improves API compliance, asserts, etc.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D23283
2020-01-23 04:54:49 +00:00
Jeff Roberson
1eb13fce84 Block the thread lock in sched_throw() and use cpu_switch() to unblock
it.  The introduction of lockless switch in r355784 created a race to
re-use the exiting thread that was only possible to hit on a hypervisor.

Reported/Tested by:	rlibby
Discussed with:	rlibby, jhb
2020-01-23 03:36:50 +00:00
Gleb Smirnoff
ad3980121b DEVICE_POLLING is an alternative to network interrupts and also
needs to enter epoch.  Assert that in the netisr_poll() and do
the work for the idle poll routine.
2020-01-23 01:30:50 +00:00
Gleb Smirnoff
511d1afb6b Enter the network epoch for interrupt handlers of INTR_TYPE_NET.
Provide tunable to limit how many times handlers may be executed
without reentering epoch.

Differential Revision:	https://reviews.freebsd.org/D23242
2020-01-23 01:24:47 +00:00
Gleb Smirnoff
c4eb66309f Add ie_hflags to struct intr_event, which accumulates flags from all
handlers on this event.  For now handle only IH_ENTROPY in that manner.
2020-01-23 01:20:59 +00:00
Conrad Meyer
4577cf3744 cpufreq(4): Add support for Intel Speed Shift
Intel Speed Shift is Intel's technology to control frequency in hardware,
with hints from software.

Let's get a working version of this in the tree and we can refine it from
here.

Submitted by:	bwidawsk, scottph
Reviewed by:	bcr (manpages), myself
Discussed with:	jhb, kib (earlier versions)
With feedback from:	Greg V, gallatin, freebsdnewbie AT freenet.de
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D18028
2020-01-22 23:28:42 +00:00
Hans Petter Selasky
1f69a50940 Make sure the VNET is properly set when calling tcp_drop() from
the ktls taskqueue callback function.

A valid VNET is needed when updating statistics.

panic()
tcp_state_change()
tcp_drop()
ktls_reset_send_tag()
taskqueue_run_locked()
taskqueue_thread_loop()

Sponsored by:	Mellanox Technologies
2020-01-21 11:43:25 +00:00
Mateusz Guzik
6403455301 cache: revert r352613 now that vhold does not take locks 2020-01-20 19:52:23 +00:00
Mateusz Guzik
8bba93c7e0 cache: make numcachehv use counter(9) on all archs
Requested by:	kib
2020-01-20 14:42:11 +00:00
Jeff Roberson
d6e13f3b4d Don't hold the object lock while calling getpages.
The vnode pager does not want the object lock held.  Moving this out allows
further object lock scope reduction in callers.  While here add some missing
paging in progress calls and an assert.  The object handle is now protected
explicitly with pip.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D23033
2020-01-19 23:47:32 +00:00
Mateusz Guzik
a9099e5b10 vfs: switch vop_stdunlock to call lockmgr_unlock
Since the flags argument is now alawys 0 the new call provides the same
behavior.
2020-01-19 21:41:34 +00:00
Jeff Roberson
811d05fcb7 Provide an API for interlocked refcount sleeps.
Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D22908
2020-01-19 18:18:17 +00:00
Mateusz Guzik
28479aaae2 vfs: allow v_holdcnt to transition 0->1 without the interlock
Since r356672 ("vfs: rework vnode list management") there is nothing to do
apart from altering freevnodes count, but this much can be safely done based
on the result of atomic_fetchadd.

Reviewed by:	kib
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D23186
2020-01-19 17:47:04 +00:00
Mateusz Guzik
059cb4843b cache: counter_u64_add_protected -> counter_u64_add
Fixes booting on RISC-V where it does happen to not be equivalent.

Reported by:	lwhsu
2020-01-19 17:05:26 +00:00
Mateusz Guzik
1399033590 cache: convert numcachehv to counter(9) on 64-bit platforms 2020-01-19 05:37:27 +00:00
Mateusz Guzik
512fa9a4e0 vfs: plug a conditional assigment of lo_name in getnewvnode
It only matters for witness. No functional changes.
2020-01-19 05:36:45 +00:00
Kyle Evans
05d7dd739c sysent targets: further cleanup and deduplication
r355473 vastly improved the readability and cleanliness of these Makefiles.
Every single one of them follows the same pattern and duplicates the exact
same logic.

Now that we have GENERATED/SRCS, split SRCS up into the two parameters we'll
use for ${MAKESYSCALLS} rather than assuming a specific ordering of SRCS and
include a common sysent.mk to handle the rest. This makes it less tedious to
make sweeping changes.

Some default values are provided for GENERATED/SYSENT_*; almost all of these
just use a 'syscalls.master' and 'syscalls.conf' in cwd, and they all use
effectively the same filenames with an arbitrary prefix. Most ABIs will be
able to get away with just setting GENERATED_PREFIX and including
^/sys/conf/sysent.mk, while others only need light additions. kern/Makefile
is the notable exception, as it doesn't take a SYSENT_CONF and the generated
files are spread out between ^/sys/kern and ^/sys/sys, but it otherwise fits
the pattern enough to use the common version.

Reviewed by:	brooks, imp
Nice!:		emaste
Differential Revision:	https://reviews.freebsd.org/D23197
2020-01-18 20:37:45 +00:00
Mateusz Guzik
2d0c620272 vfs: distribute freevnodes counter per-cpu
It gets rolled up to the global when deferred requeueing is performed.
A dedicated read routine makes sure to return a value only off by a certain
amount.

This soothes a global serialisation point for all 0<->1 hold count transitions.

Reviewed by:	jeff
Differential Revision:	https://reviews.freebsd.org/D23235
2020-01-18 01:29:02 +00:00
Mateusz Guzik
d3cc535474 vfs: provide F_ISUNIONSTACK as a kludge for libc
Prior to introduction of this op libc's readdir would call fstatfs(2), in
effect unnecessarily copying kilobytes of data just to check fs name and a
mount flag.

Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D23162
2020-01-17 14:42:25 +00:00
Mateusz Guzik
1ad72b270c vfs: shorten lock hold time in vdbatch_process 2020-01-17 14:39:00 +00:00
Gleb Smirnoff
66c6c556b6 Change argument order of epoch_call() to more natural, first function,
then its argument.

Reviewed by:	imp, cem, jhb
2020-01-17 06:10:24 +00:00
Mateusz Guzik
66f67d5e5e vfs: increment numvnodes without the vnode list lock unless under pressure
The vnode list lock is only needed to reclaim free vnodes or kick the vnlru
thread (or to block and not miss a wake up (but note the sleep has a timeout so
this would not be a correctness issue)). Try to get away without the lock by
just doing an atomic increment.

The lock is contended e.g., during poudriere -j 104 where about half of all
acquires come from vnode allocation code.

Note the entire scheme needs a rewrite, the above just reduces it's SMP impact.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23140
2020-01-16 21:45:21 +00:00
Mateusz Guzik
b7f50b9ad1 vfs: refcator vnode allocation
Semantics are almost identical. Some code is deduplicated and there are
fewer memory accesses.

Reviewed by:	kib, jeff
Differential Revision:	https://reviews.freebsd.org/D23158
2020-01-16 21:43:13 +00:00
Mateusz Guzik
875cfc082d vfs: reimplement vlrureclaim to actually use LRU
Take advantage of global ordering introduced in r356672.

Reviewed by:	mckusick (previous version)
Differential Revision:	https://reviews.freebsd.org/D23067
2020-01-16 10:44:02 +00:00
Jeff Roberson
a81c400e75 Simplify VM and UMA startup by eliminating boot pages. Instead use careful
ordering to allocate early pages in the same way boot pages were but only
as needed.  After the KVA allocator has started up we allocate the KVA that
we consumed during boot.  This also makes the boot pages freeable since they
have vm_page structures allocated with the rest of memory.

Parts of this patch were written and tested by markj.

Reviewed by:	glebius, markj
Differential Revision:	https://reviews.freebsd.org/D23102
2020-01-16 05:01:21 +00:00
Kirk McKusick
bbb1e07d65 Peter Holm reports that his test that does an umount(8) on an active
mount point while numerous tests are running that are writing to
files on that mount point cause the unmount(8) to hang forever.

The unmount(8) system call is handled in the kernel by the dounmount()
function. The cause of the hang is that prior to dounmount() calling
VFS_UNMOUNT() it is calling VFS_SYNC(mp, MNT_WAIT). The MNT_WAIT
flag indicates that VFS_SYNC() should not return until all the dirty
buffers associated with the mount point have been written to disk.
Because user processes are allowed to continue writing and can do
so faster than the data can be written to disk, the call to VFS_SYNC()
can never finish.

Unlike VFS_SYNC(), the VFS_UNMOUNT() routine can suspend all processes
when they request to do a write thus having a finite number of dirty
buffers to write that cannot be expanded. There is no need to call
VFS_SYNC() before calling VFS_UNMOUNT(), because VFS_UNMOUNT() needs
to flush everything again anyway after suspending writes, to catch
anything that was dirtied between the VFS_SYNC() and writes being
suspended.

The fix is to simply remove the unnecessary call to VFS_SYNC() from
dounmount().

Reported by:  Peter Holm
Analysis by:  Chuck Silvers
Tested by:    Peter Holm
MFC after:    7 days
Sponsored by: Netflix
2020-01-15 18:53:32 +00:00