Commit Graph

17514 Commits

Author SHA1 Message Date
Mateusz Guzik
b1f910e02c vfs: short-circuit the common case NDFREE calls
Almost all consumers use the NDF_ONLY_PNBUF macro, making them avoidably branch
a lot in the NDFREE routine. Also note most of them should not need to call
any cleanup anyway as they don't request HASBUF.
2020-07-30 15:47:41 +00:00
Mateusz Guzik
404927357d vfs: add support for WANTPARENT and LOCKPARENT to lockless lookup
This makes the realpath syscall operational with the new lookup. Note that the
walk to obtain the full path name still takes locks.

Tested by:      pho
Differential Revision:	https://reviews.freebsd.org/D23917
2020-07-30 15:45:11 +00:00
Mateusz Guzik
8230d29357 vfs: support negative entry promotion in lockless lookup
Tested by:	pho
2020-07-30 15:44:10 +00:00
Mateusz Guzik
4057e3eaaa vfs: add NOMACCHECK and AUDITVNODE2 to lockless lookup
They are both nops since lookup does not progress with either mac or audit enabled.

Tested by:	pho
2020-07-30 15:43:16 +00:00
Mateusz Guzik
d3e63e8eb2 vfs: make sure startdir_used is always assigned to before use
CID:	1431070
2020-07-30 07:11:08 +00:00
Mark Johnston
1b778ba260 Fix a logic error in uipc_ready_scan().
When processing the last record in a socket buffer, take care to avoid a
NULL pointer dereference when advancing the record iterator.

Reported by:	syzbot+6a689cc9c27bd265237a@syzkaller.appspotmail.com
Fixes:		r359778
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2020-07-30 00:52:37 +00:00
John Baldwin
0f70a1489d Properly handle a closed TLS socket with pending receive data.
If the remote end closes a TLS socket and the socket buffer still
contains not-yet-decrypted TLS records but no decrypted TLS records,
soreceive needs to block or fail with EWOULDBLOCK.  Previously it was
trying to return data and dereferencing a NULL pointer.

Reviewed by:	np
Sponsored by:	Chelsio
Differential Revision:	https://reviews.freebsd.org/D25838
2020-07-29 23:24:32 +00:00
Mateusz Guzik
fad6dd772d vfs: elide MAC-induced locking on rename if there are no relevant hoooks 2020-07-29 17:05:31 +00:00
Mateusz Guzik
fd8c6a48ab vfs: honor error code returned by mac_vnode_check_rename_from
MFC after:	3 days
2020-07-29 17:04:33 +00:00
Yoshihiro Takahashi
8f11c99715 - Cleanups related to sparc64 removal.
- Remove remains of sparc64 files.

Reviewed by:	imp
Differential Revision:	https://reviews.freebsd.org/D25831
2020-07-28 10:58:37 +00:00
Kyle Evans
fd35bfaecf makesyscalls.sh: improve the 'this is going away' message
Reported by:	Ronald Klop, rgrimes
2020-07-28 01:05:40 +00:00
Kyle Evans
bb97350f28 makesyscalls.sh: spit out a deprecation notice to stderr
This has for a while been replaced by makesyscalls.lua in the stock FreeBSD
build.  Ensure downstreams get some notice that it'a going away if they're
reliant on it, maybe.
2020-07-27 03:13:23 +00:00
Doug Moore
00fd73d2da Fix an overflow bug in the blist allocator that needlessly capped max
swap size by dividing a value, which was always a multiple of 64, by
64.  Remove the code that reduced max swap size down to that cap.

Eliminate the distinction between BLIST_BMAP_RADIX and
BLIST_META_RADIX.  Call them both BLIST_RADIX.

Make improvments to the blist self-test code to silence compiler
warnings and to test larger blists.

Reported by:	jmallett
Reviewed by:	alc
Discussed with:	kib
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D25736
2020-07-25 18:29:10 +00:00
Mateusz Guzik
e914224af1 fd: put back FILEDESC_SUNLOCK to pwd_hold lost during rebase
Reported by:	pho
2020-07-25 15:34:29 +00:00
Alexander Motin
aba10e131f Allow swi_sched() to be called from NMI context.
For purposes of handling hardware error reported via NMIs I need a way to
escape NMI context, being too restrictive to do something significant.

To do it this change introduces new swi_sched() flag SWI_FROMNMI, making
it careful about used KPIs.  On platforms allowing IPI sending from NMI
context (x86 for now) it immediately wakes clk_intr_event via new IPI_SWI,
otherwise it works just like SWI_DELAY.  To handle the delayed SWIs this
patch calls clk_intr_event on every hardclock() tick.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D25754
2020-07-25 15:19:38 +00:00
Mateusz Guzik
9dbd12fb52 vfs: add support for !LOCKLEAF to lockless lookup
Tested by:      pho (in a patchset)
Differential Revision:	https://reviews.freebsd.org/D23916
2020-07-25 10:40:38 +00:00
Mateusz Guzik
c42b77e694 vfs: lockless lookup
Provides full scalability as long as all visited filesystems support the
lookup and terminal vnodes are different.

Inner workings are explained in the comment above cache_fplookup.

Capabilities and fd-relative lookups are not supported and will result in
immediate fallback to regular code.

Symlinks, ".." in the path, mount points without support for lockless lookup
and mismatched counters will result in an attempt to get a reference to the
directory vnode and continue in regular lookup. If this fails, the entire
operation is aborted and regular lookup starts from scratch. However, care is
taken that data is not copied again from userspace.

Sample benchmark:
incremental -j 104 bzImage on tmpfs:
before: 142.96s user 1025.63s system 4924% cpu 23.731 total
after: 147.36s user 313.40s system 3216% cpu 14.326 total

Sample microbenchmark: access calls to separate files in /tmpfs, 104 workers, ops/s:
before:   2165816
after:  151216530

Reviewed by:    kib
Tested by:      pho (in a patchset)
Differential Revision:	https://reviews.freebsd.org/D25578
2020-07-25 10:37:15 +00:00
Mateusz Guzik
07d2145a17 vfs: add the infrastructure for lockless lookup
Reviewed by:    kib
Tested by:      pho (in a patchset)
Differential Revision:	https://reviews.freebsd.org/D25577
2020-07-25 10:32:45 +00:00
Mateusz Guzik
0379ff6ae3 vfs: introduce vnode sequence counters
Modified on each permission change and link/unlink.

Reviewed by:	kib
Tested by:	pho (in a patchset)
Differential Revision:	https://reviews.freebsd.org/D25573
2020-07-25 10:31:52 +00:00
Mateusz Guzik
d1385ab26e Guard sbcompress_ktls_rx with KERN_TLS
Fixes a compilation warning after r363464
2020-07-25 07:15:23 +00:00
Mateusz Guzik
bf71b96c69 Do a lockless check in kthread_suspend_check
Otherwise an idle system running lockstat sleep 10 reports contention on
process lock comming from bufdaemon.

While here fix a style nit.
2020-07-25 07:14:33 +00:00
Conrad Meyer
81dc6c2c61 Use gbincore_unlocked for unprotected incore()
Reviewed by:	markj
Sponsored by:	Isilon
Differential Revision:	https://reviews.freebsd.org/D25790
2020-07-24 17:34:44 +00:00
Conrad Meyer
68ee1dda06 Add unlocked/SMR fast path to getblk()
Convert the bufobj tries to an SMR zone/PCTRIE and add a gbincore_unlocked()
API wrapping this functionality.  Use it for a fast path in getblkx(),
falling back to locked lookup if we raced a thread changing the buf's
identity.

Reported by:	Attilio
Reviewed by:	kib, markj
Testing:	pho (in progress)
Sponsored by:	Isilon
Differential Revision:	https://reviews.freebsd.org/D25782
2020-07-24 17:34:04 +00:00
Conrad Meyer
3c30b23519 Use SMR to provide safe unlocked lookup for pctries from SMR zones
Adapt r358130, for the almost identical vm_radix, to the pctrie subsystem.
Like that change, the tree is kept correct for readers with store barriers
and careful ordering.  Existing locks serialize writers.

Add a PCTRIE_DEFINE_SMR() wrapper that takes an additional smr_t parameter
and instantiates a FOO_PCTRIE_LOOKUP_UNLOCKED() function, in addition to the
usual definitions created by PCTRIE_DEFINE().

Interface consumers will be introduced in later commits.

As future work, it might be nice to add vm_radix algorithms missing from
generic pctrie to the pctrie interface, and then adapt vm_radix to use
pctrie.

Reported by:	Attilio
Reviewed by:	markj
Sponsored by:	Isilon
Differential Revision:	https://reviews.freebsd.org/D25781
2020-07-24 17:32:10 +00:00
Mateusz Guzik
138698898f lockmgr: add missing 'continue' to account for spuriously failed fcmpset
PR:		248245
Reported by:	gbe
Noted by:	markj
Fixes by:	r363415 ("lockmgr: add adaptive spinning")
2020-07-24 17:28:24 +00:00
John Baldwin
3c0e568505 Add support for KTLS RX via software decryption.
Allow TLS records to be decrypted in the kernel after being received
by a NIC.  At a high level this is somewhat similar to software KTLS
for the transmit path except in reverse.  Protocols enqueue mbufs
containing encrypted TLS records (or portions of records) into the
tail of a socket buffer and the KTLS layer decrypts those records
before returning them to userland applications.  However, there is an
important difference:

- In the transmit case, the socket buffer is always a single "record"
  holding a chain of mbufs.  Not-yet-encrypted mbufs are marked not
  ready (M_NOTREADY) and released to protocols for transmit by marking
  mbufs ready once their data is encrypted.

- In the receive case, incoming (encrypted) data appended to the
  socket buffer is still a single stream of data from the protocol,
  but decrypted TLS records are stored as separate records in the
  socket buffer and read individually via recvmsg().

Initially I tried to make this work by marking incoming mbufs as
M_NOTREADY, but there didn't seemed to be a non-gross way to deal with
picking a portion of the mbuf chain and turning it into a new record
in the socket buffer after decrypting the TLS record it contained
(along with prepending a control message).  Also, such mbufs would
also need to be "pinned" in some way while they are being decrypted
such that a concurrent sbcut() wouldn't free them out from under the
thread performing decryption.

As such, I settled on the following solution:

- Socket buffers now contain an additional chain of mbufs (sb_mtls,
  sb_mtlstail, and sb_tlscc) containing encrypted mbufs appended by
  the protocol layer.  These mbufs are still marked M_NOTREADY, but
  soreceive*() generally don't know about them (except that they will
  block waiting for data to be decrypted for a blocking read).

- Each time a new mbuf is appended to this TLS mbuf chain, the socket
  buffer peeks at the TLS record header at the head of the chain to
  determine the encrypted record's length.  If enough data is queued
  for the TLS record, the socket is placed on a per-CPU TLS workqueue
  (reusing the existing KTLS workqueues and worker threads).

- The worker thread loops over the TLS mbuf chain decrypting records
  until it runs out of data.  Each record is detached from the TLS
  mbuf chain while it is being decrypted to keep the mbufs "pinned".
  However, a new sb_dtlscc field tracks the character count of the
  detached record and sbcut()/sbdrop() is updated to account for the
  detached record.  After the record is decrypted, the worker thread
  first checks to see if sbcut() dropped the record.  If so, it is
  freed (can happen when a socket is closed with pending data).
  Otherwise, the header and trailer are stripped from the original
  mbufs, a control message is created holding the decrypted TLS
  header, and the decrypted TLS record is appended to the "normal"
  socket buffer chain.

(Side note: the SBCHECK() infrastucture was very useful as I was
 able to add assertions there about the TLS chain that caught several
 bugs during development.)

Tested by:	rmacklem (various versions)
Relnotes:	yes
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D24628
2020-07-23 23:48:18 +00:00
Mateusz Guzik
c795344ff7 locks: fix a long standing bug for primitives with kdtrace but without spinning
In such a case the second argument to lock_delay_arg_init was NULL which was
immediately causing a null pointer deref.

Since the sructure is only used for spin count, provide a dedicate routine
initializing it.

Reported by:	andrew
2020-07-23 17:26:53 +00:00
Brooks Davis
5a01eca698 Use SI_ORDER_(FOURTH|FIFTH) rather than bespoke versions.
No functional change.

When these SYSINITs were added these macros didn't exist.

Reviewed by:	imp
Obtained from:	CheriBSD
MFC after:	1 week
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D25758
2020-07-22 23:35:41 +00:00
Mateusz Guzik
31ad4050fe lockmgr: add adaptive spinning
It is very conservative. Only spinning when LK_ADAPTIVE is passed, only on
exclusive lock and never when any waiters are present. buffer cache is remains
not spinning.

This reduces total sleep times during buildworld etc., but it does not shorten
total real time (culprits are contention in the vm subsystem along with slock +
upgrade which is not covered).

For microbenchmarks: open3_processes -t 52 (open/close of the same file for
writing) ops/s:
before: 258845
after: 801638

Reviewed by:	kib
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D25753
2020-07-22 12:30:31 +00:00
Mitchell Horne
dc42509049 INTRNG: only shuffle for !EARLY_AP_STARTUP
During device attachment, all interrupt sources will bind to the BSP,
as it is the only processor online. This means interrupts must be
redistributed ("shuffled") later, during SI_SUB_SMP.

For the EARLY_AP_STARTUP case, this is no longer true. SI_SUB_SMP will
execute much earlier, meaning APs will be online and available before
devices begin attachment, and there will therefore be nothing to
shuffle.

All PIC-conforming interrupt controllers will handle this early
distribution properly, except for RISC-V's PLIC. Make the necessary
tweak to the PLIC driver.

While here, convert irq_assign_cpu from a boolean_t to a bool.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D25693
2020-07-21 22:47:02 +00:00
Mateusz Guzik
4aff9f5d99 lockmgr: denote recursion with a bit in lock value
This reduces excessive reads from the lock.

Tested by:	pho
2020-07-21 14:42:22 +00:00
Mateusz Guzik
f6b091fbbd lockmgr: rewrite upgrade to stop always dropping the lock
This matches rw and sx locks.
2020-07-21 14:41:25 +00:00
Mateusz Guzik
bdb6d824f4 lockmgr: add a helper for reading the lock value 2020-07-21 14:39:20 +00:00
Adrian Chadd
f7d38a13a8 [net80211] Add new privileges; restrict what can be done in a jail.
Split the MANAGE privilege into MANAGE, SETMAC and CREATE_VAP.

+ VAP_MANAGE is everything but setting the MAC and creating a VAP.
+ VAP_SETMAC is setting the MAC address of the VAP.
  Typically you wouldn't want the jail to be able to modify this.
+ CREATE_VAP is to create a new VAP. Again, you don't want to be doing
  this in a jail, but this DOES stop being able to run some corner
  cases like Dynamic WDS (DWDS) AP in a jail/vnet. We can figure this
  bit out later.

This allows me to run wpa_supplicant in a jail after transferring
a STA VAP into it. I unfortunately can't currently set the wlan
debugging inside the jail; that would be super useful!

Reviewed by:	bz
Differential Revision:	https://reviews.freebsd.org/D25630
2020-07-19 15:16:27 +00:00
Mateusz Guzik
7cd4443fb1 Short-circuit tdfind when looking for the calling thread.
Common occurence with cpuset and other places.
2020-07-18 00:14:43 +00:00
Mateusz Guzik
3ea3fbe685 vfs: fix vn_poll performance with either MAC or AUDIT
The code would unconditionally lock the vnode to audit or call the
mac hoook, even if neither want to do anything. Pre-check the state
to avoid locking in the common case of nothing to do.

Note this code should not be normally executed anyway as vnodes are
always return ready. However, poll1/2 from will-it-scale use regular
files for benchmarking, presumably to focus on the interface itself
as the vnode handler is not supposed to do almost anything.

This in particular fixes poll2 which passes 128 fds.

$ ./poll2_processes -s 10
before: 134411
after:  271572
2020-07-16 14:09:18 +00:00
Mateusz Guzik
ab06a30517 vfs: fix MAC/AUDIT mismatch in vn_poll
Auditing would not be performed without MAC compiled in.
2020-07-16 14:04:28 +00:00
Mateusz Guzik
b1607c8727 poll: factor fd lookup out of scan and rescan 2020-07-15 10:24:39 +00:00
Mateusz Guzik
d8bc2a17a5 fd: remove fd_lastfile
It keeps recalculated way more often than it is needed.

Provide a routine (fdlastfile) to get it if necessary.

Consumers may be better off with a bitmap iterator instead.
2020-07-15 10:24:04 +00:00
Mateusz Guzik
7177149a4d fd: add obvious branch predictions to fdalloc 2020-07-15 10:14:00 +00:00
Mateusz Guzik
29f3e5ea41 cache: make negative shrinker round robin on all lists every time
Previously it would check 4, 3, 2, 1 lists. In practice by the time
it is getting called all lists have some elements and consequently
this does not result in new evictions.

Nonetheless, the code is clearer.

Tested by:	pho
2020-07-14 21:19:33 +00:00
Mateusz Guzik
a110fa2ee1 cache: remove numcalls
The counter is not very useful and if necessary the value can be
found by summing up other counters.
2020-07-14 21:17:46 +00:00
Mateusz Guzik
4516c7eed9 cache: count dropped entries 2020-07-14 21:17:08 +00:00
Mateusz Guzik
654e644e80 cache: remove neg_locked argument from cache_zap_locked
Tested by:	pho
2020-07-14 21:16:48 +00:00
Mateusz Guzik
ffb0abddf1 cache: remove a useless argument from cache_negative_insert 2020-07-14 21:16:07 +00:00
Mateusz Guzik
9f8d452173 cache: create a dedicate struct for negative entries
.. and stuff if into the unused target vnode field

This gets rid of concurrent nc_flag modifications racing with the
shrinker and consequently fixes a bug where such a change could have
been missed when cache_ncp_invalidate was being issued..

Reported by:	zeising
Tested by:	pho, zeising
Fixes:	r362828 ("cache: lockless forward lookup with smr")
2020-07-14 21:14:59 +00:00
Mateusz Guzik
373278a7f6 fd: stop looping in pwd_hold
We don't expect to fail acquiring the reference unless running into a corner
case. Just in case ensure forward progress by taking the lock.

Reviewed by:	kib, markj
Differential Revision: https://reviews.freebsd.org/D25616
2020-07-11 21:57:03 +00:00
Mateusz Guzik
74f61caed5 vfs: fix early termination of kern_getfsstat
The kernel would unlock already unlocked mutex if the buffer got filled up
before the mount list ended.

Reported by:	pho
Fixes:	r363069 ("vfs: depessimize getfsstat when only the count is requested")
2020-07-10 09:24:27 +00:00
Mateusz Guzik
422f38d8ea vfs: fix trivial whitespace issues which don't interefere with blame
.. even without the -w switch
2020-07-10 09:01:36 +00:00
Mateusz Guzik
6c69e69724 vfs: depessimize getfsstat when only the count is requested
This avoids relocking mountlist_mtx for each entry.
2020-07-10 06:47:58 +00:00