Commit Graph

17848 Commits

Author SHA1 Message Date
Konstantin Belousov
f10845877e Suspend all writeable local filesystems on power suspend.
This ensures that no writes are pending in memory, either metadata or
user data, but not including dirty pages not yet converted to fs writes.

Only filesystems declared local are suspended.

Note that this does not guarantee absence of the metadata errors or
leaks if resume is not done: for instance, on UFS unlinked but opened
inodes are leaked and require fsck to gc.

Reviewed by:	markj
Discussed with:	imp
Tested by:	imp (previous version), pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D27054
2020-11-05 20:52:49 +00:00
Mateusz Guzik
16b971ed6d malloc: add a helper returning size allocated for given request
Sample usage: kernel modules can decide whether to stick to malloc or
create their own zone.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D27097
2020-11-05 16:21:21 +00:00
Mateusz Guzik
2dee296a3d Rationalize per-cpu zones.
The 2 provided zones had inconsistent naming between each other
("int" and "64") and other allocator zones (which use bytes).

Follow malloc by naming them "pcpu-" + size in bytes.

This is a step towards replacing ad-hoc per-cpu zones with
general slabs.
2020-11-05 15:08:56 +00:00
Mateusz Guzik
ea33cca971 poll/select: change selfd_zone into a malloc type
On a sample box vmstat -z shows:

ITEM                   SIZE  LIMIT     USED     FREE      REQ
64:                      64,      0, 1043784, 4367538,3698187229
selfd:                   64,      0,    1520,   13726,182729008

But at the same time:
vm.uma.selfd.keg.domain.1.pages: 121
vm.uma.selfd.keg.domain.0.pages: 121

Thus 242 pages got pulled even though the malloc zone would likely accomodate
the load without using extra memory.
2020-11-05 12:24:37 +00:00
Mateusz Guzik
2fbb45c601 vfs: change nt_zone into a malloc type
Elements are small in size and allocated for short periods.
2020-11-05 12:06:50 +00:00
Kyle Evans
df69035d7f imgact_binmisc: fix up some minor nits
- Removed a bunch of redundant headers
- Don't explicitly initialize to 0
- The !error check prior to setting imgp->interpreter_name is redundant, all
  error paths should and do return or go to 'done'. We have larger problems
  otherwise.
2020-11-05 04:19:48 +00:00
Mateusz Guzik
3c50616fc1 fd: make all f_count uses go through refcount_* 2020-11-05 02:12:33 +00:00
Mateusz Guzik
d737e9eaf5 fd: hide _fdrop 0 count check behind INVARIANTS
While here use refcount_load and make sure to report the tested value.
2020-11-05 02:12:08 +00:00
Mateusz Guzik
331c21dd5e pipe: whitespace nit in previous 2020-11-04 23:17:41 +00:00
Mateusz Guzik
c22ba7bb06 pipe: fix POLLHUP handling if no events were specified
Linux allows polling without any events specified and it happens to be the case
in FreeBSD as well. POLLHUP has to be delivered regardless of the event mask
and this works fine if the condition is already present. However, if it is
missing, selrecord is only called if the eventmask has relevant bits set. This
in particular leads to a conditon where pipe_poll can return 0 events and
neglect to selrecord, while kern_poll takes it as an indication it has to go to
sleep, but then there is nobody to wake it up.

While the problem seems systemic to *_poll handlers the least we can do is fix
it up for pipes.

Reported by:	Jeremie Galarneau <jeremie.galarneau at efficios.com>
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D27094
2020-11-04 23:11:54 +00:00
Mateusz Guzik
6fc2b069ca rms: fixup concurrent writer handling and add more features
Previously the code had one wait channel for all pending writers.
This could result in a buggy scenario where after a writer switches
the lock mode form readers to writers goes off CPU, another writer
queues itself and then the last reader wakes up the latter instead
of the former.

Use a separate channel.

While here add features to reliably detect whether curthread has
the lock write-owned. This will be used by ZFS.
2020-11-04 21:18:08 +00:00
Mark Johnston
f7db0c9532 vmspace: Convert to refcount(9)
This is mostly mechanical except for vmspace_exit().  There, use the new
refcount_release_if_last() to avoid switching to vmspace0 unless other
processes are sharing the vmspace.  In that case, upon switching to
vmspace0 we can unconditionally release the reference.

Remove the volatile qualifier from vm_refcnt now that accesses are
protected using refcount(9) KPIs.

Reviewed by:	alc, kib, mmel
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27057
2020-11-04 16:30:56 +00:00
Brooks Davis
19647e76fc sysvshm: pass relevant uap members as arguments
Alter shmget_allocate_segment and shmget_existing to take the values
they want from struct shmget_args rather than passing the struct
around.  In general, uap structures should only be the interface to
sys_<foo> functions.

This makes on small functional change and records the allocated space
rather than the requested space.  If this turns out to be a problem (e.g.
if software tries to find undersized segments by exact size rather than
using keys), we can correct that easily.

Reviewed by:	kib
Obtained from:	CheriBSD
MFC after:	1 week
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D27077
2020-11-03 19:14:03 +00:00
Conrad Meyer
2de07e4096 unix(4): Add SOL_LOCAL:LOCAL_CREDS_PERSISTENT
This option is intended to be semantically identical to Linux's
SOL_SOCKET:SO_PASSCRED.  For now, it is mutually exclusive with the
pre-existing sockopt SOL_LOCAL:LOCAL_CREDS.

Reviewed by:	markj (penultimate version)
Differential Revision:	https://reviews.freebsd.org/D27011
2020-11-03 01:17:45 +00:00
Mateusz Guzik
e1b6a7f83f malloc: prefix zones with malloc-
Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D27038
2020-11-02 17:39:15 +00:00
Mateusz Guzik
828afdda17 malloc: export kernel zones instead of relying on them being power-of-2
Reviewed by:	markj (previous version)
Differential Revision:	https://reviews.freebsd.org/D27026
2020-11-02 17:38:08 +00:00
Stefan Eßer
1ebef47735 Make sysctl user.local a tunable that can be written at run-time
This sysctl value had been provided as a read-only variable that is
compiled into the C library based on the value of _PATH_LOCALBASE in
paths.h.

After this change, the value is compiled into the kernel as an empty
string, which is translated to _PATH_LOCALBASE by the C library.

This empty string can be overridden at boot time or by a privileged
user at run time and will then be returned by sysctl.

When set to an empty string, the value returned by sysctl reverts to
_PATH_LOCALBASE.

This update does not change the behavior on any system that does
not modify the default value of user.localbase.

I consider this change as experimental and would prefer if the run-time
write permission was reconsidered and the sysctl variable defined with
CLFLAG_RDTUN instead to restrict it to be set at boot time.

MFC after:	1 month
2020-10-31 23:48:41 +00:00
Mateusz Guzik
82c174a3b4 malloc: delegate M_EXEC handling to dedicacted routines
It is almost never needed and adds an avoidable branch.

While here do minior clean ups in preparation for larger changes.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D27019
2020-10-30 20:02:32 +00:00
Stefan Eßer
147eea393f Add read only sysctl variable user.localbase
The value is provided by the C library as for other sysctl variables in
the user tree. It is compiled in and returns the value of _PATH_LOCALBASE
defined in paths.h.

Reviewed by:	imp, scottl
Differential Revision:	https://reviews.freebsd.org/D27009
2020-10-30 18:48:09 +00:00
Mateusz Guzik
0685574968 vfs: change vnode poll to just a malloc type
The size is 120, close fit for 128 and rarely used. The infrequent use
avoidably populates per-CPU caches and ends up with more memory.
2020-10-30 14:02:56 +00:00
Mateusz Guzik
4bfebc8d2c cache: add cache_vop_mkdir and rename cache_rename to cache_vop_rename 2020-10-30 10:46:35 +00:00
John Baldwin
36e0a362ac Add m_snd_tag_alloc() as a wrapper around if_snd_tag_alloc().
This gives a more uniform API for send tag life cycle management.

Reviewed by:	gallatin, hselasky
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D27000
2020-10-29 23:28:39 +00:00
Mateusz Guzik
62568e886a vfs: add NAMEI_DBG_HADSTARTDIR handling lost in rewrite
Noted by:	rpokala
2020-10-29 18:43:37 +00:00
Mateusz Guzik
eebc2e450f vfs: add NDREINIT to facilitate repeated namei calls
struct nameidata mixes caller arguments, internal state and output, which
can be quite error prone.

Recent addition of valdiating ni_resflags uncovered a caller which could
repeatedly call namei, effectively operating on partially populated state.

Add bare minimium validation this does not happen. The real fix would
decouple aforementioned state.

Reported by:	pho
Tested by:	pho (different variant)
2020-10-29 12:56:02 +00:00
John Baldwin
521eac97f3 Support hardware rate limiting (pacing) with TLS offload.
- Add a new send tag type for a send tag that supports both rate
  limiting (packet pacing) and TLS offload (mostly similar to D22669
  but adds a separate structure when allocating the new tag type).

- When allocating a send tag for TLS offload, check to see if the
  connection already has a pacing rate.  If so, allocate a tag that
  supports both rate limiting and TLS offload rather than a plain TLS
  offload tag.

- When setting an initial rate on an existing ifnet KTLS connection,
  set the rate in the TCP control block inp and then reset the TLS
  send tag (via ktls_output_eagain) to reallocate a TLS + ratelimit
  send tag.  This allocates the TLS send tag asynchronously from a
  task queue, so the TLS rate limit tag alloc is always sleepable.

- When modifying a rate on a connection using KTLS, look for a TLS
  send tag.  If the send tag is only a plain TLS send tag, assume we
  failed to allocate a TLS ratelimit tag (either during the
  TCP_TXTLS_ENABLE socket option, or during the send tag reset
  triggered by ktls_output_eagain) and ignore the new rate.  If the
  send tag is a ratelimit TLS send tag, change the rate on the TLS tag
  and leave the inp tag alone.

- Lock the inp lock when setting sb_tls_info for a socket send buffer
  so that the routines in tcp_ratelimit can safely dereference the
  pointer without needing to grab the socket buffer lock.

- Add an IFCAP_TXTLS_RTLMT capability flag and associated
  administrative controls in ifconfig(8).  TLS rate limit tags are
  only allocated if this capability is enabled.  Note that TLS offload
  (whether unlimited or rate limited) always requires IFCAP_TXTLS[46].

Reviewed by:	gallatin, hselasky
Relnotes:	yes
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D26691
2020-10-29 00:23:16 +00:00
Konstantin Belousov
3cbf9dc81c Check for process group change in tty_wait_background().
The calling process's process group can change between PROC_UNLOCK(p)
and PGRP_LOCK(pg) in tty_wait_background(), e.g. by a setpgid() call
from another process. If that happens, the signal is not sent to the
calling process, even if the prior checks determine that one should be
sent.  Re-check that the process group hasn't changed after acquiring
the pgrp lock, and if it has, redo the checks.

PR:	250701
Submitted by:	Jakub Piecuch <j.piecuch96@gmail.com>
MFC after:	2 weeks
2020-10-28 22:12:47 +00:00
Edward Tomasz Napierala
bdc0cb4e2c Add local variable to store the sysent pointer. Just a cleanup,
no functional changes.

Reviewed by:	kib (earlier version)
MFC after:	2 weeks
Sponsored by:	EPSRC
Differential Revision:	https://reviews.freebsd.org/D26977
2020-10-28 14:43:38 +00:00
Edward Tomasz Napierala
bce7ee9d41 Drop "All rights reserved" from all my stuff. This includes
Foundation copyrights, approved by emaste@.  It does not include
files which carry other people's copyrights; if you're one
of those people, feel free to make similar change.

Reviewed by:	emaste, imp, gbe (manpages)
Differential Revision:	https://reviews.freebsd.org/D26980
2020-10-28 13:46:11 +00:00
Mateusz Guzik
11743b6e47 vfs: tidy up vnlru_free
Apart from cosmeatic changes make sure to only decrease the recycled counter
if vtryrecycle succeeded.

Tested by:	pho
2020-10-27 18:13:09 +00:00
Mateusz Guzik
68ac2b804c vfs: fix vnode reclaim races against getnwevnode
All vnodes allocated by UMA are present on the global list used by
vnlru. getnewvnode modifies the state of the vnode (most notably
altering v_holdcnt) but never locks it. Moreover filesystems also
modify it in arbitrary manners sometimes before taking the vnode
lock or adding any other indicator that the vnode can be used.

Picking up such a vnode by vnlru would be problematic.

To that end there are 2 fixes:
- vlrureclaim, not recycling v_holdcnt == 0 vnodes, takes the
interlock and verifies that v_mount has been set. It is an
invariant that the vnode lock is held by that point, providing
the necessary serialisation against locking after vhold.
- vnlru_free_locked, only wanting to free v_holdcnt == 0 vnodes,
now makes sure to only transition the count 0->1 and newly allocated
vnodes start with v_holdcnt == VHOLD_NO_SMR. getnewvnode will only
transition VHOLD_NO_SMR->1 once more making the hold fail

Tested by:	pho
2020-10-27 18:12:07 +00:00
Mateusz Guzik
d681c51d36 cache: add missing NIRES_ABS handling 2020-10-26 18:01:18 +00:00
Alexander Motin
3c0177b887 Enable bioq 'car limit' added at r335066 at 128 bios.
Without the 'car limit' enabled (before this), running sequential ZFS scrub
on HDD without command queuing support, I've measured latency on concurrent
random reads reaching 4 seconds (surprised that not more).  Enabling this
reduced the latency to 65 milliseconds, while scrub still doing ~180MB/s.

For disks with command queuing this does not make much difference (if any),
since most time all the requests are queued down to the disk or HBA, leaving
nothing in the queue to sort.  And even if something does not fit, staying on
the queue, it is likely not for long.  To not limit sorting in such bursty
scenarios I've added batched counter zeroing when the queue is getting empty.

The internal scheduler of the SAS HDD I was testing seems to be even more
loyal to random I/O, reducing the scrub speed to ~120MB/s.  So in case
somebody worried this is limit is too strict -- it actually looks relaxed.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2020-10-26 04:04:06 +00:00
Alexander Motin
8b220f8915 Fix asymmetry in devstat(9) calls by GEOM.
Before this GEOM passed bio pointer to transaction start, but not end.
It was irrelevant until devstat(9) got DTrace hooks, that appeared to
provide bio pointer on I/O completion, but not on submission.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2020-10-24 21:07:10 +00:00
Ruslan Bukin
f32f0095e9 o Add iommu de-initialization method for MSI interface.
o Add iommu_unmap_msi() to release the msi GAS entry.
o Provide default implementations for iommu init/deinit methods.

Reviewed by:	kib
Sponsored by:	Innovate DSbD
Differential Revision:	https://reviews.freebsd.org/D26906
2020-10-24 20:09:27 +00:00
Ryan Moeller
e58483c4fb sysctl+kern_sysctl: Honor SKIP for descendant nodes
Ensure we also skip descendants of SKIP nodes when iterating through children
of an explicitly specified node.

Reported by:	np
Reviewed by:	np
MFC after:	1 week
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D26833
2020-10-24 16:17:07 +00:00
Ryan Moeller
0595c12484 kern_sysctl: Misc code cleanup
Remove unused oidpp parameter from sysctl_sysctl_next_ls and
add high level comments to describe how it works.

No functional change.

Reviewed by:	imp
MFC after:	1 week
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D26854
2020-10-24 14:46:38 +00:00
Kyle Evans
275c821d3d audit: correct reporting of *execve(2) success
r326145 corrected do_execve() to return EJUSTRETURN upon success so that
important registers are not clobbered. This had the side effect of tapping
out 'failures' for all *execve(2) audit records, which is less than useful
for auditing purposes.

Audit exec returns earlier, where we can know for sure that EJUSTRETURN
translates to success. Note that this unsets TDP_AUDITREC as we commit the
audit record, so the usual audit in the syscall return path will do nothing.

PR:		249179
Reported by:	Eirik Oeverby <ltning-freebsd anduin net>
Reviewed by:	csjp, kib
MFC after:	1 week
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D26922
2020-10-24 14:39:17 +00:00
Mateusz Guzik
eb65cde4f5 cache: assorted typo fixes 2020-10-24 13:31:40 +00:00
Mateusz Guzik
029cfccc71 cache: add the missing NC_NOMAKEENTRY and NC_KEEPPOSENTRY to lockless lookup
They are de facto ignored.
2020-10-24 13:31:25 +00:00
Mateusz Guzik
7cc1718613 vfs: fix a race where reclaim vholds freed vnodes
Reported by:	pho
Tested by:	pho (previous version)
Fixes:	r366974 ("vfs: stop taking the interlock in vnode reclaim")
2020-10-24 13:30:37 +00:00
Mateusz Guzik
acb41008f3 cache: batch updates to numcache in case of mass removal 2020-10-24 01:14:52 +00:00
Mateusz Guzik
208cb7c4b6 cache: refactor alloc/free
This in particular centralizes manipulation of numcache.
2020-10-24 01:14:17 +00:00
Mateusz Guzik
1d44405690 cache: fold branch prediction into cache_ncp_canuse 2020-10-24 01:13:47 +00:00
Mateusz Guzik
c13d7d1f98 cache: fix some typos 2020-10-24 01:13:16 +00:00
Mateusz Guzik
f878526f20 cache: drop write-only vars 2020-10-24 01:13:02 +00:00
Ruslan Bukin
9729b14985 Move the iommu stubs to a generic place, so they are available on all the
platforms.

This allows to not depend on the IOMMU macro in AHCI driver.

Requested by:	kib
Suggested by:	andrew
Reviewed by:	kib
Sponsored by:	Innovate DSbD
Differential Revision:	https://reviews.freebsd.org/D26887
2020-10-23 21:27:48 +00:00
Mateusz Guzik
3862838921 cache: reduce memory waste in struct namecache
The previous scheme for calculating the total size was doing sizeof
on the struct and then adding the wanted space for the buffer.

nc_name is at offset 58 while sizeof(struct namecache) is 64.
With CACHE_PATH_CUTOFF of 39 bytes and 1 byte of padding we were
allocating 104 bytes for the entry and never accounting for the 6
byte padding, wasting that space.
2020-10-23 15:56:22 +00:00
Mateusz Guzik
703f3fafa5 vfs: stop taking the interlock in vnode reclaim
It no longer protects any of tested fields, keeping all the checks racy.

While here make vtryrecycle drop the vnode on its own. Avoids an additional
lock trip.
2020-10-23 15:49:18 +00:00
Mateusz Guzik
c7520caa4f vfs: prevent avoidable evictions on mkdir of existing directories
mkdir -p /foo/bar/baz will mkdir each path component and ignore EEXIST.

The NOCACHE lookup will make the namecache unnecessarily evict the existing entry,
and then fallback to the fs lookup routine eventually leading namei to return an
error as the directory is already there.

For invocations like mkdir -p /usr/obj/usr/src/sys/GENERIC/modules this triggers
fallbacks to the slowpath for concurrently executing lookups.

Tested by:	pho
Discussed with:	kib
2020-10-22 19:28:12 +00:00
Mateusz Guzik
54f09403a3 cache: assert the created entry does not point to itself 2020-10-22 19:22:34 +00:00