Commit Graph

19592 Commits

Author SHA1 Message Date
Mateusz Guzik
8495fa49ea vfs: whack spurious comments from syncer's vop_vector 2023-04-06 15:20:40 +00:00
Konstantin Belousov
11cdffc603 Regen 2023-04-04 16:19:08 +03:00
Konstantin Belousov
dac3102488 Rename kqueue1(2) to kqueuex(2) to avoid compat issues with NetBSD
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D39377
2023-04-04 16:19:08 +03:00
Randall Stewart
73ee5756de Fixes in the tcp infrastructure with respect to stack changes as well as other infrastructure updates for incoming rack features.
So stack switching as always been a bit of a issue. We currently use a break before make setup which means that
if something goes wrong you have to try to get back to a stack. This patch among a lot of other things changes that so
that it is a make before break. We also expand some of the function blocks in prep for new features in rack that will allow
more controlled pacing. We also add other abilities such as the pathway for a stack to query a previous stack to acquire from
it critical state information so things in flight don't get dropped or mis-handled when switching stacks. We also add the
concept of a timer granularity. This allows an alternate stack to change from the old ticks granularity to microseconds and
of course this even gives us a pathway to go to nanosecond timekeeping if we need to (something for the data center to consider
for sure).

Once all this lands I will then update rack to begin using all these new features.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D39210
2023-04-01 01:46:38 -04:00
Mark Johnston
cab1056105 kdb: Modify securelevel policy
Currently, sysctls which enable KDB in some way are flagged with
CTLFLAG_SECURE, meaning that you can't modify them if securelevel > 0.
This is so that KDB cannot be used to lower a running system's
securelevel, see commit 3d7618d8bf.  However, the newer mac_ddb(4)
restricts DDB operations which could be abused to lower securelevel
while retaining some ability to gather useful debugging information.

To enable the use of KDB (specifically, DDB) on systems with a raised
securelevel, change the KDB sysctl policy: rather than relying on
CTLFLAG_SECURE, add a check of the current securelevel to kdb_trap().
If the securelevel is raised, only pass control to the backend if MAC
specifically grants access; otherwise simply check to see if mac_ddb
vetoes the request, as before.

Add a new secure sysctl, debug.kdb.enter_securelevel, to override this
behaviour.  That is, the sysctl lets one enter a KDB backend even with a
raised securelevel, so long as it is set before the securelevel is
raised.

Reviewed by:	mhorne, stevek
MFC after:	1 month
Sponsored by:	Juniper Networks
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D37122
2023-03-30 10:45:00 -04:00
Mateusz Guzik
80cf427b8d proc: shave a lock trip on exit if possible
... which happens to be vast majority of the time
2023-03-29 09:19:03 +00:00
Mateusz Guzik
37337709d3 cred: convert the refcount from int to long
On 64-bit platforms this sorts out worries about mitigating bugs which
overflow the counter, all while not pessimizng anything -- most notably
it avoids whacking per-thread operation in favor of refcount(9) API.

The struct already had two instances of 4 byte padding with 256 bytes in
size, cr_flags gets moved around to avoid growing it.

32-bit platforms could also get the extended counter, but I did not do
it as one day(tm) the mutex protecting centralized operation should be
replaced with atomics and 64-bit ops on 32-bit platforms remain quite
penalizing.

While worries of counter overflow are addressed, the following is not
(just like it would not be with conversion to refcount(9)):
- counter *underflows*
- buffer overruns from adjacent allocations
- UAF due to stale cred pointer
- .. and other goodies

As such, while lipstick was placed, the pig should not be participating
in any beauty pageants.

Prodded by:	emaste
Differential Revision:	https://reviews.freebsd.org/D39220
2023-03-29 05:02:32 +00:00
Konstantin Belousov
6a0a634590 Regen 2023-03-28 02:39:26 +03:00
Konstantin Belousov
61194e9852 Add kqueue1() syscall
It takes the flags argument.  Immediate use is to provide the KQUEUE_CLOEXEC
flag for kqueue(2).

Reviewed by:	emaste, jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D39271
2023-03-28 02:39:26 +03:00
Alexander V. Chernikov
04f75b9802 netlink: allow netlink sockets in non-vnet jails.
This change allow to open Netlink sockets in the non-vnet jails, even for
 unpriviledged processes.
The security model largely follows the existing one. To be more specific:
* by default, every `NETLINK_ROUTE` command is **NOT** allowed in non-VNET
 jail UNLESS `RTNL_F_ALLOW_NONVNET_JAIL` flag is specified in the command
 handler.
* All notifications are **disabled** for non-vnet jails (requests to
 subscribe for the notifications are ignored). This will change to be more
 fine-grained model once the first netlink provider requiring this gets
 committed.
* Listing interfaces (RTM_GETLINK) is **allowed** w/o limits (**including**
 interfaces w/o any addresses attached to the jail). The value of this is
 questionable, but it follows the existing approach.
* Listing ARP/NDP neighbours is **forbidden**. This is a **change** from the
 current approach - currently we list static ARP/ND entries belonging to the
 addresses attached to the jail.
* Listing interface addresses is **allowed**, but the addresses are filtered
 to match only ones attached to the jail.
* Listing routes is **allowed**, but the routes are filtered to provide only
 host routes matching the addresses attached to the jail.
* By default, every `NETLINK_GENERIC` command is **allowed** in non-VNET jail
 (as sub-families may be unrelated to network at all).
 It is the goal of the family author to implement the restriction if
 necessary.

Differential Revision: https://reviews.freebsd.org/D39206
MFC after:	1 month
2023-03-26 08:44:09 +00:00
Mateusz Guzik
22eb66d961 vfs cache: always assert on ndp->ni_resflags 2023-03-25 21:57:55 +00:00
Mateusz Guzik
138a5dafba vfs: trylock vnode requeue
The quasi-LRU still gets in the way for example when doing an
incremental bzImage build, with vnode_list lock being at the
top of the profile. Further damage control the problem by trylocking.

Note the entire mechanism desperately wants to be reaped out in favor
of something(tm) which both scales in a multicore setting and provides
sensible replacement policy.

With this change everything vfs almost disappears from the on CPU
flamegraph, what is left is tons of contention in the VM.
2023-03-25 13:42:27 +00:00
Mateusz Guzik
245767c278 vfs: flip deferred_inact to atomic
Turns out it is very rarely triggered, making a per-cpu
counter a waste.

Examples from real life boxes:
uptime		counter
135 days	847
138 days	2190
141 days	1
2023-03-25 13:42:27 +00:00
Mateusz Guzik
e5eb1d298f vfs: replace some spelled out VNASSERTs with VNPASS
nfc
2023-03-25 13:42:27 +00:00
Kyle Evans
89c52f9d59 arm64: add KASAN support
This entails:
- Marking some obvious candidates for __nosanitizeaddress
- Similar trap frame markings as amd64, for similar reasons
- Shadow map implementation

The shadow map implementation is roughly similar to what was done on
amd64, with some exceptions.  Attempting to use available space at
preinit_map_va + PMAP_PREINIT_MAPPING_SIZE (up to the end of that range,
as depicted in the physmap) results in odd failures, so we instead
search the physmap for free regions that we can carve out, fragmenting
the shadow map as necessary to try and fit as much as we need for the
initial kernel map.  pmap_bootstrap_san() is thus after
pmap_bootstrap(), which still included some technically reserved areas
of the memory map that needed to be included in the DMAP.

The odd failure noted above may be a bug, but I haven't investigated it
all that much.

Initial work by mhorne with additional fixes from kevans and markj.

Reviewed by:	andrew, markj
Sponsored by:	Juniper Networks, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D36701
2023-03-23 16:34:33 -05:00
John Baldwin
d2dab20c2a ktls: Drop all the INET and INET6 compile-time guards.
Consistent with 9fd0d9b16e, KERN_TLS is
not supported on kernels without any INET support.

Reviewed by:	gallatin, hselasky
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D39232
2023-03-23 14:29:07 -07:00
Mateusz Guzik
c16c4ea6d3 vfs cache: return ENOTDIR for not_a_dir/{.,..} lookups
Reported by:	Oliver Kiddle
PR:	270419
MFC:	3 days
2023-03-23 19:31:18 +00:00
Mateusz Guzik
b5d43972e3 vfs: decouple freevnodes from vnode batching
In principle one cpu can keep vholding vnodes, while another vdrops
them. In this case it may be the local count will keep growing in an
unbounded manner. Roll it up after a threshold instead.

While here move it out of dpcpu into struct pcpu.

Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D39195
2023-03-22 23:57:25 +00:00
Mark Johnston
b4b33821fa ktls: Fix interlocking between ktls_enable_rx() and listen(2)
The TCP_TXTLS_ENABLE and TCP_RXTLS_ENABLE socket option handlers check
whether the socket is listening socket and fail if so, but this check is
racy.  Since we have to lock the socket buffer later anyway, defer the
check to that point.

ktls_enable_tx() locks the send buffer's I/O lock, which will fail if
the socket is a listening socket, so no explicit checks are needed.  In
ktls_enable_rx(), which does not acquire the I/O lock (see the review
for some discussion on this), use an explicit SOLISTENING() check after
locking the recv socket buffer.

Otherwise, a concurrent solisten_proto() call can trigger crashes and
memory leaks by wiping out socket buffers as ktls_enable_*() is
modifying them.

Also make sure that a KTLS-enabled socket can't be converted to a
listening socket, and use SOCK_(SEND|RECV)BUF_LOCK macros instead of the
old ones while here.

Add some simple regression tests involving listen(2).

Reported by:	syzkaller
MFC after:	2 weeks
Reviewed by:	gallatin, glebius, jhb
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D38504
2023-03-21 16:04:00 -04:00
Mitchell Horne
8965b3033e callout(9): adopt old references to timeout(9)
timeout(9) was removed a couple of years ago; all consumers now use the
callout(9) interface.

Explicitly do not bump .Dd anywhere, as this is not a content or
semantic change.

Reviewed by:	markj, jhb, Pau Amma <pauamma@gundo.com>
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D39136
2023-03-20 17:12:12 -03:00
Mark Johnston
c3179891f8 kerneldump: Inline dump_savectx() into its callers
The callers of dump_savectx() (i.e., doadump() and livedump_start())
subsequently call dumpsys()/minidumpsys(), which dump the calling
thread's stack when writing the dump.  If dump_savectx() gets its own
stack frame, that frame might be clobbered when its caller later calls
dumpsys()/minidumpsys(), making it difficult for debuggers to unwind the
stack.

Fix this by making dump_savectx() a macro, so that savectx() is always
called directly by the function which subsequently calls
dumpsys()/minidumpsys().

This fixes stack unwinding for the panicking thread from arm64
minidumps.  The same happened to work on amd64, but kgdb reports the
dump_savectx() calls as coming from dumpsys(), so in that case it
appears to work by accident.

Fixes:	c9114f9f86 ("Add new vnode dumper to support live minidumps")
Reviewed by:	mhorne, jhb
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D39151
2023-03-20 14:16:28 -04:00
Mateusz Guzik
62a573d953 vfs: retire KERN_VNODE
It got disabled in 2003:

commit acb18acfec
Author: Poul-Henning Kamp <phk@FreeBSD.org>
Date:   Sun Feb 23 18:09:05 2003 +0000

    Bracket the kern.vnode sysctl in #ifdef notyet because it results
    in massive locking issues on diskless systems.

    It is also not clear that this sysctl is non-dangerous in its
    requirements for locked down memory on large RAM systems.

There does not seem to be practical use for it and the disabled routine
does not work anyway.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D39127
2023-03-17 16:21:45 +00:00
Mina Galić
0b0ae2e4cd jail: convert several functions from int to bool
these functions exclusively return (0) and (1), so convert them to bool

We also convert some networking related jail functions from int to bool
some of which were returning an error that was never used.

Differential Revision: https://reviews.freebsd.org/D29659
Reviewed by: imp, jamie (earlier version)
Pull Request: https://github.com/freebsd/freebsd-src/pull/663
2023-03-14 21:05:33 -06:00
Mark Johnston
cd133525fa smr: Remove the return value from smr_wait()
This is supposed to be a blocking version of smr_poll(), so there's no
need for a return value.  No functional change intended.

MFC after:	1 week
2023-03-13 10:45:35 -04:00
Kyle Evans
cc0fe048ec kern: physmem: don't create a new exregion for different flags...
... if the region we're adding is an exact match to one that we already
have.  Simply extend the flags of the existing entry as needed so that
we don't end up with duplicate regions.

It could be that we got the exclusion through two different means, e.g.,
FDT memreserve and the EFI memory map, and we may derive different
characteristics from each.  Apply the most restrictive set to the
region.

Reported by:	Mark Millard <marklmi yahoo com>
Reviewed by:	mhorne
2023-03-09 23:27:39 -06:00
Justin Hibbits
084846271a ktls: Use IfAPI accessors to get capabilities
Summary:
Avoid referencing the ifnet struct directly, and use the IfAPI accessors
instead.

Reviewed by:	gallatin
Sponsored by:	Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D38932
2023-03-07 09:47:00 -05:00
Mark Johnston
831601773e deadlkres: Make parameters settable with tunables
MFC after:	1 week
Sponsored by:	Klara, Inc.
Sponsored by:	Juniper Networks, Inc.
2023-03-03 11:16:41 -05:00
Rick Macklem
cbbb22031f kern_jail.c: Remove #ifdefs for VNET_NFSD
The consensus was that VNET_NFSD was not needed.
This patch removes it from kern_jail.c.

With this patch, support for the "allow.nfsd"
jail parameter is enabled in the kernel for
kernels built with "options VIMAGE".

Reviewed by:	markj
MFC after:	3 months
Differential Revision:	https://reviews.freebsd.org/D38808
2023-03-02 13:13:24 -08:00
Rick Macklem
4bbbd5875d vfs_mount.c: Allow mountd(8) to do exports in a vnet prison
To run mountd in a vnet prison, three checks in vfs_domount()
and vfs_domount_update() related to doing exports needed
to be changed, so that a file system visible within the
prison but mounted outside the prison can be exported.

I did all three in a minimal way, only changing the checks for
the specific case of a process (typically mountd) doing exports
within a vnet prison and not updating the mount point in other
ways.  The changes are:
- Ignore the error return from vfs_suser(), since the file
  system being mounted outside the prison will cause it to fail.
- Use the priv_check(PRIV_NFS_DAEMON) for this specific case
  within a prison.
- Skip the call to VFS_MOUNT(), since it will return an error,
  due to the "from" argument not being set correctly.  VFS_MOUNT()
  does not appear to do anything for the case of doing exports only.

Reviewed by:	markj
MFC after:	3 months
Differential Revision:	https://reviews.freebsd.org/D37741
2023-03-02 13:09:01 -08:00
Mark Johnston
bcd8cd859e buf: Make buf_daemon_shutdown() a no-op after a panic
As in commit 9d7cc536e2, there is no need to do anything in this
context.

MFC after:	1 week
2023-03-01 10:15:54 -05:00
Mateusz Guzik
a357112938 kern: whack __mips__ leftover
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2023-03-01 11:05:12 +00:00
Zhenlei Huang
2c33b456ff jail: Improve readability
No functional change intended.

Reviewed by:	melifaro
Differential Revision:	https://reviews.freebsd.org/D37890
2023-02-28 18:20:07 +08:00
Zhenlei Huang
500f82d6c3 jail: Use flexible array member within struct prison_ip
Current implementation utilize off-by-one struct prison_ip to access the
IPv[46] addresses. It is error prone and hence comes the regression fix
21ad3e27fa and ddbf879d79. Use flexible array member so that compiler
will catch such errors and it will also be easier to review.

No functional change intended.

Reviewed by:	melifaro, glebius
Differential Revision:	https://reviews.freebsd.org/D37874
2023-02-28 18:20:06 +08:00
Sebastian Huber
28ed159f26 pps: Round to closest integer in pps_event()
The comment above bintime2timespec() says:

  When converting between timestamps on parallel timescales of differing
  resolutions it is historical and scientific practice to round down.

However, the delta_nsec value is a time difference and not a timestamp.  Also
the rounding errors accumulate in the frequency accumulator, see hardpps().
So, rounding to the closest integer is probably slightly better.

Reviewed by: imp
Pull Request: https://github.com/freebsd/freebsd-src/pull/604
2023-02-27 15:10:55 -07:00
Sebastian Huber
1e48d9d336 pps: Simplify the nsec calculation in pps_event()
Let A be the current calculation of the frequency accumulator (pps_fcount)
update in pps_event()

  scale = (uint64_t)1 << 63;
  scale /= captc->tc_frequency;
  scale *= 2;
  bt.sec = 0;
  bt.frac = 0;
  bintime_addx(&bt, scale * tcount);
  bintime2timespec(&bt, &ts);
  hardpps(tsp, ts.tv_nsec + 1000000000 * ts.tv_sec);

and hardpps(..., delta_nsec):

  u_nsec = delta_nsec;
  if (u_nsec > (NANOSECOND >> 1))
          u_nsec -= NANOSECOND;
  else if (u_nsec < -(NANOSECOND >> 1))
          u_nsec += NANOSECOND;
  pps_fcount += u_nsec;

This change introduces a new calculation which is slightly simpler and more
straight forward.  Name it B.

Consider the following sample values with a tcount of 2000000100 and a
tc_frequency of 2000000000 (2GHz).

For A, the scale is 9223372036.  Then scale * tcount is 18446744994337203600
which is larger than UINT64_MAX (= 18446744073709551615).  The result is
920627651984 == 18446744994337203600 % UINT64_MAX.  Since all operands are
unsigned the result is well defined through modulo arithmetic.  The result of
bintime2timespec(&bt, &ts) is 49.  This is equal to the correct result
1000000049 % NANOSECOND.

In hardpps(), both conditional statements are not executed and pps_fcount is
incremented by 49.

For the new calculation B, we have 1000000000 * tcount is 2000000100000000000
which is less than UINT64_MAX. This yields after the division with tc_frequency
the correct result of 1000000050 for delta_nsec.

In hardpps(), the first conditional statement is executed and pps_fcount is
incremented by 50.

This shows that both methods yield roughly the same results.  However, method B
is easier to understand and requires fewer conditional statements.

Reviewed by: imp
Pull Request: https://github.com/freebsd/freebsd-src/pull/604
2023-02-27 15:10:55 -07:00
Sebastian Huber
8a142484d4 pps: Directly assign the timestamps in pps_event()
Reviewed by: imp
Pull Request: https://github.com/freebsd/freebsd-src/pull/604
2023-02-27 15:10:55 -07:00
Sebastian Huber
0448501f2b pps: Move pcount assignment in pps_event()
Move the pseq increment.  This makes it possible to reuse registers earlier.

Reviewed by: imp
Pull Request: https://github.com/freebsd/freebsd-src/pull/604
2023-02-27 15:10:55 -07:00
Sebastian Huber
fd88f4e190 pps: Simplify capture and event processing
Use local variables for the captured timehand and timecounter in pps_event().
This fixes a potential issue in the nsec preparation for hardpps().  Here the
timecounter was accessed through the captured timehand after the generation was
checked.

Make a snapshot of the relevent timehand values early in pps_event().  Check
the timehand generation only once during the capture and event processing.  Use
atomic_thread_fence_acq() similar to the other readers.

Reviewed by: imp
Pull Request: https://github.com/freebsd/freebsd-src/pull/604
2023-02-27 15:10:55 -07:00
Sebastian Huber
cb2a028b15 pps: Load timecounter once in pps_capture()
This ensures that the timecounter and the tc_get_timecount handler belong
together.

Reviewed by: imp
Pull Request: https://github.com/freebsd/freebsd-src/pull/604
2023-02-27 15:10:54 -07:00
Mateusz Guzik
1ebec3806e vfs: s/ppsratecheck/eventratecheck
nfc
2023-02-24 19:30:49 +00:00
Mateusz Guzik
83158c6893 time: s/ppsratecheck/eventratecheck
The routine is used as a general event-limiting routine in places which
have nothing to do with packets.

Provide a define to keep everything happy.

Reviewed by:	rew
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D38746
2023-02-24 19:26:36 +00:00
Mark Johnston
9d7cc536e2 buf: Make bufspace_daemon_shutdown() a no-op after a panic
This function doesn't need to do anything in that context, and calling
wakeup() can lead to recursive panics.

Discussed with:	mhorne
MFC after:	1 week
2023-02-23 21:56:36 -05:00
Mitchell Horne
9a7f7c26c5 lockmgr: upgrade panic return checks
We short-circuit lockmgr functions in the face of a kernel panic. Other
lock implementations do this with a SCHEDULER_STOPPED() check, which
covers the additional case where the debugger is active but the system
has not panicked. Update this code to match that behaviour.

Reviewed by:	mjg, kib, markj
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D38655
2023-02-22 11:12:22 -04:00
Rick Macklem
88175af8b7 vfs_export: Add mnt_exjail to control exports done in prisons
If there are multiple instances of mountd(8) (in different
prisons), there will be confusion if they manipulate the
exports of the same file system.  This patch adds mnt_exjail
to "struct mount" so that the credentials (and, therefore,
the prison) that did the exports for that file system can
be recorded.  If another prison has already exported the
file system, vfs_export() will fail with an error.
If mnt_exjail == NULL, the file system has not been exported.
mnt_exjail is checked by the NFS server, so that exports done
from within a different prison will not be used.

The patch also implements vfs_exjail_destroy(), which is
called from prison_cleanup() to release all the mnt_exjail
credential references, so that the prison can be removed.
Mainly to avoid doing a scan of the mountlist for the case
where there were no exports done from within the prison,
a count of how many file systems have been exported from
within the prison is kept in pr_exportcnt.

Reviewed by:	markj
Discussed with:	jamie
Differential Revision:	https://reviews.freebsd.org/D38371
MFC after:	3 months
2023-02-21 13:00:42 -08:00
Gleb Smirnoff
71e70c25c0 Revert "unix/dgram: return EAGAIN instead of ENOBUFS when O_NONBLOCK set"
This API change led to unexpected consequences with Go runtime. The
Go runtime emulates blocking sockets over non-blocking sockets and
for that uses available event dispatcher on the target OS, which is
kevent(2) if availabe, with OS independent layer on top.  It expects
that if whatever O_NONBLOCK socket returned ever EAGAIN, then it is
supposed to be reported as writable by the event dispatcher. kevent(2)
would never report a unix/dgram socket, since they never change their
state, they always are writeable.  The expectations of Go are not
literally specified by SUS, however they are in its spirit.  The SUS
specifies EAGAIN for send(2) as "The socket's file descriptor is marked
O_NONBLOCK and the requested operation would block" [1].  This doesn't
apply to FreeBSD unix/dgram socket, it never blocks on send(2).

Thus, changing API trying to mimic Linux was a mistake.  But what about
the problem we tried to fix? Discussed that with Max Dounin of nginx,
and we agreed that the log bomb described shall be fixed on nginx side,
and it actually isn't specific to FreeBSD, may happen with nginx on any
non-Linux system with a certain configuration.

[1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/send.html

This reverts commit 65572cade3.
2023-02-21 08:50:07 -08:00
Zhenlei Huang
b2d76b52fd jail: Fix redoing ip restricting
`prison_ip_restrict()` is called in loop FOREACH_PRISON_DESCENDANT_LOCKED.
While under low memory, it is still possible that in subsequent rounds
`prison_ip_restrict()` succeed and `redo_ip[46]` flip over from true to
false, thus leave some prisons's IPv[46] addresses unrestricted.

Reviewed by:	jamie
Fixes:		8bce8d28ab jail: Avoid multipurpose return value of function prison_ip_restrict()
Differential Revision:	https://reviews.freebsd.org/D38697
2023-02-21 23:43:25 +08:00
Konstantin Belousov
836e4b371b kern/sysv_ipc.c: use ANSI C function definition
Also remove pointless return's.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2023-02-21 16:02:46 +02:00
Mateusz Guzik
de709b1455 sx: whack set-but-not-used warn in _sx_slock_hard
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2023-02-21 13:49:14 +00:00
Mateusz Guzik
dbcd7e7e32 vfs cache: whack set-but-not-used warn in cache_purgevfs
Reported by:	kib
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2023-02-21 13:48:35 +00:00
Kyle Evans
c32946d8be kern: physmem: fix the format string again, i is a size_t
Fixes the riscv LINT build.

Fixes:	7b5cb32fca ("kern: physmem: properly cast %jx [...]")
2023-02-20 23:39:38 -06:00
Kyle Evans
7b5cb32fca kern: physmem: properly cast %jx arguments to uintmax_t
While we're here, slap prfunc with a __printflike to get compiler
checking on args to catch silly mistakes like this.

Reported by:	jrtc27
2023-02-20 16:12:55 -06:00
Kyle Evans
cd73914b01 kern: physmem: don't truncate addresses in DEBUG output
Make it consistent with the above region printing, otherwise it appears
to be somewhat confusing.
2023-02-20 12:55:04 -06:00
Elliott Mitchell
3aed0ffc15 kern/clock: remove interrupt reporting from watchdog_fire()
The interrupt counts may have been valuable in the past, but now DDB can
readily provide them via 'show intrcnt'. This is one of the only
consumers of these counter arrays outside of the interrupt code itself,
and this should be avoided.

Reviewed by:	mhorne, fuz
Differential Revision: https://reviews.freebsd.org/D37870
2023-02-16 17:24:29 -04:00
John Baldwin
98844e99d4 aio: Fix more synchronization issues in aio_biowakeup.
- Use atomic_store to set job->error.  atomic_set does an or
  operation, not assignment.

- Use refcount_* to manage job->nbio.

  This ensures proper memory barriers are present so that the last bio
  won't see a possibly stale value of job->error.

- Don't re-read job->error after reading it via atomic_load.

Reported by:	markj (1)
Reviewed by:	mjg, markj
Differential Revision:	https://reviews.freebsd.org/D38611
2023-02-15 13:32:52 -08:00
John Baldwin
cca6d6160f aio_biowakeup: Various style fixes. 2023-02-15 10:57:08 -08:00
Keith Reynolds
40734fc57e aio: Fix a test and set race in aio_biowakeup.
Use atomic_fetchadd in place of separate atomic_subtract / atomic_load.

Reviewed by:	markj
Sponsored by:	HPE TidalScale
Differential Revision:	https://reviews.freebsd.org/D38559
2023-02-15 10:56:39 -08:00
Mitchell Horne
28137bdb19 intrng: track counter allocation with a bitmap
Crucially, this allows releasing counters, and interrupt sources by
extension. Where before we were incrementing intrcnt_index with atomics,
now we protect the bitmap using the existing isrc_table_lock mutex.

Reviewed by:	mmel
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D38437
2023-02-14 14:06:00 -04:00
Mitchell Horne
82e846df5b intrng: sort includes
MFC after:	3 days
2023-02-14 14:06:00 -04:00
Mark Johnston
636b19ead4 tcp: Disallow re-connection of a connected socket
soconnectat() tries to ensure that one cannot connect a connected
socket.  However, the check is racy and does not really prevent two
threads from attempting to connect the same TCP socket.

Modify tcp_connect() and tcp6_connect() to perform the check again, this
time synchronized by the inpcb lock, under which we call
soisconnecting().

Reported by:	syzkaller
Reviewed by:	glebius
MFC after:	2 weeks
Sponsored by:	Klara, Inc.
Sponsored by:	Modirum MDPay
Differential Revision:	https://reviews.freebsd.org/D38507
2023-02-14 10:07:19 -05:00
Konstantin Belousov
020e8a4d06 allocbuf(): convert direct panic() calls to KASSERT()s
Also do minor style adjustments.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D38549
2023-02-14 00:28:42 +02:00
Mateusz Guzik
a066bba2da ntptime: ansify
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2023-02-13 18:24:13 +00:00
Mateusz Guzik
00343b4adc uipc: ansify
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2023-02-13 18:20:29 +00:00
Mitchell Horne
78919798e7 kern_poll: include sys/sched.h
For sched_relinquish(). This fixes the build for some kernel configs.

Reported by:	Jenkins
Fixes:		1029dab634 ("mi_switch(): clean up switch types and their usage")
2023-02-09 17:13:02 -04:00
Andrew Gallatin
d24b032bec ktls: Fix comments & whitespace issues with c0e4090e3d
Address some last minute review feedback on c0e4090e3d
by fixing spacing around comments, and clarifying that the
newly added destroy_task is not related to tls 1.0.
No functional change intended.

Pointed out by: jhb
Sponsored by: Netflix
2023-02-09 14:11:24 -05:00
Andrew Gallatin
c0e4090e3d ktls: Accurately track if ifnet ktls is enabled
This allows us to avoid spurious calls to ktls_disable_ifnet()

When we implemented ifnet kTLSe, we set a flag in the tx socket
buffer (SB_TLS_IFNET) to indicate ifnet kTLS.  This flag meant that
now, or in the past, ifnet ktls was active on a socket.  Later,
I added code to switch ifnet ktls sessions to software in the case
of lossy TCP connections that have a high retransmit rate.
Because TCP was using SB_TLS_IFNET to know if it needed to do math
to calculate the retransmit ratio and potentially call into
ktls_disable_ifnet(), it was doing unneeded work long after
a session was moved to software.

This patch carefully tracks whether or not ifnet ktls is still enabled
on a TCP connection.  Because the inp is now embedded in the tcpcb, and
because TCP is the most frequent accessor of this state, it made sense to
move this from the socket buffer flags to the tcpcb. Because we now need
reliable access to the tcbcb, we take a ref on the inp when creating a tx
ktls session.

While here, I noticed that rack/bbr were incorrectly implementing
tfb_hwtls_change(), and applying the change to all pending sends,
when it should apply only to future sends.

This change reduces spurious calls to  ktls_disable_ifnet() by 95% or so
in a Netflix CDN environment.

Reviewed by: markj, rrs
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D38380
2023-02-09 12:44:44 -05:00
Mitchell Horne
1029dab634 mi_switch(): clean up switch types and their usage
Overall, this is a non-functional change, except for kernels built with
SCHED_STATS. However, the switch types are useful for communicating the
intent of the caller.

1. Ensure that every caller provides a type. In most cases, we upgrade
   the basic yield to sched_relinquish() aka SWT_RELINQUISH.
2. The case of sched_bind() is distinct, so add a new switch type SWT_BIND.
3. Remove the two unused types, SWT_PREEMPT and SWT_SLEEPQTIMO.
4. Remove SWT_NONE altogether and assert that callers always provide
   a type flag.
5. Reference the mi_switch(9) man page in the comments, as these flags
   will be documented there.

Reviewed by:	kib, markj
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D38184
2023-02-09 12:01:32 -04:00
Mitchell Horne
bff02948ed sched_4bsd: use the same switch flags as ULE
ULE uses the more specific SWT_REMOTEPREEMPT and SWT_REMOTEWAKEIDLE
switch types, let's do that here as well. SWT_PREEMPT is somewhat
redundant when we also have the SW_PREEMPT flag.

This only has an effect for kernels built with SCHED_STATS.

Reviewed by:	kib, markj
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D38183
2023-02-09 12:01:32 -04:00
Mitchell Horne
dc9b13736f Use maybe_yield() in a few more places
Reviewed by:	kib, markj
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D38186
2023-02-09 11:58:06 -04:00
Mitchell Horne
d570418bd8 Boolify should_yield()
Do this ahead of adding a man page that describes the function. No
functional change.

Reviewed by:	kib, markj
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D38181
2023-02-09 11:58:06 -04:00
Mitchell Horne
a7a452fedc Update comments referencing create_thread()
The equivalent function is now named thread_create(). Mention
kthread_add() where it is also relevant.

Reviewed by:	kib, markj
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D38180
2023-02-09 11:58:06 -04:00
Mitchell Horne
e6cf1a0826 physmem: add ram0 pseudo-driver
Its purpose is to reserve all I/O space belonging to physical memory
from nexus, preventing it from being handed out by bus_alloc_resource()
to callers such as xenpv_alloc_physmem(), which looks for the first
available free range it can get. This mimics the existing pseudo-driver
on x86.

If needed, the device can be disabled with hint.ram.0.disabled="1" in
/boot/device.hints.

Reviewed by:	imp
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D32343
2023-02-08 16:50:46 -04:00
Mateusz Guzik
08d357287b sysv: ansify
Reported by:	clang 15
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2023-02-08 00:11:10 +00:00
Mateusz Guzik
8377575772 vfs: ansify
Reported by:	clang 15
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2023-02-07 23:03:20 +00:00
Mark Johnston
27202b98dc jail: Use atomic(9) instead of CK atomics
There's no reason to use one over the other here, let's prefer the
interface that's used elsewhere in the kernel.

No functional change intended.

Reviewed by:	mjg
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D38360
2023-02-07 15:10:24 -05:00
Val Packett
4a1c4de232 Allow sysctl hw.machine/hw.machine_arch in capability mode
There's no harm in reading strings like 'amd64'.

Reviewed by: emaste, manu
Sponsored by: https://www.patreon.com/valpackett
Differential Revision: https://reviews.freebsd.org/D28703
2023-02-06 14:00:52 -05:00
Justin Hibbits
6472761966 IfAPI: use IfAPI in mbuf
Sponsored by:	Juniper Networks, Inc.
2023-02-06 12:32:04 -05:00
Justin Hibbits
1e6131bad6 IfAPI: Add needed APIs for mbuf support
Summary:
Add 2 new APIs for supporting recent mbuf changes:
* 36e0a362ac added the m_snd_tag_alloc() wrapper around
  if_snd_tag_alloc().  Push this down to the ifnet level.
* 4d7a1361ef adds the m_rcvif_serialize()/m_rcvif_restore() KPIs to
  serialize and restore an ifnet pointer.  Add the necessary wrapper to
  get the index generation for this.

Reviewed By:	jhb
Sponsored by:	Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D38340
2023-02-06 12:32:04 -05:00
Rick Macklem
db5655124c vfs_mount.c: Free exports structures in vfs_destroy_mount()
During testing of exporting file systems in jails, I
noticed that the export structures on a mount
were not being free'd when the mount is dismounted.

This bug appears to have been in the system for a
very long time.  It would have resulted in a slow memory
leak when exported file systems were dismounted.

Prior to r362158, freeing the structures during dismount
would not have been safe, since VFS_CHECKEXP() returned
a pointer into an export structure, which might still have been
used by the NFS server for an in-progress RPC when the file system
is dismounted.  r362158 fixed this, so it should now be safe
to free the structures in vfs_mount_destroy(), which is what
this patch does.

Reviewed by:	kib
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D38385
2023-02-04 14:45:23 -08:00
Rick Macklem
d94e0bdc14 Revert "vfs_export: Add checks for correct prison when updating exports"
This reverts commit 7926a01ed7.

A new patch in D38371 is being considered for doing this.
2023-02-04 14:38:32 -08:00
Konstantin Belousov
3b6056204d FIOSEEKHOLE/FIOSEEKDATA: correct consistency for bmap-based implementation
Writes on UFS through a mapped region do not allocate disk blocks in
holes immediately. The blocks are allocated when the pages are paged out
first time.

This breaks the algorithm in vn_bmap_seekhole() and ufs_bmap_seekdata(),
because VOP_BMAP() reports hole for the place which already contains a
valid data.

Clean the pages before doing VOP_BMAP() in the affected functions.  In
principle, we could clean less by only requesting clean starting from
the offset, but it is probably not very important.

PR:	269261
Reported by:	asomers
Reviewed by:	asomers, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D38379
2023-02-04 20:32:07 +02:00
Pawel Jakub Dawidek
c54d240eb1 kern_prot.c p_candebug(): Remove single-use variable.
Reviewed by:		allanjude, oshogbo
Approved by:		allanjude, oshogbo
Differential Revision:	https://reviews.freebsd.org/D38288
2023-02-02 17:00:24 -08:00
Brooks Davis
5c274b3622 whitespace: rewrap to match case directly above
It's easier to visually diff the two case blocks if there aren't
gratutious whitespace differences.

Sponsored by:	DARPA
2023-02-03 00:37:31 +00:00
Rick Macklem
7926a01ed7 vfs_export: Add checks for correct prison when updating exports
mountd(8) basically does the following:
getmntinfo()
for each mount
      delete_exports
using nmount(2) to do the creation/deletion of individual exports.

For prison0 (and for other prisons if enforce_statfs == 0) getmntinfo()
returns all mount points, including ones being used within other prisons.
This can cause confusion if the same file system is specified in the
exports(5) file for multiple prisons.

This patch adds a perminent identifier to each prison
and marks which prison did the exports in a field of
the mount structure called mnt_exjail.  This field can
then be compared to the perminent identifier for the
prison that the thread's credentials is in.
Also required was a new function called prison_isalive_permid()
which returns if the prison is alive, so that the check can be
ignored for prisons that have been removed.

This prepares the system to allow mountd(8) to run in multiple
prisons, including prison0.

Future commits will complete the modifications to allow mountd(8)
to run in vnet prisons.  Until then, these changes should not affect
semantics.

Reviewed by:	markj
MFC after:	3 months
Differential Revision:	https://reviews.freebsd.org/D38144
2023-02-02 16:20:58 -08:00
Dag-Erling Smørgrav
69d94f4c76 Add tarfs, a filesystem backed by tarballs.
Sponsored by:	Juniper Networks, Inc.
Sponsored by:	Klara, Inc.
Reviewed by:	pauamma, imp
Differential Revision:	https://reviews.freebsd.org/D37753
2023-02-02 18:19:29 +01:00
Rick Macklem
99187c3a44 prison_check_nfsd: Add check for enforce_statfs != 0
Since mountd(8) will not be able to do exports
when running in a vnet prison if enforce_statfs is
set to 0, add a check for this to prison_check_nfsd().

Reviewed by:	jamie, markj
MFC after:	2 months
Differential Revision:	https://reviews.freebsd.org/D38189
2023-02-01 16:02:20 -08:00
Konstantin Belousov
2555f175b3 Move kstack_contains() and GET_STACK_USAGE() to MD machine/stack.h
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D38320
2023-02-02 00:59:26 +02:00
Gleb Smirnoff
a0102dee34 sockets: in sousrsend() pass down the error to aio(4)
This somewhat undermines the initial goal of sousrsend() to have all
the special error handling for a write on a socket in a single place.
The aio(4) needs to see EWOULDBLOCK to re-schedule the job.  Because
aio(4) handles return from soreceive() and sousrsend() with the same
code, we can't check for (error == 0 && done < job_nbytes).  Keeping
this exclusion for aio(4) seems a lesser evil.

Fixes:	7a2c93b86e
2023-02-01 13:03:10 -08:00
Gleb Smirnoff
fd53298799 unix: add myself to the copyright notice
for the new implementation of PF_UNIX/SOCK_DGRAM
2023-02-01 09:39:28 -08:00
Justin Hibbits
9507d03bfe IfAPI: Use the ifnet APIs in kern_poll()
The only API used is if_name().

Sponsored by:	Juniper Networks, Inc.
2023-01-31 15:02:16 -05:00
Sebastian Huber
c7c53e3ca6 Clarify hardpps() parameter name and comment
Since 32c203577a by phk in 1999 (Make even more of the PPSAPI
implementations generic), the "nsec" parameter of hardpps() is a time
difference and no longer a time point.  Change the name to "delta_nsec"
and adjust the comment.

Remove comment about a clock tick adjustment which is no longer in the code.

Pull Request: https://github.com/freebsd/freebsd-src/pull/640
Reviewed by: imp
2023-01-30 11:07:40 -07:00
Jose Luis Duran
df949e762c kern_environment: Partially apply style(9)
Sort include files, remove duplicates and remove trailing whitespce.

Pull Request:	https://github.com/freebsd/freebsd-src/pull/589
Reviewed by:	imp
2023-01-30 10:47:56 -07:00
Dmitry Chagin
2058f075b4 cpuset: Handle CPU_WHICH_TIDPID wherever cpuset_which() is called.
cpuset_which() resolves the argument pair which and id and returns references
to an appropriate resources. To avoid leaking resources or accessing unresolved
references to a resources handle new which CPU_WHICH_TIDPID wherever
cpuset_which() is called.
To avoid code duplication cpuset_which2() has been added.

Reported by:		syzbot+331e8402e0f7347f0f2a@syzkaller.appspotmail.com
Reviewed by:		kib
Differential Revision:	https://reviews.freebsd.org/D38272
MFC after:		2 weeks
2023-01-30 19:28:54 +03:00
Dmitry Chagin
e4754c8036 subr_smp: Trim trailing whitespaces.
MFC after:		1 week
2023-01-29 16:18:17 +03:00
Dmitry Chagin
c21b080f3d cpuset: Fix sched_[g|s]etaffinity() for better compatibility with Linux.
Under Linux to sched_[g|s]etaffinity() functions the value returned from a call
to gettid(2) (thread id) can be passed in the argument pid. Specifying pid as 0
will set the attribute for the calling thread, and passing the value returned
from a call to getpid(2) (process id) will set the attribute for the main thread
of the thread group.

Native cpuset(2) family of system calls has "which" argument to determine how
the value of id argument is interpreted, i.e., CPU_WHICH_TID is used to pass
a thread id and CPU_WHICH_PID - to pass a process id.

For now native sched_[g|s]etaffinity() implementation is wrong as uses "which"
CPU_WHICH_PID to pass both (process and thread id) to the kernel. To fix this
adding a new "which" CPU_WHICH_TIDPID intended to handle both id's.

Reviewed by:		kib
Differential Revision:	https://reviews.freebsd.org/D38209
MFC after:		1 week
2023-01-29 16:17:33 +03:00
Dmitry Chagin
01f74ccd5a libthr: Fix pthread_attr_[g|s]etaffinity_np to match it's manual and the kernel.
Since f35093f8 semantics of a thread affinity functions is changed to be a
compatible with Linux:

In case of getaffinity(), the minimum cpuset_t size that the kernel permits is
the maximum CPU id, present in the system, / NBBY bytes, the maximum size is not
limited.
In case of setaffinity(), the kernel does not limit the size of the user-provided
cpuset_t, internally using only the meaningful part of the set, where the upper
bound is the maximum CPU id, present in the system, no larger than the size of
the kernel cpuset_t.

To match pthread_attr_[g|s]etaffinity_np checks of the user-provided cpusets to
the kernel behavior export the minimum cpuset_t size allowed by running kernel
via new sysctl kern.sched.cpusetsizemin and use it in checks.

Reviewed by:
Differential Revision:	https://reviews.freebsd.org/D38112
MFC after:		1 week
2023-01-29 15:35:18 +03:00
Allan Jude
5ff13fbc19 MFV: zstd 1.5.2
Merge commit 'b3392d84da5bf2162baf937c77e0557f3fd8a52b' into zstd_1.5.2

full changelog: https://github.com/facebook/zstd/compare/v1.4.8...v1.5.2

Updated sys/kern/subr_compressor.c to new API

MFC after:	3 days
Relnotes:	yes
Sponsored by:	Klara, Inc.
2023-01-27 17:22:31 +00:00
Gleb Smirnoff
f394d9c0a4 sysctl: use correct types and names in sysctl_*sec_to_sbintime
The functions are intended to report kernel variables that are
stored as sbintime_t (pointed to by arg1) as human readable
nanoseconds or milliseconds (reported via sysctl_handle_64).
The variable types and names were reversed.  I guess there is
no functional change here, as all types flipped around were
signed 64.  Note that these function aren't used yet anywhere
in the kernel.

Reviewed by:		mav
Differential revision:	https://reviews.freebsd.org/D38217
2023-01-27 07:09:22 -08:00
Mitchell Horne
627ca221c3 kern_reboot: unconditionally call shutdown_reset()
Currently shutdown_reset() is registered as the final entry of the
shutdown_final event handler. However, if a panic occurs early in boot
before the event is registered (SI_SUB_INTRINSIC), we may end up
spinning in the subsequent infinite for loop and failing to reset
altogether. Instead we can simply call this function unconditionally.

Reviewed by:	markj
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D37981
2023-01-23 15:10:24 -04:00
Jiajie Chen
dec7db4960 Add kf_file_nlink field to kf_file and populate it
This will allow user-space programs (e.g. lsof) to locate deleted files
whose nlink equals zero. Prior to this commit, programs has to use
stat(kf_path) to get nlink, but that will fail if the file is deleted.

[mjg: s/fail/file in the commit message]

Reviewed by:	mjg
Differential Revision:  https://reviews.freebsd.org/D38169
2023-01-23 17:09:52 +00:00
Konstantin Belousov
456f05756b Handle int rank issues in in vn_getsize_locked() and vn_seek()
In vn_getsize_locked(), when storing vattr.va_size of type u_quad_t into
off_t size, we must avoid overflow.

Then, the check for fsize < 0, introduced in the commit
f45feecfb2 'vfs: add vn_getsize', is nop [1].

Reported and reviewed by:	jhb
Coverity CID:	1502346
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D38133
2023-01-20 23:56:29 +02:00