Commit Graph

19556 Commits

Author SHA1 Message Date
Gordon Bergling
c159f76713 kern: remove a double word in a KASSERT in subr_trap
- s/with with/with/

MFC after:	5 days
2023-04-13 20:03:37 +02:00
Ed Maste
2ef2c26f3f link_elf: fix SysV hash function overflow
Quoting from https://maskray.me/blog/2023-04-12-elf-hash-function:

The System V Application Binary Interface (generic ABI) specifies the
ELF object file format. When producing an output executable or shared
object needing a dynamic symbol table (.dynsym), a linker generates a
.hash section with type SHT_HASH to hold a symbol hash table. A DT_HASH
tag is produced to hold the address of .hash.

The function is supposed to return a value no larger than 0x0fffffff.
Unfortunately, there is a bug. When unsigned long consists of more than
32 bits, the return value may be larger than UINT32_MAX. For instance,
elf_hash((const unsigned char *)"\xff\x0f\x0f\x0f\x0f\x0f\x12") returns
0x100000002, which is clearly unintended, as the function should behave
the same way regardless of whether long represents a 32-bit integer or
a 64-bit integer.

Reviewed by:	kib, Fangrui Song
Sponsored by:	The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D39517
2023-04-12 15:33:55 -04:00
Konstantin Belousov
c53e990b8d DEBUG_VFS_LOCKS: restore diagnostic for the witness use case
Reviewed by:	jah, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D39477
2023-04-11 15:59:55 +03:00
Konstantin Belousov
75fc6f86c3 Add witness_is_owned(9)
which returns an indicator if the current thread owns the specified
lock.

Reviewed by:	jah, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D39477
2023-04-11 15:59:49 +03:00
Konstantin Belousov
afa8f8971b vn_start_write(): consistently set *mpp to NULL on error or after failed sleep
This ensures that *mpp != NULL iff vn_finished_write() should be
called, regardless of the returned error, except for V_NOWAIT.
The only exception that must be maintained is the case where
vn_start_write(V_NOWAIT) is called with the intent of later dropping
other locks and then doing vn_start_write(V_XSLEEP), which needs the mp
value calculated from the non-waitable call above it.

Also note that V_XSLEEP is not supported by vn_start_secondary_write().

Reviewed by:	markj, mjg (previous version), rmacklem (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D39441
2023-04-11 15:59:46 +03:00
Konstantin Belousov
b2f3288747 vn_start_write(): minor style
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D39441
2023-04-11 15:59:39 +03:00
Konstantin Belousov
7b6fe2428a DEBUG_VFS_LOCKS: use witness if available
The assert_vop_locked messages are ignored, and file/line information
is not too useful. Fixing this without changing both witness and VFS
asserts KPIs is not possible.

Reviewed by:	markj (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D39464
2023-04-10 00:34:12 +03:00
Konstantin Belousov
bb24eaea49 vn_lock_pair(): allow to request shared locking
If either of vnodes is shared locked, lock must not be recursed.

Requested by:	rmacklem
Reviewed by:	markj, rmacklem
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D39444
2023-04-08 01:58:26 +03:00
Mateusz Guzik
02e6e8d218 vfs: extend vn_printf with vop vector 2023-04-07 20:39:06 +00:00
Mateusz Guzik
26b9648750 vfs: more informative panic for missing fplookup ops 2023-04-07 20:39:06 +00:00
Mateusz Guzik
f87a9f51ef vfs: validate that a mount point with FPLOOKUP has vop_fplookup ops 2023-04-06 15:20:41 +00:00
Mateusz Guzik
e237e2ba5f vfs: only allow doomed vnodes to return EOPNOTSUPP for fplookup vops
This helps asserting that they are provided by filesystems indicating
they do it.
2023-04-06 15:20:41 +00:00
Mateusz Guzik
5f6df17775 vfs: validate that vop vectors provide all or none fplookup vops
In order to prevent later susprises.
2023-04-06 15:20:41 +00:00
Mateusz Guzik
0baef43ed0 vfs: add missing vop_fplookup ops to syncer 2023-04-06 15:20:41 +00:00
Mateusz Guzik
8495fa49ea vfs: whack spurious comments from syncer's vop_vector 2023-04-06 15:20:40 +00:00
Konstantin Belousov
11cdffc603 Regen 2023-04-04 16:19:08 +03:00
Konstantin Belousov
dac3102488 Rename kqueue1(2) to kqueuex(2) to avoid compat issues with NetBSD
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D39377
2023-04-04 16:19:08 +03:00
Randall Stewart
73ee5756de Fixes in the tcp infrastructure with respect to stack changes as well as other infrastructure updates for incoming rack features.
So stack switching as always been a bit of a issue. We currently use a break before make setup which means that
if something goes wrong you have to try to get back to a stack. This patch among a lot of other things changes that so
that it is a make before break. We also expand some of the function blocks in prep for new features in rack that will allow
more controlled pacing. We also add other abilities such as the pathway for a stack to query a previous stack to acquire from
it critical state information so things in flight don't get dropped or mis-handled when switching stacks. We also add the
concept of a timer granularity. This allows an alternate stack to change from the old ticks granularity to microseconds and
of course this even gives us a pathway to go to nanosecond timekeeping if we need to (something for the data center to consider
for sure).

Once all this lands I will then update rack to begin using all these new features.

Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D39210
2023-04-01 01:46:38 -04:00
Mark Johnston
cab1056105 kdb: Modify securelevel policy
Currently, sysctls which enable KDB in some way are flagged with
CTLFLAG_SECURE, meaning that you can't modify them if securelevel > 0.
This is so that KDB cannot be used to lower a running system's
securelevel, see commit 3d7618d8bf.  However, the newer mac_ddb(4)
restricts DDB operations which could be abused to lower securelevel
while retaining some ability to gather useful debugging information.

To enable the use of KDB (specifically, DDB) on systems with a raised
securelevel, change the KDB sysctl policy: rather than relying on
CTLFLAG_SECURE, add a check of the current securelevel to kdb_trap().
If the securelevel is raised, only pass control to the backend if MAC
specifically grants access; otherwise simply check to see if mac_ddb
vetoes the request, as before.

Add a new secure sysctl, debug.kdb.enter_securelevel, to override this
behaviour.  That is, the sysctl lets one enter a KDB backend even with a
raised securelevel, so long as it is set before the securelevel is
raised.

Reviewed by:	mhorne, stevek
MFC after:	1 month
Sponsored by:	Juniper Networks
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D37122
2023-03-30 10:45:00 -04:00
Mateusz Guzik
80cf427b8d proc: shave a lock trip on exit if possible
... which happens to be vast majority of the time
2023-03-29 09:19:03 +00:00
Mateusz Guzik
37337709d3 cred: convert the refcount from int to long
On 64-bit platforms this sorts out worries about mitigating bugs which
overflow the counter, all while not pessimizng anything -- most notably
it avoids whacking per-thread operation in favor of refcount(9) API.

The struct already had two instances of 4 byte padding with 256 bytes in
size, cr_flags gets moved around to avoid growing it.

32-bit platforms could also get the extended counter, but I did not do
it as one day(tm) the mutex protecting centralized operation should be
replaced with atomics and 64-bit ops on 32-bit platforms remain quite
penalizing.

While worries of counter overflow are addressed, the following is not
(just like it would not be with conversion to refcount(9)):
- counter *underflows*
- buffer overruns from adjacent allocations
- UAF due to stale cred pointer
- .. and other goodies

As such, while lipstick was placed, the pig should not be participating
in any beauty pageants.

Prodded by:	emaste
Differential Revision:	https://reviews.freebsd.org/D39220
2023-03-29 05:02:32 +00:00
Konstantin Belousov
6a0a634590 Regen 2023-03-28 02:39:26 +03:00
Konstantin Belousov
61194e9852 Add kqueue1() syscall
It takes the flags argument.  Immediate use is to provide the KQUEUE_CLOEXEC
flag for kqueue(2).

Reviewed by:	emaste, jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D39271
2023-03-28 02:39:26 +03:00
Alexander V. Chernikov
04f75b9802 netlink: allow netlink sockets in non-vnet jails.
This change allow to open Netlink sockets in the non-vnet jails, even for
 unpriviledged processes.
The security model largely follows the existing one. To be more specific:
* by default, every `NETLINK_ROUTE` command is **NOT** allowed in non-VNET
 jail UNLESS `RTNL_F_ALLOW_NONVNET_JAIL` flag is specified in the command
 handler.
* All notifications are **disabled** for non-vnet jails (requests to
 subscribe for the notifications are ignored). This will change to be more
 fine-grained model once the first netlink provider requiring this gets
 committed.
* Listing interfaces (RTM_GETLINK) is **allowed** w/o limits (**including**
 interfaces w/o any addresses attached to the jail). The value of this is
 questionable, but it follows the existing approach.
* Listing ARP/NDP neighbours is **forbidden**. This is a **change** from the
 current approach - currently we list static ARP/ND entries belonging to the
 addresses attached to the jail.
* Listing interface addresses is **allowed**, but the addresses are filtered
 to match only ones attached to the jail.
* Listing routes is **allowed**, but the routes are filtered to provide only
 host routes matching the addresses attached to the jail.
* By default, every `NETLINK_GENERIC` command is **allowed** in non-VNET jail
 (as sub-families may be unrelated to network at all).
 It is the goal of the family author to implement the restriction if
 necessary.

Differential Revision: https://reviews.freebsd.org/D39206
MFC after:	1 month
2023-03-26 08:44:09 +00:00
Mateusz Guzik
22eb66d961 vfs cache: always assert on ndp->ni_resflags 2023-03-25 21:57:55 +00:00
Mateusz Guzik
138a5dafba vfs: trylock vnode requeue
The quasi-LRU still gets in the way for example when doing an
incremental bzImage build, with vnode_list lock being at the
top of the profile. Further damage control the problem by trylocking.

Note the entire mechanism desperately wants to be reaped out in favor
of something(tm) which both scales in a multicore setting and provides
sensible replacement policy.

With this change everything vfs almost disappears from the on CPU
flamegraph, what is left is tons of contention in the VM.
2023-03-25 13:42:27 +00:00
Mateusz Guzik
245767c278 vfs: flip deferred_inact to atomic
Turns out it is very rarely triggered, making a per-cpu
counter a waste.

Examples from real life boxes:
uptime		counter
135 days	847
138 days	2190
141 days	1
2023-03-25 13:42:27 +00:00
Mateusz Guzik
e5eb1d298f vfs: replace some spelled out VNASSERTs with VNPASS
nfc
2023-03-25 13:42:27 +00:00
Kyle Evans
89c52f9d59 arm64: add KASAN support
This entails:
- Marking some obvious candidates for __nosanitizeaddress
- Similar trap frame markings as amd64, for similar reasons
- Shadow map implementation

The shadow map implementation is roughly similar to what was done on
amd64, with some exceptions.  Attempting to use available space at
preinit_map_va + PMAP_PREINIT_MAPPING_SIZE (up to the end of that range,
as depicted in the physmap) results in odd failures, so we instead
search the physmap for free regions that we can carve out, fragmenting
the shadow map as necessary to try and fit as much as we need for the
initial kernel map.  pmap_bootstrap_san() is thus after
pmap_bootstrap(), which still included some technically reserved areas
of the memory map that needed to be included in the DMAP.

The odd failure noted above may be a bug, but I haven't investigated it
all that much.

Initial work by mhorne with additional fixes from kevans and markj.

Reviewed by:	andrew, markj
Sponsored by:	Juniper Networks, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D36701
2023-03-23 16:34:33 -05:00
John Baldwin
d2dab20c2a ktls: Drop all the INET and INET6 compile-time guards.
Consistent with 9fd0d9b16e, KERN_TLS is
not supported on kernels without any INET support.

Reviewed by:	gallatin, hselasky
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D39232
2023-03-23 14:29:07 -07:00
Mateusz Guzik
c16c4ea6d3 vfs cache: return ENOTDIR for not_a_dir/{.,..} lookups
Reported by:	Oliver Kiddle
PR:	270419
MFC:	3 days
2023-03-23 19:31:18 +00:00
Mateusz Guzik
b5d43972e3 vfs: decouple freevnodes from vnode batching
In principle one cpu can keep vholding vnodes, while another vdrops
them. In this case it may be the local count will keep growing in an
unbounded manner. Roll it up after a threshold instead.

While here move it out of dpcpu into struct pcpu.

Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D39195
2023-03-22 23:57:25 +00:00
Mark Johnston
b4b33821fa ktls: Fix interlocking between ktls_enable_rx() and listen(2)
The TCP_TXTLS_ENABLE and TCP_RXTLS_ENABLE socket option handlers check
whether the socket is listening socket and fail if so, but this check is
racy.  Since we have to lock the socket buffer later anyway, defer the
check to that point.

ktls_enable_tx() locks the send buffer's I/O lock, which will fail if
the socket is a listening socket, so no explicit checks are needed.  In
ktls_enable_rx(), which does not acquire the I/O lock (see the review
for some discussion on this), use an explicit SOLISTENING() check after
locking the recv socket buffer.

Otherwise, a concurrent solisten_proto() call can trigger crashes and
memory leaks by wiping out socket buffers as ktls_enable_*() is
modifying them.

Also make sure that a KTLS-enabled socket can't be converted to a
listening socket, and use SOCK_(SEND|RECV)BUF_LOCK macros instead of the
old ones while here.

Add some simple regression tests involving listen(2).

Reported by:	syzkaller
MFC after:	2 weeks
Reviewed by:	gallatin, glebius, jhb
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D38504
2023-03-21 16:04:00 -04:00
Mitchell Horne
8965b3033e callout(9): adopt old references to timeout(9)
timeout(9) was removed a couple of years ago; all consumers now use the
callout(9) interface.

Explicitly do not bump .Dd anywhere, as this is not a content or
semantic change.

Reviewed by:	markj, jhb, Pau Amma <pauamma@gundo.com>
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D39136
2023-03-20 17:12:12 -03:00
Mark Johnston
c3179891f8 kerneldump: Inline dump_savectx() into its callers
The callers of dump_savectx() (i.e., doadump() and livedump_start())
subsequently call dumpsys()/minidumpsys(), which dump the calling
thread's stack when writing the dump.  If dump_savectx() gets its own
stack frame, that frame might be clobbered when its caller later calls
dumpsys()/minidumpsys(), making it difficult for debuggers to unwind the
stack.

Fix this by making dump_savectx() a macro, so that savectx() is always
called directly by the function which subsequently calls
dumpsys()/minidumpsys().

This fixes stack unwinding for the panicking thread from arm64
minidumps.  The same happened to work on amd64, but kgdb reports the
dump_savectx() calls as coming from dumpsys(), so in that case it
appears to work by accident.

Fixes:	c9114f9f86 ("Add new vnode dumper to support live minidumps")
Reviewed by:	mhorne, jhb
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D39151
2023-03-20 14:16:28 -04:00
Mateusz Guzik
62a573d953 vfs: retire KERN_VNODE
It got disabled in 2003:

commit acb18acfec
Author: Poul-Henning Kamp <phk@FreeBSD.org>
Date:   Sun Feb 23 18:09:05 2003 +0000

    Bracket the kern.vnode sysctl in #ifdef notyet because it results
    in massive locking issues on diskless systems.

    It is also not clear that this sysctl is non-dangerous in its
    requirements for locked down memory on large RAM systems.

There does not seem to be practical use for it and the disabled routine
does not work anyway.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D39127
2023-03-17 16:21:45 +00:00
Mina Galić
0b0ae2e4cd jail: convert several functions from int to bool
these functions exclusively return (0) and (1), so convert them to bool

We also convert some networking related jail functions from int to bool
some of which were returning an error that was never used.

Differential Revision: https://reviews.freebsd.org/D29659
Reviewed by: imp, jamie (earlier version)
Pull Request: https://github.com/freebsd/freebsd-src/pull/663
2023-03-14 21:05:33 -06:00
Mark Johnston
cd133525fa smr: Remove the return value from smr_wait()
This is supposed to be a blocking version of smr_poll(), so there's no
need for a return value.  No functional change intended.

MFC after:	1 week
2023-03-13 10:45:35 -04:00
Kyle Evans
cc0fe048ec kern: physmem: don't create a new exregion for different flags...
... if the region we're adding is an exact match to one that we already
have.  Simply extend the flags of the existing entry as needed so that
we don't end up with duplicate regions.

It could be that we got the exclusion through two different means, e.g.,
FDT memreserve and the EFI memory map, and we may derive different
characteristics from each.  Apply the most restrictive set to the
region.

Reported by:	Mark Millard <marklmi yahoo com>
Reviewed by:	mhorne
2023-03-09 23:27:39 -06:00
Justin Hibbits
084846271a ktls: Use IfAPI accessors to get capabilities
Summary:
Avoid referencing the ifnet struct directly, and use the IfAPI accessors
instead.

Reviewed by:	gallatin
Sponsored by:	Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D38932
2023-03-07 09:47:00 -05:00
Mark Johnston
831601773e deadlkres: Make parameters settable with tunables
MFC after:	1 week
Sponsored by:	Klara, Inc.
Sponsored by:	Juniper Networks, Inc.
2023-03-03 11:16:41 -05:00
Rick Macklem
cbbb22031f kern_jail.c: Remove #ifdefs for VNET_NFSD
The consensus was that VNET_NFSD was not needed.
This patch removes it from kern_jail.c.

With this patch, support for the "allow.nfsd"
jail parameter is enabled in the kernel for
kernels built with "options VIMAGE".

Reviewed by:	markj
MFC after:	3 months
Differential Revision:	https://reviews.freebsd.org/D38808
2023-03-02 13:13:24 -08:00
Rick Macklem
4bbbd5875d vfs_mount.c: Allow mountd(8) to do exports in a vnet prison
To run mountd in a vnet prison, three checks in vfs_domount()
and vfs_domount_update() related to doing exports needed
to be changed, so that a file system visible within the
prison but mounted outside the prison can be exported.

I did all three in a minimal way, only changing the checks for
the specific case of a process (typically mountd) doing exports
within a vnet prison and not updating the mount point in other
ways.  The changes are:
- Ignore the error return from vfs_suser(), since the file
  system being mounted outside the prison will cause it to fail.
- Use the priv_check(PRIV_NFS_DAEMON) for this specific case
  within a prison.
- Skip the call to VFS_MOUNT(), since it will return an error,
  due to the "from" argument not being set correctly.  VFS_MOUNT()
  does not appear to do anything for the case of doing exports only.

Reviewed by:	markj
MFC after:	3 months
Differential Revision:	https://reviews.freebsd.org/D37741
2023-03-02 13:09:01 -08:00
Mark Johnston
bcd8cd859e buf: Make buf_daemon_shutdown() a no-op after a panic
As in commit 9d7cc536e2, there is no need to do anything in this
context.

MFC after:	1 week
2023-03-01 10:15:54 -05:00
Mateusz Guzik
a357112938 kern: whack __mips__ leftover
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2023-03-01 11:05:12 +00:00
Zhenlei Huang
2c33b456ff jail: Improve readability
No functional change intended.

Reviewed by:	melifaro
Differential Revision:	https://reviews.freebsd.org/D37890
2023-02-28 18:20:07 +08:00
Zhenlei Huang
500f82d6c3 jail: Use flexible array member within struct prison_ip
Current implementation utilize off-by-one struct prison_ip to access the
IPv[46] addresses. It is error prone and hence comes the regression fix
21ad3e27fa and ddbf879d79. Use flexible array member so that compiler
will catch such errors and it will also be easier to review.

No functional change intended.

Reviewed by:	melifaro, glebius
Differential Revision:	https://reviews.freebsd.org/D37874
2023-02-28 18:20:06 +08:00
Sebastian Huber
28ed159f26 pps: Round to closest integer in pps_event()
The comment above bintime2timespec() says:

  When converting between timestamps on parallel timescales of differing
  resolutions it is historical and scientific practice to round down.

However, the delta_nsec value is a time difference and not a timestamp.  Also
the rounding errors accumulate in the frequency accumulator, see hardpps().
So, rounding to the closest integer is probably slightly better.

Reviewed by: imp
Pull Request: https://github.com/freebsd/freebsd-src/pull/604
2023-02-27 15:10:55 -07:00
Sebastian Huber
1e48d9d336 pps: Simplify the nsec calculation in pps_event()
Let A be the current calculation of the frequency accumulator (pps_fcount)
update in pps_event()

  scale = (uint64_t)1 << 63;
  scale /= captc->tc_frequency;
  scale *= 2;
  bt.sec = 0;
  bt.frac = 0;
  bintime_addx(&bt, scale * tcount);
  bintime2timespec(&bt, &ts);
  hardpps(tsp, ts.tv_nsec + 1000000000 * ts.tv_sec);

and hardpps(..., delta_nsec):

  u_nsec = delta_nsec;
  if (u_nsec > (NANOSECOND >> 1))
          u_nsec -= NANOSECOND;
  else if (u_nsec < -(NANOSECOND >> 1))
          u_nsec += NANOSECOND;
  pps_fcount += u_nsec;

This change introduces a new calculation which is slightly simpler and more
straight forward.  Name it B.

Consider the following sample values with a tcount of 2000000100 and a
tc_frequency of 2000000000 (2GHz).

For A, the scale is 9223372036.  Then scale * tcount is 18446744994337203600
which is larger than UINT64_MAX (= 18446744073709551615).  The result is
920627651984 == 18446744994337203600 % UINT64_MAX.  Since all operands are
unsigned the result is well defined through modulo arithmetic.  The result of
bintime2timespec(&bt, &ts) is 49.  This is equal to the correct result
1000000049 % NANOSECOND.

In hardpps(), both conditional statements are not executed and pps_fcount is
incremented by 49.

For the new calculation B, we have 1000000000 * tcount is 2000000100000000000
which is less than UINT64_MAX. This yields after the division with tc_frequency
the correct result of 1000000050 for delta_nsec.

In hardpps(), the first conditional statement is executed and pps_fcount is
incremented by 50.

This shows that both methods yield roughly the same results.  However, method B
is easier to understand and requires fewer conditional statements.

Reviewed by: imp
Pull Request: https://github.com/freebsd/freebsd-src/pull/604
2023-02-27 15:10:55 -07:00
Sebastian Huber
8a142484d4 pps: Directly assign the timestamps in pps_event()
Reviewed by: imp
Pull Request: https://github.com/freebsd/freebsd-src/pull/604
2023-02-27 15:10:55 -07:00