freebsd-dev

Author	SHA1	Message	Date
Mark Johnston	cab1056105	kdb: Modify securelevel policy Currently, sysctls which enable KDB in some way are flagged with CTLFLAG_SECURE, meaning that you can't modify them if securelevel > 0. This is so that KDB cannot be used to lower a running system's securelevel, see commit `3d7618d8bf`. However, the newer mac_ddb(4) restricts DDB operations which could be abused to lower securelevel while retaining some ability to gather useful debugging information. To enable the use of KDB (specifically, DDB) on systems with a raised securelevel, change the KDB sysctl policy: rather than relying on CTLFLAG_SECURE, add a check of the current securelevel to kdb_trap(). If the securelevel is raised, only pass control to the backend if MAC specifically grants access; otherwise simply check to see if mac_ddb vetoes the request, as before. Add a new secure sysctl, debug.kdb.enter_securelevel, to override this behaviour. That is, the sysctl lets one enter a KDB backend even with a raised securelevel, so long as it is set before the securelevel is raised. Reviewed by: mhorne, stevek MFC after: 1 month Sponsored by: Juniper Networks Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D37122	2023-03-30 10:45:00 -04:00
Mateusz Guzik	80cf427b8d	proc: shave a lock trip on exit if possible ... which happens to be vast majority of the time	2023-03-29 09:19:03 +00:00
Mateusz Guzik	37337709d3	cred: convert the refcount from int to long On 64-bit platforms this sorts out worries about mitigating bugs which overflow the counter, all while not pessimizng anything -- most notably it avoids whacking per-thread operation in favor of refcount(9) API. The struct already had two instances of 4 byte padding with 256 bytes in size, cr_flags gets moved around to avoid growing it. 32-bit platforms could also get the extended counter, but I did not do it as one day(tm) the mutex protecting centralized operation should be replaced with atomics and 64-bit ops on 32-bit platforms remain quite penalizing. While worries of counter overflow are addressed, the following is not (just like it would not be with conversion to refcount(9)): - counter underflows - buffer overruns from adjacent allocations - UAF due to stale cred pointer - .. and other goodies As such, while lipstick was placed, the pig should not be participating in any beauty pageants. Prodded by: emaste Differential Revision: https://reviews.freebsd.org/D39220	2023-03-29 05:02:32 +00:00
Konstantin Belousov	6a0a634590	Regen	2023-03-28 02:39:26 +03:00
Konstantin Belousov	61194e9852	Add kqueue1() syscall It takes the flags argument. Immediate use is to provide the KQUEUE_CLOEXEC flag for kqueue(2). Reviewed by: emaste, jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D39271	2023-03-28 02:39:26 +03:00
Alexander V. Chernikov	04f75b9802	netlink: allow netlink sockets in non-vnet jails. This change allow to open Netlink sockets in the non-vnet jails, even for unpriviledged processes. The security model largely follows the existing one. To be more specific: * by default, every `NETLINK_ROUTE` command is NOT allowed in non-VNET jail UNLESS `RTNL_F_ALLOW_NONVNET_JAIL` flag is specified in the command handler. * All notifications are disabled for non-vnet jails (requests to subscribe for the notifications are ignored). This will change to be more fine-grained model once the first netlink provider requiring this gets committed. * Listing interfaces (RTM_GETLINK) is allowed w/o limits (including interfaces w/o any addresses attached to the jail). The value of this is questionable, but it follows the existing approach. * Listing ARP/NDP neighbours is forbidden. This is a change from the current approach - currently we list static ARP/ND entries belonging to the addresses attached to the jail. * Listing interface addresses is allowed, but the addresses are filtered to match only ones attached to the jail. * Listing routes is allowed, but the routes are filtered to provide only host routes matching the addresses attached to the jail. * By default, every `NETLINK_GENERIC` command is allowed in non-VNET jail (as sub-families may be unrelated to network at all). It is the goal of the family author to implement the restriction if necessary. Differential Revision: https://reviews.freebsd.org/D39206 MFC after: 1 month	2023-03-26 08:44:09 +00:00
Mateusz Guzik	22eb66d961	vfs cache: always assert on ndp->ni_resflags	2023-03-25 21:57:55 +00:00
Mateusz Guzik	138a5dafba	vfs: trylock vnode requeue The quasi-LRU still gets in the way for example when doing an incremental bzImage build, with vnode_list lock being at the top of the profile. Further damage control the problem by trylocking. Note the entire mechanism desperately wants to be reaped out in favor of something(tm) which both scales in a multicore setting and provides sensible replacement policy. With this change everything vfs almost disappears from the on CPU flamegraph, what is left is tons of contention in the VM.	2023-03-25 13:42:27 +00:00
Mateusz Guzik	245767c278	vfs: flip deferred_inact to atomic Turns out it is very rarely triggered, making a per-cpu counter a waste. Examples from real life boxes: uptime counter 135 days 847 138 days 2190 141 days 1	2023-03-25 13:42:27 +00:00
Mateusz Guzik	e5eb1d298f	vfs: replace some spelled out VNASSERTs with VNPASS nfc	2023-03-25 13:42:27 +00:00
Kyle Evans	89c52f9d59	arm64: add KASAN support This entails: - Marking some obvious candidates for __nosanitizeaddress - Similar trap frame markings as amd64, for similar reasons - Shadow map implementation The shadow map implementation is roughly similar to what was done on amd64, with some exceptions. Attempting to use available space at preinit_map_va + PMAP_PREINIT_MAPPING_SIZE (up to the end of that range, as depicted in the physmap) results in odd failures, so we instead search the physmap for free regions that we can carve out, fragmenting the shadow map as necessary to try and fit as much as we need for the initial kernel map. pmap_bootstrap_san() is thus after pmap_bootstrap(), which still included some technically reserved areas of the memory map that needed to be included in the DMAP. The odd failure noted above may be a bug, but I haven't investigated it all that much. Initial work by mhorne with additional fixes from kevans and markj. Reviewed by: andrew, markj Sponsored by: Juniper Networks, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D36701	2023-03-23 16:34:33 -05:00
John Baldwin	d2dab20c2a	ktls: Drop all the INET and INET6 compile-time guards. Consistent with `9fd0d9b16e`, KERN_TLS is not supported on kernels without any INET support. Reviewed by: gallatin, hselasky MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D39232	2023-03-23 14:29:07 -07:00
Mateusz Guzik	c16c4ea6d3	vfs cache: return ENOTDIR for not_a_dir/{.,..} lookups Reported by: Oliver Kiddle PR: 270419 MFC: 3 days	2023-03-23 19:31:18 +00:00
Mateusz Guzik	b5d43972e3	vfs: decouple freevnodes from vnode batching In principle one cpu can keep vholding vnodes, while another vdrops them. In this case it may be the local count will keep growing in an unbounded manner. Roll it up after a threshold instead. While here move it out of dpcpu into struct pcpu. Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D39195	2023-03-22 23:57:25 +00:00
Mark Johnston	b4b33821fa	ktls: Fix interlocking between ktls_enable_rx() and listen(2) The TCP_TXTLS_ENABLE and TCP_RXTLS_ENABLE socket option handlers check whether the socket is listening socket and fail if so, but this check is racy. Since we have to lock the socket buffer later anyway, defer the check to that point. ktls_enable_tx() locks the send buffer's I/O lock, which will fail if the socket is a listening socket, so no explicit checks are needed. In ktls_enable_rx(), which does not acquire the I/O lock (see the review for some discussion on this), use an explicit SOLISTENING() check after locking the recv socket buffer. Otherwise, a concurrent solisten_proto() call can trigger crashes and memory leaks by wiping out socket buffers as ktls_enable_*() is modifying them. Also make sure that a KTLS-enabled socket can't be converted to a listening socket, and use SOCK_(SEND\|RECV)BUF_LOCK macros instead of the old ones while here. Add some simple regression tests involving listen(2). Reported by: syzkaller MFC after: 2 weeks Reviewed by: gallatin, glebius, jhb Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D38504	2023-03-21 16:04:00 -04:00
Mitchell Horne	8965b3033e	callout(9): adopt old references to timeout(9) timeout(9) was removed a couple of years ago; all consumers now use the callout(9) interface. Explicitly do not bump .Dd anywhere, as this is not a content or semantic change. Reviewed by: markj, jhb, Pau Amma <pauamma@gundo.com> MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D39136	2023-03-20 17:12:12 -03:00
Mark Johnston	c3179891f8	kerneldump: Inline dump_savectx() into its callers The callers of dump_savectx() (i.e., doadump() and livedump_start()) subsequently call dumpsys()/minidumpsys(), which dump the calling thread's stack when writing the dump. If dump_savectx() gets its own stack frame, that frame might be clobbered when its caller later calls dumpsys()/minidumpsys(), making it difficult for debuggers to unwind the stack. Fix this by making dump_savectx() a macro, so that savectx() is always called directly by the function which subsequently calls dumpsys()/minidumpsys(). This fixes stack unwinding for the panicking thread from arm64 minidumps. The same happened to work on amd64, but kgdb reports the dump_savectx() calls as coming from dumpsys(), so in that case it appears to work by accident. Fixes: `c9114f9f86` ("Add new vnode dumper to support live minidumps") Reviewed by: mhorne, jhb MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D39151	2023-03-20 14:16:28 -04:00
Mateusz Guzik	62a573d953	vfs: retire KERN_VNODE It got disabled in 2003: commit `acb18acfec` Author: Poul-Henning Kamp <phk@FreeBSD.org> Date: Sun Feb 23 18:09:05 2003 +0000 Bracket the kern.vnode sysctl in #ifdef notyet because it results in massive locking issues on diskless systems. It is also not clear that this sysctl is non-dangerous in its requirements for locked down memory on large RAM systems. There does not seem to be practical use for it and the disabled routine does not work anyway. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D39127	2023-03-17 16:21:45 +00:00
Mina Galić	0b0ae2e4cd	jail: convert several functions from int to bool these functions exclusively return (0) and (1), so convert them to bool We also convert some networking related jail functions from int to bool some of which were returning an error that was never used. Differential Revision: https://reviews.freebsd.org/D29659 Reviewed by: imp, jamie (earlier version) Pull Request: https://github.com/freebsd/freebsd-src/pull/663	2023-03-14 21:05:33 -06:00
Mark Johnston	cd133525fa	smr: Remove the return value from smr_wait() This is supposed to be a blocking version of smr_poll(), so there's no need for a return value. No functional change intended. MFC after: 1 week	2023-03-13 10:45:35 -04:00
Kyle Evans	cc0fe048ec	kern: physmem: don't create a new exregion for different flags... ... if the region we're adding is an exact match to one that we already have. Simply extend the flags of the existing entry as needed so that we don't end up with duplicate regions. It could be that we got the exclusion through two different means, e.g., FDT memreserve and the EFI memory map, and we may derive different characteristics from each. Apply the most restrictive set to the region. Reported by: Mark Millard <marklmi yahoo com> Reviewed by: mhorne	2023-03-09 23:27:39 -06:00
Justin Hibbits	084846271a	ktls: Use IfAPI accessors to get capabilities Summary: Avoid referencing the ifnet struct directly, and use the IfAPI accessors instead. Reviewed by: gallatin Sponsored by: Juniper Networks, Inc. Differential Revision: https://reviews.freebsd.org/D38932	2023-03-07 09:47:00 -05:00
Mark Johnston	831601773e	deadlkres: Make parameters settable with tunables MFC after: 1 week Sponsored by: Klara, Inc. Sponsored by: Juniper Networks, Inc.	2023-03-03 11:16:41 -05:00
Rick Macklem	cbbb22031f	kern_jail.c: Remove #ifdefs for VNET_NFSD The consensus was that VNET_NFSD was not needed. This patch removes it from kern_jail.c. With this patch, support for the "allow.nfsd" jail parameter is enabled in the kernel for kernels built with "options VIMAGE". Reviewed by: markj MFC after: 3 months Differential Revision: https://reviews.freebsd.org/D38808	2023-03-02 13:13:24 -08:00
Rick Macklem	4bbbd5875d	vfs_mount.c: Allow mountd(8) to do exports in a vnet prison To run mountd in a vnet prison, three checks in vfs_domount() and vfs_domount_update() related to doing exports needed to be changed, so that a file system visible within the prison but mounted outside the prison can be exported. I did all three in a minimal way, only changing the checks for the specific case of a process (typically mountd) doing exports within a vnet prison and not updating the mount point in other ways. The changes are: - Ignore the error return from vfs_suser(), since the file system being mounted outside the prison will cause it to fail. - Use the priv_check(PRIV_NFS_DAEMON) for this specific case within a prison. - Skip the call to VFS_MOUNT(), since it will return an error, due to the "from" argument not being set correctly. VFS_MOUNT() does not appear to do anything for the case of doing exports only. Reviewed by: markj MFC after: 3 months Differential Revision: https://reviews.freebsd.org/D37741	2023-03-02 13:09:01 -08:00
Mark Johnston	bcd8cd859e	buf: Make buf_daemon_shutdown() a no-op after a panic As in commit `9d7cc536e2`, there is no need to do anything in this context. MFC after: 1 week	2023-03-01 10:15:54 -05:00
Mateusz Guzik	a357112938	kern: whack __mips__ leftover Sponsored by: Rubicon Communications, LLC ("Netgate")	2023-03-01 11:05:12 +00:00
Zhenlei Huang	2c33b456ff	jail: Improve readability No functional change intended. Reviewed by: melifaro Differential Revision: https://reviews.freebsd.org/D37890	2023-02-28 18:20:07 +08:00
Zhenlei Huang	500f82d6c3	jail: Use flexible array member within struct prison_ip Current implementation utilize off-by-one struct prison_ip to access the IPv[46] addresses. It is error prone and hence comes the regression fix `21ad3e27fa` and `ddbf879d79`. Use flexible array member so that compiler will catch such errors and it will also be easier to review. No functional change intended. Reviewed by: melifaro, glebius Differential Revision: https://reviews.freebsd.org/D37874	2023-02-28 18:20:06 +08:00
Sebastian Huber	28ed159f26	pps: Round to closest integer in pps_event() The comment above bintime2timespec() says: When converting between timestamps on parallel timescales of differing resolutions it is historical and scientific practice to round down. However, the delta_nsec value is a time difference and not a timestamp. Also the rounding errors accumulate in the frequency accumulator, see hardpps(). So, rounding to the closest integer is probably slightly better. Reviewed by: imp Pull Request: https://github.com/freebsd/freebsd-src/pull/604	2023-02-27 15:10:55 -07:00
Sebastian Huber	1e48d9d336	pps: Simplify the nsec calculation in pps_event() Let A be the current calculation of the frequency accumulator (pps_fcount) update in pps_event() scale = (uint64_t)1 << 63; scale /= captc->tc_frequency; scale = 2; bt.sec = 0; bt.frac = 0; bintime_addx(&bt, scale tcount); bintime2timespec(&bt, &ts); hardpps(tsp, ts.tv_nsec + 1000000000 * ts.tv_sec); and hardpps(..., delta_nsec): u_nsec = delta_nsec; if (u_nsec > (NANOSECOND >> 1)) u_nsec -= NANOSECOND; else if (u_nsec < -(NANOSECOND >> 1)) u_nsec += NANOSECOND; pps_fcount += u_nsec; This change introduces a new calculation which is slightly simpler and more straight forward. Name it B. Consider the following sample values with a tcount of 2000000100 and a tc_frequency of 2000000000 (2GHz). For A, the scale is 9223372036. Then scale * tcount is 18446744994337203600 which is larger than UINT64_MAX (= 18446744073709551615). The result is 920627651984 == 18446744994337203600 % UINT64_MAX. Since all operands are unsigned the result is well defined through modulo arithmetic. The result of bintime2timespec(&bt, &ts) is 49. This is equal to the correct result 1000000049 % NANOSECOND. In hardpps(), both conditional statements are not executed and pps_fcount is incremented by 49. For the new calculation B, we have 1000000000 * tcount is 2000000100000000000 which is less than UINT64_MAX. This yields after the division with tc_frequency the correct result of 1000000050 for delta_nsec. In hardpps(), the first conditional statement is executed and pps_fcount is incremented by 50. This shows that both methods yield roughly the same results. However, method B is easier to understand and requires fewer conditional statements. Reviewed by: imp Pull Request: https://github.com/freebsd/freebsd-src/pull/604	2023-02-27 15:10:55 -07:00
Sebastian Huber	8a142484d4	pps: Directly assign the timestamps in pps_event() Reviewed by: imp Pull Request: https://github.com/freebsd/freebsd-src/pull/604	2023-02-27 15:10:55 -07:00
Sebastian Huber	0448501f2b	pps: Move pcount assignment in pps_event() Move the pseq increment. This makes it possible to reuse registers earlier. Reviewed by: imp Pull Request: https://github.com/freebsd/freebsd-src/pull/604	2023-02-27 15:10:55 -07:00
Sebastian Huber	fd88f4e190	pps: Simplify capture and event processing Use local variables for the captured timehand and timecounter in pps_event(). This fixes a potential issue in the nsec preparation for hardpps(). Here the timecounter was accessed through the captured timehand after the generation was checked. Make a snapshot of the relevent timehand values early in pps_event(). Check the timehand generation only once during the capture and event processing. Use atomic_thread_fence_acq() similar to the other readers. Reviewed by: imp Pull Request: https://github.com/freebsd/freebsd-src/pull/604	2023-02-27 15:10:55 -07:00
Sebastian Huber	cb2a028b15	pps: Load timecounter once in pps_capture() This ensures that the timecounter and the tc_get_timecount handler belong together. Reviewed by: imp Pull Request: https://github.com/freebsd/freebsd-src/pull/604	2023-02-27 15:10:54 -07:00
Mateusz Guzik	1ebec3806e	vfs: s/ppsratecheck/eventratecheck nfc	2023-02-24 19:30:49 +00:00
Mateusz Guzik	83158c6893	time: s/ppsratecheck/eventratecheck The routine is used as a general event-limiting routine in places which have nothing to do with packets. Provide a define to keep everything happy. Reviewed by: rew Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D38746	2023-02-24 19:26:36 +00:00
Mark Johnston	9d7cc536e2	buf: Make bufspace_daemon_shutdown() a no-op after a panic This function doesn't need to do anything in that context, and calling wakeup() can lead to recursive panics. Discussed with: mhorne MFC after: 1 week	2023-02-23 21:56:36 -05:00
Mitchell Horne	9a7f7c26c5	lockmgr: upgrade panic return checks We short-circuit lockmgr functions in the face of a kernel panic. Other lock implementations do this with a SCHEDULER_STOPPED() check, which covers the additional case where the debugger is active but the system has not panicked. Update this code to match that behaviour. Reviewed by: mjg, kib, markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D38655	2023-02-22 11:12:22 -04:00
Rick Macklem	88175af8b7	vfs_export: Add mnt_exjail to control exports done in prisons If there are multiple instances of mountd(8) (in different prisons), there will be confusion if they manipulate the exports of the same file system. This patch adds mnt_exjail to "struct mount" so that the credentials (and, therefore, the prison) that did the exports for that file system can be recorded. If another prison has already exported the file system, vfs_export() will fail with an error. If mnt_exjail == NULL, the file system has not been exported. mnt_exjail is checked by the NFS server, so that exports done from within a different prison will not be used. The patch also implements vfs_exjail_destroy(), which is called from prison_cleanup() to release all the mnt_exjail credential references, so that the prison can be removed. Mainly to avoid doing a scan of the mountlist for the case where there were no exports done from within the prison, a count of how many file systems have been exported from within the prison is kept in pr_exportcnt. Reviewed by: markj Discussed with: jamie Differential Revision: https://reviews.freebsd.org/D38371 MFC after: 3 months	2023-02-21 13:00:42 -08:00
Gleb Smirnoff	71e70c25c0	Revert "unix/dgram: return EAGAIN instead of ENOBUFS when O_NONBLOCK set" This API change led to unexpected consequences with Go runtime. The Go runtime emulates blocking sockets over non-blocking sockets and for that uses available event dispatcher on the target OS, which is kevent(2) if availabe, with OS independent layer on top. It expects that if whatever O_NONBLOCK socket returned ever EAGAIN, then it is supposed to be reported as writable by the event dispatcher. kevent(2) would never report a unix/dgram socket, since they never change their state, they always are writeable. The expectations of Go are not literally specified by SUS, however they are in its spirit. The SUS specifies EAGAIN for send(2) as "The socket's file descriptor is marked O_NONBLOCK and the requested operation would block" [1]. This doesn't apply to FreeBSD unix/dgram socket, it never blocks on send(2). Thus, changing API trying to mimic Linux was a mistake. But what about the problem we tried to fix? Discussed that with Max Dounin of nginx, and we agreed that the log bomb described shall be fixed on nginx side, and it actually isn't specific to FreeBSD, may happen with nginx on any non-Linux system with a certain configuration. [1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/send.html This reverts commit `65572cade3`.	2023-02-21 08:50:07 -08:00
Zhenlei Huang	b2d76b52fd	jail: Fix redoing ip restricting `prison_ip_restrict()` is called in loop FOREACH_PRISON_DESCENDANT_LOCKED. While under low memory, it is still possible that in subsequent rounds `prison_ip_restrict()` succeed and `redo_ip[46]` flip over from true to false, thus leave some prisons's IPv[46] addresses unrestricted. Reviewed by: jamie Fixes: `8bce8d28ab` jail: Avoid multipurpose return value of function prison_ip_restrict() Differential Revision: https://reviews.freebsd.org/D38697	2023-02-21 23:43:25 +08:00
Konstantin Belousov	836e4b371b	kern/sysv_ipc.c: use ANSI C function definition Also remove pointless return's. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2023-02-21 16:02:46 +02:00
Mateusz Guzik	de709b1455	sx: whack set-but-not-used warn in _sx_slock_hard Sponsored by: Rubicon Communications, LLC ("Netgate")	2023-02-21 13:49:14 +00:00
Mateusz Guzik	dbcd7e7e32	vfs cache: whack set-but-not-used warn in cache_purgevfs Reported by: kib Sponsored by: Rubicon Communications, LLC ("Netgate")	2023-02-21 13:48:35 +00:00
Kyle Evans	c32946d8be	kern: physmem: fix the format string again, i is a size_t Fixes the riscv LINT build. Fixes: `7b5cb32fca` ("kern: physmem: properly cast %jx [...]")	2023-02-20 23:39:38 -06:00
Kyle Evans	7b5cb32fca	kern: physmem: properly cast %jx arguments to uintmax_t While we're here, slap prfunc with a __printflike to get compiler checking on args to catch silly mistakes like this. Reported by: jrtc27	2023-02-20 16:12:55 -06:00
Kyle Evans	cd73914b01	kern: physmem: don't truncate addresses in DEBUG output Make it consistent with the above region printing, otherwise it appears to be somewhat confusing.	2023-02-20 12:55:04 -06:00
Elliott Mitchell	3aed0ffc15	kern/clock: remove interrupt reporting from watchdog_fire() The interrupt counts may have been valuable in the past, but now DDB can readily provide them via 'show intrcnt'. This is one of the only consumers of these counter arrays outside of the interrupt code itself, and this should be avoided. Reviewed by: mhorne, fuz Differential Revision: https://reviews.freebsd.org/D37870	2023-02-16 17:24:29 -04:00
John Baldwin	98844e99d4	aio: Fix more synchronization issues in aio_biowakeup. - Use atomic_store to set job->error. atomic_set does an or operation, not assignment. - Use refcount_* to manage job->nbio. This ensures proper memory barriers are present so that the last bio won't see a possibly stale value of job->error. - Don't re-read job->error after reading it via atomic_load. Reported by: markj (1) Reviewed by: mjg, markj Differential Revision: https://reviews.freebsd.org/D38611	2023-02-15 13:32:52 -08:00
John Baldwin	cca6d6160f	aio_biowakeup: Various style fixes.	2023-02-15 10:57:08 -08:00
Keith Reynolds	40734fc57e	aio: Fix a test and set race in aio_biowakeup. Use atomic_fetchadd in place of separate atomic_subtract / atomic_load. Reviewed by: markj Sponsored by: HPE TidalScale Differential Revision: https://reviews.freebsd.org/D38559	2023-02-15 10:56:39 -08:00
Mitchell Horne	28137bdb19	intrng: track counter allocation with a bitmap Crucially, this allows releasing counters, and interrupt sources by extension. Where before we were incrementing intrcnt_index with atomics, now we protect the bitmap using the existing isrc_table_lock mutex. Reviewed by: mmel MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D38437	2023-02-14 14:06:00 -04:00
Mitchell Horne	82e846df5b	intrng: sort includes MFC after: 3 days	2023-02-14 14:06:00 -04:00
Mark Johnston	636b19ead4	tcp: Disallow re-connection of a connected socket soconnectat() tries to ensure that one cannot connect a connected socket. However, the check is racy and does not really prevent two threads from attempting to connect the same TCP socket. Modify tcp_connect() and tcp6_connect() to perform the check again, this time synchronized by the inpcb lock, under which we call soisconnecting(). Reported by: syzkaller Reviewed by: glebius MFC after: 2 weeks Sponsored by: Klara, Inc. Sponsored by: Modirum MDPay Differential Revision: https://reviews.freebsd.org/D38507	2023-02-14 10:07:19 -05:00
Konstantin Belousov	020e8a4d06	allocbuf(): convert direct panic() calls to KASSERT()s Also do minor style adjustments. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D38549	2023-02-14 00:28:42 +02:00
Mateusz Guzik	a066bba2da	ntptime: ansify Sponsored by: Rubicon Communications, LLC ("Netgate")	2023-02-13 18:24:13 +00:00
Mateusz Guzik	00343b4adc	uipc: ansify Sponsored by: Rubicon Communications, LLC ("Netgate")	2023-02-13 18:20:29 +00:00
Mitchell Horne	78919798e7	kern_poll: include sys/sched.h For sched_relinquish(). This fixes the build for some kernel configs. Reported by: Jenkins Fixes: `1029dab634` ("mi_switch(): clean up switch types and their usage")	2023-02-09 17:13:02 -04:00
Andrew Gallatin	d24b032bec	ktls: Fix comments & whitespace issues with `c0e4090e3d` Address some last minute review feedback on `c0e4090e3d` by fixing spacing around comments, and clarifying that the newly added destroy_task is not related to tls 1.0. No functional change intended. Pointed out by: jhb Sponsored by: Netflix	2023-02-09 14:11:24 -05:00
Andrew Gallatin	c0e4090e3d	ktls: Accurately track if ifnet ktls is enabled This allows us to avoid spurious calls to ktls_disable_ifnet() When we implemented ifnet kTLSe, we set a flag in the tx socket buffer (SB_TLS_IFNET) to indicate ifnet kTLS. This flag meant that now, or in the past, ifnet ktls was active on a socket. Later, I added code to switch ifnet ktls sessions to software in the case of lossy TCP connections that have a high retransmit rate. Because TCP was using SB_TLS_IFNET to know if it needed to do math to calculate the retransmit ratio and potentially call into ktls_disable_ifnet(), it was doing unneeded work long after a session was moved to software. This patch carefully tracks whether or not ifnet ktls is still enabled on a TCP connection. Because the inp is now embedded in the tcpcb, and because TCP is the most frequent accessor of this state, it made sense to move this from the socket buffer flags to the tcpcb. Because we now need reliable access to the tcbcb, we take a ref on the inp when creating a tx ktls session. While here, I noticed that rack/bbr were incorrectly implementing tfb_hwtls_change(), and applying the change to all pending sends, when it should apply only to future sends. This change reduces spurious calls to ktls_disable_ifnet() by 95% or so in a Netflix CDN environment. Reviewed by: markj, rrs Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D38380	2023-02-09 12:44:44 -05:00
Mitchell Horne	1029dab634	mi_switch(): clean up switch types and their usage Overall, this is a non-functional change, except for kernels built with SCHED_STATS. However, the switch types are useful for communicating the intent of the caller. 1. Ensure that every caller provides a type. In most cases, we upgrade the basic yield to sched_relinquish() aka SWT_RELINQUISH. 2. The case of sched_bind() is distinct, so add a new switch type SWT_BIND. 3. Remove the two unused types, SWT_PREEMPT and SWT_SLEEPQTIMO. 4. Remove SWT_NONE altogether and assert that callers always provide a type flag. 5. Reference the mi_switch(9) man page in the comments, as these flags will be documented there. Reviewed by: kib, markj Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D38184	2023-02-09 12:01:32 -04:00
Mitchell Horne	bff02948ed	sched_4bsd: use the same switch flags as ULE ULE uses the more specific SWT_REMOTEPREEMPT and SWT_REMOTEWAKEIDLE switch types, let's do that here as well. SWT_PREEMPT is somewhat redundant when we also have the SW_PREEMPT flag. This only has an effect for kernels built with SCHED_STATS. Reviewed by: kib, markj Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D38183	2023-02-09 12:01:32 -04:00
Mitchell Horne	dc9b13736f	Use maybe_yield() in a few more places Reviewed by: kib, markj MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D38186	2023-02-09 11:58:06 -04:00
Mitchell Horne	d570418bd8	Boolify should_yield() Do this ahead of adding a man page that describes the function. No functional change. Reviewed by: kib, markj MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D38181	2023-02-09 11:58:06 -04:00
Mitchell Horne	a7a452fedc	Update comments referencing create_thread() The equivalent function is now named thread_create(). Mention kthread_add() where it is also relevant. Reviewed by: kib, markj MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D38180	2023-02-09 11:58:06 -04:00
Mitchell Horne	e6cf1a0826	physmem: add ram0 pseudo-driver Its purpose is to reserve all I/O space belonging to physical memory from nexus, preventing it from being handed out by bus_alloc_resource() to callers such as xenpv_alloc_physmem(), which looks for the first available free range it can get. This mimics the existing pseudo-driver on x86. If needed, the device can be disabled with hint.ram.0.disabled="1" in /boot/device.hints. Reviewed by: imp MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D32343	2023-02-08 16:50:46 -04:00
Mateusz Guzik	08d357287b	sysv: ansify Reported by: clang 15 Sponsored by: Rubicon Communications, LLC ("Netgate")	2023-02-08 00:11:10 +00:00
Mateusz Guzik	8377575772	vfs: ansify Reported by: clang 15 Sponsored by: Rubicon Communications, LLC ("Netgate")	2023-02-07 23:03:20 +00:00
Mark Johnston	27202b98dc	jail: Use atomic(9) instead of CK atomics There's no reason to use one over the other here, let's prefer the interface that's used elsewhere in the kernel. No functional change intended. Reviewed by: mjg Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D38360	2023-02-07 15:10:24 -05:00
Val Packett	4a1c4de232	Allow sysctl hw.machine/hw.machine_arch in capability mode There's no harm in reading strings like 'amd64'. Reviewed by: emaste, manu Sponsored by: https://www.patreon.com/valpackett Differential Revision: https://reviews.freebsd.org/D28703	2023-02-06 14:00:52 -05:00
Justin Hibbits	6472761966	IfAPI: use IfAPI in mbuf Sponsored by: Juniper Networks, Inc.	2023-02-06 12:32:04 -05:00
Justin Hibbits	1e6131bad6	IfAPI: Add needed APIs for mbuf support Summary: Add 2 new APIs for supporting recent mbuf changes: * `36e0a362ac` added the m_snd_tag_alloc() wrapper around if_snd_tag_alloc(). Push this down to the ifnet level. * `4d7a1361ef` adds the m_rcvif_serialize()/m_rcvif_restore() KPIs to serialize and restore an ifnet pointer. Add the necessary wrapper to get the index generation for this. Reviewed By: jhb Sponsored by: Juniper Networks, Inc. Differential Revision: https://reviews.freebsd.org/D38340	2023-02-06 12:32:04 -05:00
Rick Macklem	db5655124c	vfs_mount.c: Free exports structures in vfs_destroy_mount() During testing of exporting file systems in jails, I noticed that the export structures on a mount were not being free'd when the mount is dismounted. This bug appears to have been in the system for a very long time. It would have resulted in a slow memory leak when exported file systems were dismounted. Prior to r362158, freeing the structures during dismount would not have been safe, since VFS_CHECKEXP() returned a pointer into an export structure, which might still have been used by the NFS server for an in-progress RPC when the file system is dismounted. r362158 fixed this, so it should now be safe to free the structures in vfs_mount_destroy(), which is what this patch does. Reviewed by: kib MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D38385	2023-02-04 14:45:23 -08:00
Rick Macklem	d94e0bdc14	Revert "vfs_export: Add checks for correct prison when updating exports" This reverts commit `7926a01ed7`. A new patch in D38371 is being considered for doing this.	2023-02-04 14:38:32 -08:00
Konstantin Belousov	3b6056204d	FIOSEEKHOLE/FIOSEEKDATA: correct consistency for bmap-based implementation Writes on UFS through a mapped region do not allocate disk blocks in holes immediately. The blocks are allocated when the pages are paged out first time. This breaks the algorithm in vn_bmap_seekhole() and ufs_bmap_seekdata(), because VOP_BMAP() reports hole for the place which already contains a valid data. Clean the pages before doing VOP_BMAP() in the affected functions. In principle, we could clean less by only requesting clean starting from the offset, but it is probably not very important. PR: 269261 Reported by: asomers Reviewed by: asomers, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D38379	2023-02-04 20:32:07 +02:00
Pawel Jakub Dawidek	c54d240eb1	kern_prot.c p_candebug(): Remove single-use variable. Reviewed by: allanjude, oshogbo Approved by: allanjude, oshogbo Differential Revision: https://reviews.freebsd.org/D38288	2023-02-02 17:00:24 -08:00
Brooks Davis	5c274b3622	whitespace: rewrap to match case directly above It's easier to visually diff the two case blocks if there aren't gratutious whitespace differences. Sponsored by: DARPA	2023-02-03 00:37:31 +00:00
Rick Macklem	7926a01ed7	vfs_export: Add checks for correct prison when updating exports mountd(8) basically does the following: getmntinfo() for each mount delete_exports using nmount(2) to do the creation/deletion of individual exports. For prison0 (and for other prisons if enforce_statfs == 0) getmntinfo() returns all mount points, including ones being used within other prisons. This can cause confusion if the same file system is specified in the exports(5) file for multiple prisons. This patch adds a perminent identifier to each prison and marks which prison did the exports in a field of the mount structure called mnt_exjail. This field can then be compared to the perminent identifier for the prison that the thread's credentials is in. Also required was a new function called prison_isalive_permid() which returns if the prison is alive, so that the check can be ignored for prisons that have been removed. This prepares the system to allow mountd(8) to run in multiple prisons, including prison0. Future commits will complete the modifications to allow mountd(8) to run in vnet prisons. Until then, these changes should not affect semantics. Reviewed by: markj MFC after: 3 months Differential Revision: https://reviews.freebsd.org/D38144	2023-02-02 16:20:58 -08:00
Dag-Erling Smørgrav	69d94f4c76	Add tarfs, a filesystem backed by tarballs. Sponsored by: Juniper Networks, Inc. Sponsored by: Klara, Inc. Reviewed by: pauamma, imp Differential Revision: https://reviews.freebsd.org/D37753	2023-02-02 18:19:29 +01:00
Rick Macklem	99187c3a44	prison_check_nfsd: Add check for enforce_statfs != 0 Since mountd(8) will not be able to do exports when running in a vnet prison if enforce_statfs is set to 0, add a check for this to prison_check_nfsd(). Reviewed by: jamie, markj MFC after: 2 months Differential Revision: https://reviews.freebsd.org/D38189	2023-02-01 16:02:20 -08:00
Konstantin Belousov	2555f175b3	Move kstack_contains() and GET_STACK_USAGE() to MD machine/stack.h Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D38320	2023-02-02 00:59:26 +02:00
Gleb Smirnoff	a0102dee34	sockets: in sousrsend() pass down the error to aio(4) This somewhat undermines the initial goal of sousrsend() to have all the special error handling for a write on a socket in a single place. The aio(4) needs to see EWOULDBLOCK to re-schedule the job. Because aio(4) handles return from soreceive() and sousrsend() with the same code, we can't check for (error == 0 && done < job_nbytes). Keeping this exclusion for aio(4) seems a lesser evil. Fixes: `7a2c93b86e`	2023-02-01 13:03:10 -08:00
Gleb Smirnoff	fd53298799	unix: add myself to the copyright notice for the new implementation of PF_UNIX/SOCK_DGRAM	2023-02-01 09:39:28 -08:00
Justin Hibbits	9507d03bfe	IfAPI: Use the ifnet APIs in kern_poll() The only API used is if_name(). Sponsored by: Juniper Networks, Inc.	2023-01-31 15:02:16 -05:00
Sebastian Huber	c7c53e3ca6	Clarify hardpps() parameter name and comment Since `32c203577a` by phk in 1999 (Make even more of the PPSAPI implementations generic), the "nsec" parameter of hardpps() is a time difference and no longer a time point. Change the name to "delta_nsec" and adjust the comment. Remove comment about a clock tick adjustment which is no longer in the code. Pull Request: https://github.com/freebsd/freebsd-src/pull/640 Reviewed by: imp	2023-01-30 11:07:40 -07:00
Jose Luis Duran	df949e762c	kern_environment: Partially apply style(9) Sort include files, remove duplicates and remove trailing whitespce. Pull Request: https://github.com/freebsd/freebsd-src/pull/589 Reviewed by: imp	2023-01-30 10:47:56 -07:00
Dmitry Chagin	2058f075b4	cpuset: Handle CPU_WHICH_TIDPID wherever cpuset_which() is called. cpuset_which() resolves the argument pair which and id and returns references to an appropriate resources. To avoid leaking resources or accessing unresolved references to a resources handle new which CPU_WHICH_TIDPID wherever cpuset_which() is called. To avoid code duplication cpuset_which2() has been added. Reported by: syzbot+331e8402e0f7347f0f2a@syzkaller.appspotmail.com Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D38272 MFC after: 2 weeks	2023-01-30 19:28:54 +03:00
Dmitry Chagin	e4754c8036	subr_smp: Trim trailing whitespaces. MFC after: 1 week	2023-01-29 16:18:17 +03:00
Dmitry Chagin	c21b080f3d	cpuset: Fix sched_[g\|s]etaffinity() for better compatibility with Linux. Under Linux to sched_[g\|s]etaffinity() functions the value returned from a call to gettid(2) (thread id) can be passed in the argument pid. Specifying pid as 0 will set the attribute for the calling thread, and passing the value returned from a call to getpid(2) (process id) will set the attribute for the main thread of the thread group. Native cpuset(2) family of system calls has "which" argument to determine how the value of id argument is interpreted, i.e., CPU_WHICH_TID is used to pass a thread id and CPU_WHICH_PID - to pass a process id. For now native sched_[g\|s]etaffinity() implementation is wrong as uses "which" CPU_WHICH_PID to pass both (process and thread id) to the kernel. To fix this adding a new "which" CPU_WHICH_TIDPID intended to handle both id's. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D38209 MFC after: 1 week	2023-01-29 16:17:33 +03:00
Dmitry Chagin	01f74ccd5a	libthr: Fix pthread_attr_[g\|s]etaffinity_np to match it's manual and the kernel. Since `f35093f8` semantics of a thread affinity functions is changed to be a compatible with Linux: In case of getaffinity(), the minimum cpuset_t size that the kernel permits is the maximum CPU id, present in the system, / NBBY bytes, the maximum size is not limited. In case of setaffinity(), the kernel does not limit the size of the user-provided cpuset_t, internally using only the meaningful part of the set, where the upper bound is the maximum CPU id, present in the system, no larger than the size of the kernel cpuset_t. To match pthread_attr_[g\|s]etaffinity_np checks of the user-provided cpusets to the kernel behavior export the minimum cpuset_t size allowed by running kernel via new sysctl kern.sched.cpusetsizemin and use it in checks. Reviewed by: Differential Revision: https://reviews.freebsd.org/D38112 MFC after: 1 week	2023-01-29 15:35:18 +03:00
Allan Jude	5ff13fbc19	MFV: zstd 1.5.2 Merge commit 'b3392d84da5bf2162baf937c77e0557f3fd8a52b' into zstd_1.5.2 full changelog: https://github.com/facebook/zstd/compare/v1.4.8...v1.5.2 Updated sys/kern/subr_compressor.c to new API MFC after: 3 days Relnotes: yes Sponsored by: Klara, Inc.	2023-01-27 17:22:31 +00:00
Gleb Smirnoff	f394d9c0a4	sysctl: use correct types and names in sysctl_*sec_to_sbintime The functions are intended to report kernel variables that are stored as sbintime_t (pointed to by arg1) as human readable nanoseconds or milliseconds (reported via sysctl_handle_64). The variable types and names were reversed. I guess there is no functional change here, as all types flipped around were signed 64. Note that these function aren't used yet anywhere in the kernel. Reviewed by: mav Differential revision: https://reviews.freebsd.org/D38217	2023-01-27 07:09:22 -08:00
Mitchell Horne	627ca221c3	kern_reboot: unconditionally call shutdown_reset() Currently shutdown_reset() is registered as the final entry of the shutdown_final event handler. However, if a panic occurs early in boot before the event is registered (SI_SUB_INTRINSIC), we may end up spinning in the subsequent infinite for loop and failing to reset altogether. Instead we can simply call this function unconditionally. Reviewed by: markj MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D37981	2023-01-23 15:10:24 -04:00
Jiajie Chen	dec7db4960	Add kf_file_nlink field to kf_file and populate it This will allow user-space programs (e.g. lsof) to locate deleted files whose nlink equals zero. Prior to this commit, programs has to use stat(kf_path) to get nlink, but that will fail if the file is deleted. [mjg: s/fail/file in the commit message] Reviewed by: mjg Differential Revision: https://reviews.freebsd.org/D38169	2023-01-23 17:09:52 +00:00
Konstantin Belousov	456f05756b	Handle int rank issues in in vn_getsize_locked() and vn_seek() In vn_getsize_locked(), when storing vattr.va_size of type u_quad_t into off_t size, we must avoid overflow. Then, the check for fsize < 0, introduced in the commit `f45feecfb2` 'vfs: add vn_getsize', is nop [1]. Reported and reviewed by: jhb Coverity CID: 1502346 Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D38133	2023-01-20 23:56:29 +02:00
Konstantin Belousov	5657f49ef3	kern_umtx.c do_wait(): correct confusing indent Sponsored by: The FreeBSD Foundation MFC after: 3 days	2023-01-20 23:33:11 +02:00
Brooks Davis	fa1d803c0f	epoch: replace hand coded assertion The assertion is equivalent to kstack_contains() so use that rather than spelling it out. Suggested by: jhb Reviewed by: jhb MFC after: 1 week Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D38107	2023-01-20 18:04:40 +00:00
John Baldwin	846e4a206f	ktls_disable_ifnet_help: Set curvnet around sorele(). This is required in kernels with VIMAGE such as GENERIC. MFC after: 1 week Sponsored by: Chelsio Communications	2023-01-18 15:39:04 -08:00
Konstantin Belousov	0f80d5ebc8	Require INVARIANTS and WITNESS if DEBUG_VFS_LOCKS is set Reported by: pho Reviewed by: markj, mjg Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D38070	2023-01-16 05:55:47 +02:00

1 2 3 4 5 ...

19588 Commits