freebsd-dev

Author	SHA1	Message	Date
Gordon Bergling	c159f76713	kern: remove a double word in a KASSERT in subr_trap - s/with with/with/ MFC after: 5 days	2023-04-13 20:03:37 +02:00
Ed Maste	2ef2c26f3f	link_elf: fix SysV hash function overflow Quoting from https://maskray.me/blog/2023-04-12-elf-hash-function: The System V Application Binary Interface (generic ABI) specifies the ELF object file format. When producing an output executable or shared object needing a dynamic symbol table (.dynsym), a linker generates a .hash section with type SHT_HASH to hold a symbol hash table. A DT_HASH tag is produced to hold the address of .hash. The function is supposed to return a value no larger than 0x0fffffff. Unfortunately, there is a bug. When unsigned long consists of more than 32 bits, the return value may be larger than UINT32_MAX. For instance, elf_hash((const unsigned char *)"\xff\x0f\x0f\x0f\x0f\x0f\x12") returns 0x100000002, which is clearly unintended, as the function should behave the same way regardless of whether long represents a 32-bit integer or a 64-bit integer. Reviewed by: kib, Fangrui Song Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D39517	2023-04-12 15:33:55 -04:00
Konstantin Belousov	c53e990b8d	DEBUG_VFS_LOCKS: restore diagnostic for the witness use case Reviewed by: jah, markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D39477	2023-04-11 15:59:55 +03:00
Konstantin Belousov	75fc6f86c3	Add witness_is_owned(9) which returns an indicator if the current thread owns the specified lock. Reviewed by: jah, markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D39477	2023-04-11 15:59:49 +03:00
Konstantin Belousov	afa8f8971b	vn_start_write(): consistently set mpp to NULL on error or after failed sleep This ensures that mpp != NULL iff vn_finished_write() should be called, regardless of the returned error, except for V_NOWAIT. The only exception that must be maintained is the case where vn_start_write(V_NOWAIT) is called with the intent of later dropping other locks and then doing vn_start_write(V_XSLEEP), which needs the mp value calculated from the non-waitable call above it. Also note that V_XSLEEP is not supported by vn_start_secondary_write(). Reviewed by: markj, mjg (previous version), rmacklem (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D39441	2023-04-11 15:59:46 +03:00
Konstantin Belousov	b2f3288747	vn_start_write(): minor style Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D39441	2023-04-11 15:59:39 +03:00
Konstantin Belousov	7b6fe2428a	DEBUG_VFS_LOCKS: use witness if available The assert_vop_locked messages are ignored, and file/line information is not too useful. Fixing this without changing both witness and VFS asserts KPIs is not possible. Reviewed by: markj (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D39464	2023-04-10 00:34:12 +03:00
Konstantin Belousov	bb24eaea49	vn_lock_pair(): allow to request shared locking If either of vnodes is shared locked, lock must not be recursed. Requested by: rmacklem Reviewed by: markj, rmacklem Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D39444	2023-04-08 01:58:26 +03:00
Mateusz Guzik	02e6e8d218	vfs: extend vn_printf with vop vector	2023-04-07 20:39:06 +00:00
Mateusz Guzik	26b9648750	vfs: more informative panic for missing fplookup ops	2023-04-07 20:39:06 +00:00
Mateusz Guzik	f87a9f51ef	vfs: validate that a mount point with FPLOOKUP has vop_fplookup ops	2023-04-06 15:20:41 +00:00
Mateusz Guzik	e237e2ba5f	vfs: only allow doomed vnodes to return EOPNOTSUPP for fplookup vops This helps asserting that they are provided by filesystems indicating they do it.	2023-04-06 15:20:41 +00:00
Mateusz Guzik	5f6df17775	vfs: validate that vop vectors provide all or none fplookup vops In order to prevent later susprises.	2023-04-06 15:20:41 +00:00
Mateusz Guzik	0baef43ed0	vfs: add missing vop_fplookup ops to syncer	2023-04-06 15:20:41 +00:00
Mateusz Guzik	8495fa49ea	vfs: whack spurious comments from syncer's vop_vector	2023-04-06 15:20:40 +00:00
Konstantin Belousov	11cdffc603	Regen	2023-04-04 16:19:08 +03:00
Konstantin Belousov	dac3102488	Rename kqueue1(2) to kqueuex(2) to avoid compat issues with NetBSD Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D39377	2023-04-04 16:19:08 +03:00
Randall Stewart	73ee5756de	Fixes in the tcp infrastructure with respect to stack changes as well as other infrastructure updates for incoming rack features. So stack switching as always been a bit of a issue. We currently use a break before make setup which means that if something goes wrong you have to try to get back to a stack. This patch among a lot of other things changes that so that it is a make before break. We also expand some of the function blocks in prep for new features in rack that will allow more controlled pacing. We also add other abilities such as the pathway for a stack to query a previous stack to acquire from it critical state information so things in flight don't get dropped or mis-handled when switching stacks. We also add the concept of a timer granularity. This allows an alternate stack to change from the old ticks granularity to microseconds and of course this even gives us a pathway to go to nanosecond timekeeping if we need to (something for the data center to consider for sure). Once all this lands I will then update rack to begin using all these new features. Reviewed by: tuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D39210	2023-04-01 01:46:38 -04:00
Mark Johnston	cab1056105	kdb: Modify securelevel policy Currently, sysctls which enable KDB in some way are flagged with CTLFLAG_SECURE, meaning that you can't modify them if securelevel > 0. This is so that KDB cannot be used to lower a running system's securelevel, see commit `3d7618d8bf`. However, the newer mac_ddb(4) restricts DDB operations which could be abused to lower securelevel while retaining some ability to gather useful debugging information. To enable the use of KDB (specifically, DDB) on systems with a raised securelevel, change the KDB sysctl policy: rather than relying on CTLFLAG_SECURE, add a check of the current securelevel to kdb_trap(). If the securelevel is raised, only pass control to the backend if MAC specifically grants access; otherwise simply check to see if mac_ddb vetoes the request, as before. Add a new secure sysctl, debug.kdb.enter_securelevel, to override this behaviour. That is, the sysctl lets one enter a KDB backend even with a raised securelevel, so long as it is set before the securelevel is raised. Reviewed by: mhorne, stevek MFC after: 1 month Sponsored by: Juniper Networks Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D37122	2023-03-30 10:45:00 -04:00
Mateusz Guzik	80cf427b8d	proc: shave a lock trip on exit if possible ... which happens to be vast majority of the time	2023-03-29 09:19:03 +00:00
Mateusz Guzik	37337709d3	cred: convert the refcount from int to long On 64-bit platforms this sorts out worries about mitigating bugs which overflow the counter, all while not pessimizng anything -- most notably it avoids whacking per-thread operation in favor of refcount(9) API. The struct already had two instances of 4 byte padding with 256 bytes in size, cr_flags gets moved around to avoid growing it. 32-bit platforms could also get the extended counter, but I did not do it as one day(tm) the mutex protecting centralized operation should be replaced with atomics and 64-bit ops on 32-bit platforms remain quite penalizing. While worries of counter overflow are addressed, the following is not (just like it would not be with conversion to refcount(9)): - counter underflows - buffer overruns from adjacent allocations - UAF due to stale cred pointer - .. and other goodies As such, while lipstick was placed, the pig should not be participating in any beauty pageants. Prodded by: emaste Differential Revision: https://reviews.freebsd.org/D39220	2023-03-29 05:02:32 +00:00
Konstantin Belousov	6a0a634590	Regen	2023-03-28 02:39:26 +03:00
Konstantin Belousov	61194e9852	Add kqueue1() syscall It takes the flags argument. Immediate use is to provide the KQUEUE_CLOEXEC flag for kqueue(2). Reviewed by: emaste, jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D39271	2023-03-28 02:39:26 +03:00
Alexander V. Chernikov	04f75b9802	netlink: allow netlink sockets in non-vnet jails. This change allow to open Netlink sockets in the non-vnet jails, even for unpriviledged processes. The security model largely follows the existing one. To be more specific: * by default, every `NETLINK_ROUTE` command is NOT allowed in non-VNET jail UNLESS `RTNL_F_ALLOW_NONVNET_JAIL` flag is specified in the command handler. * All notifications are disabled for non-vnet jails (requests to subscribe for the notifications are ignored). This will change to be more fine-grained model once the first netlink provider requiring this gets committed. * Listing interfaces (RTM_GETLINK) is allowed w/o limits (including interfaces w/o any addresses attached to the jail). The value of this is questionable, but it follows the existing approach. * Listing ARP/NDP neighbours is forbidden. This is a change from the current approach - currently we list static ARP/ND entries belonging to the addresses attached to the jail. * Listing interface addresses is allowed, but the addresses are filtered to match only ones attached to the jail. * Listing routes is allowed, but the routes are filtered to provide only host routes matching the addresses attached to the jail. * By default, every `NETLINK_GENERIC` command is allowed in non-VNET jail (as sub-families may be unrelated to network at all). It is the goal of the family author to implement the restriction if necessary. Differential Revision: https://reviews.freebsd.org/D39206 MFC after: 1 month	2023-03-26 08:44:09 +00:00
Mateusz Guzik	22eb66d961	vfs cache: always assert on ndp->ni_resflags	2023-03-25 21:57:55 +00:00
Mateusz Guzik	138a5dafba	vfs: trylock vnode requeue The quasi-LRU still gets in the way for example when doing an incremental bzImage build, with vnode_list lock being at the top of the profile. Further damage control the problem by trylocking. Note the entire mechanism desperately wants to be reaped out in favor of something(tm) which both scales in a multicore setting and provides sensible replacement policy. With this change everything vfs almost disappears from the on CPU flamegraph, what is left is tons of contention in the VM.	2023-03-25 13:42:27 +00:00
Mateusz Guzik	245767c278	vfs: flip deferred_inact to atomic Turns out it is very rarely triggered, making a per-cpu counter a waste. Examples from real life boxes: uptime counter 135 days 847 138 days 2190 141 days 1	2023-03-25 13:42:27 +00:00
Mateusz Guzik	e5eb1d298f	vfs: replace some spelled out VNASSERTs with VNPASS nfc	2023-03-25 13:42:27 +00:00
Kyle Evans	89c52f9d59	arm64: add KASAN support This entails: - Marking some obvious candidates for __nosanitizeaddress - Similar trap frame markings as amd64, for similar reasons - Shadow map implementation The shadow map implementation is roughly similar to what was done on amd64, with some exceptions. Attempting to use available space at preinit_map_va + PMAP_PREINIT_MAPPING_SIZE (up to the end of that range, as depicted in the physmap) results in odd failures, so we instead search the physmap for free regions that we can carve out, fragmenting the shadow map as necessary to try and fit as much as we need for the initial kernel map. pmap_bootstrap_san() is thus after pmap_bootstrap(), which still included some technically reserved areas of the memory map that needed to be included in the DMAP. The odd failure noted above may be a bug, but I haven't investigated it all that much. Initial work by mhorne with additional fixes from kevans and markj. Reviewed by: andrew, markj Sponsored by: Juniper Networks, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D36701	2023-03-23 16:34:33 -05:00
John Baldwin	d2dab20c2a	ktls: Drop all the INET and INET6 compile-time guards. Consistent with `9fd0d9b16e`, KERN_TLS is not supported on kernels without any INET support. Reviewed by: gallatin, hselasky MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D39232	2023-03-23 14:29:07 -07:00
Mateusz Guzik	c16c4ea6d3	vfs cache: return ENOTDIR for not_a_dir/{.,..} lookups Reported by: Oliver Kiddle PR: 270419 MFC: 3 days	2023-03-23 19:31:18 +00:00
Mateusz Guzik	b5d43972e3	vfs: decouple freevnodes from vnode batching In principle one cpu can keep vholding vnodes, while another vdrops them. In this case it may be the local count will keep growing in an unbounded manner. Roll it up after a threshold instead. While here move it out of dpcpu into struct pcpu. Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D39195	2023-03-22 23:57:25 +00:00
Mark Johnston	b4b33821fa	ktls: Fix interlocking between ktls_enable_rx() and listen(2) The TCP_TXTLS_ENABLE and TCP_RXTLS_ENABLE socket option handlers check whether the socket is listening socket and fail if so, but this check is racy. Since we have to lock the socket buffer later anyway, defer the check to that point. ktls_enable_tx() locks the send buffer's I/O lock, which will fail if the socket is a listening socket, so no explicit checks are needed. In ktls_enable_rx(), which does not acquire the I/O lock (see the review for some discussion on this), use an explicit SOLISTENING() check after locking the recv socket buffer. Otherwise, a concurrent solisten_proto() call can trigger crashes and memory leaks by wiping out socket buffers as ktls_enable_*() is modifying them. Also make sure that a KTLS-enabled socket can't be converted to a listening socket, and use SOCK_(SEND\|RECV)BUF_LOCK macros instead of the old ones while here. Add some simple regression tests involving listen(2). Reported by: syzkaller MFC after: 2 weeks Reviewed by: gallatin, glebius, jhb Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D38504	2023-03-21 16:04:00 -04:00
Mitchell Horne	8965b3033e	callout(9): adopt old references to timeout(9) timeout(9) was removed a couple of years ago; all consumers now use the callout(9) interface. Explicitly do not bump .Dd anywhere, as this is not a content or semantic change. Reviewed by: markj, jhb, Pau Amma <pauamma@gundo.com> MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D39136	2023-03-20 17:12:12 -03:00
Mark Johnston	c3179891f8	kerneldump: Inline dump_savectx() into its callers The callers of dump_savectx() (i.e., doadump() and livedump_start()) subsequently call dumpsys()/minidumpsys(), which dump the calling thread's stack when writing the dump. If dump_savectx() gets its own stack frame, that frame might be clobbered when its caller later calls dumpsys()/minidumpsys(), making it difficult for debuggers to unwind the stack. Fix this by making dump_savectx() a macro, so that savectx() is always called directly by the function which subsequently calls dumpsys()/minidumpsys(). This fixes stack unwinding for the panicking thread from arm64 minidumps. The same happened to work on amd64, but kgdb reports the dump_savectx() calls as coming from dumpsys(), so in that case it appears to work by accident. Fixes: `c9114f9f86` ("Add new vnode dumper to support live minidumps") Reviewed by: mhorne, jhb MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D39151	2023-03-20 14:16:28 -04:00
Mateusz Guzik	62a573d953	vfs: retire KERN_VNODE It got disabled in 2003: commit `acb18acfec` Author: Poul-Henning Kamp <phk@FreeBSD.org> Date: Sun Feb 23 18:09:05 2003 +0000 Bracket the kern.vnode sysctl in #ifdef notyet because it results in massive locking issues on diskless systems. It is also not clear that this sysctl is non-dangerous in its requirements for locked down memory on large RAM systems. There does not seem to be practical use for it and the disabled routine does not work anyway. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D39127	2023-03-17 16:21:45 +00:00
Mina Galić	0b0ae2e4cd	jail: convert several functions from int to bool these functions exclusively return (0) and (1), so convert them to bool We also convert some networking related jail functions from int to bool some of which were returning an error that was never used. Differential Revision: https://reviews.freebsd.org/D29659 Reviewed by: imp, jamie (earlier version) Pull Request: https://github.com/freebsd/freebsd-src/pull/663	2023-03-14 21:05:33 -06:00
Mark Johnston	cd133525fa	smr: Remove the return value from smr_wait() This is supposed to be a blocking version of smr_poll(), so there's no need for a return value. No functional change intended. MFC after: 1 week	2023-03-13 10:45:35 -04:00
Kyle Evans	cc0fe048ec	kern: physmem: don't create a new exregion for different flags... ... if the region we're adding is an exact match to one that we already have. Simply extend the flags of the existing entry as needed so that we don't end up with duplicate regions. It could be that we got the exclusion through two different means, e.g., FDT memreserve and the EFI memory map, and we may derive different characteristics from each. Apply the most restrictive set to the region. Reported by: Mark Millard <marklmi yahoo com> Reviewed by: mhorne	2023-03-09 23:27:39 -06:00
Justin Hibbits	084846271a	ktls: Use IfAPI accessors to get capabilities Summary: Avoid referencing the ifnet struct directly, and use the IfAPI accessors instead. Reviewed by: gallatin Sponsored by: Juniper Networks, Inc. Differential Revision: https://reviews.freebsd.org/D38932	2023-03-07 09:47:00 -05:00
Mark Johnston	831601773e	deadlkres: Make parameters settable with tunables MFC after: 1 week Sponsored by: Klara, Inc. Sponsored by: Juniper Networks, Inc.	2023-03-03 11:16:41 -05:00
Rick Macklem	cbbb22031f	kern_jail.c: Remove #ifdefs for VNET_NFSD The consensus was that VNET_NFSD was not needed. This patch removes it from kern_jail.c. With this patch, support for the "allow.nfsd" jail parameter is enabled in the kernel for kernels built with "options VIMAGE". Reviewed by: markj MFC after: 3 months Differential Revision: https://reviews.freebsd.org/D38808	2023-03-02 13:13:24 -08:00
Rick Macklem	4bbbd5875d	vfs_mount.c: Allow mountd(8) to do exports in a vnet prison To run mountd in a vnet prison, three checks in vfs_domount() and vfs_domount_update() related to doing exports needed to be changed, so that a file system visible within the prison but mounted outside the prison can be exported. I did all three in a minimal way, only changing the checks for the specific case of a process (typically mountd) doing exports within a vnet prison and not updating the mount point in other ways. The changes are: - Ignore the error return from vfs_suser(), since the file system being mounted outside the prison will cause it to fail. - Use the priv_check(PRIV_NFS_DAEMON) for this specific case within a prison. - Skip the call to VFS_MOUNT(), since it will return an error, due to the "from" argument not being set correctly. VFS_MOUNT() does not appear to do anything for the case of doing exports only. Reviewed by: markj MFC after: 3 months Differential Revision: https://reviews.freebsd.org/D37741	2023-03-02 13:09:01 -08:00
Mark Johnston	bcd8cd859e	buf: Make buf_daemon_shutdown() a no-op after a panic As in commit `9d7cc536e2`, there is no need to do anything in this context. MFC after: 1 week	2023-03-01 10:15:54 -05:00
Mateusz Guzik	a357112938	kern: whack __mips__ leftover Sponsored by: Rubicon Communications, LLC ("Netgate")	2023-03-01 11:05:12 +00:00
Zhenlei Huang	2c33b456ff	jail: Improve readability No functional change intended. Reviewed by: melifaro Differential Revision: https://reviews.freebsd.org/D37890	2023-02-28 18:20:07 +08:00
Zhenlei Huang	500f82d6c3	jail: Use flexible array member within struct prison_ip Current implementation utilize off-by-one struct prison_ip to access the IPv[46] addresses. It is error prone and hence comes the regression fix `21ad3e27fa` and `ddbf879d79`. Use flexible array member so that compiler will catch such errors and it will also be easier to review. No functional change intended. Reviewed by: melifaro, glebius Differential Revision: https://reviews.freebsd.org/D37874	2023-02-28 18:20:06 +08:00
Sebastian Huber	28ed159f26	pps: Round to closest integer in pps_event() The comment above bintime2timespec() says: When converting between timestamps on parallel timescales of differing resolutions it is historical and scientific practice to round down. However, the delta_nsec value is a time difference and not a timestamp. Also the rounding errors accumulate in the frequency accumulator, see hardpps(). So, rounding to the closest integer is probably slightly better. Reviewed by: imp Pull Request: https://github.com/freebsd/freebsd-src/pull/604	2023-02-27 15:10:55 -07:00
Sebastian Huber	1e48d9d336	pps: Simplify the nsec calculation in pps_event() Let A be the current calculation of the frequency accumulator (pps_fcount) update in pps_event() scale = (uint64_t)1 << 63; scale /= captc->tc_frequency; scale = 2; bt.sec = 0; bt.frac = 0; bintime_addx(&bt, scale tcount); bintime2timespec(&bt, &ts); hardpps(tsp, ts.tv_nsec + 1000000000 * ts.tv_sec); and hardpps(..., delta_nsec): u_nsec = delta_nsec; if (u_nsec > (NANOSECOND >> 1)) u_nsec -= NANOSECOND; else if (u_nsec < -(NANOSECOND >> 1)) u_nsec += NANOSECOND; pps_fcount += u_nsec; This change introduces a new calculation which is slightly simpler and more straight forward. Name it B. Consider the following sample values with a tcount of 2000000100 and a tc_frequency of 2000000000 (2GHz). For A, the scale is 9223372036. Then scale * tcount is 18446744994337203600 which is larger than UINT64_MAX (= 18446744073709551615). The result is 920627651984 == 18446744994337203600 % UINT64_MAX. Since all operands are unsigned the result is well defined through modulo arithmetic. The result of bintime2timespec(&bt, &ts) is 49. This is equal to the correct result 1000000049 % NANOSECOND. In hardpps(), both conditional statements are not executed and pps_fcount is incremented by 49. For the new calculation B, we have 1000000000 * tcount is 2000000100000000000 which is less than UINT64_MAX. This yields after the division with tc_frequency the correct result of 1000000050 for delta_nsec. In hardpps(), the first conditional statement is executed and pps_fcount is incremented by 50. This shows that both methods yield roughly the same results. However, method B is easier to understand and requires fewer conditional statements. Reviewed by: imp Pull Request: https://github.com/freebsd/freebsd-src/pull/604	2023-02-27 15:10:55 -07:00
Sebastian Huber	8a142484d4	pps: Directly assign the timestamps in pps_event() Reviewed by: imp Pull Request: https://github.com/freebsd/freebsd-src/pull/604	2023-02-27 15:10:55 -07:00

1 2 3 4 5 ...

19556 Commits