freebsd-dev

Author	SHA1	Message	Date
Gleb Smirnoff	4682ac697c	unix: turn check in unp_externalize() into assertion In this function we always work with mbufs that we previously created ourselves in unp_internalize(). They must be valid. Reviewed by: markj Differential revision: https://reviews.freebsd.org/D35319	2022-05-25 13:29:20 -07:00
Gleb Smirnoff	579b45e203	unix/*: check new control size in unp_internalize() Now that we call sbcreatecontrol() with M_WAITOK, we are expected to pass a valid size. Return same error code, we are returning for an oversized control from sockargs(). Reviewed by: markj Differential revision: https://reviews.freebsd.org/D35317	2022-05-25 13:29:13 -07:00
Gleb Smirnoff	d60ea9a10a	sockets: return EMSGSIZE if control part of message is too large Specification doesn't list an explicit error code for the control size specified by msg_control being too large. But it does list EMSGSIZE as error code for "message is too large to be sent all at once (as the socket requires)". It also lists EINVAL as code for the "The sum of the iov_len values overflows an ssize_t." Given how generic and uninformative EINVAL is, the EMSGSIZE is more appropriate. https://pubs.opengroup.org/onlinepubs/9699919799/functions/sendmsg.html Reviewed by: markj Differential revision: https://reviews.freebsd.org/D35316	2022-05-25 13:29:04 -07:00
Gleb Smirnoff	ad51c47fb4	sockbuf: fix assertion in sbcreatecontrol() Fixes: `6890b58814`	2022-05-25 00:19:41 -07:00
Mark Johnston	524dadf7a8	kevent: Fix an off-by-one in filt_timerexpire_l() Suppose a periodic kevent timer fires close to its deadline, so that now - kc->next is small. Then delta ends up being 1, and the next timer deadline is set to (delta + 1) * kc->to, where kc->to is the timer period. This means that the timer fires at half of the requested rate, and the value returned in kn_data is similarly inaccurate. PR: 264131 Fixes: `7cb40543e9` ("filt_timerexpire: do not iterate over the interval") Reviewed by: kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D35313	2022-05-24 20:14:33 -04:00
Mateusz Guzik	cdb337b097	vfs: fix copy-pasto in previous Reported by: dchagin	2022-05-20 20:58:11 +00:00
Mateusz Guzik	ec3c225711	vfs: call vn_truncate_locked from kern_truncate This fixes a bug where the syscall would not bump writecount. PR: 263999	2022-05-20 17:25:51 +00:00
Mateusz Guzik	6b715687bd	vfs: make sure truncate always calls NDFREE_* While here convert it to NDFREE_NOTHING.	2022-05-20 17:25:51 +00:00
Mark Johnston	4a3e51335e	cpuset: Fix the KASAN and KMSAN builds Rename the "copyin" and "copyout" fields of struct cpuset_copy_cb to something less generic, since sanitizers define interceptors for copyin() and copyout() using #define. Reported by: syzbot+2db5d644097fc698fb6f@syzkaller.appspotmail.com Fixes: `47a57144af` ("cpuset: Byte swap cpuset for compat32 on big endian architectures") Sponsored by: The FreeBSD Foundation	2022-05-20 10:34:25 -04:00
Dmitry Chagin	eca368ecb6	Retire sv_transtrap Call translate_traps directly from sendsig(). MFC after: 2 weeks	2022-05-20 14:54:03 +03:00
Dmitry Chagin	2479e381cd	kqueue: Trim trailing whitespace MFC after: 1 week	2022-05-19 19:52:02 +03:00
Justin Hibbits	47a57144af	cpuset: Byte swap cpuset for compat32 on big endian architectures Summary: BITSET uses long as its basic underlying type, which is dependent on the compile type, meaning on 32-bit builds the basic type is 32 bits, but on 64-bit builds it's 64 bits. On little endian architectures this doesn't matter, because the LSB is always at the low bit, so the words get effectively concatenated moving between 32-bit and 64-bit, but on big-endian architectures it throws a wrench in, as setting bit 0 in 32-bit mode is equivalent to setting bit 32 in 64-bit mode. To demonstrate: 32-bit mode: BIT_SET(foo, 0): 0x00000001 64-bit sees: 0x0000000100000000 cpuset is the only system interface that uses bitsets, so solve this by swapping the integer sub-components at the copyin/copyout points. Reviewed by: kib MFC after: 3 days Sponsored by: Juniper Networks, Inc. Differential Revision: https://reviews.freebsd.org/D35225	2022-05-19 10:49:55 -05:00
Andrew Turner	11a6ecd425	Handle cas failure when the compare succeeds When locking a priority inherit mutex we perform a compare and swap operation to try and acquire the mutex. This may fail even when the compare succeeds. Check and handle this case. PR: 263825 Reviewed by: kib, markj Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D35150	2022-05-19 11:30:21 +01:00
Gleb Smirnoff	6890b58814	sockbuf: improve sbcreatecontrol() o Constify memory pointer. Make length unsigned. o Make it never fail with M_WAITOK and assert that length is sane.	2022-05-17 10:10:42 -07:00
Gleb Smirnoff	b46667c63e	sockbuf: merge two versions of sbcreatecontrol() into one No functional change.	2022-05-17 10:10:42 -07:00
Gleb Smirnoff	eac7f0798b	unix: garbage collect unp_dispose_mbuf() for brevity	2022-05-17 10:10:41 -07:00
Gleb Smirnoff	2e5bf7c49f	unix: fix mbuf leak on close of socket with data Fixes: `1f32cef471`	2022-05-17 10:10:41 -07:00
Vladimir Kondratyev	b6f87b78b5	LinuxKPI: Implement kthread_worker related functions Kthread worker is a single thread workqueue which can be used in cases where specific kthread association is necessary, for example, when it should have RT priority or be assigned to certain cgroup. This change implements Linux v4.9 interface which mostly hides kthread internals from users thus allowing to use ordinary taskqueue(9) KPI. As kthread worker prohibits enqueueing of already pending or canceling tasks some minimal changes to taskqueue(9) were done. taskqueue_enqueue_flags() was added to taskqueue KPI which accepts extra flags parameter. It contains one or more of the following flags: TASKQUEUE_FAIL_IF_PENDING - taskqueue_enqueue_flags() fails if the task is already scheduled to execution. EEXIST is returned and the ta_pending counter value remains unchanged. TASKQUEUE_FAIL_IF_CANCELING - taskqueue_enqueue_flags() fails if the task is in the canceling state and ECANCELED is returned. Required by: drm-kmod 5.10 MFC after: 1 week Reviewed by: hselasky, Pau Amma (docs) Differential Revision: https://reviews.freebsd.org/D35051	2022-05-17 15:10:20 +03:00
Rick Macklem	373511338d	uipc_socket.c: Modify MSG_TLSAPPDATA to only do Alert Records Without this patch, the MSG_TLSAPPDATA flag would cause soreceive_generic() to return ENXIO for any non-application data record in a TLS receive stream. This works ok for TLS1.2, since Alert records appear to be the only non-application data records received. However, for TLS1.3, there can be post-handshake handshake records, such as NewSessionKey sent to the client from the server. These handshake records cannot be handled by the upcall which does an SSL_read() with length == 0. It appears that the client can simply throw away these NewSessionKey records, but to do so, it needs to receive them within the kernel. This patch modifies the semantics of MSG_TLSAPPDATA slightly, so that it only applies to Alert records and not Handshake records. It is needed to allow the krpc to work with KTLS1.3. Reviewed by: hselasky MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D35170	2022-05-14 12:56:50 -07:00
Dmitry Chagin	cb2ae61631	sysvsem: Fix a typo Per jamie@ rpr can be NULL if the jail is created with sysvsem=disable. But at least it doesn't appear to be fatal, since rpr is never dereferenced but is only compared to other prison pointers. Reviewed by: jamie Differential revision: https://reviews.freebsd.org/D35198 MFC after: 2 weeks	2022-05-14 14:07:20 +03:00
Dmitry Chagin	b6c8f461f0	sysvsem: Style(9) MFC after: 2 weeks	2022-05-14 14:06:58 +03:00
Dmitry Chagin	f0b0fdf15e	sysvsem: Trim traiing whitespace MFC after: 2 weeks	2022-05-14 14:06:40 +03:00
Mitchell Horne	db71383b88	kerneldump: remove physical from dump routines It is unused, especially now that the underlying d_dumper methods do not accept the argument. Reviewed by: markj MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D35174	2022-05-13 10:43:19 -03:00
Mitchell Horne	489ba22236	kerneldump: remove physical argument from d_dumper The physical address argument is essentially ignored by every dumper method. In addition, the dump routines don't actually pass a real address; every call to dump_append() passes a value of zero for physical. Reviewed by: markj MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D35173	2022-05-13 10:42:48 -03:00
Mitchell Horne	0f50da2e09	Drop d_dump from struct cdevsw It appears to be unused. These days struct disk has a d_dump member, which is what gets passed to the kernel dump framework. Reviewed by: markj MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D35172	2022-05-13 10:42:17 -03:00
Gleb Smirnoff	bb35a4e11d	unix: microoptimize unp_connectat() - one less lock on success This change is also a preparation for further optimization to allow locked return on success. Reviewed by: markj Differential revision: https://reviews.freebsd.org/D35182	2022-05-12 13:22:39 -07:00
Gleb Smirnoff	08f17d1432	unix: make unp_connect2() void Assert that sockets are of the same type. unp_connectat() already did this check. Add the check to uipc_connect2(). Reviewed by: markj Differential revision: https://reviews.freebsd.org/D35181	2022-05-12 13:22:39 -07:00
Gleb Smirnoff	4328318445	sockets: use socket buffer mutexes in struct socket directly Since `c67f3b8b78` the sockbuf mutexes belong to the containing socket, and socket buffers just point to it. In `74a68313b5` macros that access this mutex directly were added. Go over the core socket code and eliminate code that reaches the mutex by dereferencing the sockbuf compatibility pointer. This change requires a KPI change, as some functions were given the sockbuf pointer only without any hint if it is a receive or send buffer. This change doesn't cover the whole kernel, many protocols still use compatibility pointers internally. However, it allows operation of a protocol that doesn't use them. Reviewed by: markj Differential revision: https://reviews.freebsd.org/D35152	2022-05-12 13:22:12 -07:00
Gleb Smirnoff	01235012e5	unix/dgram: uipc_listen() is specific for SOCK_STREAM and SOCK_SEQPACKET Rely on pr_usrreqs_init() to init SOCK_DGRAM to pru_listen_notsupp().	2022-05-12 11:04:40 -07:00
Gleb Smirnoff	3c87ba3c3b	unix/dgram: pru_rcvd never called since PR_WANTRCVD not set	2022-05-12 11:04:40 -07:00
Gleb Smirnoff	2e4e5ee23f	sockets: delete stale comment from sofree() First paragraph refers to old past "we used to" and is no longer important today. Second paragraph has just a wrong statement that socket buffer is destroyed before pru_detach.	2022-05-12 11:02:50 -07:00
Gleb Smirnoff	1f32cef471	unix: don't call sbrelease() in uipc_detach() Since `a982ce0442` the socket buffer is already cleared and released in unp_dispose() that is called just before uipc_detach().	2022-05-12 11:02:50 -07:00
Dmitry Chagin	586ed32106	kdump: Decode cpuset_t. Reviewed by: jhb Differential revision: https://reviews.freebsd.org/D34982 MFC after: 2 weeks	2022-05-11 10:40:39 +03:00
Dmitry Chagin	f35093f8d6	Use Linux semantics for the thread affinity syscalls. Linux has more tolerant checks of the user supplied cpuset_t's. Minimum cpuset_t size that the Linux kernel permits in case of getaffinity() is the maximum CPU id, present in the system / NBBY, the maximum size is not limited. For setaffinity(), Linux does not limit the size of the user-provided cpuset_t, internally using only the meaningful part of the set, where the upper bound is the maximum CPU id, present in the system, no larger than the size of the kernel cpuset_t. Unlike FreeBSD, Linux ignores high bits if set in the setaffinity(), so clear it in the sched_setaffinity() and Linuxulator itself. Reviewed by: Pau Amma (man pages) In collaboration with: jhb Differential revision: https://reviews.freebsd.org/D34849 MFC after: 2 weeks	2022-05-11 10:36:01 +03:00
Gleb Smirnoff	7db54446c6	sockbufs: make sbrelease_internal() private	2022-05-09 10:43:01 -07:00
Gleb Smirnoff	a982ce0442	sockets: remove the socket-on-stack hack from sorflush() The hack can be tracked down to 4.4BSD, where copy was performed under splimp() and then after splx() dom_dispose was called. Stevens has a chapter on this function, but he doesn't answer why this trick is necessary. Why can't we call into dom_dispose under splimp()? Anyway, with multithreaded kernel the hack seems to be necessary to avoid LORs between socket buffer lock and different filesystem locks, especially network file systems. The new socket buffers KPI sbcut() from `1d2df300e9` allow us to get rid of the hack. Reviewed by: markj Differential revision: https://reviews.freebsd.org/D35125	2022-05-09 10:43:01 -07:00
Gleb Smirnoff	42f2fa9953	sockets: don't call dom_dispose() on a listening socket sorflush() already did the right thing, so only sofree() needed a fix. Turn check into assertion in our only dom_dispose method. Reviewed by: markj Differential revision: https://reviews.freebsd.org/D35124	2022-05-09 10:42:57 -07:00
Gleb Smirnoff	c17418a0ba	sockets: assert that any protocol with PR_RIGHTS has dom_dispose() Through the entire history only PF_UNIX has this feature. Reviewed by: markj Differential revision: https://reviews.freebsd.org/D35123	2022-05-09 10:42:48 -07:00
Gleb Smirnoff	24df85d29a	unix/*: unp_internalize() can sleep, so allocate mbufs with M_WAITOK	2022-05-09 10:42:48 -07:00
Gleb Smirnoff	97f8198e95	sockets: make SO_SND/SO_RCV a enum Not a functional change now. The enum will also be used for other socket buffer related KPIs.	2022-05-09 10:42:47 -07:00
Warner Losh	45ae223ac6	msgbuf: Allow microsecond granularity timestamps Today, kern.msgbuf_show_timestamp=1 will give 1 second granularity timestamps on dmesg lines. When kern.msgbuf_show_timestamp=2, we'll produce microsecond level graunlarity. For example: old (== 1): [13] Dual Console: Video Primary, Serial Secondary [14] lo0: link state changed to UP [15] bxe0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit [15] bxe0: link state changed to UP new (== 2): [13.807015] Dual Console: Video Primary, Serial Secondary [14.544150] lo0: link state changed to UP [15.272044] bxe0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit [15.272052] bxe0: link state changed to UP Sponsored by: Netflix	2022-05-07 09:32:22 -06:00
Alan Somers	1d2421ad8b	Correctly measure system load averages > 1024 The old fixed-point arithmetic used for calculating load averages had an overflow at 1024. So on systems with extremely high load, the observed load average would actually fall back to 0 and shoot up again, creating a kind of sawtooth graph. Fix this by using 64-bit math internally, while still reporting the load average to userspace as a 32-bit number. Sponsored by: Axcient Reviewed by: imp Differential Revision: https://reviews.freebsd.org/D35134	2022-05-06 17:25:43 -06:00
John Baldwin	2fdcc2ef6f	cpufreq: Remove unused devclass argument to DRIVER_MODULE.	2022-05-06 15:46:58 -07:00
Dmitry Chagin	f04534f5c8	sysvsem: Add a timeout argument to the semop. For future use in the Linux emulation layer for the semtimedop syscall split the sys_semop syscall into two counterparts and add struct timespec *timeout argument to the last one. Reviewed by: jhb, kib Differential revision: https://reviews.freebsd.org/D35121 MFC after: 2 weeks	2022-05-06 19:51:48 +03:00
Kristof Provost	613acc6483	mbuf: do not restore dying interfaces When we remove an interface it is first removed from the interface list V_ifnet (by if_unlink_ifnet()) and marked as IFF_DYING. We then wait for any possible references to stop being used (i.e. epoch_wait/epoch_drain_callbacks) before we tear it fully down. However, the index in ifindex_table is not removed, so m_rcvif_restore() can still find the (now dying) interface. This results in panics, for example when dummynet restores the rcvif pointer and passes a packet to ip6_input() we can panic because the AF_INET6 domain has already been removed (so we end up dereferencing a NULL pointer there). Check that the interface is not dying before we restore it, which is equivalent to checking its presence in V_ifnet, and thus ensures that future accesses (while in NET_EPOCH) are safe. Reviewed by: glebius Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D34076 (cherry picked from commit `703e533da5`)	2022-05-05 14:38:08 -04:00
Gleb Smirnoff	4d7a1361ef	ifnet/mbuf: provide KPI to serialize/restore m->m_pkthdr.rcvif Supplement ifindex table with generation count and use it to serialize & restore an ifnet pointer. Reviewed by: kp Differential revision: https://reviews.freebsd.org/D33266 Fun note: git show `e6abef0918` (cherry picked from commit `e1882428dc`)	2022-05-05 14:38:07 -04:00
Marko Zec	6c741ffbfa	Revert "mbuf: do not restore dying interfaces" This reverts commit `703e533da5`. Revert "ifnet/mbuf: provide KPI to serialize/restore m->m_pkthdr.rcvif" This reverts commit `e1882428dc`. Obtained from: github.com/glebius/FreeBSD/commits/backout-ifindex	2022-05-03 19:11:40 +02:00
Konstantin Belousov	6fe78ad434	subr_unit.c: make userspace tests buildable by defining a placeholder for UNR_NO_MTX Sponsored by: The FreeBSD Foundation MFC after: 1 week	2022-04-28 03:00:14 +03:00
Konstantin Belousov	709783373e	Fix another race between fork(2) and PROC_REAP_KILL subtree where we might not yet see a new child when signalling a process. Ensure that this cannot happen by stopping all reapping subtree, which ensures that the child is not inside a syscall, in particular fork(2). Reported and tested by: pho Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D35014	2022-04-28 02:27:35 +03:00
Konstantin Belousov	39794d80ad	Fix a race between fork(2) and PROC_REAP_KILL subtree by repeating iteration over the subtree until there are no new processes to signal. Reported and tested by: pho Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D35014	2022-04-28 02:27:35 +03:00
Konstantin Belousov	d1df347368	kern_procctl: add possibility to take stop_all_proc_block() around exec stop_allo_proc_block() must be taken before proctree_lock. Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D35014	2022-04-28 02:27:35 +03:00
Konstantin Belousov	2e7595ef2f	Add stop_all_proc_block(9) It allows to have more than one consumer of thread_signle(SIGNLE_ALLPROC) by serializing them. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D35014	2022-04-28 02:27:35 +03:00
Konstantin Belousov	54a11adbd9	reap_kill(): split children and subtree killers into helpers Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D35014	2022-04-28 02:27:34 +03:00
Konstantin Belousov	134529b11b	reap_kill(): rename the reap variable to reaper Suggested and reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D35014	2022-04-28 02:27:34 +03:00
Konstantin Belousov	e4ce431e2a	reap_kill(): de-inline LIST_FOREACH(), twice Suggested and reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D35014	2022-04-28 02:27:34 +03:00
Konstantin Belousov	b9294a3e15	reaper_abandon_children(): upgrade proctree_lock assert to exclusive p_reapsibling linkage is protected by proctree_lock, and it is modified there. Suggested and reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D35014	2022-04-28 02:27:34 +03:00
Konstantin Belousov	e59b940dcb	unr(9): allow to avoid internal locking Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D35014	2022-04-28 02:27:34 +03:00
Konstantin Belousov	c4be460e84	init_unrhdr(): make it usable by initializing everything Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D35014	2022-04-28 02:27:34 +03:00
John Baldwin	1431239494	Add a __witness_used for variables only used under #ifdef WITNESS. __diagused is now solely used for variables only used under INVARIANTS. Reviewed by: mjg Differential Revision: https://reviews.freebsd.org/D35085	2022-04-27 11:46:16 -07:00
Dmitry Chagin	4a700f3c32	sigtimedwait: Prevent timeout math overflows. Our kern_sigtimedwait() calculates absolute sleep timo value as 'uptime+timeout'. So, when the user specifies a big timeout value (LONG_MAX), the calculated timo can be less the the current uptime value. In that case kern_sigtimedwait() returns EAGAIN instead of EINTR, if unblocked signal was caught. While here switch to a high-precision sleep method. Reviewed by: mav, kib In collaboration with: mav Differential revision: https://reviews.freebsd.org/D34981 MFC after: 2 weeks	2022-04-25 10:23:15 +03:00
Dmitry Chagin	91e7bdcdcf	Add timespecvalid_interval macro and use it. Reviewed by: jhb, imp (early rev) Differential revision: https://reviews.freebsd.org/D34848 MFC after: 2 weeks	2022-04-25 10:20:54 +03:00
John Baldwin	a4c5d490f6	KTLS: Move OCF function pointers out of ktls_session. Instead, create a switch structure private to ktls_ocf.c and store a pointer to the switch in the ocf_session. This will permit adding an additional function pointer needed for NIC TLS RX without further bloating ktls_session. Reviewed by: hselasky Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D35011	2022-04-22 15:52:12 -07:00
John Baldwin	92e40a9b92	busdma_bounce: Batch bounce page free operations when possible. Reviewed by: imp Differential Revision: https://reviews.freebsd.org/D34968	2022-04-21 12:01:55 -07:00
John Baldwin	d4ab3a8d4f	busdma_bounce: Add free_bounce_pages helper function. Deduplicate code to iterate over the bpages list in a bus_dmamap_t freeing bounce pages during bus_dmamap_unload. Reviewed by: imp Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D34967	2022-04-21 10:42:14 -07:00
John Baldwin	10fe9a1fb4	busdma_bounce: Make the map waiting list per-bounce-zone. When pages are freed to a bounce zone, only maps waiting for pages for that zone can make forward progress. If a map for a different bounce zone is at the head of the global list, then requests that could otherwise make forward progress will be stalled waiting on the other bounce zone. If bounce zones shared bounce pages then a global list would still make sense to prevent "later" requests from starving an earlier request but that is not a concern with per-zone bounce page pools. Reviewed by: imp Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D34966	2022-04-21 10:41:09 -07:00
John Baldwin	d11f5d4762	busdma_bounce: Use a simple kproc to invoke deferred requests. Rather than using a software interrupt with a single handler, just create a dedicated kernel process woken up with a simple wakeup(). Reviewed by: imp Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D34965	2022-04-21 10:40:35 -07:00
John Baldwin	c7aa0304d5	Run softclock threads at a hardware ithread priority. Add a new PI_SOFTCLOCK for use by softclock threads. Currently this maps to PI_AV which is the second-highest ithread priority. Reviewed by: mav, kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D33693	2022-04-21 10:40:01 -07:00
John Baldwin	3d7e90fc20	cpufreq_curr_sysctl: Use devclass_find to lookup cpufreq devclass. Reviewed by: imp Differential Revision: https://reviews.freebsd.org/D35002	2022-04-21 10:29:14 -07:00
Kristof Provost	a879e40ca2	callout: fix using shared rmlocks `15b1eb142c` changed the callout code to store the CALLOUT_SHAREDLOCK flag in c_iflags (where it used to be c_flags), but failed to update the check in softclock_call_cc(). This resulted in the callout code always taking the write lock, even if a read lock had been requested (with the CALLOUT_SHAREDLOCK flag in callout_init_rm()). Reviewed by: markj MFC after: 1 week Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D34959	2022-04-20 13:06:50 +02:00
John Baldwin	5bdea8826b	devclass_add_driver: Permit NULL to be passed in dcp. This permits a driver module structure that doesn't want to store a pointer to the new driver's devclass. Reviewed by: imp MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D34962	2022-04-19 10:43:50 -07:00
Mateusz Guzik	c5c981d443	signals: plug a set-but-not-used var Sponsored by: Rubicon Communications, LLC ("Netgate")	2022-04-19 12:45:57 +00:00
John Baldwin	d139909d6e	destroy_dev_sched*: Don't hold Giant for all deferred destroy_dev. Rather than using taskqueue_swi_giant which holds Giant for all deferred destroy_dev calls, create a separate queue for destroyed devices with D_NEEDGIANT set in the corresponding cdevsw. The task for this queue holds Giant whild destroying deferred devices while the task for the default queue does not hold Giant. In addition, switch to taskqueue_thread for destroy_dev_sched. Deferred destroy_dev requests don't need to run at an SWI priority. Reviewed by: imp, markj MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D34915	2022-04-18 12:04:30 -07:00
Konstantin Belousov	362ff9867e	Revert rest of `a5970a529c`: use vrefact() when working on fp->f_vnode Now, since O_PATH-opened file descriptors use use references instead of the hold references, vrefact() chahges from that revision can be reverted. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D34906	2022-04-15 16:56:20 +03:00
Ed Maste	f99cc5a389	sysent: regen after `52a1d90c8b`, posix_fadvise in capmode	2022-04-14 15:17:36 -04:00
Ed Maste	52a1d90c8b	Allow posix_fadvise in capability mode posix_fadvise operates only on a provided fd. Noted by Mathieu <sigsys@gmail.com> in review D34761. No new CAP_ rights are added for posix_fadvise(), as 'advice' in general only influences when I/O happens; the fd must have existing CAP_ rights for actual data access. Reviewed by: markj MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D34903	2022-04-14 15:11:21 -04:00
Konstantin Belousov	bf13db086b	Mostly revert `a5970a529c`: Make files opened with O_PATH to not block non-forced unmount Problem is that open(O_PATH) on nullfs -o nocache is broken then, because there is no reference on the vnode after the open syscall exits. Reported and tested by: ambrisko Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week	2022-04-14 02:47:04 +03:00
John Baldwin	36fb372264	kern: Move variables only used for MAC under #ifdef MAC.	2022-04-13 16:08:23 -07:00
John Baldwin	4aec198420	sched_ule: Inline value of ts in sched_thread_priority. This avoids a set but unused warning in kernels without SMP where TDQ_CPU() doesn't use its argument.	2022-04-13 16:08:23 -07:00
John Baldwin	8758ac757f	sched_4bsd: ts is only used in sched_bind for SMP.	2022-04-13 16:08:22 -07:00
John Baldwin	72ff256c51	sched_4bsd: Remove unused variables.	2022-04-12 14:58:59 -07:00
John Baldwin	dbd51c416a	realloc(9): Move slab and zone under #ifndef DEBUG_REDZONE.	2022-04-12 14:58:59 -07:00
Mark Johnston	d769609620	tty: Remove an incorrect assertion from ttyinq_line_iterate() We may legitimately have tib == NULL if we're at the very end of the queue. PR: 215373 Reported by: pho MFC after: 1 week Sponsored by: The FreeBSD Foundation	2022-04-12 17:30:04 -04:00
Tom Jones	1ea833a572	kdb: set kdb_why when entered via reboot and panic Reviewed by: jhb Sponsored by: NetApp, Inc. Sponsored by: Klara, Inc. X-NetApp-PR: #74 Differential Revision: https://reviews.freebsd.org/D34551	2022-04-12 10:34:40 +01:00
Dmitry Chagin	c6487446d7	getdirentries: return ENOENT for unlinked but still open directory. To be more compatible to IEEE Std 1003.1-2008 (“POSIX.1”). Reviewed by: mjg, Pau Amma (doc) Differential revision: https://reviews.freebsd.org/D34680 MFC after: 2 weeks	2022-04-11 23:30:16 +03:00
Konstantin Belousov	eca39864f7	Add sysctl KERN_LOCKF reporting the shapshot of the active advisory locks. A new VFS ops method vfs_report_lockf if provided in the mount point op table. If it is NULL, as it is currently for all existing filesystems, vfs_report_lockf() function is used, which gathers information from the standard implementation inside kern/kern_lockf.c. Filesystems implementing its own locking (NFSv4 as example) can provide a custom implementation. Reviewed by: markj, rmacklem Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D34756	2022-04-10 00:43:53 +03:00
Konstantin Belousov	147e4fe3f1	kern_lockf.c: remove no longer neeeded UFS headers Reviewed by: markj, rmacklem Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D34756	2022-04-10 00:43:53 +03:00
Konstantin Belousov	59e85819be	lockf: remove lf_inode from struct lockf_entry The UFS-specific struct inode cannot be used in generic advisory lock code. It was probably used as a shortcut for the debugging, as the remnants of the code around it indicates. Use somewhat more verbose and less concentrated, but universal, VOP_PRINT(), where needed. Reviewed by: markj, rmacklem Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D34756	2022-04-10 00:43:53 +03:00
Gordon Bergling	f171938cd6	jail: Remove a double word in a source code comment - s/a a/a/ MFC after: 3 days	2022-04-09 14:19:17 +02:00
Gordon Bergling	c3721292e3	kern: Remove a double word in a source code comment - s/for for/for/ MFC after: 3 days	2022-04-09 10:50:04 +02:00
Gordon Bergling	768f9b8b8b	kern: Fix a typo in a source code comment - s/is is/is/ MFC after: 3 days	2022-04-09 09:14:14 +02:00
Andrew Turner	41e6d2091c	Enable subr_physmem_test on supported architectures Only build where it's supported. While here add support for amd64 to help with testing. Sponsored by: The FreeBSD Foundation	2022-04-07 14:31:51 +01:00
Andrew Turner	d8bff5b67c	Handle non-page aligned/sized memory in physmem In some configurations the firmware may pass memory regions that are not page sized or aligned, e.g. when using 16k pages on arm64. If this is the case we will calculate many small regions because the alignment is applied before being inserted. As we round the start up and end down this will leave a 1 page hole between what should have been a single region. Fix by keeping the original alignment until we are just about to insert the region into the avail array. Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D34694	2022-04-06 14:13:29 +01:00
Andrew Turner	8c99dfed54	Port subr_physmem to userspace and add tests These give us some confidience we haven't broken anything in early boot code that may be running before the console. Reviewed by: emaste Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D34691	2022-04-06 14:13:05 +01:00
Mitchell Horne	eb9d205fa6	livedump: add event handler hooks Add three hooks to the livedump process: before, after, and for each block of dumped data. This allows, for example, quiescing the system before the dump begins or protecting data of interest to ensure its consistency in the final output. Reviewed by: markj, kib (previous version) Reviewed by: debdrup (manpages) Reviewed by: Pau Amma <pauamma@gundo.com> (manpages) MFC after: 3 weeks Sponsored by: Juniper Networks, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D34067	2022-04-05 15:35:05 -03:00
Mitchell Horne	c9114f9f86	Add new vnode dumper to support live minidumps This dumper can instantiate and write the dump's contents to a file-backed vnode. Unlike existing disk or network dumpers, the vnode dumper should not be invoked during a system panic, and therefore is not added to the global dumper_configs list. Instead, the vnode dumper is constructed ad-hoc when a live dump is requested using the new ioctl on /dev/mem. This is similar in spirit to a kgdb session against the live system via /dev/mem. As described briefly in the mem(4) man page, live dumps are not guaranteed to result in a usuable output file, but offer some debugging value where forcefully panicing a system to dump its memory is not desirable/feasible. A future change to savecore(8) will add an option to save a live dump. Reviewed by: markj, Pau Amma <pauamma@gundo.com> (manpages) Discussed with: kib MFC after: 3 weeks Sponsored by: Juniper Networks, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D33813	2022-04-05 15:35:05 -03:00
Mitchell Horne	59c27ea18c	Split out dumper allocation from list insertion Add a new function, dumper_create(), to allocate a dumper. dumper_insert() will call this function and retains the existing behaviour. This is desirable for performing live dumps of the system. Here, there is a need to allocate and configure a dumper structure that is invoked outside of the typical debugger context. Therefore, it should be excluded from the list of panic-time dumpers. free_single_dumper() is made public and renamed to dumper_destroy(). Reviewed by: kib, markj MFC after: 1 week Sponsored by: Juniper Networks, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D34068	2022-04-05 15:35:05 -03:00
Mateusz Guzik	b7262756e2	vfs: fixup WANTIOCTLCAPS on open In some cases vn_open_cred overwrites cn_flags, effectively nullifying initialisation done in NDINIT. This will have to be fixed. In the meantime make sure the flag is passed. Reported by: jenkins Noted by: Mathieu <sigsys@gmail.com>	2022-04-02 20:49:01 +02:00
Gordon Bergling	c9b04ee4f8	kern: Fix two typos in source code comments - s/accomodate/accommodate/ MFC after: 3 days	2022-04-02 14:52:49 +02:00
Gordon Bergling	7181887e82	kern: Fix two typos in source code comments - s/measurment/measurement/ MFC after: 3 days	2022-04-02 14:15:27 +02:00
Mateusz Guzik	0c805718cb	vfs: fix memory leak on lookup with fds with ioctl caps Reviewed by: markj PR: 262515 Noted by: firk@cantconnect.ru Differential Revision: https://reviews.freebsd.org/D34667	2022-04-02 12:09:07 +00:00
Gordon Bergling	669d5ea4e3	kern: Fix a typo in a source code comment - s/paniced/panicked/ MFC after: 3 days	2022-04-02 10:15:02 +02:00
Ed Maste	e5821a2156	syscalls.master: remove obsolete comment about compatibility tables Compatibility ABIs no longer use a separate syscalls.master. Fixes: `be67ea40c5` ("freebsd32: generate from ...") Sponsored by: The FreeBSD Foundation	2022-03-30 11:07:00 -04:00
Brooks Davis	8601fca789	sysent: regen for syscallarg_t	2022-03-28 19:43:03 +01:00
Brooks Davis	b1ad6a9000	syscallarg_t: Add a type for system call arguments This more clearly differentiates system call arguments from integer registers and return values. On current architectures it has no effect, but on architectures where pointers are not integers (CHERI) and may not even share registers (CHERI-MIPS) it is necessiary to differentiate between system call arguments (syscallarg_t) and integer register values (register_t). Obtained from: CheriBSD Reviewed by: imp, kib Differential Revision: https://reviews.freebsd.org/D33780	2022-03-28 19:43:03 +01:00
Andrew Turner	f461b95561	Fix a sign mismatch warning in the physmem code Make sure both sides of a comparison are unsigned. As the values being compared are size_t make the the value in the for loop size_t too. Sponsored by: The FreeBSD Foundation	2022-03-28 11:51:09 +01:00
Mateusz Guzik	2533b5dc82	vfs: add missing bits to vdropl_impl This completes the patch which was originally meant to go in. Spotted by: mhorne Fixes: `c35ec1efdc` ("vfs: [1/2] fix stalls in vnode reclaim by not requeieing from vnlru")	2022-03-27 14:35:37 +00:00
Mateusz Guzik	a4032e2a69	vfs: assorted tidy ups to lookup No functional changes.	2022-03-26 17:06:09 +00:00
Alexander Leidinger	aeb91e95cf	Log euid, rgid and jail on listen queue overflow If you have numerous jails with multiple similar services running, this helps to narrow down which services this log is referring to.	2022-03-26 11:17:55 +01:00
Eric van Gyzen	aca2a7faca	stack_zero is not needed before stack_save The man page was recently clarified to commit to this contract. MFC after: 1 week Sponsored by: Dell EMC Isilon	2022-03-25 20:10:38 -05:00
Eric van Gyzen	863070bbf6	ksiginfo_alloc: pass M_WAITOK or M_NOWAIT to uma_zalloc It expects exactly one of those flags. A future commit will assert this. Reviewed by: rstone MFC after: 1 month Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D34451	2022-03-25 20:10:37 -05:00
Mateusz Guzik	0f60088399	vfs: set cn_namelen when handling degenerate lookups Turns out execve looks at it to store binary name, but in order to trigger the problem one has to be trying to exec '/'. As is the value would be left uninitialized (or rather set to -1 on debug kernels). Fixes: `56244d3574` ("vfs: hoist degenerate path lookups out of the loop")	2022-03-25 18:19:36 +00:00
Mateusz Guzik	4ef6e56ae8	vfs: hoist trailing slash handling out of the loop	2022-03-24 14:36:31 +00:00
Mateusz Guzik	3b6792d28a	vfs: factor symlink traversal out of namei The intent down the road is to eliminate the loop to begin with, pushing traversal down to vfs_lookup, all while not allocating the extra buffer.	2022-03-24 13:11:22 +00:00
Mateusz Guzik	d9ea7e2b1e	vfs: factor FAILIFEXISTS handling out of vfs_lookup	2022-03-24 11:22:20 +00:00
Mateusz Guzik	56244d3574	vfs: hoist degenerate path lookups out of the loop	2022-03-24 11:22:12 +00:00
Mateusz Guzik	bb92cd7bcd	vfs: NDFREE(&nd, NDF_ONLY_PNBUF) -> NDFREE_PNBUF(&nd)	2022-03-24 10:20:51 +00:00
Mark Johnston	1babcad6bc	elf: Avoid dumping uninitialized bytes in PRSTATUS core dump notes elf_prstatus_t contains pad space. Reported by: KMSAN MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D34606	2022-03-23 12:53:49 -04:00
Mark Johnston	7524994da0	callout: Remove the CS_EXECUTING flag It is now unused. MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D34626	2022-03-23 12:37:02 -04:00
Mark Johnston	b319171861	setitimer: Fix exit race We use the p_itcallout callout, interlocked by the proc lock, to schedule timeouts for the setitimer(2) system call. When a process exits, the callout must be stopped before the process struct is recycled. Currently we attempt to stop the callout in exit1() with the call _callout_stop_safe(&p->p_itcallout, CS_EXECUTING). If this call returns 0, then we sleep in order to drain the callout. However, this happens only if the callout is not scheduled at all. If the callout thread is blocked on the proc lock, then exit1() will not block and the callout may execute after the process has fully exited, typically resulting in a panic. I cannot see a reason to use the CS_EXECUTING flag here. Instead, use the regular callout_stop()/callout_drain() dance to halt the callout. Reported by: ler Tested by: ler, pho MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D34625	2022-03-23 12:36:12 -04:00
Alexander Motin	fd6ca665d2	Fix umtxq_sleep() regression caused by `56070dd2e4`. umtxq_requeue() moves the queue to a different hash chain and different lock, so we can't rely on msleep_sbt() reacquiring the same old lock. We have to use PDROP and update the queue chain and so lock pointer. PR: 262587 MFC after: 2 weeks	2022-03-21 19:55:55 -04:00
firk	bb53dd56c3	kern_tc.c/cputick2usec() (which is used to calculate cputime from cpu ticks) has some imprecision and, worse, huge timestep (about 20 minutes on 4GHz CPU) near 53.4 days of elapsed time. kern_time.c/cputick2timespec() (it is used for clock_gettime() for querying process or thread consumed cpu time) Uses cputick2usec() and then needlessly converting usec to nsec, obviously losing precision even with fixed cputick2usec(). kern_time.c/kern_clock_getres() uses some weird (anyway wrong) formula for getting cputick resolution. PR: 262215 Reviewed by: gnn Differential Revision: https://reviews.freebsd.org/D34558	2022-03-21 09:33:46 -04:00
Andrew Turner	cab496e16c	Make SHMMAXPGS an unsigned long This is used to calculate sizes that are then stored in unsigned long fields. Make this unsigned long so the calculations use this type and not an int that can lead to an integer overflow with a large PAGE_SIZE. This allows building this on arm64 with PAGE_SIZE of 16k. Further work will be needed if a 32-bit architecture tries to use a similar sized page. Sponsored by: The FreeBSD Foundation	2022-03-21 10:27:35 +00:00
Colin Percival	2406867f5b	tslog: Add CTLFLAG_SKIP to sysctls The timestamp logs are quite large (often much larger than all the other sysctls combined) so it's unlikely anyone will want to have them displayed by `sysctl -a`. MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D34616	2022-03-20 11:31:16 -07:00
Mateusz Guzik	6ff3e8a316	cache: add a comment about a realpath bug	2022-03-19 15:11:25 +00:00
Mateusz Guzik	eb574ba0b6	vfs: replace VFS_NOTIFY_UPPER_* macros with an enum	2022-03-19 13:15:55 +00:00
Mateusz Guzik	cceb91b025	vfs: add missing flags to db show mount	2022-03-19 12:04:44 +00:00
Mateusz Guzik	93a0ba8f49	vfs: retire the no longer used MNTK_LOOKUP_EXCL_DOTDOT flag Reviewed by: markj Tested by: pho (previous version) Differential Revision: https://reviews.freebsd.org/D34466	2022-03-19 10:47:29 +00:00
Mateusz Guzik	1cb0045c97	vfs: add MNTK_UNLOCKED_INSMNTQUE Can be used when the fs at hand can synchronize insmntque with other means than the vnode lock. Reviewed by: markj Tested by: pho (previous version) Differential Revision: https://reviews.freebsd.org/D34466	2022-03-19 10:46:40 +00:00
firk	28d08dc7d0	clock_gettime: Fix CLOCK_THREAD_CPUTIME_ID race Use a spinlock section instead of a critical section to synchronize with statclock(). Otherwise the CLOCK_THREAD_CPUTIME_ID clock can appear to go backwards. PR: 262273 Reviewed by: markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D34568	2022-03-17 15:39:00 -04:00
Mark Johnston	fc7e121d88	file: Move FILEDESC_FOREACH macros to kern_descrip.c They are only used in kern_descrip.c, so make them private. No functional change intended. Discussed with: mjg Sponsored by: The FreeBSD Foundation	2022-03-17 15:39:00 -04:00
Mark Johnston	c702242292	file: Avoid a read-after-free of fd tables in sysctl handlers Some loops access the fd table of a different process, and drop the filedesc lock while iterating, so they check the table's refcount. However, we access the table before the first iteration, in order to get the number of table entries, and this access can be a use-after-free. Fix the problem by checking the refcount before we start iterating. Reported by: pho Reviewed by: mjg MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D34575	2022-03-17 15:39:00 -04:00
Mateusz Guzik	0134bbe56f	vfs: prefix lookup and relookup with vfs_ Reviewed by: imp, mckusick Differential Revision: https://reviews.freebsd.org/D34530	2022-03-13 14:44:39 +00:00
Mateusz Guzik	02fc4e319c	cache: use flexible array member ... instead of 0-sizing the array	2022-03-13 14:43:35 +00:00
John Baldwin	6b71405bfe	Store core dump notes for all valid register sets for FreeBSD processes. In particular, use a generic wrapper around struct regset rather than requiring per-regset helpers. This helper replaces the MI __elfN(note_prstatus) and __elfN(note_fpregset) helpers. It also removes the need to explicitly dump NT_ARM_ADDR_MASK in the arm64 __elfN(dump_thread). Reviewed by: markj, emaste Sponsored by: University of Cambridge, Google, Inc. Differential Revision: https://reviews.freebsd.org/D34446	2022-03-10 15:40:19 -08:00
Kornel Duleba	b344de4d0d	Extend device_get_property API In order to support various types of data stored in device tree properties or ACPI _DSD packages, create a new enum so the caller can specify the expected type of a property they want to read, according to the binding. The bus logic will use that information to process the underlying data. For example in DT all integer properties are stored in BE format. In order to get constant results across different platforms we need to convert its endianness to match the host. Another example are ACPI_TYPE_INTEGER properties stored as uint64_t. Before this patch the ACPI logic would refuse to read them if the provided buffer was smaller than 8 bytes. Now this can be handled by using DEVICE_PROP_UINT32 type. Modify the existing consumers of this API to reflect the changes and update the man pages accordingly. Reviewed by: mw Obtained from: Semihalf MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D33457	2022-03-10 12:11:32 +01:00
Kornel Duleba	206dc82bc3	bus_if: Add a default implementation of get_property There are multiple buses that pretend to be ofw compatible, e.g ofw_pci, mii_fdt. We now need to provide an implementation of BUS_GET_PROPERTY for every one of them. Instead of modifying them one by one it's better to just provide a default implementation that simply traverses up the device tree. Remove the now unneeded BUS_GET_PROPERTY implementation in mii_fdt. Reviewed by: andrew, bz Obtained from: Semihalf MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D34031	2022-03-10 12:11:32 +01:00
Mateusz Guzik	3a4c5dab92	vfs: [2/2] fix stalls in vnode reclaim by only counting attempts ... and ignoring if they succeded, which matches historical behavior. Reported by: pho	2022-03-10 09:41:50 +00:00
Mateusz Guzik	c35ec1efdc	vfs: [1/2] fix stalls in vnode reclaim by not requeieing from vnlru Reported by: pho	2022-03-10 09:41:50 +00:00
Ed Maste	080b4e8a0c	kcov: use __func__ in KASSERT instead of old function name MFC after: 1 week Sponsored by: The FreeBSD Foundation	2022-03-07 10:47:27 -05:00
Mark Johnston	afb44cb010	rmlock: Temporarily revert commit `c84bb8cd77` It appears to have introduced a regression on arm64, possibly due to the fact that the pcpu pointer is reloaded outside of the critical section in _rm_rlock(). Until this is resolved one way or another, let's revert. Reported by: Ronald Klop <ronald-lists@klop.ws> Sponsored by: The FreeBSD Foundation	2022-03-07 10:43:19 -05:00
Mark Johnston	8dbae4ce32	linker: Permit CTFv3 containers Reviewed by: Domagoj Stolfa MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D34362	2022-03-07 10:43:19 -05:00
Mark Johnston	cab9382a2c	linker: Simplify CTF container handling Use sys/ctf.h to provide various definitions required to parse the CTF header. No functional change intended. Reviewed by: Domagoj Stolfa, emaste MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D34359	2022-03-07 10:43:18 -05:00
Konstantin Belousov	1fb00c8f10	buf_alloc(): Stop using LK_NOWAIT, use LK_NOWITNESS Despite the buffer taken from cache or free list, it still can be locked, due to 'lockless lookup' in getblkx() potentially operating on the freed buffers. The lock is transient, but prevents the use of LK_NOWAIT there for the goal of neutralizing WITNESS. Just use LK_NOWITNESS. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 days	2022-03-06 10:29:31 -05:00
Alexander Motin	56070dd2e4	Improve timeout precision of pthread_cond_timedwait(). This code was not touched when all other user-space sleep functions were switched to sbintime_t and decoupled from hardclock. When it is possible, convert supplied times into sbinuptime to supply directly to msleep_sbt() with C_ABSOLUTE. This provides the timeout resolution of few microseconds instead of 2 milliseconds, plus avoids few clock reads and conversions. Reviewed by: vangyzen MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D34163	2022-03-03 22:03:09 -05:00
John Baldwin	0b25cbc79d	Fix the size returned for NT_FPREGSET. Sponsored by: University of Cambridge, Google, Inc.	2022-03-03 17:53:06 -08:00
John Baldwin	9af41803cb	Use vnsz2log directly in assertion on its relation to sizeof(struct vnode). This reduces the size of diffs required to support different values of vnsz2log. In CheriBSD, kernels for CHERI architectures have vnodes larger than 512 bytes and require a value of 9. Reviewed by: mjg Obtained from: CheriBSD Sponsored by: University of Cambridge, Google, Inc. Differential Revision: https://reviews.freebsd.org/D34418	2022-03-03 17:52:07 -08:00
Mateusz Guzik	afb08a6d07	cache: hide hash stats behind DEBUG_CACHE They take a long time to dump and hinder sysctl -a when used with DIAGNOSTIC.	2022-03-03 17:21:58 +00:00
Mateusz Guzik	f3f3e3c44d	fd: add close_range(..., CLOSE_RANGE_CLOEXEC) For compatibility with Linux. MFC after: 3 days Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D34424	2022-03-03 17:21:58 +00:00
Mark Johnston	879b0604a8	proc: Remove assertion that P_WEXIT is not set in proc_rwmem() exit1() sets P_WEXIT before waiting for holding threads to finish, rather than after, so this assertion is racy. Fixes: `12fb39ec3e` ("proc: Relax proc_rwmem()'s assertion on the process hold count") Reported by: Jenkins	2022-03-01 15:09:45 -05:00
Mark Johnston	12fb39ec3e	proc: Relax proc_rwmem()'s assertion on the process hold count This reference ensures that the process and its associated vmspace will not be destroyed while proc_rwmem() is executing. If, however, the calling thread belongs to the target process, then it is unnecessary to hold the process. In particular, fasttrap - a module which enables userspace dtrace - may frequently call proc_rwmem(), and we'd prefer to avoid the overhead of locking and bumping the hold count when possible. Thus, make the assertion conditional on "p != curproc". Also assert that the process is not already exiting. No functional change intended. MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2022-03-01 12:40:35 -05:00
Warner Losh	b36bd3a906	bus: Create dev_wired_cache A simple cache to cache differnet locators to the same device. Sponsored by: Netflix Changes Suggested by: jhb Differential Revision: https://reviews.freebsd.org/D32783	2022-03-01 08:06:41 -07:00
Warner Losh	cae7d9ec83	bus: Add ACPI locator support Add support for printing ACPI paths. This is a bit of a degenerate case for this interface since it's always just the device handle if the device has one. But it is illustrtive of how to do this for a few nodes in the tree. Sponsored by: Netflix Reviewed by: jhb Differential Revision: https://reviews.freebsd.org/D32748	2022-03-01 08:06:41 -07:00
Warner Losh	38e942a345	devctl: Add DEV_GET_PATH DEV_GET_PATH will get the path to a device based on different locators. Sponsored by: Netflix Reviewed by: jhb Differential Revision: https://reviews.freebsd.org/D32745	2022-03-01 08:06:41 -07:00
Warner Losh	e19db70769	bus: Introduce the bus interface get_device_path This returns the full path of a the child device requested. Since there's different ways to recon the entire path, include a 'locator' method. The default 'FreeBSD' method uses a filesystem-like path name with each device to the root node separated by /. Other locators will be UEFI, ACPI and fdt, though others are possible in the future. Make the locator a string to allow maximum flexibility. Sponsored by: Netflix Reviewed by: jhb Differential Revision: https://reviews.freebsd.org/D32744	2022-03-01 08:06:40 -07:00
Warner Losh	78408171bd	devctl2: Change to 644 protections We make sure that we check for device privs (usually meaning root or better) for everything. To allow other functions that don't require this, default to 644 protection. Sponsored by: Netflix Reviewed by: jhb Differential Revision: https://reviews.freebsd.org/D32863	2022-03-01 08:06:40 -07:00
Mark Johnston	89ae8eb74e	rmlock: Add required compiler barriers to _rm_runlock() Also remove excessive whitespace in _rm_rlock(). Reviewed by: jah, mjg MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D34381	2022-03-01 09:38:45 -05:00
Warner Losh	9891cb1e76	Eliminate curlen, it's set but never used Sponsored by: Netflix	2022-02-27 09:02:45 -07:00
Mark Johnston	c84bb8cd77	rmlock: Micro-optimize read locking Use get_pcpu() instead of an open-coded pcpu_find(td->td_oncpu). This eliminates some memory accesses and results in a shorter instruction sequence. Note that get_pcpu() didn't exist when rmlocks were added. Reviewed by: jah, mjg MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D34377	2022-02-25 13:55:24 -05:00
Marvin Ma	1517b8d5a7	vfs_unregister: fix error handling Due to misplaced braces, an error from vfs_uninit() in the VFCF_SBDRY case was ignored. Reported by: Anton Rang <rang@acm.org> Reviewed by: Anton Rang <rang@acm.org>, markj MFC after: 1 week Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D34375	2022-02-25 12:19:14 -06:00
Warner Losh	2bfdc1ee9b	cons: Use bool for boolean variables MFC After: 3 days Sponsored by: Netflix	2022-02-24 10:58:07 -07:00
Jamie Gritton	d7c4ea7d72	posixshm: Allow jails to use kern.ipc.posix_shm_list PR: 257554 Reported by: grembo@	2022-02-24 09:30:49 -08:00
Gleb Smirnoff	d2b3a0ed31	sendto: don't clear transient errors for atomic protocols The changeset `65572cade3` uncovered the fact that top layer of sendto(2) would clear a transient error code if some data was copied out of uio. The clearing of the error makes sense for non-atomic protocols, since they have sent some data. The atomic protocols send all or nothing. The current implementation of unix/dgram uses sosend_generic(), which would always copyout and only then it may fail to deliver a message. The sosend_dgram(), currently used by UDP only, also has same behavior. Reported by: pho Reviewed by: pho, markj Differential revision: https://reviews.freebsd.org/D34309	2022-02-23 10:24:14 -08:00
Andrew Turner	d58b79d1ce	Built all KCSAN atomic interceptors on arm64 These atomic functions are now supported. Add them to KCSAN. Sponsored by: The FreeBSD Foundation	2022-02-23 14:45:47 +00:00
Mateusz Guzik	f17ef28674	fd: rename fget_locked to fget_noref This gets rid of the error prone naming where fget_unlocked returns with a ref held, while fget_locked requires a lock but provides nothing in terms of making sure the file lives past unlock. No functional changes.	2022-02-22 18:53:43 +00:00
Robert Wing	0a2f498234	tty: fix a panic with INVARIANTS watch'ing a tty triggers a refcount wraparound panic, take a reference on fp after fget_cap_locked() to fix. Reported by: Michael Jung <mikej_at_paymentallianceintl.com> Reviewed by: hselasky, mjg Fixes: `f40dd6c803` ("tty: switch ttyhook_register to use fget_cap_locked") Differential Revision: https://reviews.freebsd.org/D34335	2022-02-22 09:37:13 -09:00
Mitchell Horne	5a8fceb3bd	boottrace: trace annotations for startup and shutdown Add trace events for execution of SYSINITs (both static and dynamically loaded), and to the various steps in the shutdown/panic/reboot paths. Sponsored by: NetApp, Inc. Sponsored by: Klara, Inc. X-NetApp-PR: #23 Differential Revision: https://reviews.freebsd.org/D30187	2022-02-21 20:15:57 -04:00
Mitchell Horne	0aa9ffcd9c	init_main.c: sort includes This is preferred by style(9). Do this ahead of adding another include. Reviewed by: imp, kevans, allanjude MFC after: 3 days Sponsored by: NetApp, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D30186	2022-02-21 20:15:51 -04:00
Mitchell Horne	877eea429b	kern_linker.c: sort includes This is preferred by style(9). Do this ahead of adding another include. Reviewed by: imp, kevans MFC after: 3 days Sponsored by: NetApp, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D30185	2022-02-21 20:15:51 -04:00
Mitchell Horne	da5b7e90e7	boottrace: a simple boot and shutdown-time tracing facility Boottrace is a facility for capturing trace events during boot and shutdown. This includes kernel initialization, as well as rc. It has been used by NetApp internally for several years, for catching and diagnosing slow devices or subsystems. It is driven from userspace by sysctl interface, and the output is a human-readable log of events (kern.boottrace.log). This commit adds the core boottrace functionality implementing these interfaces. Adding the trace annotations themselves to kernel and userland will happen in follow-up commits. A future commit will also add a boottrace(4) man page. For now, boottrace is unconditionally compiled into the kernel but disabled by default. It can be enabled by setting the kern.boottrace.enabled tunable to 1 in loader.conf(5). There is an existing boot-time event tracing facility, which can be compiled into the kernel with 'options TSLOG'. While there is some functional overlap between this and boottrace, they are distinct. TSLOG is suitable for generating detailed timing information and flamegraphs, and has been used to great success by cperciva@ to diagnose and reduce the overall system boot time. Boottrace aims to more quickly provide an overview of timing and resource usage of the boot (and shutdown) process to a sysadmin who requires this knowledge. Sponsored by: NetApp, Inc. Sponsored by: Klara, Inc. X-NetApp-PR: #23 Differential Revision: https://reviews.freebsd.org/D30184	2022-02-21 20:15:45 -04:00
Mateusz Guzik	e68a5225e8	fd: add fde_copy To dedup handrolled memcpy. This will be used later to make fd code atomic-clean.	2022-02-15 17:51:08 +00:00
Mateusz Guzik	ec12b4f4ff	fd: add missing seqc to dupfdopen	2022-02-15 17:51:08 +00:00
Mateusz Guzik	c9a995994b	seqc: rename seqc_consistent_nomb to seqc_consistent_no_fence For more consistency with other primitives.	2022-02-15 17:51:07 +00:00
John Baldwin	becaf6433b	Use vmspace->vm_stacktop in place of sv_usrstack in more places. Reviewed by: markj Obtained from: CheriBSD Differential Revision: https://reviews.freebsd.org/D34174	2022-02-14 10:57:30 -08:00
Gleb Smirnoff	65572cade3	unix/dgram: return EAGAIN instead of ENOBUFS when O_NONBLOCK set This is behavior what some programs expect and what Linux does. For example nginx expects EAGAIN when sending messages to /var/run/log, which it connects to with O_NONBLOCK. Particularly with nginx the problem is magnified by the fact that a ENOBUFS on send(2) is also logged, so situation creates a log-bomb - a failed log message triggers another log message. Reviewed by: markj Differential revision: https://reviews.freebsd.org/D34187	2022-02-14 09:21:55 -08:00
Mark Johnston	852ff943b9	sleepqueue: Annotate sleepq_max_depth as static MFC after: 1 week Sponsored by: The FreeBSD Foundation	2022-02-14 10:06:47 -05:00
Mark Johnston	893be9d8ac	sleepqueue: Address a lock order reversal After commit `74cf7cae4d` ("softclock: Use dedicated ithreads for running callouts."), there is a lock order reversal between the per-CPU callout lock and the scheduler lock. softclock_thread() locks callout lock then the scheduler lock, when preparing to switch off-CPU, and sleepq_remove_thread() stops the timed sleep callout while potentially holding a scheduler lock. In the latter case, it's the thread itself that's locked, and if the thread is sleeping then its lock will be a sleepqueue lock, but if it's still in the process of going to sleep it'll be a scheduler lock. We could perhaps change softclock_thread() to try to acquire locks in the opposite order, but that'd require dropping and re-acquiring the callout lock, which seems expensive for an operation that will happen quite frequently. We can instead perhaps avoid stopping the td_slpcallout callout if the thread is still going to sleep, which is what this patch does. This will result in a spurious call to sleepq_timeout(), but some counters suggest that this is very rare. PR: 261198 Fixes: `74cf7cae4d` ("softclock: Use dedicated ithreads for running callouts.") Reported and tested by: thj Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D34204	2022-02-14 10:06:47 -05:00
Mateusz Guzik	6aa246e605	vfs: convert vnsz2log to a macro	2022-02-13 13:07:08 +00:00
Mateusz Guzik	5c31025060	fd: use FILEDESC_FOREACH_{FDE,FP} where appropriate	2022-02-13 13:07:08 +00:00
Mateusz Guzik	809f3121be	fd: assign fd_freefile early when copying This is to simplify an upcomming change.	2022-02-13 13:07:08 +00:00
Mateusz Guzik	893d20c95a	fd: move fd table sizing out of fdinit now it is placed with the rest of actual initialisation	2022-02-13 13:07:08 +00:00
Mateusz Guzik	4103c3cd5b	fd: drop volatile keyword from refcounts While here move a comment where it belongs and do small whitespace clean up.	2022-02-13 13:07:08 +00:00
Mateusz Guzik	b53133a778	proc: load/store p_cowgen using atomic primitives	2022-02-13 13:07:08 +00:00
Mateusz Guzik	29ee49f66b	thread: remove dead store from thread_cow_update	2022-02-13 13:07:08 +00:00
John Baldwin	cd0525f615	ktls: Write-lock the INP when changing a transmit TLS session. The TCP rate pacing code relies on being able to read this pointer safely while holding an INP lock. The initial TLS session pointer is set while holding the write lock already. Reviewed by: gallatin, hselasky Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D34086	2022-02-11 15:16:25 -08:00
Mateusz Guzik	1d65a9b47e	cache: improve vnode vs name assertion in cache_enter_time	2022-02-11 12:29:26 +00:00
Mateusz Guzik	611470a515	cache: remove NOCACHE handling from cache_fplookup_noentry It was copy-pasted from locked lookup. As LOOKUP operation cannot have the flag set it was always ending up setting MAKEENTRY.	2022-02-11 12:29:26 +00:00
Mateusz Guzik	513c7a6e0c	fd: make fget_unlocked take a thread argument Just like other fget routines. This enables embedding fd table pointer in struct thread, avoiding taking a trip through proc.	2022-02-11 12:29:26 +00:00
Mateusz Guzik	45bb8beacc	fd: elide one acquire fence in fget_unlocked_seq Still validate we got the stable state before returning an error though.	2022-02-11 12:29:26 +00:00
Mateusz Guzik	62849eef5b	fd: split fget_unlocked_seq depending on CAPABILITIES This will simplify an upcoming change.	2022-02-11 12:27:22 +00:00
Mateusz Guzik	b937908e41	fd: split fget_cap depending on CAPABILITIES This will simplify an upcoming change.	2022-02-11 12:13:27 +00:00
Mateusz Guzik	f40dd6c803	tty: switch ttyhook_register to use fget_cap_locked It is still wrong-ish as fget* funcs don't expect to operate on abitrary file descriptor tables, but this at least moves it out of the way of an upcoming change while being bug-compatible.	2022-02-11 12:13:27 +00:00
Mateusz Guzik	93288e2445	Employ thread_cow_synced in setrlimit In order to avoid proc lock/unlock on next kernel entry.	2022-02-11 11:44:07 +00:00
Mateusz Guzik	32114b639f	Add PROC_COW_CHANGECOUNT and thread_cow_synced Combined they can be used to avoid a proc lock/unlock cycle in the syscall handler for curthread, see upcoming examples.	2022-02-11 11:44:07 +00:00
Mateusz Guzik	8a0cb04df4	Add lim_cowsync, similar to crcowsync	2022-02-11 11:44:07 +00:00
Justin Hibbits	6db44b0158	Fix gzip compressed core dumps on big endian architectures The gzip trailer words (size and CRC) are both little-endian per the spec. MFC after: 3 days Sponsored by: Juniper Networks, Inc.	2022-02-10 09:34:37 -06:00
Dimitry Andric	7d8a4eb943	tty_info: Avoid warning by using logical instead of bitwise operators Since TD_IS_RUNNING() and TS_ON_RUNQ() are defined as logical expressions involving '==', clang 14 warns about them being checked with a bitwise operator instead of a logical one: ``` sys/kern/tty_info.c:124:9: error: use of bitwise '\|' with boolean operands [-Werror,-Wbitwise-instead-of-logical] runa = TD_IS_RUNNING(td) \| TD_ON_RUNQ(td); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \|\| sys/sys/proc.h:562:27: note: expanded from macro 'TD_IS_RUNNING' ^ sys/kern/tty_info.c:124:9: note: cast one or both operands to int to silence this warning sys/sys/proc.h:562:27: note: expanded from macro 'TD_IS_RUNNING' ^ sys/kern/tty_info.c:129:9: error: use of bitwise '\|' with boolean operands [-Werror,-Wbitwise-instead-of-logical] runb = TD_IS_RUNNING(td2) \| TD_ON_RUNQ(td2); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \|\| sys/sys/proc.h:562:27: note: expanded from macro 'TD_IS_RUNNING' ^ sys/kern/tty_info.c:129:9: note: cast one or both operands to int to silence this warning sys/sys/proc.h:562:27: note: expanded from macro 'TD_IS_RUNNING' ^ ``` Fix this by using logical operators instead. No functional change intended. Reviewed by: cem, emaste, kevans, markj MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D34186	2022-02-08 21:21:04 +01:00
Mark Johnston	5de79eeddb	ktls: Disallow transmitting empty frames outside of TLS 1.0/CBC mode There was nothing preventing one from sending an empty fragment on an arbitrary KTLS TX-enabled socket, but ktls_frame() asserts that this could not happen. Though the transmit path handles this case for TLS 1.0 with AES-CBC, we should be strict and allow empty fragments only in modes where it is explicitly allowed. Modify sosend_generic() to reject writes to a KTLS-enabled socket if the number of data bytes is zero, so that userspace cannot trigger the aforementioned assertion. Add regression tests to exercise this case. Reported by: syzkaller Reviewed by: gallatin, jhb MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D34195	2022-02-08 12:40:41 -05:00
Mark Johnston	300cfb96fc	file: Make fget() and getvnode() consistent about initializing fpp Most fget() functions initialize the output parameter to NULL. Make the externally visible interface behave consistently, and make fget_unlocked_seq() private to kern_descrip.c. This fixes at least one bug in a consumer, _filemon_wrapper_openat(), which assumes that getvnode() sets the output file pointer to NULL upon an error. Reported by: syzbot+01c0459408f896a5933a@syzkaller.appspotmail.com Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D34190	2022-02-08 12:40:41 -05:00
Sebastian Huber	3ec0dc367b	kern_ntptime.c: Remove ntp_init() The ntp_init() function did set a couple of global objects to zero. These objects are in the .bss section and already initialized to zero during kernel or module loading.	2022-02-07 14:16:16 -07:00
John Baldwin	949e395966	Trim duplicate code for copying in iovecs for PT_[GS]ETREGSET. Reviewed by: andrew, emaste Differential Revision: https://reviews.freebsd.org/D34177	2022-02-07 11:49:29 -08:00
Gordon Bergling	5a78ec9e7c	kern_fflock: Fix a typo in a source code comment - s/foward/forward/ MFC after: 3 days	2022-02-06 17:29:43 +01:00
Gordon Bergling	a9bee9c77a	kern_racct: Fix a typo in a source code comment - s/maxumum/maximum/ MFC after: 3 days	2022-02-06 17:28:27 +01:00
Gleb Smirnoff	c999e3481d	dmesg: detect wrapped msgbuf on the kernel side and if so, skip first line Since `59f256ec35` dmesg(8) will always skip first line of the message buffer, cause it might be incomplete. The problem is that in most cases it is complete, valid and contains the "---<<BOOT>>---" marker. This skip can be disabled with '-a', but that would also unhide all non-kernel messages. Move this functionality from dmesg(8) to kernel, since kernel actually knows if wrap has happened or not. The main motivation for the change is not actually the value of the "---<<BOOT>>---" marker. The problem breaks unit tests, that clear message buffer, perform a test and then check the message buffer for a result. Example of such test is sys/kern/sonewconn_overflow.	2022-02-05 13:35:31 -08:00
Kyle Evans	642701abc8	kern: harvest entropy from callouts `74cf7cae4d` ("softclock: Use dedicated ithreads for running callouts.") switched callouts away from the swi infrastructure. It turns out that this was a major source of entropy in early boot, which we've now lost. As a result, first boot on hardware without a 'fast' entropy source would block waiting for fortuna to be seeded with little hope of progressing without manual intervention. Let's resolve it by explicitly harvesting entropy in callout_process() if we've handled any callouts. cc/curthread/now seem to be reasonable sources of entropy, so use those. Discussed with: jhb (also proposed initial patch) Reported by: many Reviewed by: cem, markm (both csprng) Differential Revision: https://reviews.freebsd.org/D34150	2022-02-03 10:05:06 -06:00
Konstantin Belousov	c02780b78c	Add GB_NOWITNESS flag It prevents WITNESS from recording the lock order for the buffer lock acquired by getblkx(). Reviewed by: mckusick Discussed with: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D34073	2022-02-01 06:54:50 +02:00
John Baldwin	d958bc7963	ktls: Try to enable TOE TLS after marking existing data not ready. At the moment this is mostly a no-op but in the future there will be in-flight encrypted data which requires software decryption. This same setup is also needed for NIC TLS RX. Note that this does break TOE TLS RX for AES-CBC ciphers since there is no software fallback for AES-CBC receive. This will be resolved one way or another before 14.0 is released. Reviewed by: hselasky Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D34082	2022-01-31 16:39:21 -08:00
Konstantin Belousov	66c5fbca77	insmntque1(): remove useless arguments Also remove once-used functions to clean up after failed insmntque1(), which were destructor callbacks in previous life. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D34071	2022-01-31 16:49:08 +02:00
Konstantin Belousov	3d68c4e175	syncer VOP_FSYNC(): unlock syncer vnode around call to VFS_SYNC() The lock is unneccessary since the mount point is busied, which prevents unmount and syncer vnode deallocation. Having the vnode locked causes innocent LoRs and complicates debugging. Also stop starting write accounting around it. Any caller of VOP_FSYNC() must do it already, and sync_vnode() does. Reported and tested by: pho Reviewed by: markj, mckusick Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D34072	2022-01-31 04:46:21 +02:00
Konstantin Belousov	5875b94c74	buf_alloc(): lock the buffer with LK_NOWAIT The buffer must not be accessed by any other thread, it is freshly allocated. As such, LK_NOWAIT should be nop but also it prevents recording the order between the buffer lock and any other locks we might own in the call to getnewbuf(). In particular, if we own FFS snap lock, it should avoid triggering false positive warning. Reviewed by: markj, mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D34072	2022-01-31 04:46:21 +02:00
Konstantin Belousov	531f8cfea0	Use dedicated lock name for pbufs Also remove a pointer to array variable, use array address directly. Reviewed by: markj, mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D34072	2022-01-31 04:46:14 +02:00
Kristof Provost	703e533da5	mbuf: do not restore dying interfaces When we remove an interface it is first removed from the interface list V_ifnet (by if_unlink_ifnet()) and marked as IFF_DYING. We then wait for any possible references to stop being used (i.e. epoch_wait/epoch_drain_callbacks) before we tear it fully down. However, the index in ifindex_table is not removed, so m_rcvif_restore() can still find the (now dying) interface. This results in panics, for example when dummynet restores the rcvif pointer and passes a packet to ip6_input() we can panic because the AF_INET6 domain has already been removed (so we end up dereferencing a NULL pointer there). Check that the interface is not dying before we restore it, which is equivalent to checking its presence in V_ifnet, and thus ensures that future accesses (while in NET_EPOCH) are safe. Reviewed by: glebius Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D34076	2022-01-28 23:09:08 +01:00
Mateusz Guzik	2a7e4cf843	Revert `b58ca5df0b` ("vfs: remove the now unused insmntque1") I was somehow convinced that insmntque calls insmntque1 with a NULL destructor. Unfortunately this worked well enough to not immediately blow up in simple testing. Keep not using the destructor in previously patched filesystems though as it avoids unnecessary casts. Noted by: kib Reported by: pho	2022-01-27 16:32:22 +00:00
Andrew Turner	548a2ec49b	Add PT_GETREGSET This adds the PT_GETREGSET and PT_SETREGSET ptrace types. These can be used to access all the registers from a specified core dump note type. The NT_PRSTATUS and NT_FPREGSET notes are initially supported. Other machine-dependant types are expected to be added in the future. The ptrace addr points to a struct iovec pointing at memory to hold the registers along with its length. On success the length in the iovec is updated to tell userspace the actual length the kernel wrote or, if the base address is NULL, the length the kernel would have written. Because the data field is an int the arguments are backwards when compared to the Linux PTRACE_GETREGSET call. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D19831	2022-01-27 11:40:34 +00:00
Gleb Smirnoff	e1882428dc	ifnet/mbuf: provide KPI to serialize/restore m->m_pkthdr.rcvif Supplement ifindex table with generation count and use it to serialize & restore an ifnet pointer. Reviewed by: kp Differential revision: https://reviews.freebsd.org/D33266 Fun note: git show `e6abef0918`	2022-01-26 21:58:50 -08:00
Mateusz Guzik	b58ca5df0b	vfs: remove the now unused insmntque1 Bump __FreeBSD_version to 1400052.	2022-01-27 01:00:24 +01:00
Kyle Evans	773fa8cd13	execve: disallow argc == 0 The manpage has contained the following verbiage on the matter for just under 31 years: "At least one argument must be present in the array" Previous to this version, it had been prefaced with the weakening phrase "By convention." Carry through and document it the rest of the way. Allowing argc == 0 has been a source of security issues in the past, and it's hard to imagine a valid use-case for allowing it. Toss back EINVAL if we ended up not copying in any args for *execve(). The manpage change can be considered "Obtained from: OpenBSD" Reviewed by: emaste, kib, markj (all previous version) Differential Revision: https://reviews.freebsd.org/D34045	2022-01-26 13:40:27 -06:00
Hans Petter Selasky	9e2cce7e6a	Implement a function to get the next TCP- and TLS- receive sequence number. This function will be used by coming TLS hardware receive offload support. Differential Revision: https://reviews.freebsd.org/D32356 Discussed with: jhb@ MFC after: 1 week Sponsored by: NVIDIA Networking	2022-01-26 12:55:00 +01:00
Hans Petter Selasky	17cbcf33c3	mbuf(9): Assert receive mbufs don't carry a send tag. Else we would start leaking reference counts. Discussed with: jhb@ MFC after: 1 week Sponsored by: NVIDIA Networking	2022-01-26 12:55:00 +01:00
John Baldwin	308fc7e5b1	user_getpeername: Use 'bool' for the compat argument. This matches user_getsockname. Reviewed by: brooks, kib Sponsored by: The University of Cambridge, Google Inc. Differential Revision: https://reviews.freebsd.org/D33987	2022-01-24 09:51:35 -08:00
Konstantin Belousov	fe6db72708	Add security.bsd.allow_ptrace sysctl that disables any access to ptrace(2) for all processes. Reviewed by: emaste Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D33986	2022-01-22 19:36:56 +02:00
Konstantin Belousov	55a0aa2162	p_candebug(), p_cansee(): always allow for curproc Privilege checks in both functions should allow the current process to infer information about itself, as well as use the interfaces that are proclaimed 'debugging', for instance, procctl(2). Note that in p_cansee() case, explicit comparision of curproc and p avoids a race where the process might change credentials and cause thread to compare its cached stale credentials against updated process creds, effectively disallowing the process to observe itself. Reviewed by: emaste Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D33986	2022-01-22 19:36:56 +02:00
Mark Johnston	6be8944d96	ktls: Zero out TLS_GET_RECORD control messages Otherwise we end up copying one uninitialized byte into the socket buffer. Reported by: KMSAN Reviewed by: jhb MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D33953	2022-01-20 15:42:46 -05:00
Mark Johnston	c3196306f0	clockcalib: Fix an overflow bug tc_counter_mask is an unsigned int and in the TSC timecounter is equal to UINT_MAX, so the addition tc->tc_counter_mask + 1 can overflow to 0, resulting in a hang during boot. Fixes: `c2705ceaeb` ("x86: Speed up clock calibration") Reviewed by: cperciva Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D33956	2022-01-20 08:23:38 -05:00
Alexander Motin	b7ff445ffa	Reduce bufdaemon/bufspacedaemon shutdown time. Before this change bufdaemon and bufspacedaemon threads used kthread_shutdown() to stop activity on system shutdown. The problem is that kthread_shutdown() has no idea about the wait channel and lock used by specific thread to wake them up reliably. As result, up to 9 threads could consume up to 9 seconds to shutdown for no good reason. This change introduces specific shutdown functions, knowing how to properly wake up specific threads, reducing wait for those threads on shutdown/reboot from average 4 seconds to effectively zero. MFC after: 2 weeks Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D33936	2022-01-18 19:26:16 -05:00
Mark Johnston	3ce04aca49	proc: Add a sysctl to fetch virtual address space layout info This provides information about fixed regions of the target process' user memory map. Reviewed by: kib MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D33708	2022-01-17 16:12:43 -05:00
Mark Johnston	1811c1e957	exec: Reimplement stack address randomization The approach taken by the stack gap implementation was to insert a random gap between the top of the fixed stack mapping and the true top of the main process stack. This approach was chosen so as to avoid randomizing the previously fixed address of certain process metadata stored at the top of the stack, but had some shortcomings. In particular, mlockall(2) calls would wire the gap, bloating the process' memory usage, and RLIMIT_STACK included the size of the gap so small (< several MB) limits could not be used. There is little value in storing each process' ps_strings at a fixed location, as only very old programs hard-code this address; consumers were converted decades ago to use a sysctl-based interface for this purpose. Thus, this change re-implements stack address randomization by simply breaking the convention of storing ps_strings at a fixed location, and randomizing the location of the entire stack mapping. This implementation is simpler and avoids the problems mentioned above, while being unlikely to break compatibility anywhere the default ASLR settings are used. The kern.elfN.aslr.stack_gap sysctl is renamed to kern.elfN.aslr.stack, and is re-enabled by default. PR: 260303 Reviewed by: kib Discussed with: emaste, mw MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D33704	2022-01-17 16:12:36 -05:00
Mark Johnston	758d98debe	exec: Remove the stack gap implementation ASLR stack randomization will reappear in a forthcoming commit. Rather than inserting a random gap into the stack mapping, the entire stack mapping itself will be randomized in the same way that other mappings are when ASLR is enabled. No functional change intended, as the stack gap implementation is currently disabled by default. Reviewed by: kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D33704	2022-01-17 16:11:54 -05:00
Mark Johnston	706f4a81a8	exec: Introduce the PROC_PS_STRINGS() macro Rather than fetching the ps_strings address directly from a process' sysentvec, use this macro. With stack address randomization the ps_strings address is no longer fixed. Reviewed by: kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D33704	2022-01-17 16:11:54 -05:00
Mark Johnston	5a8413e779	setrlimit: Remove special handling for RLIMIT_STACK with a stack gap This will not be required with a forthcoming reimplementation of ASLR stack randomization. Moreover, this change was not sufficient to enable the use of a stack size limit smaller than the stack gap itself. PR: 260303 Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D33704	2022-01-17 11:42:13 -05:00
Mark Johnston	3fc21fdd5f	sysent: Add a sv_psstringssz field to struct sysentvec The size of the ps_strings structure varies between ABIs, so this is useful for computing the address of the ps_strings structure relative to the top of the stack when stack address randomization is enabled. Reviewed by: kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D33704	2022-01-17 11:42:07 -05:00
Mark Johnston	1544f5add8	Revert "kern_exec: Add kern.stacktop sysctl." The current ASLR stack gap feature will be removed, and with that the need for the kern.stacktop sysctl is gone. All consumers have been removed. This reverts commit `a97d697122`. Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D33704	2022-01-17 11:41:58 -05:00
Mark Johnston	dc7526170d	posixshm: Report output buffer truncation from kern.ipc.posix_shm_list PR: 240573 Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D33912	2022-01-17 08:35:19 -05:00
Alexander Motin	e76c010899	Fix inverse sleep logic in buf_daemon(). Before commit `3cec5c77d6` buf_daemon() went to longer 1s sleep if numdirtybuffers <= lodirtybuffers. After that commit new condition !BIT_EMPTY(BUF_DOMAINS, &bdlodirty) got opposite -- true when one or more more domains is above lodirtybuffers. As result, on freshly booted system with no dirty buffers buf_daemon() wakes up 10 times per second and probably only 1 time per second when there is actual work to do. MFC after: 1 week Reviewed by: kib, markj Tested by: pho Differential revision: https://reviews.freebsd.org/D33890	2022-01-15 19:32:36 -05:00
Simon J. Gerraty	bacb140f31	Ignore calcru: runtime went backwards for vm_guest VM's have little control over CPU speed, don't make matters worse by constantly spaming console. Reviewed by: jhb Differential Revision: https://reviews.freebsd.org/D33902	2022-01-14 16:07:43 -08:00
Brooks Davis	0910a41ef3	Revert "syscallarg_t: Add a type for system call arguments" Missed issues in truss on at least armv7 and powerpcspe need to be resolved before recommit. This reverts commit `3889fb8af0`. This reverts commit `1544e0f5d1`.	2022-01-12 23:29:20 +00:00
Brooks Davis	3889fb8af0	sysent: regen for syscallarg_t	2022-01-12 22:51:25 +00:00
Brooks Davis	1544e0f5d1	syscallarg_t: Add a type for system call arguments This more clearly differentiates system call arguments from integer registers and return values. On current architectures it has no effect, but on architectures where pointers are not integers (CHERI) and may not even share registers (CHERI-MIPS) it is necessiary to differentiate between system call arguments (syscallarg_t) and integer register values (register_t). Obtained from: CheriBSD Reviewed by: imp, kib Differential Revision: https://reviews.freebsd.org/D33780	2022-01-12 22:51:25 +00:00
Colin Percival	c2705ceaeb	x86: Speed up clock calibration Prior to this commit, the TSC and local APIC frequencies were calibrated at boot time by measuring the clocks before and after a one-second sleep. This was simple and effective, but had the disadvantage of requiring a one-second sleep. Rather than making two clock measurements (before and after sleeping) we now perform many measurements; and rather than simply subtracting the starting count from the ending count, we calculate a best-fit regression between the target clock and the reference clock (for which the current best available timecounter is used). While we do this, we keep track of an estimate of the uncertainty in the regression slope (aka. the ratio of clock speeds), and stop measuring when we believe the uncertainty is less than 1 PPM. In order to avoid the risk of aliasing resulting from the data-gathering loop synchronizing with (a multiple of) the frequency of the reference clock, we add some additional spinning depending upon the iteration number. For numerical stability and simplicity of implementation, we make use of floating-point arithmetic for the statistical calculations. On the author's Dell laptop, this reduces the time spent in calibration from 2000 ms to 29 ms; on an EC2 c5.xlarge instance, it is reduced from 2000 ms to 2.5 ms. Reviewed by: bde (previous version), kib MFC after: 1 month Sponsored by: https://www.patreon.com/cperciva Differential Revision: https://reviews.freebsd.org/D33802	2022-01-12 12:34:07 -08:00
Konstantin Belousov	a24afbb4e6	Ignore debugger-injected signals left after detaching PR: 261010 Reported by: Martin Simmons <martin@lispworks.com> Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D33787	2022-01-12 07:33:30 +02:00
Alexander Motin	cb1f5d1136	Reduce minimum idle hardclock rate from 2Hz to 1Hz. On idle 80-thread system it allows to improve package-level idle state residency and so power consumption by several percent. MFC after: 2 weeks	2022-01-09 19:25:56 -05:00
Konstantin Belousov	4a4b059a97	Add vfs_remount_ro() a helper to remount filesystem from rw to ro. Tested by: pho Reviewed by: markj, mckusick Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D33721	2022-01-08 05:41:44 +02:00
John Baldwin	7def1e10b3	bus_dma: Deduplicate locking helper functions. - Move busdma_lock_mutex to subr_bus_dma.c. - Move _busdma_lock_dflt to subr_bus_dma.c. This function was named a couple of different things previously. It is not a public API but an internal helper used in place of a NULL pointer. The prototype is in <sys/bus_dma.h> as not all backends include <sys/bus_dma_internal.h>. Reviewed by: kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D33694	2022-01-05 13:50:40 -08:00
John Baldwin	85b4607324	Deduplicate bus_dma bounce code. Move mostly duplicated code in various MD bus_dma backends to support bounce pages into sys/kern/subr_busdma_bounce.c. This file is currently #include'd into the backends rather than compiled standalone since it requires access to internal members of opaque bus_dma structures such as bus_dmamap_t and bus_dma_tag_t. Reviewed by: kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D33684	2022-01-05 13:50:40 -08:00
John Baldwin	753c851387	sigev_findtd: Fix whitespace nit in argument list. Obtained from: CheriBSD	2022-01-04 13:37:39 -08:00
Gleb Smirnoff	644ca0846d	domains: make domain_init() initialize only global state Now that each module handles its global and VNET initialization itself, there is no VNET related stuff left to do in domain_init(). Differential revision: https://reviews.freebsd.org/D33541	2022-01-03 10:15:22 -08:00
Gleb Smirnoff	24e1c6ae7d	domains: init with standard SYSINIT(9) or VNET_SYSINIT() There left only three modules that used dom_init(). And netipsec was the last one to use dom_destroy(). Differential revision: https://reviews.freebsd.org/D33540	2022-01-03 10:15:22 -08:00
Gleb Smirnoff	340c7343f4	protocols: don't execute protosw_init() for every VNET The function now modifies pr_usrreqs only, which are always global. Rename it to pr_usrreqs_init(). Differential revision: https://reviews.freebsd.org/D33538	2022-01-03 10:15:21 -08:00
Gleb Smirnoff	89128ff3e4	protocols: init with standard SYSINIT(9) or VNET_SYSINIT The historical BSD network stack loop that rolls over domains and over protocols has no advantages over more modern SYSINIT(9). While doing the sweep, split global and per-VNET initializers. Getting rid of pr_init allows to achieve several things: o Get rid of ifdef's that protect against double foo_init() when both INET and INET6 are compiled in. o Isolate initializers statically to the module they init. o Makes code easier to understand and maintain. Reviewed by: melifaro Differential revision: https://reviews.freebsd.org/D33537	2022-01-03 10:15:21 -08:00
Jessica Clarke	a3e828c91d	intrng: Use less confusing return value for intr_pic_add_handler Currently intr_pic_add_handler either returns the PIC you gave it (which is useless and risks causing confusion about whether it's creating another PIC) or, on error, NULL. Instead, convert it to return an int error code as one would expect. Note that the only consumer of this API, arm64's gicv3_its, does not use the return value, so no uses need updating to work with the revised API. Reviewed by: markj, mmel MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D33341	2022-01-03 17:08:44 +00:00
Stefan Eßer	ec3af9d0ca	sys/kern/sched_4bsd.c: fix typo introduced in previous commit	2022-01-01 15:33:38 +01:00

... 3 4 5 6 7 ...

19325 Commits