freebsd-dev

Author	SHA1	Message	Date
Mitchell Horne	1029dab634	mi_switch(): clean up switch types and their usage Overall, this is a non-functional change, except for kernels built with SCHED_STATS. However, the switch types are useful for communicating the intent of the caller. 1. Ensure that every caller provides a type. In most cases, we upgrade the basic yield to sched_relinquish() aka SWT_RELINQUISH. 2. The case of sched_bind() is distinct, so add a new switch type SWT_BIND. 3. Remove the two unused types, SWT_PREEMPT and SWT_SLEEPQTIMO. 4. Remove SWT_NONE altogether and assert that callers always provide a type flag. 5. Reference the mi_switch(9) man page in the comments, as these flags will be documented there. Reviewed by: kib, markj Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D38184	2023-02-09 12:01:32 -04:00
Konstantin Belousov	c6d31b8306	AST: rework Make most AST handlers dynamically registered. This allows to have subsystem-specific handler source located in the subsystem files, instead of making subr_trap.c aware of it. For instance, signal delivery code on return to userspace is now moved to kern_sig.c. Also, it allows to have some handlers designated as the cleanup (kclear) type, which are called both at AST and on thread/process exit. For instance, ast(), exit1(), and NFS server no longer need to be aware about UFS softdep processing. The dynamic registration also allows third-party modules to register AST handlers if needed. There is one caveat with loadable modules: the code does not make any effort to ensure that the module is not unloaded before all threads processed through AST handler in it. In fact, this is already present behavior for hwpmc.ko and ufs.ko. I do not think it is worth the efforts and the runtime overhead to try to fix it. Reviewed by: markj Tested by: emaste (arm64), pho Discussed with: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D35888	2022-08-02 21:11:09 +03:00
Mark Johnston	bd980ca847	sched_ule: Ensure we hold the thread lock when modifying td_flags The load balancer may force a running thread to reschedule and pick a new CPU. To do this it sets some flags in the thread running on a loaded CPU. But the code assumed that a running thread's lock is the same as that of the corresponding runqueue, and there are small windows where this is not true. In this case, we can end up with non-atomic modifications to td_flags. Since this load balancing is best-effort, simply give up if the thread's lock doesn't match; in this case the thread is about to enter the scheduler anyway. Reviewed by: kib Reported by: glebius Fixes: `e745d729be` ("sched_ule(4): Improve long-term load balancer.") MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D35821	2022-07-18 15:52:27 -04:00
Mateusz Guzik	6eeba7dbd6	ule: unbreak UP builds Sponsored by: Rubicon Communications, LLC ("Netgate")	2022-07-16 12:45:09 +00:00
John Baldwin	954cffe95d	ule: Simplistic time-sharing for interrupt threads. If an interrupt thread runs for a full quantum without yielding the CPU, demote its priority and schedule a preemption to give other ithreads a turn. Reviewed by: kib, markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D35644	2022-07-14 13:13:57 -07:00
John Baldwin	fea89a2804	Add sched_ithread_prio to set the base priority of an interrupt thread. Use it instead of sched_prio when setting the priority of an interrupt thread. Reviewed by: kib, markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D35642	2022-07-14 13:13:10 -07:00
Mark Johnston	6cbc4ceb7a	sched_ule: Use the correct atomic_load variant for tdq_lowpri Reported by: tuexen Fixes: `11484ad8a2` ("sched_ule: Use explicit atomic accesses for tdq fields")	2022-07-14 15:34:02 -04:00
Mark Johnston	11484ad8a2	sched_ule: Use explicit atomic accesses for tdq fields Different fields in the tdq have different synchronization protocols. Some are constant, some are accessed only while holding the tdq lock, some are modified with the lock held but accessed without the lock, some are accessed only on the tdq's CPU, and some are not synchronized by the lock at all. Convert ULE to stop using volatile and instead use atomic_load_* and atomic_store_* to provide the desired semantics for lockless accesses. This makes the intent of the code more explicit, gives more freedom to the compiler when accesses do not need to be qualified, and lets KCSAN intercept unlocked accesses. Thus: - Introduce macros to provide unlocked accessors for certain fields. - Use atomic_load/store for all accesses of tdq_cpu_idle, which is not synchronized by the mutex. - Use atomic_load/store for accesses of the switch count, which is updated by sched_clock(). - Add some comments to fields of struct tdq describing how accesses are synchronized. No functional change intended. Reviewed by: mav, kib MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D35737	2022-07-14 10:45:33 -04:00
Mark Johnston	0927ff7814	sched_ule: Enable preemption of curthread in the load balancer The load balancer executes from statclock and periodically tries to move threads among CPUs in order to balance load. It may move a thread to the current CPU (the loader balancer always runs on CPU 0). When it does so, it may need to schedule preemption of the interrupted thread. Use sched_setpreempt() to do so, same as sched_add(). PR: 264867 Reviewed by: mav, kib, jhb MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D35744	2022-07-14 10:27:58 -04:00
Mark Johnston	6d3f74a14a	sched_ule: Fix racy loads of pc_curthread Thread switching used to be atomic with respect to the current CPU's tdq lock. Since commit `686bcb5c14` that is no longer the case. Now sched_switch() does this: 1. lock tdq (might already be locked) 2. maybe put the current thread in the tdq, choose a new thread to run 2a. update tdq_lowpri 3. unlock tdq 4. switch CPU context, update curthread Some code paths in ULE will load pc_curthread from a remote CPU with that CPU's tdq lock held, usually to inspect its priority. But, as of the aforementioned commit this is racy. The problem I noticed is in tdq_notify(), which optionally sends an IPI to a remote CPU when a new thread is added to its runqueue. If the new thread's priority is higher (lower) than the currently running thread's priority, then we deliver an IPI. But inspecting pc_curthread->td_priority doesn't work, since pc_curthread might be between steps 3 and 4 above. If pc_curthread's priority is higher than that of the newly added thread, but pc_curthread is switching to a lower-priority thread, then tdq_notify() might fail to deliever an IPI, leaving a high priority thread stuck on the runqueue for longer than it should. This can cause multi-millisecond stalls in interactive/ithread/realtime threads. Fix this problem by modifying tdq_add() and tdq_move() to return the value of tdq_lowpri before the addition of the new thread. This ensures that tdq_notify() has the correct priority value to compare against. The other two uses of pc_curthread are susceptible to the same race. To fix the one in sched_rem()->tdq_setlowpri() we need to have an exact value for curthread. Thus, introduce a new tdq_curthread field to the tdq which gets updated any time a new thread is selected to run on the CPU. Because this field is synchronized by the thread lock, its priority reflects the correct lowpri value for the tdq. PR: 264867 Fixes: `686bcb5c14` ("schedlock 4/4") Reviewed by: mav, kib, jhb MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D35736	2022-07-14 10:27:51 -04:00
Mark Johnston	ba71333f60	sched_ule: Fix a typo in a comment PR: 226107 MFC after: 1 week	2022-07-11 15:58:43 -04:00
Mark Johnston	ef80894c9d	sched_ule: Purge an obsolete comment The referenced bitmask was removed in commit `62fa74d95a`. MFC after: 1 week Sponsored by: The FreeBSD Foundation	2022-07-11 15:58:43 -04:00
Mark Johnston	35dd6d6cb5	sched_ule: Eliminate a superfluous local variable in tdq_move() No functional change intended. MFC after: 1 week Sponsored by: The FreeBSD Foundation	2022-07-11 15:58:43 -04:00
John Baldwin	4aec198420	sched_ule: Inline value of ts in sched_thread_priority. This avoids a set but unused warning in kernels without SMP where TDQ_CPU() doesn't use its argument.	2022-04-13 16:08:23 -07:00
Gordon Bergling	15b5c347f1	sched_ule(4): Fix two typo in source code comments - s/conditons/conditions/ - s/unconditonally/unconditionally/ MFC after: 3 days	2021-11-19 19:13:28 +01:00
Kyle Evans	6a8ea6d174	sched: split sched_ap_entry() out of sched_throw() sched_throw() can no longer take a NULL thread, APs enter through sched_ap_entry() instead. This completely removes branching in the common case and cleans up both paths. No functional change intended. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D32829	2021-11-05 15:45:51 -05:00
Kyle Evans	589aed00e3	sched: separate out schedinit_ap() schedinit_ap() sets up an AP for a later call to sched_throw(NULL). Currently, ULE sets up some pcpu bits and fixes the idlethread lock with a call to sched_throw(NULL); this results in a window where curthread is setup in platforms' init_secondary(), but it has the wrong td_lock. Typical platform AP startup procedure looks something like: - Setup curthread - ... other stuff, including cpu_initclocks_ap() - Signal smp_started - sched_throw(NULL) to enter the scheduler cpu_initclocks_ap() may have callouts to process (e.g., nvme) and attempt to sched_add() for this AP, but this attempt fails because of the noted violated assumption leading to locking heartburn in sched_setpreempt(). Interrupts are still disabled until cpu_throw() so we're not really at risk of being preempted -- just let the scheduler in on it a little earlier as part of setting up curthread. Reviewed by: alfredo, kib, markj Triage help from: andrew, markj Smoke-tested by: alfredo (ppc), kevans (arm64, x86), mhorne (arm) Differential Revision: https://reviews.freebsd.org/D32797	2021-11-03 15:54:59 -05:00
Alexander Motin	1c119e173d	sched_ule(4): Fix possible significance loss. Before this change kern.sched.interact sysctl setting above 32 gave all interactive threads identical priority of PRI_MIN_INTERACT due to ((PRI_MAX_INTERACT - PRI_MIN_INTERACT + 1) / sched_interact) turning zero. Setting the sysctl lower reduced the range of used priority levels up to half, that is not great either. Change of the operations order should fix the issue, always using full range of priorities, while overflow is impossible there since both score and priority values are small. While there, make the variables unsigned as they really are. MFC after: 1 month	2021-10-02 00:09:45 -04:00
Alexander Motin	08063e9f98	sched_ule(4): Fix hang with steal_thresh < 2. `e745d729be` caused infinite loop with interrupts disabled in load stealing code if steal_thresh set below 2. Such configuration should not generally be used, but appeared some people are using it to workaround some problems. To fix the problem explicitly pass to sched_highest() minimum number of transferrable threads, supported by the caller, instead of guessing. MFC after: 25 days	2021-09-26 12:03:05 -04:00
Alexander Motin	ef50d5fbc3	x86: Add NUMA nodes into CPU topology. Depending on hardware, NUMA nodes may match last level caches, or they may be above them (AMD Zen 2/3) or below (Intel Xeon w/ SNC). This information is provided by ACPI instead of CPUID, and it is provided for each CPU individually instead of mask widths, but this code should be able to properly handle all the above cases. This change should immediately allow idle stealing in sched_ule(4) to prefer load from NUMA-local CPUs to remote ones when the node does not match LLC. Later we may think of how to better handle it on sched_pickcpu() side. MFC after: 1 month	2021-09-23 14:31:38 -04:00
Alexander Motin	8db1669959	Fix build without SMP. MFC after: 1 month	2021-09-21 22:13:33 -04:00
Alexander Motin	e745d729be	sched_ule(4): Improve long-term load balancer. Before this change long-term load balancer was unable to migrate running threads, only ones waiting on run queues. But with growing number of CPU cores it is quite typical now for system to not have many waiting threads. But same time if due to some coincidence two long-running CPU-bound threads ended up sharing same physical CPU core, they could suffer from the SMT penalty indefinitely, and the load balancer couldn't help. Improve that by teaching the load balancer to hint running threads to migrate by marking them with TDF_NEEDRESCHED and new TDF_PICKCPU flag, making sched_pickcpu() to search for better CPU later, when it is convenient. Fix CPU search logic when balancing to limit round-robin migrations in case of almost equal load to the group of physical cores. The previous code bounced threads across all the system, that should be pretty bad for caches and NUMA affinity, while additional fairness was almost invisible, diminishing with number of cores in the group. MFC after: 1 month	2021-09-21 18:19:20 -04:00
Alexander Motin	bd84094a51	sched_ule(4): Fix interactive threads stealing. In scenarios when first thread in the queue can migrate to specified CPU, but later ones can't runq_steal_from() incorrectly returned NULL. MFC after: 2 weeks	2021-09-21 16:03:32 -04:00
Alexander Motin	ca34553b6f	sched_ule(4): Pre-seed sched_random(). I don't think it changes anything, but why not. While there, make cpu_search_highest() use all 8 lower load bits for noise, since it does not use cs_prefer and the code is not shared with cpu_search_lowest() any more. MFC after: 1 month	2021-08-02 10:55:28 -04:00
Alexander Motin	8bb173fb5b	sched_ule(4): Use trylock when stealing load. On some load patterns it is possible for several CPUs to try steal thread from the same CPU despite randomization introduced. It may cause significant lock contention when holding one queue lock idle thread tries to acquire another one. Use of trylock on the remote queue allows both reduce the contention and handle lock ordering easier. If we can't get lock inside tdq_trysteal() we just return, allowing tdq_idled() handle it. If it happens in tdq_idled(), then we repeat search for load skipping this CPU. On 2-socket 80-thread Xeon system I am observing dramatic reduction of the lock spinning time when doing random uncached 4KB reads from 12 ZVOLs, while IOPS increase from 327K to 403K. MFC after: 1 month	2021-08-01 22:42:01 -04:00
Alexander Motin	2668bb2add	sched_ule(4): Reduce duplicate search for load. When sched_highest() called for some CPU group returns nothing, idle thread calls it for the parent CPU group. But the parent CPU group also includes the CPU group we've just searched, and unless there is a race going on, it is unlikely we find anything new this time. Avoid the double search in case of parent group having only two sub- groups (the most prominent case). Instead of escalating to the parent group run the next search over the sibling subgroup and escalate two levels up after if that fail too. In case of more than two siblings the difference is less significant, while searching the parent group can result in better decision if we find several candidate CPUs. On 2-socket 40-core Xeon system I am measuring ~25% reduction of CPU time spent inside cpu_search_highest() in both SMT (2x20x2) and non- SMT (2x20) cases. MFC after: 1 month	2021-08-01 22:07:51 -04:00
Dmitry Chagin	af29f39958	umtx: Split umtx.h on two counterparts. To prevent umtx.h polluting by future changes split it on two headers: umtx.h - ABI header for userspace; umtxvar.h - the kernel staff. While here fix umtx_key_match style. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D31248 MFC after: 2 weeks	2021-07-29 12:41:29 +03:00
Alexander Motin	aefe0a8c32	Refactor/optimize cpu_search_(). Remove cpu_search_both(), unused for many years. Without it there is less sense for the trick of compiling common cpu_search() into separate cpu_search_lowest() and cpu_search_highest(), so split them completely, making code more readable. While there, split iteration over children groups and CPUs, complicating code for very small deduplication. Stop passing cpuset_t arguments by value and avoid some manipulations. Since MAXCPU bump from 64 to 256, what was a single register turned into 32-byte memory array, requiring memory allocation and accesses. Splitting struct cpu_search into parameter and result parts allows to even more reduce stack usage, since the first can be passed through on recursion. Remove CPU_FFS() from the hot paths, precalculating first and last CPU for each CPU group in advance during initialization. Again, it was not a problem for 64 CPUs before, but for 256 FFS needs much more code. With these changes on 80-thread system doing ~260K uncached ZFS reads per second I observe ~30% reduction of time spent in cpu_search_(). MFC after: 1 month	2021-07-28 22:00:29 -04:00
wiklam	43521b46fc	Correcting comment about "sched_interact_score". Reviewed by: jrtc@, imp@ Pull Request: https://github.com/freebsd/freebsd-src/pull/431 Sponsored by: Netflix	2021-06-02 21:50:57 -06:00
Mateusz Guzik	b77594bbbf	sched: fix an incorrect comparison in sched_lend_user_prio_cond Compare with sched_lend_user_prio.	2020-11-15 01:54:44 +00:00
Pawel Biernacki	b05ca4290c	sys/: Document few more sysctls. Submitted by: Antranig Vartanian <antranigv@freebsd.am> Reviewed by: kaktus Commented by: jhb Approved by: kib (mentor) Sponsored by: illuria security Differential Revision: https://reviews.freebsd.org/D23759	2020-03-02 15:30:52 +00:00
Pawel Biernacki	7029da5c36	Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many) r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are still not MPSAFE (or already are but aren’t properly marked). Use it in preparation for a general review of all nodes. This is non-functional change that adds annotations to SYSCTL_NODE and SYSCTL_PROC nodes using one of the soon-to-be-required flags. Mark all obvious cases as MPSAFE. All entries that haven't been marked as MPSAFE before are by default marked as NEEDGIANT Approved by: kib (mentor, blanket) Commented by: kib, gallatin, melifaro Differential Revision: https://reviews.freebsd.org/D23718	2020-02-26 14:26:36 +00:00
Mark Johnston	e489450589	Fix the !SMP case in sched_add() after r355779. If the thread's lock is already that of the runqueue, don't recurse on the queue lock. Reviewed by: jeff, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D23492	2020-02-03 22:49:05 +00:00
Mateusz Guzik	3ff65f71cb	Remove duplicated empty lines from kern/*.c No functional changes.	2020-01-30 20:05:05 +00:00
Mark Johnston	a89c2c8c34	Revert r357050. It seems to have introduced a couple of regressions. Reported by: cy, pho	2020-01-24 14:58:02 +00:00
Mark Johnston	1bfca40c57	Set td_oncpu before dropping the thread lock during a switch. After r355784 we no longer hold a thread's thread lock when switching it out. Preserve the previous synchronization protocol for td_oncpu by setting it together with td_state, before dropping the thread lock during a switch. Reported and tested by: pho Reviewed by: kib Discussed with: jeff Differential Revision: https://reviews.freebsd.org/D23270	2020-01-23 16:24:51 +00:00
Jeff Roberson	1eb13fce84	Block the thread lock in sched_throw() and use cpu_switch() to unblock it. The introduction of lockless switch in r355784 created a race to re-use the exiting thread that was only possible to hit on a hypervisor. Reported/Tested by: rlibby Discussed with: rlibby, jhb	2020-01-23 03:36:50 +00:00
Mateusz Guzik	879e0604ee	Add KERNEL_PANICKED macro for use in place of direct panicstr tests	2020-01-12 06:07:54 +00:00
Jeff Roberson	d8d5f03610	Fix a bug in r355784. I missed a sched_add() call that needed to reacquire the thread lock. Reported by: mjg	2019-12-19 18:22:11 +00:00
Jeff Roberson	686bcb5c14	schedlock 4/4 Don't hold the scheduler lock while doing context switches. Instead we unlock after selecting the new thread and switch within a spinlock section leaving interrupts and preemption disabled to prevent local concurrency. This means that mi_switch() is entered with the thread locked but returns without. This dramatically simplifies scheduler locking because we will not hold the schedlock while spinning on blocked lock in switch. This change has not been made to 4BSD but in principle it would be more straightforward. Discussed with: markj Reviewed by: kib Tested by: pho Differential Revision: https://reviews.freebsd.org/D22778	2019-12-15 21:26:50 +00:00
Jeff Roberson	61a74c5ccd	schedlock 1/4 Eliminate recursion from most thread_lock consumers. Return from sched_add() without the thread_lock held. This eliminates unnecessary atomics and lock word loads as well as reducing the hold time for scheduler locks. This will eventually allow for lockless remote adds. Discussed with: kib Reviewed by: jhb Tested by: pho Differential Revision: https://reviews.freebsd.org/D22626	2019-12-15 21:11:15 +00:00
Ryan Libby	9825eadf2c	bitset: rename confusing macro NAND to ANDNOT s/BIT_NAND/BIT_ANDNOT/, and for CPU and DOMAINSET too. The actual implementation is "and not" (or "but not"), i.e. A but not B. Fortunately this does appear to be what all existing callers want. Don't supply a NAND (not (A and B)) operation at this time. Discussed with: jeff Reviewed by: cem Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D22791	2019-12-13 09:32:16 +00:00
Mark Johnston	7789ab32b3	Rename tdq_ipipending and clear it in sched_switch(). This fixes a regression after r355311. Specifically, sched_preempt() may trigger a context switch by calling thread_lock(), since thread_lock() calls critical_exit() in its slow path and the interrupted thread may have already been marked for preemption. This would happen before tdq_ipipending is cleared, blocking further preemption IPIs. The CPU can be left in this state indefinitely if the interrupted thread migrates. Rename tdq_ipipending to tdq_owepreempt. Any switch satisfies a remote preemption request, so clear tdq_owepreempt in sched_switch() instead of sched_preempt() to avoid subtle problems of the sort described above. Reviewed by: jeff, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22758	2019-12-12 02:43:24 +00:00
Jeff Roberson	c3cccf95bf	Handle multiple clock interrupts simultaneously in sched_clock(). Reviewed by: kib, markj, mav Differential Revision: https://reviews.freebsd.org/D22625	2019-12-08 01:17:38 +00:00
Alexander Motin	61322a0a8a	Mark some more hot global variables with __read_mostly. MFC after: 1 week	2019-12-04 21:26:03 +00:00
Jeff Roberson	e15046952d	Initialize the idle thread's lock sooner so it's not evaluated on every fork exit and we can rely on it elsewhere. Reviewed by: mav, kib, jhb, markj Differential Revision: https://reviews.freebsd.org/D22624	2019-12-02 22:35:45 +00:00
Alexander Motin	176dd236dc	Microoptimize sched_pickcpu() CPU affinity on SMT. Use of CPU_FFS() to implement CPUSET_FOREACH() allows to save up to ~0.5% of CPU time on 72-thread SMT system doing 80K IOPS to NVMe from one thread. MFC after: 1 month Sponsored by: iXsystems, Inc.	2019-09-26 00:35:06 +00:00
Alexander Motin	c55dc51c37	Microoptimize sched_pickcpu() after r352658. I've noticed that I missed intr check at one more SCHED_AFFINITY(), so instead of adding one more branching I prefer to remove few. Profiler shows the function CPU time reduction from 0.24% to 0.16%. MFC after: 1 month Sponsored by: iXsystems, Inc.	2019-09-25 19:29:09 +00:00
Alexander Motin	bb3dfc6ae9	Fix wrong assertion in r352658. MFC after: 1 month	2019-09-25 11:58:54 +00:00
Alexander Motin	c9205e3500	Fix/improve interrupt threads scheduling. Doing some tests with very high interrupt rates I've noticed that one of conditions I added in r232207 to make interrupt threads in most cases run on local CPU never worked as expected (worked only if previous time it was executed on some other CPU, that is quite opposite). It caused additional CPU usage to run full CPU search and could schedule interrupt threads to some other CPU. This patch removes that code and instead reuses existing non-interrupt code path with some tweaks for interrupt case: - On SMT systems, if current thread is idle, don't look on other threads. Even if they are busy, it may take more time to do fill search and bounce the interrupt thread to other core then execute it locally, even sharing CPU resources. It is other threads should migrate, not bound interrupts. - Try hard to keep interrupt threads within LLC of their original CPU. This improves scheduling cost and supposedly cache and memory locality. On a test system with 72 threads doing 2.2M IOPS to NVMe this saves few percents of CPU time while adding few percents to IOPS. MFC after: 1 month Sponsored by: iXsystems, Inc.	2019-09-24 20:01:20 +00:00

1 2 3 4 5 ...

419 Commits