Commit Graph

198 Commits

Author SHA1 Message Date
rlibby
dbf795e374 bitset: rename confusing macro NAND to ANDNOT
s/BIT_NAND/BIT_ANDNOT/, and for CPU and DOMAINSET too.  The actual
implementation is "and not" (or "but not"), i.e. A but not B.
Fortunately this does appear to be what all existing callers want.

Don't supply a NAND (not (A and B)) operation at this time.

Discussed with:	jeff
Reviewed by:	cem
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D22791
2019-12-13 09:32:16 +00:00
jeff
389afb1898 Handle multiple clock interrupts simultaneously in sched_clock().
Reviewed by:	kib, markj, mav
Differential Revision:	https://reviews.freebsd.org/D22625
2019-12-08 01:17:38 +00:00
mav
0e1fa50f0d Mark some more hot global variables with __read_mostly.
MFC after:	1 week
2019-12-04 21:26:03 +00:00
mjg
c1523e6e58 Reduce umtx-related work on exec and exit
- there is no need to take the process lock to iterate the thread
  list after single-threading is enforced
- typically there are no mutexes to clean up (testable without taking
  the global umtx lock)
- typically there is no need to adjust the priority (testable without
  taking thread lock)

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20160
2019-05-08 16:30:38 +00:00
andrew
ae591a440e Create a new macro for static DPCPU data.
On arm64 (and possible other architectures) we are unable to use static
DPCPU data in kernel modules. This is because the compiler will generate
PC-relative accesses, however the runtime-linker expects to be able to
relocate these.

In preparation to fix this create two macros depending on if the data is
global or static.

Reviewed by:	bz, emaste, markj
Sponsored by:	ABT Systems Ltd
Differential Revision:	https://reviews.freebsd.org/D16140
2018-07-05 17:13:37 +00:00
mjg
3f10402174 Inlined sched_userret.
The tested condition is rarely true and it induces a function call
on each return to userspace.

Bumps getuid rate by about 1% on Broadwell.
2018-05-07 23:36:16 +00:00
jeff
94c7af8ca2 Implement 'domainset', a cpuset based NUMA policy mechanism. This allows
userspace to control NUMA policy administratively and programmatically.

Implement domainset based iterators in the page layer.

Remove the now legacy numa_* syscalls.

Cleanup some header polution created by having seq.h in proc.h.

Reviewed by:	markj, kib
Discussed with:	alc
Tested by:	pho
Sponsored by:	Netflix, Dell/EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D13403
2018-01-12 22:48:23 +00:00
pfg
4736ccfd9c sys: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
2017-11-20 19:43:44 +00:00
emaste
1901c3e1f2 Remove register keyword from sys/ and ANSIfy prototypes
A long long time ago the register keyword told the compiler to store
the corresponding variable in a CPU register, but it is not relevant
for any compiler used in the FreeBSD world today.

ANSIfy related prototypes while here.

Reviewed by:	cem, jhb
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D10193
2017-05-17 00:34:34 +00:00
avg
83025f4a68 move thread switch tracing from mi_switch to sched_switch
This is done so that the thread state changes during the switch
are not confused with the thread state changes reported when the thread
spins on a lock.

Here is an example, three consecutive entries for the same thread (from top to
bottom):

  KTRGRAPH group:"thread", id:"zio_write_intr_3 tid 100260", state:"sleep", attributes: prio:84, wmesg:"-", lockname:"(null)"
  KTRGRAPH group:"thread", id:"zio_write_intr_3 tid 100260", state:"spinning", attributes: lockname:"sched lock 1"
  KTRGRAPH group:"thread", id:"zio_write_intr_3 tid 100260", state:"running", attributes: none

The above trace could leave an impression that the final state of
the thread was "running".
After this change the sleep state will be reported after the "spinning"
and "running" states reported for the sched lock.

Reviewed by:	jhb, markj
MFC after:	1 week
Sponsored by:	Panzura
Differential Revision: https://reviews.freebsd.org/D9961
2017-03-23 08:57:04 +00:00
avg
2758f60873 trace thread running state when a thread is run for the first time
This applies to both KTR_SCHED and DTrace sched:::on-cpu tracing.

MFC after:	10 days
2017-03-11 15:57:41 +00:00
rstone
c310497b05 Revert r313814 and r313816
Something evidently got mangled in my git tree in between testing and
review, as an old and broken version of the patch was apparently submitted
to svn.  Revert this while I work out what went wrong.

Reported by:	tuexen
Pointy hat to:	rstone
2017-02-16 21:18:31 +00:00
rstone
943126a914 Check for preemption after lowering a thread's priority
When a high-priority thread is waiting for a mutex held by a
low-priority thread, it temporarily lends its priority to the
low-priority thread to prevent priority inversion.  When the mutex
is released, the lent priority is revoked and the low-priority
thread goes back to its original priority.

When the priority of that thread is lowered (through a call to
sched_priority()), the schedule was not checking whether
there is now a high-priority thread in the run queue.  This can
cause threads with real-time priority to be starved in the run
queue while the low-priority thread finishes its quantum.

Fix this by explicitly checking whether preemption is necessary
when a thread's priority is lowered.

Sponsored by: Dell EMC Isilon
Obtained from: Sandvine Inc
Differential Revision:	https://reviews.freebsd.org/D9518
Reviewed by: Jeff Roberson (ule)
MFC after: 1 month
2017-02-16 19:41:13 +00:00
avg
96691e46d1 fix a thread preemption regression in schedulers introduced in r270423
Commit r270423 fixed a regression in sched_yield() that was introduced
in earlier changes.  Unfortunately, at the same time it introduced an
new regression.  The problem is that SWT_RELINQUISH (6), like all other
SWT_* constants and unlike SW_* flags, is not a bit flag.  So, (flags &
SWT_RELINQUISH) is true in cases where that was not really indended,
for example, with SWT_OWEPREEMPT (2) and SWT_REMOTEPREEMPT (11).

A straight forward fix would be to use (flags & SW_TYPE_MASK) ==
SWT_RELINQUISH, but my impression is that the switch types are designed
mostly for gathering statistics, not for influencing scheduling
decisions.

So, I decided that it would be better to check for SW_PREEMPT flag
instead.  That's also the same flag that was checked before r239157.
I double-checked how that flag is used and I am confident that the flag
is set only in the places where we really have the preemption:
- critical_exit + td_owepreempt
- sched_preempt in the ULE scheduler
- sched_preempt in the 4BSD scheduler

Reviewed by:	kib, mav
MFC after:	4 days
Sponsored by:	Panzura
Differential Revision: https://reviews.freebsd.org/D9230
2017-01-19 18:46:41 +00:00
jhb
31bad36604 Allow scheduling during early boot.
- Send IPI wakeups once SMP is started even if cold is true.
- Permit preemptions when cold is true.

These changes are needed for EARLY_AP_STARTUP.

MFC after:	2 weeks
Sponsored by:	Netflix
2016-11-12 00:23:09 +00:00
jhb
2fd562f38d Don't place threads on the run queue after waking up other CPUs.
The other CPU might resume and see a still-empty runq and go back to
sleep before sched_add() adds the thread to the runq.  This results
in a lost wakeup and a potential hang if the system is otherwise
completely idle.

The race originated due to a micro-optimization (my fault) in 4BSD in
that it avoided putting a thread on the run queue if the scheduler was
going to preempt to the new thread.  To avoid complexity while fixing
this race, just drop this optimization.  4BSD now always sets the
"owepreempt" flag when a preemption is warranted and defers the actual
preemption to the thread_unlock of the caller the same as ULE.

MFC after:	2 weeks
Sponsored by:	Netflix
2016-11-12 00:14:13 +00:00
emaste
00b67b15b9 Renumber license clauses in sys/kern to avoid skipping #3 2016-09-15 13:16:20 +00:00
jhb
040924b411 Don't treat NOCPU as a valid CPU to CPU_ISSET.
If a thread is created bound to a cpuset it might already be bound before
it's very first timeslice, and td_lastcpu will be NOCPU in that case.

MFC after:	1 week
2016-07-29 20:19:14 +00:00
kib
ef5f88c357 Get rid of struct proc p_sched and struct thread td_sched pointers.
p_sched is unused.

The struct td_sched is always co-allocated with the struct thread,
except for the thread0.  Avoid useless indirection, instead calculate
td_sched location using simple pointer arithmetic in td_get_sched(9).
For thread0, which is statically allocated, create a structure to
emulate layout of the dynamic allocation.

Reviewed by:	jhb (previous version)
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D6711
2016-06-05 17:04:03 +00:00
pfg
28823d0656 sys/kern: spelling fixes in comments.
No functional change.
2016-04-29 22:15:33 +00:00
kib
f16910a47e The struct thread td_estcpu member is only used by the 4BSD scheduler.
Move it to the struct td_sched for 4BSD, removing always present
field, otherwise unused for ULE.

New scheduler method sched_estcpu() returns the estimation for
kinfo_proc consumption.  As before, it always returns 0 for ULE.

Remove sched_tick() scheduler method, unused both by 4BSD and ULE.

Update locking comment for the 4BSD struct td_sched, copying it from
the same comment for ULE.

Spell MAXPRI as PRI_MAX_TIMESHARE in the 4BSD comment.

Based on some notes from, and reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
2016-04-17 11:04:27 +00:00
jhb
4678a09484 kgdb uses td_oncpu to determine if a thread is running and should use
a pcb from stoppcbs[] rather than the thread's PCB.  However, exited threads
retained td_oncpu from the last time they ran, and newborn threads had their
CPU fields cleared to zero during fork and thread creation since they are
in the set of fields zeroed when threads are setup.  To fix, explicitly
update the CPU fields for exiting threads in sched_throw() to reflect the
switch out and reset the CPU fields for new threads in sched_fork_thread()
to NOCPU.

Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D3193
2015-08-03 20:43:36 +00:00
trasz
802017a04b Add kern.racct.enable tunable and RACCT_DISABLED config option.
The point of this is to be able to add RACCT (with RACCT_DISABLED)
to GENERIC, to avoid having to rebuild the kernel to use rctl(8).

Differential Revision:	https://reviews.freebsd.org/D2369
Reviewed by:	kib@
MFC after:	1 month
Relnotes:	yes
Sponsored by:	The FreeBSD Foundation
2015-04-29 10:23:02 +00:00
mav
f7d8522961 Restore pre-r239157 handling of sched_yield(), when thread time slice was
aborted, allowing other threads to run.  Without this change thread is just
rescheduled again, that was illustrated by provided test tool.

PR:		192926
Submitted by:	eric@vangyzen.net
MFC after:	2 weeks
2014-08-23 17:31:56 +00:00
marius
6678ece656 Given that as of r258002 the last external user is gone, make sched_lock
static.
2014-04-29 20:51:57 +00:00
markj
85492dd71d The arguments to sched:::off-cpu are the thread and associated process of
the thread selected to run, not the currently running thread. This fix has
already been made for ULE in r252070.

PR:		177706
MFC after:	1 week
2013-12-29 17:08:30 +00:00
avg
71889a5eff dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINE
In its stead use the Solaris / illumos approach of emulating '-' (dash)
in probe names with '__' (two consecutive underscores).

Reviewed by:	markj
MFC after:	3 weeks
2013-11-26 08:46:27 +00:00
attilio
7ee4e910ce - For kernel compiled only with KDTRACE_HOOKS and not any lock debugging
option, unbreak the lock tracing release semantic by embedding
  calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
  version of the releasing functions for mutex, rwlock and sxlock.
  Failing to do so skips the lockstat_probe_func invokation for
  unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
  kernel compiled without lock debugging options, potentially every
  consumer must be compiled including opt_kdtrace.h.
  Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
  dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
  is linked there and it is only used as a compile-time stub [0].

[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested.  As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while.  Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].

Sponsored by:	EMC / Isilon storage division
Discussed with:	rstone
[0] Reported by:	rstone
[1] Discussed with:	philip
2013-11-25 07:38:45 +00:00
avg
9e6374b6a9 rename scheduler->swapper and SI_SUB_RUN_SCHEDULER->SI_SUB_LAST
Also directly call swapper() at the end of mi_startup instead of
relying on swapper being the last thing in sysinits order.

Rationale:

- "RUN_SCHEDULER" was misleading, scheduling already takes place at that stage
- "scheduler" was misleading, the function swaps in the swapped out processes
- another SYSINIT(SI_SUB_RUN_SCHEDULER, SI_ORDER_ANY) could never be
  invoked depending on its relative order with scheduler; this was not obvious
  and the bug actually used to exist

Reviewed by:	kib (ealier version)
MFC after:	14 days
2013-07-24 09:45:31 +00:00
trasz
d97338334a Add CPU percentage limit enforcement to RCTL. The resouce name is "pcpu".
It was implemented by Rudolf Tomori during Google Summer of Code 2012.
2012-10-26 16:01:08 +00:00
jhb
ca55558465 Mark the idle threads as non-sleepable and also assert that an idle
thread never blocks on a turnstile.
2012-08-22 20:01:38 +00:00
mav
60e552193c Some more minor tunings inspired by bde@. 2012-08-11 20:24:39 +00:00
mav
5b837de0b3 Some minor tunings/cleanups inspired by bde@ after previous commits:
- remove extra dynamic variable initializations;
 - restore (4BSD) and implement (ULE) hogticks variable setting;
 - make sched_rr_interval() more tolerant to options;
 - restore (4BSD) and implement (ULE) kern.sched.quantum sysctl, a more
user-friendly wrapper for sched_slice;
 - tune some sysctl descriptions;
 - make some style fixes.
2012-08-10 19:02:49 +00:00
mav
1f0ad947a6 Rework r220198 change (by fabient). I believe it solves the problem from
the wrong direction. Before it, if preemption and end of time slice happen
same time, thread was put to the head of the queue as for only preemption.
It could cause single thread to run for indefinitely long time. r220198
handles it by not clearing TDF_NEEDRESCHED in case of preemption. But that
causes delayed context switch every time preemption happens, even when not
needed.

Solve problem by introducing scheduler-specifoc thread flag TDF_SLICEEND,
set when thread's time slice is over and it should be put to the tail of
queue. Using SW_PREEMPT flag for that purpose as it was before just not
enough informative to work correctly.

On my tests this by 2-3 times reduces run time deviation (improves fairness)
in cases when several threads share one CPU.

Reviewed by:	fabient
MFC after:	2 months
Sponsored by:	iXsystems, Inc.
2012-08-09 19:26:13 +00:00
mav
f0fe2cf739 SCHED_4BSD scheduling quantum mechanism appears to be broken for some time.
With switchticks variable being reset each time thread preempted (that is
done regularly by interrupt threads) scheduling quantum may never expire.
It was not noticed in time because several other factors still regularly
trigger context switches.

Handle the problem by replacing that mechanism with its equivalent from
SCHED_ULE called time slice. It is effectively the same, just measured in
context of stathz instead of hz. Some unification is probably not bad.
2012-08-09 18:09:59 +00:00
pluknet
7aab7d56be Fix typo in function name SDT_PROBE4 and unbreak 4BSD UP. 2012-05-15 10:58:17 +00:00
rstone
a059a0e086 Implement the DTrace sched provider. This implementation aims to be
compatible with the sched provider implemented by Solaris and its open-
source derivatives.  Full documentation of the sched provider can be found
on Oracle's DTrace wiki pages.

Note that for compatibility with scripts originally written for Solaris,
serveral probes are defined that will never fire.  These probes are defined
to fire when Solaris-specific features perform certain actions.  As these
features are not present in FreeBSD, the probes can never fire.

Also, I have added a two probes that are not defined in Solaris, lend-pri
and load-change.  These probes have been added to make it possible to
collect schedgraph data with DTrace.

Finally, a few probes are defined in Solaris to take a cpuinfo_t *
argument.  As it was not immediately clear to me how to translate that to
FreeBSD, currently those probes are passed NULL in place of a cpuinfo_t *.

Sponsored by: Sandvine Incorporated
MFC after:	2 weeks
2012-05-15 01:30:25 +00:00
jhb
4fea355eb2 Add a new sched_clear_name() method to the scheduler interface to clear
the cached name used for KTR_SCHED traces when a thread's name changes.
This way KTR_SCHED traces (and thus schedgraph) will notice when a thread's
name changes, most commonly via execve().

MFC after:	2 weeks
2012-03-08 19:41:05 +00:00
jhb
b759911211 Some small fixes to CPU accounting for threads:
- Only initialize the per-cpu switchticks and switchtime in sched_throw()
  for the very first context switch on APs during boot.  This avoids a
  small gap between the middle of thread_exit() and sched_throw() where
  time is not accounted to any thread.
- In thread_exit(), update the timestamp bookkeeping to track the changes
  to mi_switch() introduced by td_rux so that the code once again matches
  the comment claiming it is mimicing mi_switch().  Specifically, only
  update the per-thread stats directly and depend on ruxagg() to update
  p_rux rather than adjusting p_rux directly.  While here, move the
  timestamp bookkeeping as late in the function as possible.

Reviewed by:	bde, kib
MFC after:	1 week
2012-01-03 21:03:28 +00:00
ed
0c56cf839d Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs.
The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.
2011-11-07 15:43:11 +00:00
delphij
f687a7bf6f Fix format strings for KTR_STATE in 4BSD ad ULE schedulers.
Submitted by:	Ivan Klymenko <fidaj@ukr.net>
PR:		kern/159904, kern/159905
MFC after:	2 weeks
Approved by:	re (kib)
2011-08-26 18:00:07 +00:00
attilio
364d0522f7 With retirement of cpumask_t and usage of cpuset_t for representing a
mask of CPUs, pc_other_cpus and pc_cpumask become highly inefficient.

Remove them and replace their usage with custom pc_cpuid magic (as,
atm, pc_cpumask can be easilly represented by (1 << pc_cpuid) and
pc_other_cpus by (all_cpus & ~(1 << pc_cpuid))).

This change is not targeted for MFC because of struct pcpu members
removal and dependency by cpumask_t retirement.

MD review by:	marcel, marius, alc
Tested by:	pluknet
MD testing by:	marcel, marius, gonzo, andreast
2011-07-04 12:04:52 +00:00
attilio
bc4d32e80b MFC 2011-05-31 21:22:44 +00:00
attilio
fe4de567b5 Commit the support for removing cpumask_t and replacing it directly with
cpuset_t objects.
That is going to offer the underlying support for a simple bump of
MAXCPU and then support for number of cpus > 32 (as it is today).

Right now, cpumask_t is an int, 32 bits on all our supported architecture.
cpumask_t on the other side is implemented as an array of longs, and
easilly extendible by definition.

The architectures touched by this commit are the following:
- amd64
- i386
- pc98
- arm
- ia64
- XEN

while the others are still missing.
Userland is believed to be fully converted with the changes contained
here.

Some technical notes:
- This commit may be considered an ABI nop for all the architectures
  different from amd64 and ia64 (and sparc64 in the future)
- per-cpu members, which are now converted to cpuset_t, needs to be
  accessed avoiding migration, because the size of cpuset_t should be
  considered unknown
- size of cpuset_t objects is different from kernel and userland (this is
  primirally done in order to leave some more space in userland to cope
  with KBI extensions). If you need to access kernel cpuset_t from the
  userland please refer to example in this patch on how to do that
  correctly (kgdb may be a good source, for example).
- Support for other architectures is going to be added soon
- Only MAXCPU for amd64 is bumped now

The patch has been tested by sbruno and Nicholas Esborn on opteron
4 x 12 pack CPUs. More testing on big SMP is expected to came soon.
pluknet tested the patch with his 8-ways on both amd64 and i386.

Tested by:	pluknet, sbruno, gianni, Nicholas Esborn
Reviewed by:	jeff, jhb, sbruno
2011-05-05 14:39:14 +00:00
attilio
7ac8b4739c - Remove the following sysctl:
kern.sched.ipiwakeup.onecpu
  kern.sched.ipiwakeup.htt2

  Because they are absolutely obsolete. Probabilly the whole wakeup
  forward mechanism should be revisited for a better fitting in modern
  hw.
- As map2 variable is no longer used rename map3 to map2
- Fix a string by making more informative the msg and removing the
  arguments passing

Approved by:	julian
2011-04-30 23:28:07 +00:00
attilio
d0e06d02bc idle_cpus_mask is just used in the SMP case and within sched_4BSD.
Declare appropriately.
2011-04-30 22:30:18 +00:00
rstone
8a5b424a2c If the 4BSD scheduler tries to schedule a thread that has been pinned or
bound to an AP before SMP has started, the system will panic when we try
to touch per-CPU state for that AP because that state has not been
initialized yet.  Fix this in the same way as ULE: place all threads in
the global run queue before SMP has started.

Reviewed by:	jhb
MFC after:	1 month
2011-04-26 20:34:30 +00:00
jhb
96b1d8b6d7 Fix several places to ignore processes that are not yet fully constructed.
MFC after:	1 week
2011-04-06 17:47:22 +00:00
fabient
e0588db8d2 Clearing the flag when preempting will let the preempted thread run
too much time. This can finish in a scheduler deadlock with ping-pong
between two threads.

One sample of this is:
- device lapic (to have a preemption point on critical_exit())
- options DEVICE_POLLING with HZ>1499 (to have lapic freq = hardclock freq)
- running a cpu intensive task (that does not enter the kernel)
- only one CPU on SMP or no SMP.

As requested by jhb@ 4BSD have received the same type of fix instead of
propagating the flag to the new thread.

Reviewed by:	jhb, jeff
MFC after:	1 month
2011-03-31 13:59:47 +00:00
jhb
b92da6d9e2 Rework realtime priority support:
- Move the realtime priority range up above kernel sleep priorities and
  just below interrupt thread priorities.
- Contract the interrupt and kernel sleep priority ranges a bit so that
  the timesharing priority band can be increased.  The new timeshare range
  is now slightly larger than the old realtime + timeshare ranges.
- Change the ULE scheduler to no longer use realtime priorities for
  interactive threads.  Instead, the larger timeshare range is now split
  into separate subranges for interactive and non-interactive ("batch")
  threads.  The end result is that interactive threads and non-interactive
  threads still use the same priority ranges as before, but realtime
  threads now have a separate, dedicated priority range.
- Do not modify the priority of non-timeshare threads in sched_sleep()
  or via cv_broadcastpri().  Realtime and idle priority threads will
  no longer have their priorities affected by sleeping in the kernel.

Reviewed by:	jeff
2011-01-14 17:06:54 +00:00