Having a reserved vnode count does not guarantee that getnewvnodes wont
block later. Said blocking partially defeats the purpose of reserving in
the first place.
Preallocate instaed. The only consumer was always passing "1" as count
and never nesting reservations.
and make it usable outside of kern_umtx.c. To be used in several
future changes.
Discussed with: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Don't hold the scheduler lock while doing context switches. Instead we
unlock after selecting the new thread and switch within a spinlock
section leaving interrupts and preemption disabled to prevent local
concurrency. This means that mi_switch() is entered with the thread
locked but returns without. This dramatically simplifies scheduler
locking because we will not hold the schedlock while spinning on
blocked lock in switch.
This change has not been made to 4BSD but in principle it would be
more straightforward.
Discussed with: markj
Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22778
Eliminate recursion from most thread_lock consumers. Return from
sched_add() without the thread_lock held. This eliminates unnecessary
atomics and lock word loads as well as reducing the hold time for
scheduler locks. This will eventually allow for lockless remote adds.
Discussed with: kib
Reviewed by: jhb
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22626
This saves 320 bytes of the precious stack space.
The only negative aspect of the change I can think of is that the
struct thread increased by 320 bytes obviously, and that 320 bytes are
not swapped out anymore. I believe the freed stack space is much more
important than that. Also, current struct thread size is 1392 bytes
on amd64, so UMA will allocate two thread structures per (4KB) slab,
which leaves a space for pcb without increasing zone memory use.
Reviewed by: alc, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D22138
properly nested and warns about recursive entrances. Unlike with locks,
there is nothing fundamentally wrong with such use, the intent of tracer
is to help to review complex epoch-protected code paths, and we mean the
network stack here.
Reviewed by: hselasky
Sponsored by: Netflix
Pull Request: https://reviews.freebsd.org/D21610
Early errors prior to a system call did not set td_errno. This commit
sets td_errno for all errors during syscallenter(). As a result,
syscallret() can now always use td_errno without checking TDP_NERRNO.
Reviewed by: kib
MFC after: 1 month
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D20898
For machines having cmpxcgh16b instruction, i.e. everything but very
early Athlons, provide lockless implementation of delayed
invalidation.
The implementation maintains lock-less single-linked list with the
trick from the T.L. Harris article about volatile mark of the elements
being removed. Double-CAS is used to atomically update both link and
generation. New thread starting DI appends itself to the end of the
queue, setting the generation to the generation of the last element
+1. On DI finish, thread donates its generation to the previous
element. The generation of the fake head of the list is the last
passed DI generation. Basically, the implementation is a queued
spinlock but without spinlock.
Many thanks both to Peter Holm and Mark Johnson for keeping with me
while I produced intermediate versions of the patch.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
MFC note: td_md.md_invl_gen should go to the end of struct thread
Differential revision: https://reviews.freebsd.org/D19630
Contrary to the comments, it was never used by core dumps or
debuggers. Instead, it used to hold the signal code of a pending
signal, but that was replaced by the 'ksi_code' member of ksiginfo_t
when signal information was reworked in 7.0.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D20047
PTI mode for the process pmap on exec is activated iff P_MD_PTI is set.
On exec, the existing vmspace can be reused only if pti mode of the
pmap matches the P_MD_PTI flag of the process. Add MD
cpu_exec_vmspace_reuse() callback for exec_new_vmspace() which can
vetoed reuse of the existing vmspace.
MFC note: md_flags change struct proc KBI.
Reviewed by: jhb, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D19514
When building with KCOV enabled the compiler will insert function calls
to probes allowing us to trace the execution of the kernel from userspace.
These probes are on function entry (trace-pc) and on comparison operations
(trace-cmp).
Userspace can enable the use of these probes on a single kernel thread with
an ioctl interface. It can allocate space for the probe with KIOSETBUFSIZE,
then mmap the allocated buffer and enable tracing with KIOENABLE, with the
trace mode being passed in as the int argument. When complete KIODISABLE
is used to disable tracing.
The first item in the buffer is the number of trace event that have
happened. Userspace can write 0 to this to reset the tracing, and is
expected to do so on first use.
The format of the buffer depends on the trace mode. When in PC tracing just
the return address of the probe is stored. Under comparison tracing the
comparison type, the two arguments, and the return address are traced. The
former method uses on entry per trace event, while the later uses 4. As
such they are incompatible so only a single mode may be enabled.
KCOV is expected to help fuzzing the kernel, and while in development has
already found a number of issues. It is required for the syzkaller system
call fuzzer [1]. Other kernel fuzzers could also make use of it, either
with the current interface, or by extending it with new modes.
A man page is currently being worked on and is expected to be committed
soon, however having the code in the kernel now is useful for other
developers to use.
[1] https://github.com/google/syzkaller
Submitted by: Mitchell Horne <mhorne063@gmail.com> (Earlier version)
Reviewed by: kib
Testing by: tuexen
Sponsored by: DARPA, AFRL
Sponsored by: The FreeBSD Foundation (Mitchell Horne)
Differential Revision: https://reviews.freebsd.org/D14599
Bootstacks are unused after APs executed sched_throw() in
init_secondary_tail() and started executing on proper idle thread
stack. Add sysinit that detects that the idle thread for each CPU was
scheduled at least once, and free corresponding bootstack.
Slight addition of the code (~200 bytes) is compensated by the saving,
because even on typical small modern desktop CPU we leak 128K of
memory otherwise (4 pages x 8 threads).
Reviewed by: jhb
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D18486
mutexes but now are converted to epoch(9) use thread-private epoch_tracker.
Embedding tracker into ifnet(9) or ifnet derived structures creates a non
reentrable function, that will fail miserably if called simultaneously from
two different contexts.
A thread private tracker will provide a single tracker that would allow to
call these functions safely. It doesn't allow nested call, but this is not
expected from compatibility KPIs.
Reviewed by: markj
handling.
This is split into a separate commit from the main change to make it
easier to handle possible revert after upcoming KBI freeze.
Reviewed by: kevans
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (rgrimes)
Differential revision: https://reviews.freebsd.org/D16972
When the "pc" audit class is enabled and auditd is running, witness will
panic during thread exit because au_event_class tries to lock an rwlock
while holding a spinlock acquired upstack by thread_exit.
To fix this, move AUDIT_SYSCALL_EXIT futher upstack, before the spinlock is
acquired. Of thread_exit's 16 callers, it's only necessary to call
AUDIT_SYSCALL_EXIT from two, exit1 (for exiting processes) and kern_thr_exit
(for exiting threads). The other callers are all kernel threads, which
needen't call AUDIT_SYSCALL_EXIT because since they can't make syscalls
there will be nothing to audit. And exit1 already does call
AUDIT_SYSCALL_EXIT, making the second call in thread_exit redundant for that
case.
PR: 228444
Reported by: aniketp
Reviewed by: aniketp, kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D16210
By logging all threads and processes 'pmc filter'
can now filter on process or thread name, relieving
the user of the burden of determining which tid or
pid was which when the sample was taken.
% pmc filter -T if_io_tqg -P nginx pmc.log pmc-iflib.log
% pmc filter -x -T idle pmc.log pmc-noidle.log
This implements per-thread counters for PMC sampling. The thread
descriptors are stored in a list attached to the process descriptor.
These thread descriptors can store any per-thread information necessary
for current or future features. For the moment, they just store the counters
for sampling.
The thread descriptors are created when the process descriptor is created.
Additionally, thread descriptors are created or freed when threads
are started or stopped. Because the thread exit function is called in a
critical section, we can't directly free the thread descriptors. Hence,
they are freed to a cache, which is also used as a source of allocations
when needed for new threads.
Approved by: sbruno
Obtained from: jtl
Sponsored by: Juniper Networks, Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15335
Allow processes to request the delivery of a signal upon death of
their parent process. Supposed consumer of the feature is PostgreSQL.
Submitted by: Thomas Munro
Reviewed by: jilles, mjg
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D15106
Assert that all such memory is unwired on return to usermode.
The count of the wired memory will be used to detect the copyout mode.
Tested by: pho (as part of the larger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
userspace to control NUMA policy administratively and programmatically.
Implement domainset based iterators in the page layer.
Remove the now legacy numa_* syscalls.
Cleanup some header polution created by having seq.h in proc.h.
Reviewed by: markj, kib
Discussed with: alc
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13403
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
This introduces a facility to EVENTHANDLER(9) for explicitly defining a
reference to an event handler list. This is useful since previously all
invokers of events had to do a locked traversal of the global list of
event handler lists in order to find the appropriate event handler list.
By keeping a pointer to the appropriate list an invoker can avoid this
traversal completely. The pointer is initialized with SYSINIT(9) during
the eventhandler stage. Users registering interest in events do not need
to know if the event is backed by such a list, since the list is added
to the global list of lists. As with lists that are not pre-defined it
is safe to register for the events before the list has been created.
This converts the process_* and thread_* events to using the new
facility, as these are events whose locked traversals end up showing up
significantly in ports build workflows (and presumably other workflows
with many short lived threads/procs). It may be advantageous to convert
other events to using the new facility.
The el_flags field is now unused, but leave it be so that this revision
can be MFC'd.
Reviewed by: bdrewery, markj, mjg
Approved by: rstone (mentor)
In collaboration with: ian
MFC after: 4 weeks
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12814
struct thread.
For all architectures, the syscall trap handlers have to allocate the
structure on the stack. The structure takes 88 bytes on 64bit arches
which is not negligible. Also, it cannot be easily found by other
code, which e.g. caused duplication of some members of the structure
to struct thread already. The change removes td_dbg_sc_code and
td_dbg_sc_nargs which were directly copied from syscall_args.
The structure is put into the copied on fork part of the struct thread
to make the syscall arguments information correct in the child after
fork.
This move will also allow several more uses shortly.
Reviewed by: jhb (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
X-Differential revision: https://reviews.freebsd.org/D11080
Some notes:
- Only i386 and amd64 layouts are checked, other Tier-1 (or close to
it) architectures would benefit from the same check.
- Unconditional enabling of the asserts depend on the stability of locks
memory layout. If locks are optimized to avoid bloat when some debugging
or profiling features turned off, it makes sense to only assert layout
for production configs.
Reviewed by: badger, emaste, jhb, vangyzen
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D10526
in place. To do per-cpu stats, convert all fields that previously were
maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
before we have set up UMA and we can do counter_u64_alloc(), provide an
early counter mechanism:
o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
so that at early stages of boot, before counters are allocated we already
point to a counter that can be safely written to.
o For sparc64 that required a whole dummy pcpu[MAXCPU] array.
Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.
This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html
Reviewed by: kib, gallatin, marius, lidl
Differential Revision: https://reviews.freebsd.org/D10156
Thread might create a condition for delayed SU cleanup, which creates
a reference to the mount point in td_su, but exit without returning
through userret(), e.g. when terminating due to single-threading or
process exit. In this case, td_su reference is not dropped and mount
point cannot be freed.
Handle the situation by clearing td_su also in the thread destructor
and in exit1(). softdep_ast_cleanup() has to receive the thread as
argument, since e.g. thread destructor is executed in different
context.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
specifics of callout KPI. Esp., do not depend on the exact interface
of callout_stop(9) return values.
The main change is that instead of requiring precise callouts, code
maintains absolute time to wake up. Callouts now should ensure that a
wake occurs at the requested moment, but we can tolerate both run-away
callout, and callout_stop(9) lying about running callout either way.
As consequence, it removes the constant source of the bugs where
sleepq_check_timeout() causes uninterruptible thread state where the
thread is detached from CPU, see e.g. r234952 and r296320.
Patch also removes dual meaning of the TDF_TIMEOUT flag, making code
(IMO much) simpler to reason about.
Tested by: pho
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D7137
calculate appropriate return value for stops. Simplify the code by
using them.
Fix typo in sig_suspend_threads(). The thread which sleep must be
aborted is td2. (*)
In issignal(), when handling stopping signal for thread in
TD_SBDRY_INTR state, do not stop, this is wrong and fires assert.
This is yet another place where execution should be forced out of
SBDRY-protected region. For such case, return -1 from issignal() and
translate it to corresponding error code in sleepq_catch_signals().
Assert that other consumers of cursig() are not affected by the new
return value. (*)
Micro-optimize, mostly VFS and VOP methods, by avoiding calling the
functions when SIGDEFERSTOP_NOP non-change is requested. (**)
Reported and tested by: pho (*)
Requested by: bde (**)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Approved by: re (gjb)
framework allowing to set the suspension policy for the dynamic block.
Extend the currently possible policies of stopping on interruptible
sleeps and ignoring such sleeps by two more: do not suspend at
interruptible sleeps, but interrupt them with either EINTR or ERESTART.
Reviewed by: jilles
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Approved by: re (gjb)
reason for it in modern times. In the other case, expand the comment
stating instead of doubting.
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (hrs)
X-Differential revision: https://reviews.freebsd.org/D6731
p_sched is unused.
The struct td_sched is always co-allocated with the struct thread,
except for the thread0. Avoid useless indirection, instead calculate
td_sched location using simple pointer arithmetic in td_get_sched(9).
For thread0, which is statically allocated, create a structure to
emulate layout of the dynamic allocation.
Reviewed by: jhb (previous version)
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D6711
intention of the POSIX IEEE Std 1003.1TM-2008/Cor 1-2013.
A robust mutex is guaranteed to be cleared by the system upon either
thread or process owner termination while the mutex is held. The next
mutex locker is then notified about inconsistent mutex state and can
execute (or abandon) corrective actions.
The patch mostly consists of small changes here and there, adding
neccessary checks for the inconsistent and abandoned conditions into
existing paths. Additionally, the thread exit handler was extended to
iterate over the userspace-maintained list of owned robust mutexes,
unlocking and marking as terminated each of them.
The list of owned robust mutexes cannot be maintained atomically
synchronous with the mutex lock state (it is possible in kernel, but
is too expensive). Instead, for the duration of lock or unlock
operation, the current mutex is remembered in a special slot that is
also checked by the kernel at thread termination.
Kernel must be aware about the per-thread location of the heads of
robust mutex lists and the current active mutex slot. When a thread
touches a robust mutex for the first time, a new umtx op syscall is
issued which informs about location of lists heads.
The umtx sleep queues for PP and PI mutexes are split between
non-robust and robust.
Somewhat unrelated changes in the patch:
1. Style.
2. The fix for proper tdfind() call use in umtxq_sleep_pi() for shared
pi mutexes.
3. Removal of the userspace struct pthread_mutex m_owner field.
4. The sysctl kern.ipc.umtx_vnode_persistent is added, which controls
the lifetime of the shared mutex associated with a vnode' page.
Reviewed by: jilles (previous version, supposedly the objection was fixed)
Discussed with: brooks, Martin Simmons <martin@lispworks.com> (some aspects)
Tested by: pho
Sponsored by: The FreeBSD Foundation
This code is missing the racct_subr() call from kern_thr_exit() and would
require further code duplication in future changes.
Reviewed by: kib
MFC after: 1 week
When establishing the locking state for several lock types (including
blockable mutexes and sx) failed, locking primitives try to spin while
the owner thread is running. The spinning loop performs the test for
running condition by dereferencing the owner->td_state field of the
owner thread. If the owner thread exited while spinner was put off
the processor, it is harmless to access reused struct thread owner,
since in some near future the current processor would notice the owner
change and make appropriate progress. But it could be that the page
which carried the freed struct thread was unmapped, then we fault
(this cannot happen on amd64).
For now, disallowing free of the struct thread seems to be good
enough, and tests which create a lot of threads once, did not
demonstrated regressions.
Reviewed by: jhb, pho
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D3908
Their primary use was in thread_cow_update to free up old resources.
Freeing had to be done with proc lock held and _cow_ funcs already knew
how to free old structs.
This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.
* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
set in a variety of methods.
This is only relevant for very specific workloads.
This doesn't pretend to be a final NUMA solution.
The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.
This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.
Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.
Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.
Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.
Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!
Tested:
* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)
* Peter Holm ran a stress test suite on this work and found one
issue, but has not been able to verify it (it doesn't look NUMA
related, and he only saw it once over many testing runs.)
* I've tested bhyve instances running in fixed NUMA domains and cpusets;
all seems to work correctly.
Verified:
* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
NUMA policies for processes under test.
Review:
This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@. The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).
This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus. My hope is that with further
exposure and testing more functionality can be implemented and evaluated.
Notes:
* The VM doesn't handle unbalanced domains very well, and if you have an overly
unbalanced memory setup whilst under high memory pressure, VM page allocation
may fail leading to a kernel panic. This was a problem in the past, but it's
much more easily triggered now with these tools.
* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory
isn't really guaranteed in any way. That's next on my plate.
Sponsored by: Norse Corp, Inc.; Dell
Use the same scheme implemented to manage credentials.
Code needing to look at process's credentials (as opposed to thred's) is
provided with *_proc variants of relevant functions.
Places which possibly had to take the proc lock anyway still use the proc
pointer to access limits.
Thread credentials are maintained as follows: each thread has a pointer to
creds and a reference on them. The pointer is compared with proc's creds on
userspace<->kernel boundary and updated if needed.
This patch introduces a counter which can be compared instead, so that more
structures can use this scheme without adding more comparisons on the boundary.
stopping signals should obey, but also all forms of single-threading.
Otherwise, thread might sleep interruptible while owning some
resources, and single-threading thread could try to access them.
An example is owning vnode lock while dumping core.
Submitted by: Conrad Meyer
Review: https://reviews.freebsd.org/D2612
Tested by: pho
MFC after: 1 week
years for head. However, it is continuously misused as the mpsafe argument
for callout_init(9). Deprecate the flag and clean up callout_init() calls
to make them more consistent.
Differential Revision: https://reviews.freebsd.org/D2613
Reviewed by: jhb
MFC after: 2 weeks