particular, the Giant is supposed to protect against parallel
ntp_adjtime(2) invocations. But, for instance, sys_ntp_adjtime() does
copyout(9) under Giant and then examines time_status to return syscall
result. Since copyout(9) could sleep, the syscall result might be
inconsistent.
Another and more important issue is that if PPS is configured,
hardpps(9) is executed without any protection against the parallel
top-level code invocation. Potentially, this may result in the
inconsistent state of the ntptime state variables, but I cannot say
how serious such distortion is. The non-functional splclock() call in
sys_ntp_adjtime() protected against clock interrupts calling hardpps()
in the pre-SMP era.
Modernize the locking. A mutex protects ntptime data. Due to the
hardpps() KPI legitimately serving from the interrupt filters (and
e.g. uart(4) does call it from filter), the lock cannot be sleepable
mutex if PPS_SYNC is defined. Otherwise, use normal sleepable mutex
to reduce interrupt latency.
Reviewed by: imp, jhb
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
Differential revision: https://reviews.freebsd.org/D6825
private mtx in resettodr(), no implementation of CLOCK_SETTIME() is
allowed to sleep.
Reviewed by: imp, jhb
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
X-Differential revision: https://reviews.freebsd.org/D6825
TDF_SEINTR flags values, unlike TDF_SBDRY, must be treated almost as
if TDF_SBDRY is not set for STOP signal delivery. The only difference
is that sig_suspend_threads() should abort the sleep instead of doing
immediate suspension.
Reported by: ngie
Sponsored by: The FreeBSD Foundation
MFC after: 12 days
Approved by: re (gjb)
structure, change it to int.
The real fix is to sanitize user-visible definitions in sys/event.h,
e.g. the affected struct knlist is of no use for userspace programs.
Reported and tested by: jkim
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Approved by: re (gjb)
exiting (NOTE_EXIT->knlist_remove_inevent()), two things happen:
- knote kn_knlist pointer is reset
- INFLUX knote is removed from the process knlist.
And, there are two consequences:
- KN_LIST_UNLOCK() on such knote is nop
- there is nothing which would block exit1() from processing past the
knlist_destroy() (and knlist_destroy() resets knlist lock pointers).
Both consequences result either in leaked process lock, or
dereferencing NULL function pointers for locking.
Handle this by stopping embedding the process knlist into struct proc.
Instead, the knlist is allocated together with struct proc, but marked
as autodestroy on the zombie reap, by knlist_detach() function. The
knlist is freed when last kevent is removed from the list, in
particular, at the zombie reap time if the list is empty. As result,
the knlist_remove_inevent() is no longer needed and removed.
Other changes:
In filt_procattach(), clear NOTE_EXEC and NOTE_FORK desired events
from kn_sfflags for knote registered by kernel to only get NOTE_CHILD
notifications. The flags leak resulted in excessive
NOTE_EXEC/NOTE_FORK reports.
Fix immediate note activation in filt_procattach(). Condition should
be either the immediate CHILD_NOTE activation, or immediate NOTE_EXIT
report for the exiting process.
In knote_fork(), do not perform racy check for KN_INFLUX before kq
lock is taken. Besides being racy, it did not accounted for notes
just added by scan (KN_SCAN).
Some minor and incomplete style fixes.
Analyzed and tested by: Eric Badger <eric@badgerio.us>
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Approved by: re (gjb)
Differential revision: https://reviews.freebsd.org/D6859
interrupt sleeps with the ERESTART on the suspension attempts.
Otherwise, single-threading requests are deferred until the locks are
granted for NFS files, which causes hangs.
When retrying local registration of the remotely-granted adv lock,
allow full suspension and check for suspension, for usual reasons.
Reported by: markj, pho
Reviewed by: jilles
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Approved by: re (gjb)
framework allowing to set the suspension policy for the dynamic block.
Extend the currently possible policies of stopping on interruptible
sleeps and ignoring such sleeps by two more: do not suspend at
interruptible sleeps, but interrupt them with either EINTR or ERESTART.
Reviewed by: jilles
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Approved by: re (gjb)
lists must be functional.
Reported by: Daniel Engberg <daniel.engberg.lists@pyret.net>,
Guy Yur <guyyur@gmail.com>
Tested by: Guy Yur <guyyur@gmail.com>
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb), including the KBI change
While reading the code, I noticed that shm_read() returns without unlocking
foffset and rangelock if mac_posixshm_check_read() rejects the read.
Reviewed by: kib, jhb, rwatson
Approved by: re (gjb)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D6927
Mark the pipe() system call as COMPAT10.
As of r302092 libc uses pipe2() with a zero flags value instead of pipe().
Approved by: re (gjb)
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D6816
As of r302092 libc uses pipe2() with a zero flags value instead of pipe().
Commit with regenerated files and implementation to follow.
Approved by: re (gjb)
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D6816
File and disk-backed I/O requests store counts of read/written disk
blocks in each AIO job so that they can be charged to the thread that
completes an AIO request via aio_return() or aio_waitcomplete(). This
change extends AIO jobs to store counts of received/sent messages and
updates socket backends to set these counts accordingly. Note that
the socket backends are careful to only charge a single messages for
each AIO request even though a single request on a blocking socket might
invoke sosend or soreceive multiple times. This is to mimic the
resource accounting of synchronous read/write.
Adjust the UNIX socketpair AIO test to verify that the message resource
usage counts update accordingly for aio_read and aio_write.
Approved by: re (hrs)
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D6911
than removing the network interfaces first. This change is rather larger
and convoluted as the ordering requirements cannot be separated.
Move the pfil(9) framework to SI_SUB_PROTO_PFIL, move Firewalls and
related modules to their own SI_SUB_PROTO_FIREWALL.
Move initialization of "physical" interfaces to SI_SUB_DRIVERS,
move virtual (cloned) interfaces to SI_SUB_PSEUDO.
Move Multicast to SI_SUB_PROTO_MC.
Re-work parts of multicast initialisation and teardown, not taking the
huge amount of memory into account if used as a module yet.
For interface teardown we try to do as many of them as we can on
SI_SUB_INIT_IF, but for some this makes no sense, e.g., when tunnelling
over a higher layer protocol such as IP. In that case the interface
has to go along (or before) the higher layer protocol is shutdown.
Kernel hhooks need to go last on teardown as they may be used at various
higher layers and we cannot remove them before we cleaned up the higher
layers.
For interface teardown there are multiple paths:
(a) a cloned interface is destroyed (inside a VIMAGE or in the base system),
(b) any interface is moved from a virtual network stack to a different
network stack ("vmove"), or (c) a virtual network stack is being shut down.
All code paths go through if_detach_internal() where we, depending on the
vmove flag or the vnet state, make a decision on how much to shut down;
in case we are destroying a VNET the individual protocol layers will
cleanup their own parts thus we cannot do so again for each interface as
we end up with, e.g., double-frees, destroying locks twice or acquiring
already destroyed locks.
When calling into protocol cleanups we equally have to tell them
whether they need to detach upper layer protocols ("ulp") or not
(e.g., in6_ifdetach()).
Provide or enahnce helper functions to do proper cleanup at a protocol
rather than at an interface level.
Approved by: re (hrs)
Obtained from: projects/vnet
Reviewed by: gnn, jhb
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D6747
to mount points with the given filesystem type, specified by mount
vfs_ops pointer.
Based on patch by: mckusick
Reviewed by: avg, mckusick
Tested by: allanjude, madpilot
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
threads, to make it less confusing and using modern kernel terms.
Rename the functions to reflect current use of the functions, instead
of the historic KSE conventions:
cpu_set_fork_handler -> cpu_fork_kthread_handler (for kthreads)
cpu_set_upcall -> cpu_copy_thread (for forks)
cpu_set_upcall_kse -> cpu_set_upcall (for new threads creation)
Reviewed by: jhb (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (hrs)
Differential revision: https://reviews.freebsd.org/D6731
reason for it in modern times. In the other case, expand the comment
stating instead of doubting.
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (hrs)
X-Differential revision: https://reviews.freebsd.org/D6731
This reduces the size of kaiocb slightly. I've also added some generic
fields that other backends can use in place of the BIO-specific fields.
Change the socket and Chelsio DDP backends to use 'backend3' instead of
abusing _aiocb_private.status directly. This confines the use of
_aiocb_private to the AIO internals in vfs_aio.c.
Reviewed by: kib (earlier version)
Approved by: re (gjb)
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D6547
we set MNTK_UNMOUNT flag on the mp. Otherwise parallel unmount which
wins race with us could dereference the covered vnode, and we are
left with the locked freed memory.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
MFC after: 1 week
The allproc_lock lock used in the sysctl_kern_corefile function is initialized
in the procinit function which is called after setting sysctl values at boot.
That means if we set kern.corefile at boot we will be trying to use
lock with is uninitialized and machine will crash.
If we define kern.corefile as tunable instead of using CTFLAG_RWTUN we will
not call the sysctl_kern_corefile function and we will not use an uninitialized
lock. When machine will boot then we will start using function depending on
the lock.
Reviewed by: pjd
by holding allprison_lock exclusively (even if only for a moment before
downgrading) on all paths that call PR_METHOD_REMOVE. Since they may run
on a downgraded lock, it's still possible for them to run concurrently
with PR_METHOD_GET, which will need to use the prison lock.
when the process credentials were not changed. This can happen if an
error occured trying to activate the setuid binary. And on error, if
new credentials were not yet assigned, they must be freed to not
create the leak.
Use oldcred == NULL as the predicate to detect credential
reassignment.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
panic string again if set, in case it scrolled out of the active
window. This avoids having to remember the symbol name.
Also add a show callout <addr> command to DDB in order to inspect
some struct callout fields in case of panics in the callout code.
This may help to see if there was memory corruption or to further
ease debugging problems.
Obtained from: projects/vnet
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Reviewed by: jhb (comment only on the show panic initally)
Differential Revision: https://reviews.freebsd.org/D4527
p_sched is unused.
The struct td_sched is always co-allocated with the struct thread,
except for the thread0. Avoid useless indirection, instead calculate
td_sched location using simple pointer arithmetic in td_get_sched(9).
For thread0, which is statically allocated, create a structure to
emulate layout of the dynamic allocation.
Reviewed by: jhb (previous version)
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D6711
BUS_MAP_INTR() is used to get an interrupt mapping data according
to provided hints. The hints could be modified afterwards, but only
if mapping data was allocated. This method is intended to be called
before BUS_ALLOC_RESOURCE().
An interrupt mapping data describes an interrupt - hardware number,
type, configuration, cpu binding, and whatever is needed to setup it.
(2) Introduce a method which allows storing of an additional data
in struct resource to be available for bus drivers. This method is
convenient in two ways:
- there is no need to rework existing bus drivers as they can simply
be extended to provide an additional data,
- there is no need to modify any existing bus methods as struct
resource is already passed to them as argument and thus stored data
is simply accessible by other bus drivers.
For now, implement this method only for INTRNG.
This is motivated by needs of modern SOCs where hardware initialization
is not straightforward and resources descriptions are complex, opaque
for everyone but provider, and may vary from SOC to SOC. Typical
situation is that one bus driver can fetch a resource description for
its child device, but it's opaque for this driver. Another bus driver
knows a provider for this kind of resource and can pass this resource
description to it. In fact, something like device IVARS would be
perfect for that if implemented generally enough. Unfortunatelly, IVARS
are usable only by their owners now. Only owner knows its IVARS layout,
thus other bus drivers are not able to use them.
Differential Revision: https://reviews.freebsd.org/D6632
range of interrupts they pass to a second controller driver to handle.
The parent driver is expected to detect when one of these interrupts has
been triggered and call intr_child_irq_handler to pass the interrupt to
a child. The children controllers are then expected to manage the range
by allocating interrupts as needed.
This will initially be used by the ARM GICv3 driver, but is is expected to
be useful for other driver where this type of allocation applies.
Obtained from: ABT Systems Ltd
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D6436
Inline version of primitives do an atomic op and if it fails they fallback to
actual primitives, which immediately retry the atomic op.
The obvious optimisation is to check if the lock is free and only then proceed
to do an atomic op.
Reviewed by: jhb, vangyzen
specific order. VNET_SYSUNINITs however are doing exactly that.
Thus remove the VIMAGE conditional field from the domain(9) protosw
structure and replace it with VNET_SYSUNINITs.
This also allows us to change some order and to make the teardown functions
file local static.
Also convert divert(4) as it uses the same mechanism ip(4) and ip6(4) use
internally.
Slightly reshuffle the SI_SUB_* fields in kernel.h and add a new ones, e.g.,
for pfil consumers (firewalls), partially for this commit and for others
to come.
Reviewed by: gnn, tuexen (sctp), jhb (kernel.h)
Obtained from: projects/vnet
MFC after: 2 weeks
X-MFC: do not remove pr_destroy
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D6652
The lock was temporarily dropped for vrele calls, but they can be
postponed to a point where the lock is not held in the first place.
While here shuffle other code not needing the lock.
This was previously set after the hook and only if auxargs were present.
Now always provide it if possible.
MFC after: 2 weeks
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D6546