- Send IPI wakeups once SMP is started even if cold is true.
- Permit preemptions when cold is true.
These changes are needed for EARLY_AP_STARTUP.
MFC after: 2 weeks
Sponsored by: Netflix
The other CPU might resume and see a still-empty runq and go back to
sleep before sched_add() adds the thread to the runq. This results
in a lost wakeup and a potential hang if the system is otherwise
completely idle.
The race originated due to a micro-optimization (my fault) in 4BSD in
that it avoided putting a thread on the run queue if the scheduler was
going to preempt to the new thread. To avoid complexity while fixing
this race, just drop this optimization. 4BSD now always sets the
"owepreempt" flag when a preemption is warranted and defers the actual
preemption to the thread_unlock of the caller the same as ULE.
MFC after: 2 weeks
Sponsored by: Netflix
Pass current thread credentials instead of NOCRED.
Only allow unmapped buffers for filesystem which proclaimed the support.
For all filesystems which currently use buffer pager (UFS, msdosfs and
cd9660), the changes are effectively nop.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
vp->v_mount->mnt_vnodecovered unlocked. This allowed unmount to race.
Lock vnode after we noticed the VV_ROOT flag. See comments for
explanation why unlocked check for the flag is considered safe.
Reported and tested by: avg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
we never try to sleep while the thread is on a sleepqueue.
Reviewed by: jhb
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D8422
If dotdot lookup does not escape from the file descriptor passed as
the lookup root, we can allow the component traversal. Track the
directories traversed, and check the result of dotdot lookup against
the recorded list of the directory vnodes.
Dotdot lookups are enabled by sysctl vfs.lookup_cap_dotdot, currently
disabled by default until more verification of the approach is done.
Disallow non-local filesystems for dotdot, since remote server might
conspire with the local process to allow it to escape the namespace.
This might be too cautious, provide the knob
vfs.lookup_cap_dotdot_nonlocal to override as well.
Idea by: rwatson
Discussed with: emaste, jonathan, rwatson
Reviewed by: mjg (previous version)
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 week
Differential revision: https://reviews.freebsd.org/D8110
process of being unmounted. Previously it would skip them, even if the
unmount eventually failed eg due to the filesystem being busy.
This behaviour broke autounmountd(8) - if you tried to manually unmount
a mounted filesystem, using 'automount -u', and the autounmountd attempted
to refresh the filesystem list in that very moment, it would conclude that
the filesystem got unmounted and not try to unmount it afterwards.
Reviewed by: kib@
Tested by: pho@
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D8030
Upstream the BUF_TRACKING and FULL_BUF_TRACKING buffer debugging code.
This can be handy in tracking down what code touched hung bios and bufs
last. The full history is especially useful, but adds enough bloat that
it shouldn't be enabled in release builds.
Function names (or arbitrary string constants) are tracked in a
fixed-size ring in bufs. Bios gain a pointer to the upper buf for
tracking. SCSI CCBs gain a pointer to the upper bio for tracking.
Reviewed by: markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8366
not a bitfield. For the intended usage - being passed either MNT_WAIT,
or MNT_NOWAIT - this shouldn't introduce any changes in behaviour.
Reviewed by: jhb@
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D8373
which also use buffer cache.
Most important addition to the code is the handling of filesystems
where the block size is less than the machine page size, which might
require reading several buffers to validate single page.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
On several Intel chipsets, diagnostic NMIs sent from BMC or NMIs
reporting hardware errors are broadcasted to all CPUs.
When kernel is configured to enter kdb on NMI, the outcome is
problematic, because each CPU tries to enter kdb. All CPUs are
executing NMI handlers, which set the latches disabling the nested NMI
delivery; this means that stop_cpus_hard(), used by kdb_enter() to
stop other cpus by broadcasting IPI_STOP_HARD NMI, cannot work. One
indication of this is the harmless but annoying diagnostic "timeout
stopping cpus".
Much more harming behaviour is that because all CPUs try to enter kdb,
and if ddb is used as debugger, all CPUs issue prompt on console and
race for the input, not to mention the simultaneous use of the ddb
shared state.
Try to fix this by introducing a pseudo-lock for simultaneous attempts
to handle NMIs. If one core happens to enter NMI trap handler, other
cores see it and simulate reception of the IPI_STOP_HARD. More,
generic_stop_cpus() avoids sending IPI_STOP_HARD and avoids waiting
for the acknowledgement, relying on the nmi handler on other cores
suspending and then restarting the CPU.
Since it is impossible to detect at runtime whether some stray NMI is
broadcast or unicast, add a knob for administrator (really developer)
to configure debugging NMI handling mode.
The updated patch was debugged with the help from Andrey Gapon (avg)
and discussed with him.
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D8249
native fueword64(9) still, use proper type for local where fuword64()
result is stored.
Note that fueword64() is unused in the tree.
Submitted by: Chunhui He <hchunhui@mail.ustc.edu.cn>
PR: 212520
MFC after: 1 week
In sendit(), if mp->msg_control is present, then in sockargs() we are
allocating mbuf to store mp->msg_control. Later in kern_sendit(), call
to getsock_cap(), will check validity of file pointer passed, if this
fails EBADF is returned but mbuf allocated in sockargs() is not freed.
Made code changes to free the same.
Since freeing control mbuf in sendit() after checking (control != NULL)
may lead to double freeing of control mbuf in sendit(), we can free
control mbuf in kern_sendit() if there are any errors in the routine.
Submitted by: Lohith Bellad <lohith.bellad@me.com>
Reviewed by: glebius
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D8152
If no negative entry is found on the last list, the ncp pointer will be
left uninitialized and a non-null value will make the function assume an
entry was found.
Fix the problem by initializing to NULL on entry.
Reported by: glebius
This splits the ncneg_mtx lock while preserving the hit ratio at least
during buildworld.
Create N dedicated lists for new negative entries.
Entries with at least one hit get promoted to the hot list, where they
get requeued every M hits.
Shrinking demotes one hot entry and performs a round-robin shrinking of
regular lists.
Reviewed by: kib
ref: 535865d02c
Fix cpu assignment by assuring stride is non-zero, assert that all tasks
have a valid taskqueue.
ref: db39817623
Start cpu assignment from zero.
ref: d99d39b6b6
Submitted by: mmacy@nextbsd.org
In r10905 and r10906 makesyscalls was modified to avoid emitting a
literal $Id$ string in the generated file, with:
gsub("[$]Id: ", "", $0)
gsub(" [$]", "", $0)
Then r11294 added some functionality and also tried to address the $Id$
problem in a different way, by removing every $:
sed -e 's/\$//g ...
This rendered the gsub infeffective. The gsub was later updated to
track the $Id$ -> $FreeBSD$ switch, even though it did not do anything.
Revert the addition of the s/\$//g, and update the gsub to keep the
resulting format the same.
Discussed with: bde
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
When detaching device trees parent devices must be detached prior to
detaching its children. This is because parent devices can have
pointers to the child devices in their softcs which are not
invalidated by device_delete_child(). This can cause use after free
issues and panic().
Device drivers implementing trees, must ensure its detach function
detaches or deletes all its children before returning.
While at it remove now redundant device_detach() calls before
device_delete_child() and device_delete_children(), mostly in
the USB controller drivers.
Tested by: Jan Henrik Sylvester <me@janh.de>
Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D8070
MFC after: 2 weeks
Suppose that we have an exclusively busy page, and a thread which can
accept shared-busy page. In this case, typical code waiting for the
page xbusy state to pass is
again:
VM_OBJECT_WLOCK(object);
...
if (vm_page_xbusied(m)) {
vm_page_lock(m);
VM_OBJECT_WUNLOCK(object); <---1
vm_page_busy_sleep(p, "vmopax");
goto again;
}
Suppose that the xbusy state owner locked the object, unbusied the
page and unlocked the object after we are at the line [1], but before we
executed the load of the busy_lock word in vm_page_busy_sleep(). If it
happens that there is still no waiters recorded for the busy state,
the xbusy owner did not acquired the page lock, so it proceeded.
More, suppose that some other thread happen to share-busy the page
after xbusy state was relinquished but before the m->busy_lock is read
in vm_page_busy_sleep(). Again, that thread only needs vm_object lock
to proceed. Then, vm_page_busy_sleep() reads busy_lock value equal to
the VPB_SHARERS_WORD(1).
In this case, all tests in vm_page_busy_sleep(9) pass and we are going
to sleep, despite the page being share-busied.
Update check for m->busy_lock == VPB_UNBUSIED in vm_page_busy_sleep(9)
to also accept shared-busy state if we only wait for the xbusy state to
pass.
Merge sequential if()s with the same 'then' clause in
vm_page_busy_sleep().
Note that the current code does not share-busy pages from parallel
threads, the only way to have more that one sbusy owner is right now
is to recurse.
Reported and tested by: pho (previous version)
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D8196
The runtime kernel loader, linker_load_file, unloads kernel files that
failed to load all of their modules. For consistency, treat preloaded
(loader.conf loaded) kernel files in the same way.
Reviewed by: kib
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8200
Use isrc in attached MSI data structure instead of using map's
isrc directly. map's isrc is set to NULL on IRQ deactivation
which happens prior to pci_release_msi so MSI_RELEASE_MSI
receives array of NULLs
Reviewed by: mmel
Differential Revision: https://reviews.freebsd.org/D8206
Other uses of cache_purgevfs() do rely on the cache purge for correct
operations, when paths are invalidated without unmount.
Reported and tested by: jkim
Discussed with: mjg
Sponsored by: The FreeBSD Foundation
purgevfs is purely optional and induces lock contention in workloads
which frequently mount and unmount filesystems.
In particular, poudriere will do this for filesystems with 4 vnodes or
less. Full cache scan is clearly wasteful.
Since there is no explicit counter for namecache entries, the number of
vnodes used by the target fs is checked.
The default limit is the number of bucket locks.
Reviewed by: kib
Previously free vnodes would always by directly returned to the global
LRU list. With this change up to mnt_free_list_batch vnodes are collected
first.
syncer runs always return the batch regardless of its size.
While vnodes on per-mnt lists are not counted as free, they can be
returned in case of vnode shortage.
Reviewed by: kib
Tested by: pho
function from restarting the timer.
Commonly taskqueue_enqueue_timeout() is called from within the task
function itself without any checks for teardown. Then it can happen
the timer stays active after the return of taskqueue_drain_timeout(),
because the timeout and task is drained separately.
This patch factors out the teardown flag into the timeout task itself,
allowing existing code to stay as-is instead of applying a teardown
flag to each and every of the timeout task consumers.
Add assert to taskqueue_drain_timeout() which prevents parallel
execution on the same timeout task.
Update manual page documenting the return value of
taskqueue_enqueue_timeout().
Differential Revision: https://reviews.freebsd.org/D8012
Reviewed by: kib, trasz
MFC after: 1 week
mbuf to store mp->msg_control. Later in kern_sendit(), call to getsock_cap(),
will check validity of file pointer passed, if this fails EBADF is returned but
mbuf allocated in sockargs() is not freed. Fix this possible leak.
Submitted by: Lohith Bellad <lohith.bellad@me.com>
Reviewed by: adrian
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D7910
If the kernel is not compiled with the CAPABILITIES kernel options
fget_unlocked doesn't return the sequence number so fd_modify will
always report modification, in that case we got infinity loop.
Reported by: br
Reviewed by: mjg
Tested by: br, def
fget_cap_locked returns a referenced file, but the fgetvp_rights does
not need it. Instead, due to the filedesc lock being held, it can
ref the vnode after the file was looked up.
Fix up fget_cap_locked to be consistent with other _locked helpers and not
ref the file.
This plugs a leak introduced in r306184.
Pointy hat to: mjg, oshogbo
Add a table of vnode locks and use them along with bucketlocks to provide
concurrent modification support. The approach taken is to preserve the
current behaviour of the namecache and just lock all relevant parts before
any changes are made.
Lookups still require the relevant bucket to be locked.
Discussed with: kib
Tested by: pho
buffer and put a small optimization for low socket buffer case:
- Do not hack uio_resid, and let m_uiotombuf() properly take care of it. This
fixes truncation of headers at low buffer.
- If headers ate all the space, jump right to the end of the cycle, to
avoid doing single page I/O and allocating zero length mbuf.
- Clear hdr_uio only if space is positive, which indicates that all uio
was copied in.
Reviewed by: pluknet, jtl, emax, rrs, lstewart, emax, gallatin, scottl
sooptcopyin() checks if size of data provided by user is <= than we can
accept, else it strips down the size. On bigendian platforms we have to
move pointer as well so we copy the actual data.
Reviewed by: gnn
Sponsored by: DARPA, AFRL
Sponsored by: HEIF5
Differential Revision: https://reviews.freebsd.org/D7980
Descriptor returned by accept(2) should inherits capabilities rights from
the listening socket.
PR: 201052
Reviewed by: emaste, jonathan
Discussed with: many
Differential Revision: https://reviews.freebsd.org/D7724
This causes dtrace to automatically copyin arguments from userland, so
one no longer has to explicitly use the copyin() action to do so. Moreover,
copyin() on userland addresses is a no-op, so existing scripts should be
unaffected by this change.
Discussed with: rstone
MFC after: 2 weeks
CLOCK_GETTIME() with the lock.
Now all time-related accesses to the CMOS for RTC should be under the
lock. This is needed to allow upcoming EFI Runtime Services support
to provide required execution environment for the firmware calls.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Both can be used to cause processes in capability mode to receive
SIGTRAP when ENOTCAPABLE or ECAPMODE errors are returned from
syscalls.
Idea by: emaste
Reviewed by: oshogbo (previous version), emaste
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D7965
In particular, reset the DF_QUIET flag when detaching from a device so
that a driver that marks a device quiet doesn't dictate policy for a
different driver that may claim the device in the future.
Reviewed by: rpokala, wblock
MFC after: 2 weeks
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D7803
An array of bucket locks is added.
All modifications still require the global cache_lock to be held for
writing. However, most readers only need the relevant bucket lock and in
effect can run concurrently to the writer as long as they use a
different lock. See the added comment for more details.
This is an intermediate step towards removal of the global lock.
Reviewed by: kib
Tested by: pho
If wait4() or wait6() return 0 because of WNOHANG, the status, rusage and
wrusage information should not be returned.
PR: 212048
Reported by: Casey Lucas
MFC after: 2 weeks
Use C99 designators to set the value of each slot and the nitems macro to
check for valid entries. In the process, switch to indexing by signal
number rather than signal-1 for improved clarity.
Obtained from: CheriBSD (a6053c5abf)
Sponsored by: DARPA, AFRL
Reviewed by: kib
Since negative entries are managed with a LRU list, a hit requires a
modificaton.
Currently the code tries to upgrade the global lock if needed and is
forced to retry the lookup if it fails.
Provide a dedicated lock for use when the cache is only shared-locked.
Reviewed by: kib
MFC after: 1 week
getdtablesize is "trivial global state" and is similar to
getrlimit(RLIMIT_NOFILE), so should be permitted in capability mode.
Reviewed by: oshogbo
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D7719
Calling cap_rights_contains() several times with the same inputs is not
going to produce a different output. The variable being iterated, i, is
never used inside the for loop.
The loop is actually done in cap_rights_contains()
Submitted by: Ryan Moeller <ryan@freqlabs.com>
Reviewed by: oshogbo, ed
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D7369
Add a new 'clear driver' command for devctl along with the accompanying
ioctl and devctl_clear_driver() library routine to reset a device to
use a wildcard devclass instead of a fixed devclass. This can be used
to undo a previous 'set driver' command. After the device's name has
been reset to permit wildcard names, it is reprobed so that it can
attach to newly-available (to it) device drivers.
MFC after: 1 month
Sponsored by: Chelsio Communications