Replace archaic "busses" with modern form "buses."
Intentionally excluded:
* Old/random drivers I didn't recognize
* Old hardware in general
* Use of "busses" in code as identifiers
No functional change.
http://grammarist.com/spelling/buses-busses/
PR: 216099
Reported by: bltsrc at mail.ru
Sponsored by: Dell EMC Isilon
between exp(3) and `exp` var.
The approach taken previously was not ideal for multiple
functional and stylistic reasons.
Add to existing sed call in Makefile to replace `exp` with
`exponent` instead.
MFC after: 13 days
Requested by: bde
sys/net/iflib.c:
Add ctx to filter_info and don't skpi interrupt early on unless we're on an
SMP system
sys/kern/subr_gtaskqueue.c:
Skip smp check if we're running UP
Submitted by: Matt Macy <mmacy@nextbsd.org>
Reported by: emaste bde
This is needed for kernel dumps to work, as the panicking thread will call
into code that makes use of kernel locks.
Reported and tested by: Eugene Grosbein
MFC after: 1 week
This helps fix a -Wshadow issue with exp(3) with tests/sys/acct/acct_test,
which include math.h, which in turn defines exp(3)
MFC after: 2 weeks
Tested with: clang, gcc 4.2.1, gcc 4.9
Sponsored by: Dell EMC Isilon
dropped then reacquired due to using M_WAITOK, which opens a window in
which the tty device can disappear. Check for this and return ENXIO
back up the call chain so that callers can cope.
This closes a race where TF_GONE would get set while buffers were being
allocated as part of ttydev_open(), causing a subsequent call to
ttydevsw_modem() later in ttydev_open() to assert.
Reported by: pho
Reviewed by: kib
after tty_timedwait() returns an error only if the error is EWOULDBLOCK;
other errors cause an immediate return. This fixes the case of the tty
disappearing while in tty_drain().
Reported by: pho
m_pulldown()
m_pulldown() only needs to determine if a mbuf is writable if it is going to
copy data into the data region of an existing mbuf. It does this to create a
contiguous data region in a single mbuf from multiple mbufs in the chain. If
the requested memory region is already contiguous and nothing needs to
change, the mbuf does not need to be writeable.
Submitted by: Brian Mueller <bmueller@panasas.com>
Reviewed by: bz
MFC after: 1 week
Sponsored by: Panasas
Differential Revision: https://reviews.freebsd.org/D9053
drain timeout handling to historical freebsd behavior.
The primary reason for these changes is the need to have tty_drain() call
ttydevsw_busy() at some reasonable sub-second rate, to poll hardware that
doesn't signal an interrupt when the transmit shift register becomes empty
(which includes virtually all USB serial hardware). Such hardware hangs
in a ttyout wait, because it never gets an opportunity to trigger a wakeup
from the sleep in tty_drain() by calling ttydisc_getc() again, after
handing the last of the buffered data to the hardware.
While researching the history of changes to tty_drain() I stumbled across
some email describing the historical BSD behavior of tcdrain() and close()
on serial ports, and the ability of comcontrol(1) to control timeout
behavior. Using that and some advice from Bruce Evans as a guide, I've
put together these changes to implement the hardware polling and restore
the historical timeout behaviors...
- tty_drain() now calls ttydevsw_busy() in a loop at 10 Hz to accomodate
hardware that requires polling for busy state.
- The "new historical" behavior for draining during close(2) is retained:
the drain timeout is "1 second without making any progress". When the
1-second timeout expires, if the count of bytes remaining in the tty
layer buffer is smaller than last time, the timeout is extended for
another second. Unfortunately, the same logic cannot be extended all
the way down to the hardware, because the interface to that layer is a
simple busy/not-busy indication.
- Due to the previous point, an application that needs a guarantee that
all data has been transmitted must use TIOCDRAIN/tcdrain(3) before
calling close(2).
- The historical behavior of honoring the drainwait setting for TIOCDRAIN
(used by tcdrain(3)) is restored.
- The historical kern.drainwait sysctl to control the global default
drainwait time is restored, but is now named kern.tty_drainwait.
- The historical default drainwait timeout of 300 seconds is restored.
- Handling of TIOCGDRAINWAIT and TIOCSDRAINWAIT ioctls is restored
(this also makes the comcontrol(1) drainwait verb work again).
- Manpages are updated to document these behaviors.
Reviewed by: bde (prior version)
biowait() will otherwise race with completions of such BIOs. In-tree code
only calls biowait() on BIOs that do not specify a handler, so this change
should not have any functional impact.
Reviewed by: mav
MFC after: 1 month
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D9070
Add a MSG_MOREOTOCOME message flag. When this flag is set, sosend*
set PRUS_MOREOTOCOME when invoking the protocol send method. The aio
worker tasks for sending on a socket set this flag when there are
additional write jobs waiting on the socket buffer.
Reviewed by: adrian
MFC after: 1 month
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D8955
sys/ptrace.h includes sys/signal.h, which includes sys/_sigset.h.
Note that sys/_sigset.h only defines osigset_t if COMPAT_43 was defined.
Two lines later, sys/ptrace.h includes machine/reg.h, which in case of
powerpc, includes opt_compat.h.
After the include headers reordering in r311345, we have sys/ptrace.h
included before sys/sysproto.h.
If COMPAT_43 was requested in the kernel config, the result is that
sys/_sigset.h does not define osigset_t, but sys/sysproto.h sees
COMPAT_43 and uses osigset_t.
Fix this by explicitely including opt_compat.h to cover the whole
kern/kern_exec.c scope.
Sponsored by: The FreeBSD Foundation
Right now size of the structure is 472 bytes on amd64, which is
already large and stack allocations are indesirable. With the ino64
work, MNAMELEN is increased to 1024, which will make it impossible to have
struct statfs on the stack.
Extracted from: ino64 work by gleb
Discussed with: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Upon each execve, we allocate a KVA range for use in copying data to the
new image. Pages must be faulted into the range, and when the range is
freed, the backing pages are freed and their mappings are destroyed. This
is a lot of needless overhead, and the exec_map management becomes a
bottleneck when many CPUs are executing execve concurrently. Moreover, the
number of available ranges is fixed at 16, which is insufficient on large
systems and potentially excessive on 32-bit systems.
The new allocator reduces overhead by making exec_map allocations
persistent. When a range is freed, pages backing the range are marked clean
and made easy to reclaim. With this change, the exec_map is sized based on
the number of CPUs.
Reviewed by: kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D8921
returns memory which must be freed, regardless of the error. Assign
NULL to *buf in case we are not going to allocate any memory due to
invalid mode.
Reported and tested by: pho
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks (together with r310638)
Differential revision: https://reviews.freebsd.org/D9042
a symlink and an autofs mount request. The crash was caused by namei()
calling bcopy() with a negative length, caused by numeric underflow:
in lookup(), in the relookup path, the ni_pathlen was decremented too
many times. The bug was introduced in r296715.
Big thanks to Alex Deiter for his help with debugging this.
Reviewed by: kib@
Tested by: Alex Deiter <alex.deiter at gmail.com>
MFC after: 1 month
Instead of spuriously re-reading the lock value, read it once.
This change also has a side effect of fixing a performance bug:
on failed _mtx_obtain_lock, it was possible that re-read would find
the lock is unowned, but in this case the primitive would make a trip
through turnstile code.
This is diff reduction to a variant which uses atomic_fcmpset.
Discussed with: jhb (previous version)
Tested by: pho (previous version)
and prison enforcement. Do it on the caller buffer directly.
Besides eliminating memory copies, this change also removes large
structure from the kernel stack.
Extracted from: ino64 work by gleb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
- iflib - add checksum in place support (mmacy)
- iflib - initialize IP for TSO (going to be needed for e1000) (mmacy)
- iflib - move isc_txrx from shared context to softc context (mmacy)
- iflib - Normalize checks in TXQ drainage. (shurd)
- iflib - Fix queue capping checks (mmacy)
- iflib - Fix invalid assert, em can need 2 sentinels (mmacy)
- iflib - let the driver determine what capabilities are set and what
tx csum flags are used (mmacy)
- add INVARIANTS debugging hooks to gtaskqueue enqueue (mmacy)
- update bnxt(4) to support the changes to iflib (shurd)
Some other various, sundry updates. Slightly more verbose changelog:
Submitted by: mmacy@nextbsd.org
Reviewed by: shurd
mFC after:
Sponsored by: LimeLight Networks and Dell EMC Isilon
All hash sizes are power-of-2, but the compiler does not know that for sure
and 'foo % size' forces doing a division.
Store the size - 1 and use 'foo & hash' instead which allows mere shift.
This argument is not a bitmask of flags, but only accepts a single value.
Fail with EINVAL if an invalid value is passed to 'flag'. Rename the
'flags' argument to getmntinfo(3) to 'mode' as well to match.
This is a followup to r308088.
Reviewed by: kib
MFC after: 1 month
closed by r310302 for knote().
If KN_INFLUX | KN_SCAN flags are set for the note passed to knote() or
knote_fork(), i.e. the knote is scanned, we might erronously clear
INFLUX when finishing notification. For normal knote() it was fixed
in r310302 simply by remembering the fact that we do not own
KN_INFLUX, since there we own knlist lock and scan thread cannot clear
KN_INFLUX until we drop the lock. For knote_fork(), the situation is
more complicated, e must drop knlist lock AKA the process lock, since
we need to register new knotes.
Change KN_INFLUX into counter and allow shared ownership of the
in-flux state between scan and knote_fork() or knote(). Both in-flux
setters need to ensure that knote is not dropped in parallel. Added
assert about kn_influx == 1 in knote_drop() verifies that in-flux state
is not shared when knote is destroyed.
Since KBI of the struct knote is changed by addition of the int
kn_influx field, reorder kn_hook and kn_hookid to fill pad on LP64
arches [1]. This keeps sizeof(struct knote) to same 128 bytes as it
was before addition of kn_influx, on amd64.
Reviewed by: markj
Suggested by: markj [1]
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D8898
accepting the wrong state and printing warning. Do not obliterate
kl_lock and kl_unlock pointers, they are often useful for post-mortem
analysis.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
X-Differential revision: https://reviews.freebsd.org/D8898
There is no need to do two allocations per kqueue timer. Gather all
data needed by the timer callout into the structure and allocate it at
once.
Use the structure to preserve the result of timer2sbintime(), to not
perform repeated 64bit calculations in callout.
Remove tautological casts.
Remove now unused p_nexttime [1].
Noted by: markj [1]
Reviewed by: markj (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
X-MFC note: do not remove p_nexttime
Differential revision: https://reviews.freebsd.org/D8901
The removal of TAILQ_FOREACH_SAFE introduced a small race: when the last
thread on a sleepqueue is awoken, it reclaims the sleepqueue and may begin
executing on a different CPU before sleepq_resume_thread() returns. This
leaves a window during which it may go back to sleep and incorrectly be
awoken again by the caller of sleepq_broadcast().
Reported and tested by: pho
MFC after: 3 days
Sponsored by: Dell EMC Isilon
pause() uses a spin loop to simulate a sleep during early boot. However,
we only need this for thread0 to get far enough in the boot process to
enable timers (at which point pause() can sleep). For other kthreads,
sleeping in pause() is ok as the callout will be scheduled and will
eventually fire once thread0 initializes timers.
Tested by: Steven Kargl
Sleuthing by: markj
MFC after: 1 week
Sponsored by: Netflix
For notes in KN_INFLUX|KN_SCAN state, the influx bit is set by a
parallel scan. When knote() reports event for the vnode filters,
which require kqueue unlocked, it unconditionally sets and then clears
influx to keep note around kqueue unlock. There, do not clear influx
flag if a scan set it, since we do not own it, instead we prevent scan
from executing by holding knlist lock.
The knote_fork() function has somewhat similar problem, it might set
KN_INFLUX for scanned note, drop kqueue and list locks, and then clear
the flag after relock. A solution there would be different enough, as
well as the test program, so close the reported issue first.
Reported and test case provided by: yjh0502@gmail.com
PR: 214923
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Apparently stdatomic.h implementation for gcc 4.2 on sparc64 does not
work properly. This effectively reverts r251803.
Reported and tested by: lidl
Discussed with: ed
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
This way it becomes possible to graph a property for all instances of a
single driver. For example, graphing the number of packets across all
USB controllers, the amount of dropped packets on all NICs, etc.
Reviewed by: cem
Differential Revision: https://reviews.freebsd.org/D8775
Sysctls like kern.eventtimer.et.*.quality currently embed the name of
the clock device. This is problematic for the Prometheus metrics
exporter for two reasons:
- Some of those clocks have dashes in their names, which Prometheus
doesn't allow to be used in metric names.
- It doesn't allow for extracting the same property of all clocks on the
system from within a single query.
Attach these nodes to have a label, so that the Prometheus metrics
exporter gives these metric a uniform name with the name of the clock
attached as a label.
Reviewed by: cem
Differential Revision: https://reviews.freebsd.org/D8775
I'm currently working on writing a metrics exporter for the Prometheus
monitoring system to provide access to sysctl metrics. Prometheus and
sysctl have some structural differences:
- sysctl is a tree of string component names.
- Prometheus uses a flat namespace for its metrics, but allows you to
attach labels with values to them, so that you can do aggregation.
An initial version of my exporter simply translated
hw.acpi.thermal.tz1.temperature
to
sysctl_hw_acpi_thermal_tz1_temperature_celcius
while we should ideally have
sysctl_hw_acpi_thermal_temperature_celcius{thermal_zone="tz1"}
allowing you to graph all thermal zones on a system in one go.
The change presented in this commit adds support for accomplishing this,
by providing the ability to attach labels to nodes. In the example I
gave above, the label "thermal_zone" would be attached to "tz1". As this
is a feature that will only be used very rarely, I decided to not change
the KPI too aggressively.
Discussed on: hackers@
Reviewed by: cem
Differential Revision: https://reviews.freebsd.org/D8775
This allows blind increment of relevant counters which under contention
is cheaper than inc-not-zero loops at least on amd64.
Use it in some of the places which are guaranteed to see already active
vnodes.
Reviewed by: kib (previous version)
the reaper.
The traditional reaper init(8) is aware of zombies silently reparented
to it after the parents exit, it loops around waitpid(2) to collect
them. For other reapers, the silent reparenting is surprising and
collecting zombies requires a thread blocking in waitpid(2) just for
that purpose. It seems that sending second SIGCHLD is a better
workaround than forcing all reapers to obey the setup.
Reported by: Michael Zuo <muh.muhten@gmail.com>, jilles
PR: 213928
Reviewed by: jilles (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
truncation, immediately queue the page for asynchronous laundering rather
than making the page pass through inactive queue first.
Reviewed by: kib, markj
Changes include modifications in kernel crash dump routines, dumpon(8) and
savecore(8). A new tool called decryptcore(8) was added.
A new DIOCSKERNELDUMP I/O control was added to send a kernel crash dump
configuration in the diocskerneldump_arg structure to the kernel.
The old DIOCSKERNELDUMP I/O control was renamed to DIOCSKERNELDUMP_FREEBSD11 for
backward ABI compatibility.
dumpon(8) generates an one-time random symmetric key and encrypts it using
an RSA public key in capability mode. Currently only AES-256-CBC is supported
but EKCD was designed to implement support for other algorithms in the future.
The public key is chosen using the -k flag. The dumpon rc(8) script can do this
automatically during startup using the dumppubkey rc.conf(5) variable. Once the
keys are calculated dumpon sends them to the kernel via DIOCSKERNELDUMP I/O
control.
When the kernel receives the DIOCSKERNELDUMP I/O control it generates a random
IV and sets up the key schedule for the specified algorithm. Each time the
kernel tries to write a crash dump to the dump device, the IV is replaced by
a SHA-256 hash of the previous value. This is intended to make a possible
differential cryptanalysis harder since it is possible to write multiple crash
dumps without reboot by repeating the following commands:
# sysctl debug.kdb.enter=1
db> call doadump(0)
db> continue
# savecore
A kernel dump key consists of an algorithm identifier, an IV and an encrypted
symmetric key. The kernel dump key size is included in a kernel dump header.
The size is an unsigned 32-bit integer and it is aligned to a block size.
The header structure has 512 bytes to match the block size so it was required to
make a panic string 4 bytes shorter to add a new field to the header structure.
If the kernel dump key size in the header is nonzero it is assumed that the
kernel dump key is placed after the first header on the dump device and the core
dump is encrypted.
Separate functions were implemented to write the kernel dump header and the
kernel dump key as they need to be unencrypted. The dump_write function encrypts
data if the kernel was compiled with the EKCD option. Encrypted kernel textdumps
are not supported due to the way they are constructed which makes it impossible
to use the CBC mode for encryption. It should be also noted that textdumps don't
contain sensitive data by design as a user decides what information should be
dumped.
savecore(8) writes the kernel dump key to a key.# file if its size in the header
is nonzero. # is the number of the current core dump.
decryptcore(8) decrypts the core dump using a private RSA key and the kernel
dump key. This is performed by a child process in capability mode.
If the decryption was not successful the parent process removes a partially
decrypted core dump.
Description on how to encrypt crash dumps was added to the decryptcore(8),
dumpon(8), rc.conf(5) and savecore(8) manual pages.
EKCD was tested on amd64 using bhyve and i386, mipsel and sparc64 using QEMU.
The feature still has to be tested on arm and arm64 as it wasn't possible to run
FreeBSD due to the problems with QEMU emulation and lack of hardware.
Designed by: def, pjd
Reviewed by: cem, oshogbo, pjd
Partial review: delphij, emaste, jhb, kib
Approved by: pjd (mentor)
Differential Revision: https://reviews.freebsd.org/D4712
kinfo_proc::ki_tdname is three characters shorter than
thread::td_name. Add a ki_moretdname field for these three
extra characters. Add the new field to kinfo_proc32, as well.
Update all in-tree consumers to read the new field and assemble
the full name, except for lldb's HostThreadFreeBSD.cpp, which
I will handle separately. Bump __FreeBSD_version.
Reviewed by: kib
MFC after: 1 week
Relnotes: yes
Sponsored by: Dell EMC
Differential Revision: https://reviews.freebsd.org/D8722
wait(2).
- Do not acquire the process spinlock if neither WTRAPPED nor WUNTRACED
options were passed [1].
- Extract the code to report alive process into a new helper
report_alive_proc() and use it for trapped, stopped and continued
childrens.
Note that the process spinlock is required around the WTRAPPED and
WUNTRACED tests, because P_STOPPED_TRACE and P_STOPPED_SIG flags are
set before other threads are stopped at the suspension point, and that
threads increment p_suspcount while owning only the process spinlock,
the process lock is dropped by them. If the spinlock is not taken for
tests, the syscall thread might miss both p_suspcount increment and
wakeup in wakeup in thread_suspend_switch().
Based on the submission by: mjg [1]
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Instead of failing with ENAMETOOLONG, which is swallowed by
pthread_set_name_np() anyway, truncate the given name to MAXCOMLEN+1
bytes. This is more likely what the user wants, and saves the
caller from truncating it before the call (which was the only
recourse).
Polish pthread_set_name_np(3) and add a .Xr to thr_set_name(2)
so the user might find the documentation for this behavior.
Reviewed by: jilles
MFC after: 3 days
Sponsored by: Dell EMC
Since the vnode is only expected to be shared locked, we can save a
little overhead by only pretending we are locking in the first place.
Reviewed by: kib
Tested by: pho
As of r234483, vnode deactivation causes non-VPO_NOSYNC pages to be
laundered. This behaviour has two problems:
1. Dirty VPO_NOSYNC pages must be laundered before the vnode can be
reclaimed, and this work may be unfairly deferred to the vnlru process
or an unrelated application when the system is under vnode pressure.
2. Deactivation of a vnode with dirty VPO_NOSYNC pages requires a scan of
the corresponding VM object's memq for non-VPO_NOSYNC dirty pages; if
the laundry thread needs to launder pages from an unreferenced such
vnode, it will reactivate and deactivate the vnode with each laundering,
potentially resulting in a large number of expensive scans.
Therefore, ensure that all dirty pages are laundered upon deactivation,
i.e., when all maps of the vnode are removed and all references are
released.
Reviewed by: alc, kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D8641
The callout subsystem already handles early callouts and schedules
the first clock interrupt appropriately based on the currently pending
callouts. The one nit to fix was that callouts scheduled via C_HARDCLOCK
during early boot could fire too early once timers were enabled as the
per-CPU base time is always zero until timers are initialized. The change
in callout_when() handles this case by using the current uptime as the
base time of the callout during bootup if the per-CPU base time is zero.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: Netflix
The size can be changed by side effect of modifying kern.maxvnodes.
Since numbucketlocks was not modified, setting a sufficiently low value
would give more locks than actual buckets, which would then lead to
corruption.
Force the number of buckets to be not smaller.
Note this should not matter for real world cases.
Reported and tested by: pho
The swap pager enqueues laundered pages near the head of the inactive queue
to avoid another trip through LRU before reclamation. This change adds
support for this behaviour to the vnode pager and makes use of it in UFS and
ext2fs. Some ioflag handling is consolidated into a common subroutine so
that this support can be easily extended to other filesystems which make use
of the buffer cache. No changes are needed for ZFS since its putpages
routine always undirties the pages before returning, and the laundry
thread requeues the pages appropriately in this case.
Reviewed by: alc, kib
Differential Revision: https://reviews.freebsd.org/D8589
This avoids the time-warp after kthreads have started running and the
required fixup to td_slptick and td_blktick in the EARLY_AP_STARTUP
case. Now, 'ticks' is initialized before any kthreads are created or
any context switches are performed.
Tested by: gavin
MFC after: 2 weeks
Sponsored by: Netflix
always audit the file-descriptor number and vnode information for all
fnctl(2) commands, not just locking-related ones. This was likely an
oversight in the original adaptation of this code from XNU.
MFC after: 3 days
Sponsored by: DARPA, AFRL
do any speculations about readahead, and use exactly the amount of readahead
specified by user. E.g. setting SF_FLAGS(0, SF_USER_READAHEAD) will guarantee
that no readahead at all will be performed.
sendfile_swapin() loop works this way:
- Find first invalid page in the request.
- Do vm_pager_has_page() and get count of pages, that can be taken in
single I/O.
- Trim valid pages from the end of the request.
- Cycle through the request and substitute to bogus_page all valid
pages that are in the middle of the request.
- After I/O launched (pager copies array of pages into buf(9), it
is important to restore proper page pointers with help vm_page_lookup().
Count bogus pages used and report them in sendfile stats.
not remove user-space visible fields from vm_cnt or all of the references to
cached pages from comments. Those changes will come later.)
Reviewed by: kib, markj
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8497
The negative list shrinker can demote an entry with only hotlist + neglist
locks held. On the other hand entry removal possibly sets the NCF_DVDROP
without aformentioned locks held prior to detaching it from the respective
netlist., which can lose the update made by the shrinker.
Reported and tested by: truckman
Although the higher end MIPS hardware handles cache aliasing issues in
hardware, the older cores (r4k, etc) and some compile versions of the
newer cores (mips24k, mips34k, mips74k) don't have this feature.
This means we end up with some very unfortunate behaviour that was
made very obvious by some recent changes to the FFS pager by kib.
So, flip this off until we get our MIPS pmap/cache code upgraded to
handle aliased pages in software.
Discussed with: kib, bsdimp, juli
The default (512) wastes quite a bit of space which doesn't really buy
us much on highly embedded systems which don't take a lot of locks in
parallel.
This makes it at least build time configurable so people can experiment.
Currently mount update keeps vfs_busy(9) reference on the mount point
during MNT_UPDATE VFS_MOUNT() vfsops call. This already provides the
exclusion, but is problematic for filesystems which need to perform
namei(9) during VFS_MOUNT(MNT_UPDATE) operations, e.g. to refresh
mnt_from path, because namei(9) must not be called while the
vfs_busy(9) reference is owned.
Check for MNT_UPDATE flag before setting MNTK_UNMOUNT, and for
MNTK_UNMOUNT before entering innards of vfs_domount_update(), failing
syscalls with EBUSY if conflict is detected. Keep vfs_busy(9)
reference around VFS_MOUNT(MNT_UPDATE) calls still to not change VFS
KPI.
In the update path in ffs_mount(), drop vfs_busy() reference around
namei(), which is now safe due to unmount never executing in parallel
with VFS_MOUNT(MNT_UPDATE), and which avoids the deadlock.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
- Send IPI wakeups once SMP is started even if cold is true.
- Permit preemptions when cold is true.
These changes are needed for EARLY_AP_STARTUP.
MFC after: 2 weeks
Sponsored by: Netflix
The other CPU might resume and see a still-empty runq and go back to
sleep before sched_add() adds the thread to the runq. This results
in a lost wakeup and a potential hang if the system is otherwise
completely idle.
The race originated due to a micro-optimization (my fault) in 4BSD in
that it avoided putting a thread on the run queue if the scheduler was
going to preempt to the new thread. To avoid complexity while fixing
this race, just drop this optimization. 4BSD now always sets the
"owepreempt" flag when a preemption is warranted and defers the actual
preemption to the thread_unlock of the caller the same as ULE.
MFC after: 2 weeks
Sponsored by: Netflix
Pass current thread credentials instead of NOCRED.
Only allow unmapped buffers for filesystem which proclaimed the support.
For all filesystems which currently use buffer pager (UFS, msdosfs and
cd9660), the changes are effectively nop.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
vp->v_mount->mnt_vnodecovered unlocked. This allowed unmount to race.
Lock vnode after we noticed the VV_ROOT flag. See comments for
explanation why unlocked check for the flag is considered safe.
Reported and tested by: avg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
we never try to sleep while the thread is on a sleepqueue.
Reviewed by: jhb
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D8422
If dotdot lookup does not escape from the file descriptor passed as
the lookup root, we can allow the component traversal. Track the
directories traversed, and check the result of dotdot lookup against
the recorded list of the directory vnodes.
Dotdot lookups are enabled by sysctl vfs.lookup_cap_dotdot, currently
disabled by default until more verification of the approach is done.
Disallow non-local filesystems for dotdot, since remote server might
conspire with the local process to allow it to escape the namespace.
This might be too cautious, provide the knob
vfs.lookup_cap_dotdot_nonlocal to override as well.
Idea by: rwatson
Discussed with: emaste, jonathan, rwatson
Reviewed by: mjg (previous version)
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 week
Differential revision: https://reviews.freebsd.org/D8110
process of being unmounted. Previously it would skip them, even if the
unmount eventually failed eg due to the filesystem being busy.
This behaviour broke autounmountd(8) - if you tried to manually unmount
a mounted filesystem, using 'automount -u', and the autounmountd attempted
to refresh the filesystem list in that very moment, it would conclude that
the filesystem got unmounted and not try to unmount it afterwards.
Reviewed by: kib@
Tested by: pho@
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D8030
Upstream the BUF_TRACKING and FULL_BUF_TRACKING buffer debugging code.
This can be handy in tracking down what code touched hung bios and bufs
last. The full history is especially useful, but adds enough bloat that
it shouldn't be enabled in release builds.
Function names (or arbitrary string constants) are tracked in a
fixed-size ring in bufs. Bios gain a pointer to the upper buf for
tracking. SCSI CCBs gain a pointer to the upper bio for tracking.
Reviewed by: markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8366
not a bitfield. For the intended usage - being passed either MNT_WAIT,
or MNT_NOWAIT - this shouldn't introduce any changes in behaviour.
Reviewed by: jhb@
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D8373
which also use buffer cache.
Most important addition to the code is the handling of filesystems
where the block size is less than the machine page size, which might
require reading several buffers to validate single page.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
On several Intel chipsets, diagnostic NMIs sent from BMC or NMIs
reporting hardware errors are broadcasted to all CPUs.
When kernel is configured to enter kdb on NMI, the outcome is
problematic, because each CPU tries to enter kdb. All CPUs are
executing NMI handlers, which set the latches disabling the nested NMI
delivery; this means that stop_cpus_hard(), used by kdb_enter() to
stop other cpus by broadcasting IPI_STOP_HARD NMI, cannot work. One
indication of this is the harmless but annoying diagnostic "timeout
stopping cpus".
Much more harming behaviour is that because all CPUs try to enter kdb,
and if ddb is used as debugger, all CPUs issue prompt on console and
race for the input, not to mention the simultaneous use of the ddb
shared state.
Try to fix this by introducing a pseudo-lock for simultaneous attempts
to handle NMIs. If one core happens to enter NMI trap handler, other
cores see it and simulate reception of the IPI_STOP_HARD. More,
generic_stop_cpus() avoids sending IPI_STOP_HARD and avoids waiting
for the acknowledgement, relying on the nmi handler on other cores
suspending and then restarting the CPU.
Since it is impossible to detect at runtime whether some stray NMI is
broadcast or unicast, add a knob for administrator (really developer)
to configure debugging NMI handling mode.
The updated patch was debugged with the help from Andrey Gapon (avg)
and discussed with him.
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D8249
native fueword64(9) still, use proper type for local where fuword64()
result is stored.
Note that fueword64() is unused in the tree.
Submitted by: Chunhui He <hchunhui@mail.ustc.edu.cn>
PR: 212520
MFC after: 1 week
In sendit(), if mp->msg_control is present, then in sockargs() we are
allocating mbuf to store mp->msg_control. Later in kern_sendit(), call
to getsock_cap(), will check validity of file pointer passed, if this
fails EBADF is returned but mbuf allocated in sockargs() is not freed.
Made code changes to free the same.
Since freeing control mbuf in sendit() after checking (control != NULL)
may lead to double freeing of control mbuf in sendit(), we can free
control mbuf in kern_sendit() if there are any errors in the routine.
Submitted by: Lohith Bellad <lohith.bellad@me.com>
Reviewed by: glebius
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D8152
If no negative entry is found on the last list, the ncp pointer will be
left uninitialized and a non-null value will make the function assume an
entry was found.
Fix the problem by initializing to NULL on entry.
Reported by: glebius
This splits the ncneg_mtx lock while preserving the hit ratio at least
during buildworld.
Create N dedicated lists for new negative entries.
Entries with at least one hit get promoted to the hot list, where they
get requeued every M hits.
Shrinking demotes one hot entry and performs a round-robin shrinking of
regular lists.
Reviewed by: kib
ref: 535865d02c
Fix cpu assignment by assuring stride is non-zero, assert that all tasks
have a valid taskqueue.
ref: db39817623
Start cpu assignment from zero.
ref: d99d39b6b6
Submitted by: mmacy@nextbsd.org
In r10905 and r10906 makesyscalls was modified to avoid emitting a
literal $Id$ string in the generated file, with:
gsub("[$]Id: ", "", $0)
gsub(" [$]", "", $0)
Then r11294 added some functionality and also tried to address the $Id$
problem in a different way, by removing every $:
sed -e 's/\$//g ...
This rendered the gsub infeffective. The gsub was later updated to
track the $Id$ -> $FreeBSD$ switch, even though it did not do anything.
Revert the addition of the s/\$//g, and update the gsub to keep the
resulting format the same.
Discussed with: bde
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
When detaching device trees parent devices must be detached prior to
detaching its children. This is because parent devices can have
pointers to the child devices in their softcs which are not
invalidated by device_delete_child(). This can cause use after free
issues and panic().
Device drivers implementing trees, must ensure its detach function
detaches or deletes all its children before returning.
While at it remove now redundant device_detach() calls before
device_delete_child() and device_delete_children(), mostly in
the USB controller drivers.
Tested by: Jan Henrik Sylvester <me@janh.de>
Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D8070
MFC after: 2 weeks
Suppose that we have an exclusively busy page, and a thread which can
accept shared-busy page. In this case, typical code waiting for the
page xbusy state to pass is
again:
VM_OBJECT_WLOCK(object);
...
if (vm_page_xbusied(m)) {
vm_page_lock(m);
VM_OBJECT_WUNLOCK(object); <---1
vm_page_busy_sleep(p, "vmopax");
goto again;
}
Suppose that the xbusy state owner locked the object, unbusied the
page and unlocked the object after we are at the line [1], but before we
executed the load of the busy_lock word in vm_page_busy_sleep(). If it
happens that there is still no waiters recorded for the busy state,
the xbusy owner did not acquired the page lock, so it proceeded.
More, suppose that some other thread happen to share-busy the page
after xbusy state was relinquished but before the m->busy_lock is read
in vm_page_busy_sleep(). Again, that thread only needs vm_object lock
to proceed. Then, vm_page_busy_sleep() reads busy_lock value equal to
the VPB_SHARERS_WORD(1).
In this case, all tests in vm_page_busy_sleep(9) pass and we are going
to sleep, despite the page being share-busied.
Update check for m->busy_lock == VPB_UNBUSIED in vm_page_busy_sleep(9)
to also accept shared-busy state if we only wait for the xbusy state to
pass.
Merge sequential if()s with the same 'then' clause in
vm_page_busy_sleep().
Note that the current code does not share-busy pages from parallel
threads, the only way to have more that one sbusy owner is right now
is to recurse.
Reported and tested by: pho (previous version)
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D8196
The runtime kernel loader, linker_load_file, unloads kernel files that
failed to load all of their modules. For consistency, treat preloaded
(loader.conf loaded) kernel files in the same way.
Reviewed by: kib
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8200
Use isrc in attached MSI data structure instead of using map's
isrc directly. map's isrc is set to NULL on IRQ deactivation
which happens prior to pci_release_msi so MSI_RELEASE_MSI
receives array of NULLs
Reviewed by: mmel
Differential Revision: https://reviews.freebsd.org/D8206
Other uses of cache_purgevfs() do rely on the cache purge for correct
operations, when paths are invalidated without unmount.
Reported and tested by: jkim
Discussed with: mjg
Sponsored by: The FreeBSD Foundation
purgevfs is purely optional and induces lock contention in workloads
which frequently mount and unmount filesystems.
In particular, poudriere will do this for filesystems with 4 vnodes or
less. Full cache scan is clearly wasteful.
Since there is no explicit counter for namecache entries, the number of
vnodes used by the target fs is checked.
The default limit is the number of bucket locks.
Reviewed by: kib
Previously free vnodes would always by directly returned to the global
LRU list. With this change up to mnt_free_list_batch vnodes are collected
first.
syncer runs always return the batch regardless of its size.
While vnodes on per-mnt lists are not counted as free, they can be
returned in case of vnode shortage.
Reviewed by: kib
Tested by: pho
function from restarting the timer.
Commonly taskqueue_enqueue_timeout() is called from within the task
function itself without any checks for teardown. Then it can happen
the timer stays active after the return of taskqueue_drain_timeout(),
because the timeout and task is drained separately.
This patch factors out the teardown flag into the timeout task itself,
allowing existing code to stay as-is instead of applying a teardown
flag to each and every of the timeout task consumers.
Add assert to taskqueue_drain_timeout() which prevents parallel
execution on the same timeout task.
Update manual page documenting the return value of
taskqueue_enqueue_timeout().
Differential Revision: https://reviews.freebsd.org/D8012
Reviewed by: kib, trasz
MFC after: 1 week
mbuf to store mp->msg_control. Later in kern_sendit(), call to getsock_cap(),
will check validity of file pointer passed, if this fails EBADF is returned but
mbuf allocated in sockargs() is not freed. Fix this possible leak.
Submitted by: Lohith Bellad <lohith.bellad@me.com>
Reviewed by: adrian
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D7910
If the kernel is not compiled with the CAPABILITIES kernel options
fget_unlocked doesn't return the sequence number so fd_modify will
always report modification, in that case we got infinity loop.
Reported by: br
Reviewed by: mjg
Tested by: br, def
fget_cap_locked returns a referenced file, but the fgetvp_rights does
not need it. Instead, due to the filedesc lock being held, it can
ref the vnode after the file was looked up.
Fix up fget_cap_locked to be consistent with other _locked helpers and not
ref the file.
This plugs a leak introduced in r306184.
Pointy hat to: mjg, oshogbo
Add a table of vnode locks and use them along with bucketlocks to provide
concurrent modification support. The approach taken is to preserve the
current behaviour of the namecache and just lock all relevant parts before
any changes are made.
Lookups still require the relevant bucket to be locked.
Discussed with: kib
Tested by: pho
buffer and put a small optimization for low socket buffer case:
- Do not hack uio_resid, and let m_uiotombuf() properly take care of it. This
fixes truncation of headers at low buffer.
- If headers ate all the space, jump right to the end of the cycle, to
avoid doing single page I/O and allocating zero length mbuf.
- Clear hdr_uio only if space is positive, which indicates that all uio
was copied in.
Reviewed by: pluknet, jtl, emax, rrs, lstewart, emax, gallatin, scottl
sooptcopyin() checks if size of data provided by user is <= than we can
accept, else it strips down the size. On bigendian platforms we have to
move pointer as well so we copy the actual data.
Reviewed by: gnn
Sponsored by: DARPA, AFRL
Sponsored by: HEIF5
Differential Revision: https://reviews.freebsd.org/D7980
Descriptor returned by accept(2) should inherits capabilities rights from
the listening socket.
PR: 201052
Reviewed by: emaste, jonathan
Discussed with: many
Differential Revision: https://reviews.freebsd.org/D7724
This causes dtrace to automatically copyin arguments from userland, so
one no longer has to explicitly use the copyin() action to do so. Moreover,
copyin() on userland addresses is a no-op, so existing scripts should be
unaffected by this change.
Discussed with: rstone
MFC after: 2 weeks
CLOCK_GETTIME() with the lock.
Now all time-related accesses to the CMOS for RTC should be under the
lock. This is needed to allow upcoming EFI Runtime Services support
to provide required execution environment for the firmware calls.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Both can be used to cause processes in capability mode to receive
SIGTRAP when ENOTCAPABLE or ECAPMODE errors are returned from
syscalls.
Idea by: emaste
Reviewed by: oshogbo (previous version), emaste
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D7965
In particular, reset the DF_QUIET flag when detaching from a device so
that a driver that marks a device quiet doesn't dictate policy for a
different driver that may claim the device in the future.
Reviewed by: rpokala, wblock
MFC after: 2 weeks
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D7803
An array of bucket locks is added.
All modifications still require the global cache_lock to be held for
writing. However, most readers only need the relevant bucket lock and in
effect can run concurrently to the writer as long as they use a
different lock. See the added comment for more details.
This is an intermediate step towards removal of the global lock.
Reviewed by: kib
Tested by: pho
If wait4() or wait6() return 0 because of WNOHANG, the status, rusage and
wrusage information should not be returned.
PR: 212048
Reported by: Casey Lucas
MFC after: 2 weeks
Use C99 designators to set the value of each slot and the nitems macro to
check for valid entries. In the process, switch to indexing by signal
number rather than signal-1 for improved clarity.
Obtained from: CheriBSD (a6053c5abf03a5f53bbfcdd3a26429383f67e09f)
Sponsored by: DARPA, AFRL
Reviewed by: kib
Since negative entries are managed with a LRU list, a hit requires a
modificaton.
Currently the code tries to upgrade the global lock if needed and is
forced to retry the lookup if it fails.
Provide a dedicated lock for use when the cache is only shared-locked.
Reviewed by: kib
MFC after: 1 week
getdtablesize is "trivial global state" and is similar to
getrlimit(RLIMIT_NOFILE), so should be permitted in capability mode.
Reviewed by: oshogbo
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D7719
Calling cap_rights_contains() several times with the same inputs is not
going to produce a different output. The variable being iterated, i, is
never used inside the for loop.
The loop is actually done in cap_rights_contains()
Submitted by: Ryan Moeller <ryan@freqlabs.com>
Reviewed by: oshogbo, ed
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D7369
Add a new 'clear driver' command for devctl along with the accompanying
ioctl and devctl_clear_driver() library routine to reset a device to
use a wildcard devclass instead of a fixed devclass. This can be used
to undo a previous 'set driver' command. After the device's name has
been reset to permit wildcard names, it is reprobed so that it can
attach to newly-available (to it) device drivers.
MFC after: 1 month
Sponsored by: Chelsio Communications
obliterate possible error from sleep with errors from
umtxq_check_susp(), when looping to clear URWLOCK_{READ,WRITE}_WAITERS.
Noted and reviewed by: vangyzen
Sponsored by: The FreeBSD Foundation
MFC after: 1 week