As was reported on http://seclists.org/oss-sec/2016/q3/68, tmpfs code
contains assertion that rdev != VNOVAL. On FreeBSD, there is no other
consequences except triggering the assert. To be compatible with
systems where device nodes have some significance, reject mknod(2)
call with dev == VNOVAL at the syscall level.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Fill in pr_psargs in the NT_PRSINFO ELF core dump note with command
line arguments.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D7116
A buf's b_pages and b_npages fields may be inconsistent after a panic.
For instance, vfs_vmio_invalidate() sets b_npages to zero only after all
pages are unwired and their page array entries are cleared.
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
Set vm_guest to a new enum value (VM_GUEST_KVM) when kvm is detected and use
vm_guest in conditionals testing for KVM.
Also, fix a conditional checking if we're running in a VM which caught only
the generic VM case, but not more specific VMs (KVM, VMWare, etc.). (Spotted
by: vangyzen).
Differential revision: https://reviews.freebsd.org/D7172
Sponsored by: Dell Inc.
Approved by: kib (mentor), vangyzen (mentor)
Reviewed by: alc
MFC after: 4 weeks
Despite the implication (process has pending signals -> the current
thread marked for AST and has TDF_NEEDSIGCHK set) is not true due to
other thread might manipulate its signal blocking mask, it should still
hold for the single-threaded processes. Enable check for the condition
for single-threaded case, and replicate it from userret() to ast() as
well, where we check that ast indeed has no signal to deliver.
Note that the check is under DIAGNOSTIC, it is not enabled for INVARIANTS
but !DIAGNOSTIC since it imposes too heavy-weight locking for day-to-day
used debugging kernel.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
AST must not execute with TDF_SBDRY or TDF_SEINTR/TDF_SERESTART thread
flags set, which is asserted in userret(). As the consequence, -1 return
from cursig() must not be possible.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
if vnode is VMIO. For VMIO vnodes, set BO_DEAD in vm_object_terminate().
The vnode_destroy_object(), when calling into vm_object_terminate(),
must be able to flush buffers. BO_DEAD purpose is to quickly destroy
buffers on write when the underlying vnode is not operable any more
(one example is the devfs node after geom is gone). Setting BO_DEAD
for reclaiming vnode before object is terminated is premature, and
results in unability to flush buffers with live SU dependencies from
vinvalbuf() in vm_object_terminate().
Reported by: David Cross <dcrosstech@gmail.com>
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
those system calls audit event identifiers AUE_READ and AUE_WRITE.
While auditing file-descriptor I/O is not required by the Common
Criteria, in practice this proves useful for both live and forensic
analysis.
NB: freebsd32 already assigns AUE_READ and AUE_WRITE to read(2) and
write(2).
MFC after: 3 days
Sponsored by: DARPA, AFRL
about the desirability of auditing the number, as it was in fact in the
wrong place (in the common path for open(2) and openat(2), and only the
latter accepts a file-descriptor argument). Where other ABIs support
openat(2), it may be necessary to do additional argument auditing as it is
not performed in kern_openat(9).
MFC after: 3 days
Sponsored by: DARPA, AFRL
read(2), write(2), dup(2), and mmap(2). This auditing is not
required by the Common Criteria (and hence was not being
performed), but is valuable in both contemporary live analysis
and forensic use cases.
MFC after: 3 days
Sponsored by: DARPA, AFRL
any open vnodes before proceeding. Make autounmound(8) use this flag.
Without it, even an unsuccessfull unmount causes filesystem flush,
which interferes with normal operation.
Reviewed by: kib@
Approved by: re (gjb@)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D7047
mp_maxid or CPU_FOREACH() as appropriate. This fixes a number of places in
the kernel that assumed CPU IDs are dense in [0, mp_ncpus) and would try,
for example, to run tasks on CPUs that did not exist or to allocate too
few buffers on systems with sparse CPU IDs in which there are holes in the
range and mp_maxid > mp_ncpus. Such circumstances generally occur on
systems with SMT, but on which SMT is disabled. This patch restores system
operation at least on POWER8 systems configured in this way.
There are a number of other places in the kernel with potential problems
in these situations, but where sparse CPU IDs are not currently known
to occur, mostly in the ARM machine-dependent code. These will be fixed
in a follow-up commit after the stable/11 branch.
PR: kern/210106
Reviewed by: jhb
Approved by: re (glebius)
not scheduled -> scheduled -> running -> not scheduled. The API and the
manual page assume that, some comments in the code assume that, and looks
like some contributors to the code also did. The problem is that this
paradigm isn't true. A callout can be scheduled and running at the same
time, which makes API description ambigouous. In such case callout_stop()
family of functions/macros should return 1 and 0 at the same time, since it
successfully unscheduled future callout but the current one is running.
Before this change we returned 1 in such a case, with an exception that
if running callout was migrating we returned 0, unless CS_MIGRBLOCK was
specified.
With this change, we now return 0 in case if future callout was unscheduled,
but another one is still in action, indicating to API users that resources
are not yet safe to be freed.
However, the sleepqueue code relies on getting 1 return code in that case,
and there already was CS_MIGRBLOCK flag, that covered one of the edge cases.
In the new return path we will also use this flag, to keep sleepqueue safe.
Since the flag CS_MIGRBLOCK doesn't block migration and now isn't limited to
migration edge case, rename it to CS_EXECUTING.
This change fixes panics on a high loaded TCP server.
Reviewed by: jch, hselasky, rrs, kib
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D7042
option, not INVARIANTS. The function is required if we want
to load in a module that is compiled with INVARIANTS.
Reviewed by: jhb
Approved by: re (gjb)
vpanic() uses spinlock_enter() to disable interrupts before dumping core.
However, when the scheduler is stopped and INVARIANTS is not configured,
thread_lock() does not acquire a spinlock section, while thread_unlock()
releases one. This can result in interrupts staying enabled while the
kernel dumps core, complicating post-mortem analysis of the crash.
Approved by: re (gjb)
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
method implementations: fstat(2), close(2), and poll(2). This change
synchronises auditing here with similar auditing for VFS-specific system
calls such as stat(2) that audit more complete vnode information.
Sponsored by: DARPA, AFRL
Approved by: re (kib)
MFC after: 1 week
calculate appropriate return value for stops. Simplify the code by
using them.
Fix typo in sig_suspend_threads(). The thread which sleep must be
aborted is td2. (*)
In issignal(), when handling stopping signal for thread in
TD_SBDRY_INTR state, do not stop, this is wrong and fires assert.
This is yet another place where execution should be forced out of
SBDRY-protected region. For such case, return -1 from issignal() and
translate it to corresponding error code in sleepq_catch_signals().
Assert that other consumers of cursig() are not affected by the new
return value. (*)
Micro-optimize, mostly VFS and VOP methods, by avoiding calling the
functions when SIGDEFERSTOP_NOP non-change is requested. (**)
Reported and tested by: pho (*)
Requested by: bde (**)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Approved by: re (gjb)
does it under the vnode interlock, but the interlock is not owned by the
asserting thread. As result, we might read increased use counter but also
still see VI_OWEINACT.
In collaboration with: nwhitehorn
Hardware donated by: IBM LTC
Sponsored by: The FreeBSD Foundation (kib)
Approved by: re (gjb)
the knote is activated immediately. If the exit1() later activates
knotes, such knote is attempted to be activated second time. Detect
the condition by zeroed kn_ptr.p_proc pointer, and avoid excessive
activation.
Before r302235, such knotes were removed from the knlist immediately
upon activation.
Reported by: truckman
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
particular, the Giant is supposed to protect against parallel
ntp_adjtime(2) invocations. But, for instance, sys_ntp_adjtime() does
copyout(9) under Giant and then examines time_status to return syscall
result. Since copyout(9) could sleep, the syscall result might be
inconsistent.
Another and more important issue is that if PPS is configured,
hardpps(9) is executed without any protection against the parallel
top-level code invocation. Potentially, this may result in the
inconsistent state of the ntptime state variables, but I cannot say
how serious such distortion is. The non-functional splclock() call in
sys_ntp_adjtime() protected against clock interrupts calling hardpps()
in the pre-SMP era.
Modernize the locking. A mutex protects ntptime data. Due to the
hardpps() KPI legitimately serving from the interrupt filters (and
e.g. uart(4) does call it from filter), the lock cannot be sleepable
mutex if PPS_SYNC is defined. Otherwise, use normal sleepable mutex
to reduce interrupt latency.
Reviewed by: imp, jhb
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
Differential revision: https://reviews.freebsd.org/D6825
private mtx in resettodr(), no implementation of CLOCK_SETTIME() is
allowed to sleep.
Reviewed by: imp, jhb
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
X-Differential revision: https://reviews.freebsd.org/D6825
TDF_SEINTR flags values, unlike TDF_SBDRY, must be treated almost as
if TDF_SBDRY is not set for STOP signal delivery. The only difference
is that sig_suspend_threads() should abort the sleep instead of doing
immediate suspension.
Reported by: ngie
Sponsored by: The FreeBSD Foundation
MFC after: 12 days
Approved by: re (gjb)
structure, change it to int.
The real fix is to sanitize user-visible definitions in sys/event.h,
e.g. the affected struct knlist is of no use for userspace programs.
Reported and tested by: jkim
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Approved by: re (gjb)
exiting (NOTE_EXIT->knlist_remove_inevent()), two things happen:
- knote kn_knlist pointer is reset
- INFLUX knote is removed from the process knlist.
And, there are two consequences:
- KN_LIST_UNLOCK() on such knote is nop
- there is nothing which would block exit1() from processing past the
knlist_destroy() (and knlist_destroy() resets knlist lock pointers).
Both consequences result either in leaked process lock, or
dereferencing NULL function pointers for locking.
Handle this by stopping embedding the process knlist into struct proc.
Instead, the knlist is allocated together with struct proc, but marked
as autodestroy on the zombie reap, by knlist_detach() function. The
knlist is freed when last kevent is removed from the list, in
particular, at the zombie reap time if the list is empty. As result,
the knlist_remove_inevent() is no longer needed and removed.
Other changes:
In filt_procattach(), clear NOTE_EXEC and NOTE_FORK desired events
from kn_sfflags for knote registered by kernel to only get NOTE_CHILD
notifications. The flags leak resulted in excessive
NOTE_EXEC/NOTE_FORK reports.
Fix immediate note activation in filt_procattach(). Condition should
be either the immediate CHILD_NOTE activation, or immediate NOTE_EXIT
report for the exiting process.
In knote_fork(), do not perform racy check for KN_INFLUX before kq
lock is taken. Besides being racy, it did not accounted for notes
just added by scan (KN_SCAN).
Some minor and incomplete style fixes.
Analyzed and tested by: Eric Badger <eric@badgerio.us>
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Approved by: re (gjb)
Differential revision: https://reviews.freebsd.org/D6859
interrupt sleeps with the ERESTART on the suspension attempts.
Otherwise, single-threading requests are deferred until the locks are
granted for NFS files, which causes hangs.
When retrying local registration of the remotely-granted adv lock,
allow full suspension and check for suspension, for usual reasons.
Reported by: markj, pho
Reviewed by: jilles
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Approved by: re (gjb)
framework allowing to set the suspension policy for the dynamic block.
Extend the currently possible policies of stopping on interruptible
sleeps and ignoring such sleeps by two more: do not suspend at
interruptible sleeps, but interrupt them with either EINTR or ERESTART.
Reviewed by: jilles
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Approved by: re (gjb)
lists must be functional.
Reported by: Daniel Engberg <daniel.engberg.lists@pyret.net>,
Guy Yur <guyyur@gmail.com>
Tested by: Guy Yur <guyyur@gmail.com>
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb), including the KBI change
While reading the code, I noticed that shm_read() returns without unlocking
foffset and rangelock if mac_posixshm_check_read() rejects the read.
Reviewed by: kib, jhb, rwatson
Approved by: re (gjb)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D6927
Mark the pipe() system call as COMPAT10.
As of r302092 libc uses pipe2() with a zero flags value instead of pipe().
Approved by: re (gjb)
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D6816
As of r302092 libc uses pipe2() with a zero flags value instead of pipe().
Commit with regenerated files and implementation to follow.
Approved by: re (gjb)
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D6816
File and disk-backed I/O requests store counts of read/written disk
blocks in each AIO job so that they can be charged to the thread that
completes an AIO request via aio_return() or aio_waitcomplete(). This
change extends AIO jobs to store counts of received/sent messages and
updates socket backends to set these counts accordingly. Note that
the socket backends are careful to only charge a single messages for
each AIO request even though a single request on a blocking socket might
invoke sosend or soreceive multiple times. This is to mimic the
resource accounting of synchronous read/write.
Adjust the UNIX socketpair AIO test to verify that the message resource
usage counts update accordingly for aio_read and aio_write.
Approved by: re (hrs)
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D6911
than removing the network interfaces first. This change is rather larger
and convoluted as the ordering requirements cannot be separated.
Move the pfil(9) framework to SI_SUB_PROTO_PFIL, move Firewalls and
related modules to their own SI_SUB_PROTO_FIREWALL.
Move initialization of "physical" interfaces to SI_SUB_DRIVERS,
move virtual (cloned) interfaces to SI_SUB_PSEUDO.
Move Multicast to SI_SUB_PROTO_MC.
Re-work parts of multicast initialisation and teardown, not taking the
huge amount of memory into account if used as a module yet.
For interface teardown we try to do as many of them as we can on
SI_SUB_INIT_IF, but for some this makes no sense, e.g., when tunnelling
over a higher layer protocol such as IP. In that case the interface
has to go along (or before) the higher layer protocol is shutdown.
Kernel hhooks need to go last on teardown as they may be used at various
higher layers and we cannot remove them before we cleaned up the higher
layers.
For interface teardown there are multiple paths:
(a) a cloned interface is destroyed (inside a VIMAGE or in the base system),
(b) any interface is moved from a virtual network stack to a different
network stack ("vmove"), or (c) a virtual network stack is being shut down.
All code paths go through if_detach_internal() where we, depending on the
vmove flag or the vnet state, make a decision on how much to shut down;
in case we are destroying a VNET the individual protocol layers will
cleanup their own parts thus we cannot do so again for each interface as
we end up with, e.g., double-frees, destroying locks twice or acquiring
already destroyed locks.
When calling into protocol cleanups we equally have to tell them
whether they need to detach upper layer protocols ("ulp") or not
(e.g., in6_ifdetach()).
Provide or enahnce helper functions to do proper cleanup at a protocol
rather than at an interface level.
Approved by: re (hrs)
Obtained from: projects/vnet
Reviewed by: gnn, jhb
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D6747
to mount points with the given filesystem type, specified by mount
vfs_ops pointer.
Based on patch by: mckusick
Reviewed by: avg, mckusick
Tested by: allanjude, madpilot
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
threads, to make it less confusing and using modern kernel terms.
Rename the functions to reflect current use of the functions, instead
of the historic KSE conventions:
cpu_set_fork_handler -> cpu_fork_kthread_handler (for kthreads)
cpu_set_upcall -> cpu_copy_thread (for forks)
cpu_set_upcall_kse -> cpu_set_upcall (for new threads creation)
Reviewed by: jhb (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (hrs)
Differential revision: https://reviews.freebsd.org/D6731
reason for it in modern times. In the other case, expand the comment
stating instead of doubting.
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (hrs)
X-Differential revision: https://reviews.freebsd.org/D6731
This reduces the size of kaiocb slightly. I've also added some generic
fields that other backends can use in place of the BIO-specific fields.
Change the socket and Chelsio DDP backends to use 'backend3' instead of
abusing _aiocb_private.status directly. This confines the use of
_aiocb_private to the AIO internals in vfs_aio.c.
Reviewed by: kib (earlier version)
Approved by: re (gjb)
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D6547
we set MNTK_UNMOUNT flag on the mp. Otherwise parallel unmount which
wins race with us could dereference the covered vnode, and we are
left with the locked freed memory.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
MFC after: 1 week