Commit Graph

14955 Commits

Author SHA1 Message Date
Konstantin Belousov
90b581f2cc Implement mtx_trylock_spin(9).
Discussed with:	bde
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D7192
2016-07-23 05:30:55 +00:00
John Baldwin
9c20dc9963 Add more documentation regarding unsafe AIO requests.
The asynchronous I/O changes made previously result in different
behavior out of the box. Previously all AIO requests failed with
ENOSYS / SIGSYS unless aio.ko was explicitly loaded. Now, some AIO
requests complete and others ("unsafe" requests) fail with EOPNOTSUPP.

Reword the introductory paragraph in aio(4) to add a general
description of AIO before describing the vfs.aio.enable_unsafe sysctl.

Remove the ENOSYS error description from aio_fsync(2), aio_read(2),
and aio_write(2) and replace it with a description of EOPNOTSUPP.

Remove the ENOSYS error description from aio_mlock(2).

Log a message to the system log the first time a process requests an
"unsafe" AIO request that fails with EOPNOTSUPP. This is modeled on
the log message used for processes using the legacy pty devices.

Reviewed by:	kib (earlier version)
MFC after:	1 week
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D7151
2016-07-21 22:49:47 +00:00
Konstantin Belousov
492fe1b774 Hide counted_warning(9) under #ifdef _KERNEL braces, to allow building
subr_prf.c in userspace for libsbuf.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-07-21 17:59:30 +00:00
Konstantin Belousov
9fe297bbdc Declare aio requests on files from local filesystems safe.
Two notes:
- I allow AIO on reclaimed vnodes, since it is deterministically terminated
  fast.
- devfs mounts are marked as MNT_LOCAL, but device vnodes have type
  VCHR, so the slow device io is not allowed.

Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D7273
2016-07-21 17:07:06 +00:00
Konstantin Belousov
9837947b07 Provide counter_warning(9) KPI which allows to issue limited number of
warnings for some kernel events, mostly intended for the use of
obsoleted or otherwise undersired interfaces.

This is an abstracted and race-expelled code from compat pty driver.

Requested and reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D7270
2016-07-21 16:34:56 +00:00
Conrad Meyer
1005d8aff4 imgact_elf: Rename the segment iterator to match reality
The each_writable_segment routine evaluates segments on a slightly little more
nuanced metric than simply "writable" or not.  Rename the function to more
closely match its behavior (each_dumpable_segment).

Suggested by:	jhb
Sponsored by:	EMC / Isilon Storage Division
2016-07-20 22:51:33 +00:00
Conrad Meyer
f3325003d9 ANSI-fy imgact_elf.c
Sponsored by:	EMC / Isilon Storage Division
2016-07-20 22:46:56 +00:00
Conrad Meyer
07f825e871 Fix DEBUG build on 64-bit arch after r303099
Reported by:	Larry Rosenman <ler at lerctr.org>
2016-07-20 18:11:22 +00:00
Conrad Meyer
c17b0bd2a6 Extend ELF coredump to support more than 65535 segments
The ELF e_phnum field is only 16 bits wide. To support more than 65535 segments
(program headers), Sun's "Linker and Libraries Guide" table 7-7 (or 12-7,
depending on document version) prescribes a special first section header where
sh_info represents the real number of program headers.

Test code to follow, when it is ready.

Reference:	http://docs.oracle.com/cd/E18752_01/pdf/817-1984.pdf

Reviewed by:	emaste, markj
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D7255
2016-07-20 16:59:36 +00:00
Gleb Smirnoff
9f3391243b Redo the r302894: the very new value for a non-scheduled callout is -1.
This was recently added in r290664.

Noticed by:	hselasky
Tested by:	Larry Rosenman <ler lerctr.org>
PR:		210884
2016-07-20 16:48:25 +00:00
Gleb Smirnoff
47e4280922 Revert r303037. It re-introduces the panic with TCP timers.
Agreed by:	rrs, re (gjb)
2016-07-20 16:44:22 +00:00
Randall Stewart
3d84a18803 This reverts out Gleb's changes and adds three small
fixes that I think closes up the races Gleb was
looking for. This is running quite nicely in Netflix and
now no longer causes TCP-tcb leaks.

Differential Revision:	7135
2016-07-19 18:31:19 +00:00
John Baldwin
ccb83afd81 Include process IDs in core dumps.
When threads were added to the kernel, the pr_pid member of the
NT_PRSTATUS note was repurposed to store LWP IDs instead of process
IDs.  However, the process ID was no longer recorded in core dumps.
This change adds a pr_pid field to prpsinfo (NT_PRSINFO).  Rather than
bumping the prpsinfo version number, note parsers can use the note's
payload size to determine if pr_pid is present.

Reviewed by:	kib, emaste (older version)
MFC after:	2 months
Differential Revision:	https://reviews.freebsd.org/D7117
2016-07-18 15:14:23 +00:00
John Baldwin
fc4f075a1a Add PTRACE_VFORK to trace vfork events.
First, PL_FLAG_FORKED events now also set a PL_FLAG_VFORKED flag when
the new child was created via vfork() rather than fork().  Second, a
new PL_FLAG_VFORK_DONE event can now be enabled via the PTRACE_VFORK
event mask.  This new stop is reported after the vfork parent resumes
due to the child calling exit or exec.  Debuggers can use this stop to
reinsert breakpoints in the vfork parent process before it resumes.

Reviewed by:	kib
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D7045
2016-07-18 14:53:55 +00:00
Konstantin Belousov
77d6809483 The assertion re-added in r302614 was triggered when stopping signal
is delivered to vforked child.  Issue is that we avoid stopping such
children in issignal() to not block parents.  But executed AST, which
ignored stops, leaves the child with the signal pending but no AST
pending.

On first exec after vfork(), call signotify() to handle pending
reenabled signals.  Adjust the assert to not check vfork children
until exec.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-07-18 10:53:47 +00:00
Gleb Smirnoff
809a9d1353 Revert the last commit. It must get more review and testing first. 2016-07-18 09:29:08 +00:00
Gleb Smirnoff
ef58c6a7a3 Redo the r302894: the very new value for a non-scheduled callout is -1.
This was recently added in r290664.

Noticed by:	hselasky
PR:		210884
2016-07-18 09:26:06 +00:00
Konstantin Belousov
56d48d6dcb Fix another bug after r302350.
Reported and tested by:	pho
PR:	210884
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2016-07-18 04:30:34 +00:00
Konstantin Belousov
86f1146329 Another issue reported on http://seclists.org/oss-sec/2016/q3/68 is
that struct kevent member ident has uintptr_t type, which is silently
truncated to int in the call to fget().  Explicitely check for the
valid range.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-07-16 13:24:58 +00:00
Konstantin Belousov
f470cca578 In ptrace_vm_entry(), do not call vmspace_free() while owning a vm
object lock.

The vmspace_free() operations might need to lock map, object etc on
last dereference.  Postpone the free until object's inspection is
done.

Reported and tested by:	will
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-07-15 23:26:33 +00:00
John Baldwin
8d570f64aa Add a mask of optional ptrace() events.
ptrace() now stores a mask of optional events in p_ptevents.  Currently
this mask is a single integer, but it can be expanded into an array of
integers in the future.

Two new ptrace requests can be used to manipulate the event mask:
PT_GET_EVENT_MASK fetches the current event mask and PT_SET_EVENT_MASK
sets the current event mask.

The current set of events include:
- PTRACE_EXEC: trace calls to execve().
- PTRACE_SCE: trace system call entries.
- PTRACE_SCX: trace syscam call exits.
- PTRACE_FORK: trace forks and auto-attach to new child processes.
- PTRACE_LWP: trace LWP events.

The S_PT_SCX and S_PT_SCE events in the procfs p_stops flags have
been replaced by PTRACE_SCE and PTRACE_SCX.  PTRACE_FORK replaces
P_FOLLOW_FORK and PTRACE_LWP replaces P2_LWP_EVENTS.

The PT_FOLLOW_FORK and PT_LWP_EVENTS ptrace requests remain for
compatibility but now simply toggle corresponding flags in the
event mask.

While here, document that PT_SYSCALL, PT_TO_SCE, and PT_TO_SCX both
modify the event mask and continue the traced process.

Reviewed by:	kib
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D7044
2016-07-15 15:32:09 +00:00
Gleb Smirnoff
2138e263cb Fix regression introduced by r302350. The change of return value for a
callout that wasn't scheduled at all was unintentional and yielded in
several panics.

PR:		210884
2016-07-15 09:28:32 +00:00
Konstantin Belousov
d7e8cfd63d Do not allow creation of char or block special nodes with VNOVAL dev_t.
As was reported on http://seclists.org/oss-sec/2016/q3/68, tmpfs code
contains assertion that rdev != VNOVAL.  On FreeBSD, there is no other
consequences except triggering the assert.  To be compatible with
systems where device nodes have some significance, reject mknod(2)
call with dev == VNOVAL at the syscall level.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-07-15 09:23:18 +00:00
John Baldwin
c77547d2f9 Include command line arguments in core dump process info.
Fill in pr_psargs in the NT_PRSINFO ELF core dump note with command
line arguments.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D7116
2016-07-14 23:20:05 +00:00
Mark Johnston
0649caae32 Let DDB's buf printer handle NULL pointers in the buf page array.
A buf's b_pages and b_npages fields may be inconsistent after a panic.
For instance, vfs_vmio_invalidate() sets b_npages to zero only after all
pages are unwired and their page array entries are cleared.

MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
2016-07-14 18:49:05 +00:00
Eric Badger
fdb6320d45 Add explicit detection of KVM hypervisor
Set vm_guest to a new enum value (VM_GUEST_KVM) when kvm is detected and use
vm_guest in conditionals testing for KVM.

Also, fix a conditional checking if we're running in a VM which caught only
the generic VM case, but not more specific VMs (KVM, VMWare, etc.).  (Spotted
by: vangyzen).

Differential revision:	https://reviews.freebsd.org/D7172
Sponsored by:	Dell Inc.
Approved by:	kib (mentor), vangyzen (mentor)
Reviewed by:	alc
MFC after:	4 weeks
2016-07-13 19:19:18 +00:00
Konstantin Belousov
de56aee0bf Trace timeval parameters to the getitimer(2) and setitimer(2) syscalls.
Reviewed by:	jhb
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D7158
2016-07-13 14:37:58 +00:00
Konstantin Belousov
8f01bee46b Revive the check, disabled in r197963.
Despite the implication (process has pending signals -> the current
thread marked for AST and has TDF_NEEDSIGCHK set) is not true due to
other thread might manipulate its signal blocking mask, it should still
hold for the single-threaded processes.  Enable check for the condition
for single-threaded case, and replicate it from userret() to ast() as
well, where we check that ast indeed has no signal to deliver.

Note that the check is under DIAGNOSTIC, it is not enabled for INVARIANTS
but !DIAGNOSTIC since it imposes too heavy-weight locking for day-to-day
used debugging kernel.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-07-12 03:53:15 +00:00
Konstantin Belousov
4f4c35bb38 Add assert to complement r302328.
AST must not execute with TDF_SBDRY or TDF_SEINTR/TDF_SERESTART thread
flags set, which is asserted in userret(). As the consequence, -1 return
from cursig() must not be possible.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-07-12 03:52:05 +00:00
Nathan Whitehorn
8c636a11dc Remove assumptions in MI code that the BSP is CPU 0.
MFC after:	2 weeks
2016-07-11 21:25:28 +00:00
Konstantin Belousov
725496ced5 Fix grammar.
Submitted by:	alc
MFC after:	2 weeks
2016-07-11 17:04:22 +00:00
Konstantin Belousov
19efd8a5a8 In vgonel(), postpone setting BO_DEAD until VOP_RECLAIM() is called,
if vnode is VMIO.  For VMIO vnodes, set BO_DEAD in vm_object_terminate().

The vnode_destroy_object(), when calling into vm_object_terminate(),
must be able to flush buffers.  BO_DEAD purpose is to quickly destroy
buffers on write when the underlying vnode is not operable any more
(one example is the devfs node after geom is gone).  Setting BO_DEAD
for reclaiming vnode before object is terminated is premature, and
results in unability to flush buffers with live SU dependencies from
vinvalbuf() in vm_object_terminate().

Reported by:	David Cross <dcrosstech@gmail.com>
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-07-11 14:19:09 +00:00
Robert Watson
5fa69ff015 In process-descriptor close(2) and fstat(2), audit target process
information.  pgkill(2) already audits target process ID.

MFC after:	3 days
Sponsored by:	DARPA, AFRL
2016-07-10 14:17:36 +00:00
Robert Watson
e5ec733909 Do allow auditing of read(2) and write(2) system calls, by assigning
those system calls audit event identifiers AUE_READ and AUE_WRITE.
While auditing file-descriptor I/O is not required by the Common
Criteria, in practice this proves useful for both live and forensic
analysis.

NB: freebsd32 already assigns AUE_READ and AUE_WRITE to read(2) and
write(2).

MFC after:	3 days
Sponsored by:	DARPA, AFRL
2016-07-10 13:42:33 +00:00
Robert Watson
8ec75c0fc3 Audit the file-descriptor number argument for openat(2). Remove a comment
about the desirability of auditing the number, as it was in fact in the
wrong place (in the common path for open(2) and openat(2), and only the
latter accepts a file-descriptor argument).  Where other ABIs support
openat(2), it may be necessary to do additional argument auditing as it is
not performed in kern_openat(9).

MFC after:	3 days
Sponsored by:	DARPA, AFRL
2016-07-10 09:50:21 +00:00
Robert Watson
51d1f69069 Audit file-descriptor arguments to I/O system calls such as
read(2), write(2), dup(2), and mmap(2).  This auditing is not
required by the Common Criteria (and hence was not being
performed), but is valuable in both contemporary live analysis
and forensic use cases.

MFC after:	3 days
Sponsored by:	DARPA, AFRL
2016-07-10 08:04:02 +00:00
Edward Tomasz Napierala
debc480e03 Add new unmount(2) flag, MNT_NONBUSY, to check whether there are
any open vnodes before proceeding. Make autounmound(8) use this flag.
Without it, even an unsuccessfull unmount causes filesystem flush,
which interferes with normal operation.

Reviewed by:	kib@
Approved by:	re (gjb@)
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D7047
2016-07-07 09:03:57 +00:00
Nathan Whitehorn
96c85efb4b Replace a number of conflations of mp_ncpus and mp_maxid with either
mp_maxid or CPU_FOREACH() as appropriate. This fixes a number of places in
the kernel that assumed CPU IDs are dense in [0, mp_ncpus) and would try,
for example, to run tasks on CPUs that did not exist or to allocate too
few buffers on systems with sparse CPU IDs in which there are holes in the
range and mp_maxid > mp_ncpus. Such circumstances generally occur on
systems with SMT, but on which SMT is disabled. This patch restores system
operation at least on POWER8 systems configured in this way.

There are a number of other places in the kernel with potential problems
in these situations, but where sparse CPU IDs are not currently known
to occur, mostly in the ARM machine-dependent code. These will be fixed
in a follow-up commit after the stable/11 branch.

PR:		kern/210106
Reviewed by:	jhb
Approved by:	re (glebius)
2016-07-06 14:09:49 +00:00
Gleb Smirnoff
d153eeee97 The paradigm of a callout is that it has three consequent states:
not scheduled -> scheduled -> running -> not scheduled. The API and the
manual page assume that, some comments in the code assume that, and looks
like some contributors to the code also did. The problem is that this
paradigm isn't true. A callout can be scheduled and running at the same
time, which makes API description ambigouous. In such case callout_stop()
family of functions/macros should return 1 and 0 at the same time, since it
successfully unscheduled future callout but the current one is running.
Before this change we returned 1 in such a case, with an exception that
if running callout was migrating we returned 0, unless CS_MIGRBLOCK was
specified.

With this change, we now return 0 in case if future callout was unscheduled,
but another one is still in action, indicating to API users that resources
are not yet safe to be freed.

However, the sleepqueue code relies on getting 1 return code in that case,
and there already was CS_MIGRBLOCK flag, that covered one of the edge cases.
In the new return path we will also use this flag, to keep sleepqueue safe.

Since the flag CS_MIGRBLOCK doesn't block migration and now isn't limited to
migration edge case, rename it to CS_EXECUTING.

This change fixes panics on a high loaded TCP server.

Reviewed by:	jch, hselasky, rrs, kib
Approved by:	re (gjb)
Differential Revision:	https://reviews.freebsd.org/D7042
2016-07-05 18:47:17 +00:00
Gleb Smirnoff
a0d20ecb94 Compile in the kassert_panic() function with INVARIANT_SUPPORT
option, not INVARIANTS.  The function is required if we want
to load in a module that is compiled with INVARIANTS.

Reviewed by:	jhb
Approved by:	re (gjb)
2016-07-05 18:34:34 +00:00
Mark Johnston
f61d6c5a5b Ensure that spinlock sections are balanced even after a panic.
vpanic() uses spinlock_enter() to disable interrupts before dumping core.
However, when the scheduler is stopped and INVARIANTS is not configured,
thread_lock() does not acquire a spinlock section, while thread_unlock()
releases one. This can result in interrupts staying enabled while the
kernel dumps core, complicating post-mortem analysis of the crash.

Approved by:	re (gjb)
MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
2016-07-05 17:59:04 +00:00
Robert Watson
971711fb7c Call audit hooks to capture vnode attributes for three file-descriptor
method implementations: fstat(2), close(2), and poll(2).  This change
synchronises auditing here with similar auditing for VFS-specific system
calls such as stat(2) that audit more complete vnode information.

Sponsored by:	DARPA, AFRL
Approved by:	re (kib)
MFC after:	1 week
2016-07-05 16:37:01 +00:00
Ed Maste
1cbb879df8 add description for debug.elf{32,64}_legacy_coredump sysctl
Approved by:	re (kib)
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2016-07-05 14:46:06 +00:00
Konstantin Belousov
46e47c4f8d Provide helper macros to detect 'non-silent SBDRY' state and to
calculate appropriate return value for stops.  Simplify the code by
using them.

Fix typo in sig_suspend_threads().  The thread which sleep must be
aborted is td2. (*)

In issignal(), when handling stopping signal for thread in
TD_SBDRY_INTR state, do not stop, this is wrong and fires assert.
This is yet another place where execution should be forced out of
SBDRY-protected region.  For such case, return -1 from issignal() and
translate it to corresponding error code in sleepq_catch_signals().
Assert that other consumers of cursig() are not affected by the new
return value. (*)

Micro-optimize, mostly VFS and VOP methods, by avoiding calling the
functions when SIGDEFERSTOP_NOP non-change is requested. (**)

Reported and tested by:	pho (*)
Requested by:	bde (**)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Approved by:	re (gjb)
2016-07-03 18:19:48 +00:00
Konstantin Belousov
d12d6a8476 Remove racy assert. The thread which changes vnode usecount from 0 to 1
does it under the vnode interlock, but the interlock is not owned by the
asserting thread.  As result, we might read increased use counter but also
still see VI_OWEINACT.

In collaboration with: nwhitehorn
Hardware donated by: IBM LTC
Sponsored by:	The FreeBSD Foundation (kib)
Approved by:	re (gjb)
2016-07-03 01:56:48 +00:00
Konstantin Belousov
e18ee4957d When a process knote was attached to the process which is already exiting,
the knote is activated immediately.  If the exit1() later activates
knotes, such knote is attempted to be activated second time.  Detect
the condition by zeroed kn_ptr.p_proc pointer, and avoid excessive
activation.

Before r302235, such knotes were removed from the knlist immediately
upon activation.

Reported by:	truckman
Sponsored by:	The FreeBSD Foundation
Approved by:	re (gjb)
2016-07-01 20:11:28 +00:00
Konstantin Belousov
364c516cff Currently the ntptime code and resettodr() are Giant-locked. In
particular, the Giant is supposed to protect against parallel
ntp_adjtime(2) invocations.  But, for instance, sys_ntp_adjtime() does
copyout(9) under Giant and then examines time_status to return syscall
result.  Since copyout(9) could sleep, the syscall result might be
inconsistent.

Another and more important issue is that if PPS is configured,
hardpps(9) is executed without any protection against the parallel
top-level code invocation. Potentially, this may result in the
inconsistent state of the ntptime state variables, but I cannot say
how serious such distortion is. The non-functional splclock() call in
sys_ntp_adjtime() protected against clock interrupts calling hardpps()
in the pre-SMP era.

Modernize the locking. A mutex protects ntptime data.  Due to the
hardpps() KPI legitimately serving from the interrupt filters (and
e.g. uart(4) does call it from filter), the lock cannot be sleepable
mutex if PPS_SYNC is defined.  Otherwise, use normal sleepable mutex
to reduce interrupt latency.

Reviewed by:	  imp, jhb
Sponsored by:	  The FreeBSD Foundation
Approved by:	  re (gjb)
Differential revision:	https://reviews.freebsd.org/D6825
2016-06-28 16:43:23 +00:00
Konstantin Belousov
3a0e6f920a Do not use Giant to prevent parallel calls to CLOCK_SETTIME(). Use
private mtx in resettodr(), no implementation of CLOCK_SETTIME() is
allowed to sleep.

Reviewed by:	  imp, jhb
Sponsored by:	  The FreeBSD Foundation
Approved by:	  re (gjb)
X-Differential revision:	https://reviews.freebsd.org/D6825
2016-06-28 16:42:40 +00:00
Konstantin Belousov
6f56cb8dbf Complete r302215. TDF_SBDRY | TDF_SERESTART and TDF_SBDRY |
TDF_SEINTR flags values, unlike TDF_SBDRY, must be treated almost as
if TDF_SBDRY is not set for STOP signal delivery.  The only difference
is that sig_suspend_threads() should abort the sleep instead of doing
immediate suspension.

Reported by:	ngie
Sponsored by:	The FreeBSD Foundation
MFC after:	12 days
Approved by:	re (gjb)
2016-06-28 16:41:50 +00:00
Konstantin Belousov
9eb3f14324 Fix userspace build after r302235: do not expose bool field of the
structure, change it to int.

The real fix is to sanitize user-visible definitions in sys/event.h,
e.g. the affected struct knlist is of no use for userspace programs.

Reported and tested by:	jkim
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Approved by:	re (gjb)
2016-06-27 23:34:53 +00:00