Commit Graph

15299 Commits

Author SHA1 Message Date
Mateusz Guzik
7c34b35b57 ktrace: do a lockless check on fork to see if tracing is enabled
This saves 2 lock acquisitions in the common case.
2016-08-10 15:25:44 +00:00
Mateusz Guzik
382172be68 sigio: do a lockless check in funsetownlist
There is no need to grab the lock first to see if sigio is used, and it
typically is not.
2016-08-10 15:24:15 +00:00
Konstantin Belousov
3a77833e87 Fix indentation.
Reported by:	hselasky
MFC after:	17 days
2016-08-10 14:41:53 +00:00
Konstantin Belousov
49c394a970 Re-schedule signals after kthread exits, since apparently there are
processes which combine kernel and non-kernel threads, e.g. nfsd.  For
such processes, termination of a kthread must recheck signal delivery
among other threads according to masks.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-08-10 13:47:12 +00:00
Jean-Sébastien Pédron
bd937497ea Consistently use device_t
Several files use the internal name of `struct device` instead of
`device_t` which is part of the public API. This patch changes all
`struct device *` to `device_t`.

The remaining occurrences of `struct device` are those referring to the
Linux or OpenBSD version of the structure, or the code is not built on
FreeBSD and it's unclear what to do.

Submitted by:	Matthew Macy <mmacy@nextbsd.org> (previous version)
Approved by:	emaste, jhibbits, sbruno
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D7447
2016-08-09 19:32:06 +00:00
Stephen J. Kiernan
0ce1624d0e Move IPv4-specific jail functions to new file netinet/in_jail.c
_prison_check_ip4 renamed to prison_check_ip4_locked

Move IPv6-specific jail functions to new file netinet6/in6_jail.c
_prison_check_ip6 renamed to prison_check_ip6_locked

Add appropriate prototypes to sys/sys/jail.h

Adjust kern_jail.c to call prison_check_ip4_locked and
prison_check_ip6_locked accordingly.

Add netinet/in_jail.c and netinet6/in6_jail.c to the list of files that
need to be built when INET and INET6, respectively, are configured in the
kernel configuration file.

Reviewed by:	jtl
Approved by:	sjg (mentor)
Sponsored by:	Juniper Networks, Inc.
Differential Revision:	https://reviews.freebsd.org/D6799
2016-08-09 02:16:21 +00:00
Mark Johnston
434ac8b6b7 Handle races with listening socket close when connecting a unix socket.
If the listening socket is closed while sonewconn() is executing, the
nascent child socket is aborted, which results in recursion on the
unp_link lock when the child's pru_detach method is invoked. Fix this
by using a flag to mark such sockets, and skip a part of the socket's
teardown during detach.

Reported by:	Raviprakash Darbha <rdarbha@juniper.net>
Tested by:	pho
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D7398
2016-08-08 20:25:04 +00:00
Bryan Drewery
c1fa440409 Regenerate after r303755.
MFC after:	3 days
X-MFC-With:	r303755
Sponsored by:	EMC / Isilon Storage Division
2016-08-04 19:15:51 +00:00
Bryan Drewery
417d5dec39 Still provide freebsd10_* symbols from libc for COMPAT10.
r296773 was done to only remove libc symbols for <7.  We want to provide
the syscall symbols going forward for 7+.

Discussed with:	jhb
MFC after:	3 days
Sponsored by:	EMC / Isilon Storage Division
2016-08-04 19:14:18 +00:00
Edward Tomasz Napierala
7b255097eb Remove unused - never actually implemented - vnode lock types
from vnode_if.src.

MFC after:	1 month
2016-08-04 13:45:18 +00:00
Bryan Drewery
78be18ae6e Correct some comments.
Sponsored by:	EMC / Isilon Storage Division
MFC after:	3 days
2016-08-03 18:48:56 +00:00
Mateusz Guzik
0453ade508 locks: fix sx compilation on mips after r303643
The kernel.h header is required for the SYSINIT macro, which apparently
was present on amd64 by accident.

Reported by:	kib
2016-08-03 09:15:10 +00:00
Konstantin Belousov
e69ba32f88 Remove mention of the Giant from the fork_return() description.
Making emphasis on this lock in the core function comment is confusing
for the modern kernel.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2016-08-03 07:10:09 +00:00
Ed Schouten
e938ebbc0c Regenerate system call tables for r303699 and r303700. 2016-08-03 06:36:45 +00:00
Ed Schouten
1b42875d4b Re-add traling slash that was removed in r303699.
I must have accidentally pressed some random key in vim.
2016-08-03 06:35:58 +00:00
Ed Schouten
a813fdc6c3 mprotect(): Change prototype to comply to POSIX.
Our mprotect() function seems to take a "const void *" address to the
pages whose permissions need to be adjusted. POSIX uses "void *". Simply
stick to the POSIX one to prevent us from writing unportable code.

PR:		211423 (exp-run)
Tested by:	antoine@ (Thanks!)
2016-08-03 06:33:04 +00:00
Mateusz Guzik
fa5000a4f3 locks: fix compilation for KDTRACE_HOOKS && !ADAPTIVE_* case
Reported by:	Michael Butler <imb protected-networks.net>
2016-08-02 03:05:59 +00:00
Mateusz Guzik
0412689595 locks: fix up ifdef guards introduced in r303643
Both sx and rwlocks had copy-pasted ADAPTIVE_MUTEXES instead of the correct
define.

MFC after:	1 week
2016-08-02 00:15:08 +00:00
Mateusz Guzik
1ada904147 Implement trivial backoff for locking primitives.
All current spinning loops retry an atomic op the first chance they get,
which leads to performance degradation under load.

One classic solution to the problem consists of delaying the test to an
extent. This implementation has a trivial linear increment and a random
factor for each attempt.

For simplicity, this first thouch implementation only modifies spinning
loops where the lock owner is running. spin mutexes and thread lock were
not modified.

Current parameters are autotuned on boot based on mp_cpus.

Autotune factors are very conservative and are subject to change later.

Reviewed by:	kib, jhb
Tested by:	pho
MFC after:	1 week
2016-08-01 21:48:37 +00:00
Mateusz Guzik
61852185ba locks: change sleep_cnt and spin_cnt types to u_int
Both variables are uint64_t, but they only count spins or sleeps.
All reasonable values which we can get here comfortably hit in 32-bit range.

Suggested by: kib
MFC after:	1 week
2016-07-31 12:11:55 +00:00
Mateusz Guzik
e0c45af904 sx: increment spin_cnt before cpu_spinwait in xlock
The change is a no-op only done for consistency with the rest of the file.
2016-07-30 22:23:31 +00:00
Mateusz Guzik
7a54be1870 rwlock: s/READER/WRITER/ in wlock lockstat annotation 2016-07-30 22:21:48 +00:00
Konstantin Belousov
50c22263bb Cache getbintime(9) answer in timehands, similarly to getnanotime(9)
and getmicrotime(9).

Suggested and reviewed by:	bde (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
2016-07-30 09:25:57 +00:00
John Baldwin
e2325d8285 Don't treat NOCPU as a valid CPU to CPU_ISSET.
If a thread is created bound to a cpuset it might already be bound before
it's very first timeslice, and td_lastcpu will be NOCPU in that case.

MFC after:	1 week
2016-07-29 20:19:14 +00:00
John Baldwin
005ce8e4e6 Fix locking issues with aio_fsync().
- Use correct lock in aio_cancel_sync when dequeueing job.
- Add _locked variants of aio_set/clear_cancel_function and use those
  to avoid lock recursion when adding and removing fsync jobs to the
  per-process sync queue.
- While here, add a basic test for aio_fsync().

PR:		211390
Reported by:	Randy Westlund <rwestlun@gmail.com>
MFC after:	1 week
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D7339
2016-07-29 18:26:15 +00:00
Brooks Davis
40018b91dd Don't create pointless backups of generated files in "make sysent".
Any sensible workflow will include a revision control system from which
to restore the old files if required.  In normal usage, developers just
have to clean up the mess.

Reviewed by:	jhb
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D7353
2016-07-28 21:29:04 +00:00
Ed Schouten
5590eb985e Regenerate system call table for r303435. 2016-07-28 12:22:34 +00:00
Ed Schouten
d9c4cd2fbc Change the return type of msgrcv() to ssize_t as required by POSIX.
It looks like the msgrcv() system call is already written in such a way
that the size is internally computed as a size_t and written into all of
td_retval[0]. This means that it is effectively already returning
ssize_t. It's just that the userspace prototype doesn't match up.
2016-07-28 12:22:01 +00:00
Konstantin Belousov
2d19b736ed Rewrite subr_sleepqueue.c use of callouts to not depend on the
specifics of callout KPI.  Esp., do not depend on the exact interface
of callout_stop(9) return values.

The main change is that instead of requiring precise callouts, code
maintains absolute time to wake up.  Callouts now should ensure that a
wake occurs at the requested moment, but we can tolerate both run-away
callout, and callout_stop(9) lying about running callout either way.

As consequence, it removes the constant source of the bugs where
sleepq_check_timeout() causes uninterruptible thread state where the
thread is detached from CPU, see e.g. r234952 and r296320.

Patch also removes dual meaning of the TDF_TIMEOUT flag, making code
(IMO much) simpler to reason about.

Tested by:	pho
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
Differential revision:	https://reviews.freebsd.org/D7137
2016-07-28 09:09:55 +00:00
Konstantin Belousov
a9e182e895 Extract the calculation of the callout fire time into the new function
callout_when(9).  See the man page update for the description of the
intended use.

Tested by:	pho
Reviewed by:	jhb, bjk (man page updates)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
X-Differential revision:	https://reviews.freebsd.org/D7137
2016-07-28 08:57:01 +00:00
Konstantin Belousov
b7a25e63b6 When a debugger attaches to the process, SIGSTOP is sent to the
target.  Due to a way issignal() selects the next signal to deliver
and report, if the simultaneous or already pending another signal
exists, that signal might be reported by the next waitpid(2) call.
This causes minor annoyance for debuggers, which must be prepared to
take any signal as the first event, then filter SIGSTOP later.

More importantly, for tools like gcore(1), which attach and then
detach without processing events, SIGSTOP might leak to be delivered
after PT_DETACH.  This results in the process being unintentionally
stopped after detach, which is fatal for automatic tools.

The solution is to force SIGSTOP to be the first signal reported after
the attach.  Attach code is modified to set P2_PTRACE_FSTP to indicate
that the attaching ritual was not yet finished, and issignal() prefers
SIGSTOP in that condition.  Also, the thread which handles
P2_PTRACE_FSTP is made to guarantee to own p_xthread during the first
waitpid(2).  All that ensures that SIGSTOP is consumed first.

Additionally, if P2_PTRACE_FSTP is still set on detach, which means
that waitpid(2) was not called at all, SIGSTOP is removed from the
queue, ensuring that the process is resumed on detach.

In issignal(), when acting on STOPing signals, remove the signal from
queue before suspending.  Otherwise parallel attach could result in
ptracestop() acting on that STOP as if it was the STOP signal from the
attach.  Then SIGSTOP from attach leaks again.

As a minor refactoring, some bits of the common attach code is moved
to new helper proc_set_traced().

Reported by:	markj
Reviewed by:	jhb, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D7256
2016-07-28 08:41:13 +00:00
Stephen J. Kiernan
4ac21b4f09 Prepare for network stack as a module
- Move cr_canseeinpcb to sys/netinet/in_prot.c in order to separate the
   INET and INET6-specific code from the rest of the prot code (It is only
   used by the network stack, so it makes sense for it to live with the
   other network stack code.)
 - Move cr_canseeinpcb prototype from sys/systm.h to netinet/in_systm.h
 - Rename cr_seeotheruids to cr_canseeotheruids and cr_seeothergids to
   cr_canseeothergids, make them non-static, and add prototypes (so they
   can be seen/called by in_prot.c functions.)
 - Remove sw_csum variable from ip6_forward in ip6_forward.c, as it is an
   unused variable.

Reviewed by:	gnn, jtl
Approved by:	sjg (mentor)
Sponsored by:	Juniper Networks, Inc.
Differential Revision:	https://reviews.freebsd.org/D2901
2016-07-27 20:34:09 +00:00
John Baldwin
b9a53e161b Adjust tests in fsync job scheduling loop to reduce indentation. 2016-07-27 19:31:25 +00:00
Ed Maste
18c23dffac ANSIfy kern_proc.c and delete register keyword
Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D6478
2016-07-27 14:27:08 +00:00
Konstantin Belousov
1822421c0b Remove Giant from settime(), tc_setclock_mtx guards tc_windup() calls,
and there is no other issues with parallel settime().  Remove spl()
vestiges there as well.

Tested by:	pho (as part of the whole patch)
Reviewed by:	jhb (same)
Discussed wit:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
Differential revision:	https://reviews.freebsd.org/D7302
2016-07-27 11:54:24 +00:00
Konstantin Belousov
5760b029ee Prevent parallel tc_windup() calls, both parallel top-level calls from
setclock() and from simultaneous top-level and interrupt.  For this,
tc_windup() is protected with a tc_setclock_mtx spinlock, in the try
mode when called from hardclock interrupt.  If spinlock cannot be
obtained without spinning from the interrupt context, this means that
top-level executes tc_windup() on other core and our try may be
avoided.

The boottimebin and boottime variables should be adjusted from
tc_windup().  To be correct, they must be part of the timehands and
read using lockless protocol.  Remove the globals and reimplement the
getboottime(9)/getboottimebin(9) KPI using the timehands read
protocol.

Tested by:	pho (as part of the whole patch)
Reviewed by:	jhb (same)
Discussed wit:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
X-Differential revision:	https://reviews.freebsd.org/D7302
2016-07-27 11:49:41 +00:00
Konstantin Belousov
4493f659e5 Fix a bug in r302252.
Change ntpadj_lock to spinlock always, and rename stuff removing
ADJ/adj from the names. ntp_update_second() requires ntp_lock and is
called from the tc_windup(), so ntp_lock must be a spinlock.  Add
missed lock to ntp_update_second().

Tested by:	pho (as part of the whole patch)
Reviewed by:	jhb (same)
Noted by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
X-Differential revision:	https://reviews.freebsd.org/D7302
2016-07-27 11:40:06 +00:00
Konstantin Belousov
21547fc7ca Reduce the resettodr_lock scope to only CLOCK_SETTIME() call.
Tested by:	pho (as part of the whole patch)
Reviewed by:	jhb (same)
Discussed with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
X-Differential revision:	https://reviews.freebsd.org/D7302
2016-07-27 11:34:25 +00:00
Konstantin Belousov
4d29106e55 Style.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
X-Differential revision:	https://reviews.freebsd.org/D7302
2016-07-27 11:33:33 +00:00
Konstantin Belousov
a83c016f00 Reduce number of timehands to just two. This is useful because
consumers can now be only one tc_windup() call late.

Use C99 initialization.

Tested by:	pho (as part of the whole patch)
Reviewed by:	jhb (same)
Discussed with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
X-Differential revision:	https://reviews.freebsd.org/D7302
2016-07-27 11:27:52 +00:00
Konstantin Belousov
584b675ed6 Hide the boottime and bootimebin globals, provide the getboottime(9)
and getboottimebin(9) KPI. Change consumers of boottime to use the
KPI.  The variables were renamed to avoid shadowing issues with local
variables of the same name.

Issue is that boottime* should be adjusted from tc_windup(), which
requires them to be members of the timehands structure.  As a
preparation, this commit only introduces the interface.

Some uses of boottime were found doubtful, e.g. NLM uses boottime to
identify the system boot instance.  Arguably the identity should not
change on the leap second adjustment, but the commit is about the
timekeeping code and the consumers were kept bug-to-bug compatible.

Tested by:	pho (as part of the bigger patch)
Reviewed by:	jhb (same)
Discussed with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
X-Differential revision:	https://reviews.freebsd.org/D7302
2016-07-27 11:08:59 +00:00
Stephen J. Kiernan
cc37baea09 Add the NUM_CORE_FILES kernel config option which specifies the limit for the
number of core files allowed by a particular process when using the %I core
file name pattern.

Sanity check at compile time to ensure the value is within the valid range of
0-10.

Reviewed by:	jtl, sjg
Approved by:	sjg (mentor)
Sponsored by:	Juniper Networks, Inc.
Differential Revision:	https://reviews.freebsd.org/D6812
2016-07-27 03:21:02 +00:00
Ed Schouten
f63cd251b2 Add shmatt_t.
It looks like our "struct shmid_ds::shm_nattch" deviates from the
standard in the sense that it is a signed integer, whereas POSIX
requires that it is unsigned, having a special type shmatt_t.

Patch up our native and 32-bit copies to use a new shmatt_t that is an
unsigned integer. As it's unsigned, we can relax the comparisons that
are performed on it. Leave the Linux, iBCS2, etc. copies of the
structure alone.

Reviewed by:	ngie
Differential Revision:	https://reviews.freebsd.org/D6655
2016-07-26 17:23:49 +00:00
Conrad Meyer
af326ace9d devfs: Move most ioctl logic down to vnode layer
Devfs' file layer ioctl is now just a thin shim around the vnode layer.

Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D7286
2016-07-25 16:28:02 +00:00
Konstantin Belousov
90b581f2cc Implement mtx_trylock_spin(9).
Discussed with:	bde
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D7192
2016-07-23 05:30:55 +00:00
John Baldwin
9c20dc9963 Add more documentation regarding unsafe AIO requests.
The asynchronous I/O changes made previously result in different
behavior out of the box. Previously all AIO requests failed with
ENOSYS / SIGSYS unless aio.ko was explicitly loaded. Now, some AIO
requests complete and others ("unsafe" requests) fail with EOPNOTSUPP.

Reword the introductory paragraph in aio(4) to add a general
description of AIO before describing the vfs.aio.enable_unsafe sysctl.

Remove the ENOSYS error description from aio_fsync(2), aio_read(2),
and aio_write(2) and replace it with a description of EOPNOTSUPP.

Remove the ENOSYS error description from aio_mlock(2).

Log a message to the system log the first time a process requests an
"unsafe" AIO request that fails with EOPNOTSUPP. This is modeled on
the log message used for processes using the legacy pty devices.

Reviewed by:	kib (earlier version)
MFC after:	1 week
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D7151
2016-07-21 22:49:47 +00:00
Konstantin Belousov
492fe1b774 Hide counted_warning(9) under #ifdef _KERNEL braces, to allow building
subr_prf.c in userspace for libsbuf.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-07-21 17:59:30 +00:00
Konstantin Belousov
9fe297bbdc Declare aio requests on files from local filesystems safe.
Two notes:
- I allow AIO on reclaimed vnodes, since it is deterministically terminated
  fast.
- devfs mounts are marked as MNT_LOCAL, but device vnodes have type
  VCHR, so the slow device io is not allowed.

Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D7273
2016-07-21 17:07:06 +00:00
Konstantin Belousov
9837947b07 Provide counter_warning(9) KPI which allows to issue limited number of
warnings for some kernel events, mostly intended for the use of
obsoleted or otherwise undersired interfaces.

This is an abstracted and race-expelled code from compat pty driver.

Requested and reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D7270
2016-07-21 16:34:56 +00:00
Conrad Meyer
1005d8aff4 imgact_elf: Rename the segment iterator to match reality
The each_writable_segment routine evaluates segments on a slightly little more
nuanced metric than simply "writable" or not.  Rename the function to more
closely match its behavior (each_dumpable_segment).

Suggested by:	jhb
Sponsored by:	EMC / Isilon Storage Division
2016-07-20 22:51:33 +00:00
Conrad Meyer
f3325003d9 ANSI-fy imgact_elf.c
Sponsored by:	EMC / Isilon Storage Division
2016-07-20 22:46:56 +00:00
Conrad Meyer
07f825e871 Fix DEBUG build on 64-bit arch after r303099
Reported by:	Larry Rosenman <ler at lerctr.org>
2016-07-20 18:11:22 +00:00
Conrad Meyer
c17b0bd2a6 Extend ELF coredump to support more than 65535 segments
The ELF e_phnum field is only 16 bits wide. To support more than 65535 segments
(program headers), Sun's "Linker and Libraries Guide" table 7-7 (or 12-7,
depending on document version) prescribes a special first section header where
sh_info represents the real number of program headers.

Test code to follow, when it is ready.

Reference:	http://docs.oracle.com/cd/E18752_01/pdf/817-1984.pdf

Reviewed by:	emaste, markj
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D7255
2016-07-20 16:59:36 +00:00
Gleb Smirnoff
9f3391243b Redo the r302894: the very new value for a non-scheduled callout is -1.
This was recently added in r290664.

Noticed by:	hselasky
Tested by:	Larry Rosenman <ler lerctr.org>
PR:		210884
2016-07-20 16:48:25 +00:00
Gleb Smirnoff
47e4280922 Revert r303037. It re-introduces the panic with TCP timers.
Agreed by:	rrs, re (gjb)
2016-07-20 16:44:22 +00:00
Randall Stewart
3d84a18803 This reverts out Gleb's changes and adds three small
fixes that I think closes up the races Gleb was
looking for. This is running quite nicely in Netflix and
now no longer causes TCP-tcb leaks.

Differential Revision:	7135
2016-07-19 18:31:19 +00:00
John Baldwin
ccb83afd81 Include process IDs in core dumps.
When threads were added to the kernel, the pr_pid member of the
NT_PRSTATUS note was repurposed to store LWP IDs instead of process
IDs.  However, the process ID was no longer recorded in core dumps.
This change adds a pr_pid field to prpsinfo (NT_PRSINFO).  Rather than
bumping the prpsinfo version number, note parsers can use the note's
payload size to determine if pr_pid is present.

Reviewed by:	kib, emaste (older version)
MFC after:	2 months
Differential Revision:	https://reviews.freebsd.org/D7117
2016-07-18 15:14:23 +00:00
John Baldwin
fc4f075a1a Add PTRACE_VFORK to trace vfork events.
First, PL_FLAG_FORKED events now also set a PL_FLAG_VFORKED flag when
the new child was created via vfork() rather than fork().  Second, a
new PL_FLAG_VFORK_DONE event can now be enabled via the PTRACE_VFORK
event mask.  This new stop is reported after the vfork parent resumes
due to the child calling exit or exec.  Debuggers can use this stop to
reinsert breakpoints in the vfork parent process before it resumes.

Reviewed by:	kib
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D7045
2016-07-18 14:53:55 +00:00
Konstantin Belousov
77d6809483 The assertion re-added in r302614 was triggered when stopping signal
is delivered to vforked child.  Issue is that we avoid stopping such
children in issignal() to not block parents.  But executed AST, which
ignored stops, leaves the child with the signal pending but no AST
pending.

On first exec after vfork(), call signotify() to handle pending
reenabled signals.  Adjust the assert to not check vfork children
until exec.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-07-18 10:53:47 +00:00
Gleb Smirnoff
809a9d1353 Revert the last commit. It must get more review and testing first. 2016-07-18 09:29:08 +00:00
Gleb Smirnoff
ef58c6a7a3 Redo the r302894: the very new value for a non-scheduled callout is -1.
This was recently added in r290664.

Noticed by:	hselasky
PR:		210884
2016-07-18 09:26:06 +00:00
Konstantin Belousov
56d48d6dcb Fix another bug after r302350.
Reported and tested by:	pho
PR:	210884
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2016-07-18 04:30:34 +00:00
Konstantin Belousov
86f1146329 Another issue reported on http://seclists.org/oss-sec/2016/q3/68 is
that struct kevent member ident has uintptr_t type, which is silently
truncated to int in the call to fget().  Explicitely check for the
valid range.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-07-16 13:24:58 +00:00
Konstantin Belousov
f470cca578 In ptrace_vm_entry(), do not call vmspace_free() while owning a vm
object lock.

The vmspace_free() operations might need to lock map, object etc on
last dereference.  Postpone the free until object's inspection is
done.

Reported and tested by:	will
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-07-15 23:26:33 +00:00
John Baldwin
8d570f64aa Add a mask of optional ptrace() events.
ptrace() now stores a mask of optional events in p_ptevents.  Currently
this mask is a single integer, but it can be expanded into an array of
integers in the future.

Two new ptrace requests can be used to manipulate the event mask:
PT_GET_EVENT_MASK fetches the current event mask and PT_SET_EVENT_MASK
sets the current event mask.

The current set of events include:
- PTRACE_EXEC: trace calls to execve().
- PTRACE_SCE: trace system call entries.
- PTRACE_SCX: trace syscam call exits.
- PTRACE_FORK: trace forks and auto-attach to new child processes.
- PTRACE_LWP: trace LWP events.

The S_PT_SCX and S_PT_SCE events in the procfs p_stops flags have
been replaced by PTRACE_SCE and PTRACE_SCX.  PTRACE_FORK replaces
P_FOLLOW_FORK and PTRACE_LWP replaces P2_LWP_EVENTS.

The PT_FOLLOW_FORK and PT_LWP_EVENTS ptrace requests remain for
compatibility but now simply toggle corresponding flags in the
event mask.

While here, document that PT_SYSCALL, PT_TO_SCE, and PT_TO_SCX both
modify the event mask and continue the traced process.

Reviewed by:	kib
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D7044
2016-07-15 15:32:09 +00:00
Gleb Smirnoff
2138e263cb Fix regression introduced by r302350. The change of return value for a
callout that wasn't scheduled at all was unintentional and yielded in
several panics.

PR:		210884
2016-07-15 09:28:32 +00:00
Konstantin Belousov
d7e8cfd63d Do not allow creation of char or block special nodes with VNOVAL dev_t.
As was reported on http://seclists.org/oss-sec/2016/q3/68, tmpfs code
contains assertion that rdev != VNOVAL.  On FreeBSD, there is no other
consequences except triggering the assert.  To be compatible with
systems where device nodes have some significance, reject mknod(2)
call with dev == VNOVAL at the syscall level.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-07-15 09:23:18 +00:00
John Baldwin
c77547d2f9 Include command line arguments in core dump process info.
Fill in pr_psargs in the NT_PRSINFO ELF core dump note with command
line arguments.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D7116
2016-07-14 23:20:05 +00:00
Mark Johnston
0649caae32 Let DDB's buf printer handle NULL pointers in the buf page array.
A buf's b_pages and b_npages fields may be inconsistent after a panic.
For instance, vfs_vmio_invalidate() sets b_npages to zero only after all
pages are unwired and their page array entries are cleared.

MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
2016-07-14 18:49:05 +00:00
Eric Badger
fdb6320d45 Add explicit detection of KVM hypervisor
Set vm_guest to a new enum value (VM_GUEST_KVM) when kvm is detected and use
vm_guest in conditionals testing for KVM.

Also, fix a conditional checking if we're running in a VM which caught only
the generic VM case, but not more specific VMs (KVM, VMWare, etc.).  (Spotted
by: vangyzen).

Differential revision:	https://reviews.freebsd.org/D7172
Sponsored by:	Dell Inc.
Approved by:	kib (mentor), vangyzen (mentor)
Reviewed by:	alc
MFC after:	4 weeks
2016-07-13 19:19:18 +00:00
Konstantin Belousov
de56aee0bf Trace timeval parameters to the getitimer(2) and setitimer(2) syscalls.
Reviewed by:	jhb
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D7158
2016-07-13 14:37:58 +00:00
Konstantin Belousov
8f01bee46b Revive the check, disabled in r197963.
Despite the implication (process has pending signals -> the current
thread marked for AST and has TDF_NEEDSIGCHK set) is not true due to
other thread might manipulate its signal blocking mask, it should still
hold for the single-threaded processes.  Enable check for the condition
for single-threaded case, and replicate it from userret() to ast() as
well, where we check that ast indeed has no signal to deliver.

Note that the check is under DIAGNOSTIC, it is not enabled for INVARIANTS
but !DIAGNOSTIC since it imposes too heavy-weight locking for day-to-day
used debugging kernel.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-07-12 03:53:15 +00:00
Konstantin Belousov
4f4c35bb38 Add assert to complement r302328.
AST must not execute with TDF_SBDRY or TDF_SEINTR/TDF_SERESTART thread
flags set, which is asserted in userret(). As the consequence, -1 return
from cursig() must not be possible.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-07-12 03:52:05 +00:00
Nathan Whitehorn
8c636a11dc Remove assumptions in MI code that the BSP is CPU 0.
MFC after:	2 weeks
2016-07-11 21:25:28 +00:00
Konstantin Belousov
725496ced5 Fix grammar.
Submitted by:	alc
MFC after:	2 weeks
2016-07-11 17:04:22 +00:00
Konstantin Belousov
19efd8a5a8 In vgonel(), postpone setting BO_DEAD until VOP_RECLAIM() is called,
if vnode is VMIO.  For VMIO vnodes, set BO_DEAD in vm_object_terminate().

The vnode_destroy_object(), when calling into vm_object_terminate(),
must be able to flush buffers.  BO_DEAD purpose is to quickly destroy
buffers on write when the underlying vnode is not operable any more
(one example is the devfs node after geom is gone).  Setting BO_DEAD
for reclaiming vnode before object is terminated is premature, and
results in unability to flush buffers with live SU dependencies from
vinvalbuf() in vm_object_terminate().

Reported by:	David Cross <dcrosstech@gmail.com>
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-07-11 14:19:09 +00:00
Robert Watson
5fa69ff015 In process-descriptor close(2) and fstat(2), audit target process
information.  pgkill(2) already audits target process ID.

MFC after:	3 days
Sponsored by:	DARPA, AFRL
2016-07-10 14:17:36 +00:00
Robert Watson
e5ec733909 Do allow auditing of read(2) and write(2) system calls, by assigning
those system calls audit event identifiers AUE_READ and AUE_WRITE.
While auditing file-descriptor I/O is not required by the Common
Criteria, in practice this proves useful for both live and forensic
analysis.

NB: freebsd32 already assigns AUE_READ and AUE_WRITE to read(2) and
write(2).

MFC after:	3 days
Sponsored by:	DARPA, AFRL
2016-07-10 13:42:33 +00:00
Robert Watson
8ec75c0fc3 Audit the file-descriptor number argument for openat(2). Remove a comment
about the desirability of auditing the number, as it was in fact in the
wrong place (in the common path for open(2) and openat(2), and only the
latter accepts a file-descriptor argument).  Where other ABIs support
openat(2), it may be necessary to do additional argument auditing as it is
not performed in kern_openat(9).

MFC after:	3 days
Sponsored by:	DARPA, AFRL
2016-07-10 09:50:21 +00:00
Robert Watson
51d1f69069 Audit file-descriptor arguments to I/O system calls such as
read(2), write(2), dup(2), and mmap(2).  This auditing is not
required by the Common Criteria (and hence was not being
performed), but is valuable in both contemporary live analysis
and forensic use cases.

MFC after:	3 days
Sponsored by:	DARPA, AFRL
2016-07-10 08:04:02 +00:00
Edward Tomasz Napierala
debc480e03 Add new unmount(2) flag, MNT_NONBUSY, to check whether there are
any open vnodes before proceeding. Make autounmound(8) use this flag.
Without it, even an unsuccessfull unmount causes filesystem flush,
which interferes with normal operation.

Reviewed by:	kib@
Approved by:	re (gjb@)
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D7047
2016-07-07 09:03:57 +00:00
Nathan Whitehorn
96c85efb4b Replace a number of conflations of mp_ncpus and mp_maxid with either
mp_maxid or CPU_FOREACH() as appropriate. This fixes a number of places in
the kernel that assumed CPU IDs are dense in [0, mp_ncpus) and would try,
for example, to run tasks on CPUs that did not exist or to allocate too
few buffers on systems with sparse CPU IDs in which there are holes in the
range and mp_maxid > mp_ncpus. Such circumstances generally occur on
systems with SMT, but on which SMT is disabled. This patch restores system
operation at least on POWER8 systems configured in this way.

There are a number of other places in the kernel with potential problems
in these situations, but where sparse CPU IDs are not currently known
to occur, mostly in the ARM machine-dependent code. These will be fixed
in a follow-up commit after the stable/11 branch.

PR:		kern/210106
Reviewed by:	jhb
Approved by:	re (glebius)
2016-07-06 14:09:49 +00:00
Gleb Smirnoff
d153eeee97 The paradigm of a callout is that it has three consequent states:
not scheduled -> scheduled -> running -> not scheduled. The API and the
manual page assume that, some comments in the code assume that, and looks
like some contributors to the code also did. The problem is that this
paradigm isn't true. A callout can be scheduled and running at the same
time, which makes API description ambigouous. In such case callout_stop()
family of functions/macros should return 1 and 0 at the same time, since it
successfully unscheduled future callout but the current one is running.
Before this change we returned 1 in such a case, with an exception that
if running callout was migrating we returned 0, unless CS_MIGRBLOCK was
specified.

With this change, we now return 0 in case if future callout was unscheduled,
but another one is still in action, indicating to API users that resources
are not yet safe to be freed.

However, the sleepqueue code relies on getting 1 return code in that case,
and there already was CS_MIGRBLOCK flag, that covered one of the edge cases.
In the new return path we will also use this flag, to keep sleepqueue safe.

Since the flag CS_MIGRBLOCK doesn't block migration and now isn't limited to
migration edge case, rename it to CS_EXECUTING.

This change fixes panics on a high loaded TCP server.

Reviewed by:	jch, hselasky, rrs, kib
Approved by:	re (gjb)
Differential Revision:	https://reviews.freebsd.org/D7042
2016-07-05 18:47:17 +00:00
Gleb Smirnoff
a0d20ecb94 Compile in the kassert_panic() function with INVARIANT_SUPPORT
option, not INVARIANTS.  The function is required if we want
to load in a module that is compiled with INVARIANTS.

Reviewed by:	jhb
Approved by:	re (gjb)
2016-07-05 18:34:34 +00:00
Mark Johnston
f61d6c5a5b Ensure that spinlock sections are balanced even after a panic.
vpanic() uses spinlock_enter() to disable interrupts before dumping core.
However, when the scheduler is stopped and INVARIANTS is not configured,
thread_lock() does not acquire a spinlock section, while thread_unlock()
releases one. This can result in interrupts staying enabled while the
kernel dumps core, complicating post-mortem analysis of the crash.

Approved by:	re (gjb)
MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
2016-07-05 17:59:04 +00:00
Robert Watson
971711fb7c Call audit hooks to capture vnode attributes for three file-descriptor
method implementations: fstat(2), close(2), and poll(2).  This change
synchronises auditing here with similar auditing for VFS-specific system
calls such as stat(2) that audit more complete vnode information.

Sponsored by:	DARPA, AFRL
Approved by:	re (kib)
MFC after:	1 week
2016-07-05 16:37:01 +00:00
Ed Maste
1cbb879df8 add description for debug.elf{32,64}_legacy_coredump sysctl
Approved by:	re (kib)
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2016-07-05 14:46:06 +00:00
Konstantin Belousov
46e47c4f8d Provide helper macros to detect 'non-silent SBDRY' state and to
calculate appropriate return value for stops.  Simplify the code by
using them.

Fix typo in sig_suspend_threads().  The thread which sleep must be
aborted is td2. (*)

In issignal(), when handling stopping signal for thread in
TD_SBDRY_INTR state, do not stop, this is wrong and fires assert.
This is yet another place where execution should be forced out of
SBDRY-protected region.  For such case, return -1 from issignal() and
translate it to corresponding error code in sleepq_catch_signals().
Assert that other consumers of cursig() are not affected by the new
return value. (*)

Micro-optimize, mostly VFS and VOP methods, by avoiding calling the
functions when SIGDEFERSTOP_NOP non-change is requested. (**)

Reported and tested by:	pho (*)
Requested by:	bde (**)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Approved by:	re (gjb)
2016-07-03 18:19:48 +00:00
Konstantin Belousov
d12d6a8476 Remove racy assert. The thread which changes vnode usecount from 0 to 1
does it under the vnode interlock, but the interlock is not owned by the
asserting thread.  As result, we might read increased use counter but also
still see VI_OWEINACT.

In collaboration with: nwhitehorn
Hardware donated by: IBM LTC
Sponsored by:	The FreeBSD Foundation (kib)
Approved by:	re (gjb)
2016-07-03 01:56:48 +00:00
Konstantin Belousov
e18ee4957d When a process knote was attached to the process which is already exiting,
the knote is activated immediately.  If the exit1() later activates
knotes, such knote is attempted to be activated second time.  Detect
the condition by zeroed kn_ptr.p_proc pointer, and avoid excessive
activation.

Before r302235, such knotes were removed from the knlist immediately
upon activation.

Reported by:	truckman
Sponsored by:	The FreeBSD Foundation
Approved by:	re (gjb)
2016-07-01 20:11:28 +00:00
Konstantin Belousov
364c516cff Currently the ntptime code and resettodr() are Giant-locked. In
particular, the Giant is supposed to protect against parallel
ntp_adjtime(2) invocations.  But, for instance, sys_ntp_adjtime() does
copyout(9) under Giant and then examines time_status to return syscall
result.  Since copyout(9) could sleep, the syscall result might be
inconsistent.

Another and more important issue is that if PPS is configured,
hardpps(9) is executed without any protection against the parallel
top-level code invocation. Potentially, this may result in the
inconsistent state of the ntptime state variables, but I cannot say
how serious such distortion is. The non-functional splclock() call in
sys_ntp_adjtime() protected against clock interrupts calling hardpps()
in the pre-SMP era.

Modernize the locking. A mutex protects ntptime data.  Due to the
hardpps() KPI legitimately serving from the interrupt filters (and
e.g. uart(4) does call it from filter), the lock cannot be sleepable
mutex if PPS_SYNC is defined.  Otherwise, use normal sleepable mutex
to reduce interrupt latency.

Reviewed by:	  imp, jhb
Sponsored by:	  The FreeBSD Foundation
Approved by:	  re (gjb)
Differential revision:	https://reviews.freebsd.org/D6825
2016-06-28 16:43:23 +00:00
Konstantin Belousov
3a0e6f920a Do not use Giant to prevent parallel calls to CLOCK_SETTIME(). Use
private mtx in resettodr(), no implementation of CLOCK_SETTIME() is
allowed to sleep.

Reviewed by:	  imp, jhb
Sponsored by:	  The FreeBSD Foundation
Approved by:	  re (gjb)
X-Differential revision:	https://reviews.freebsd.org/D6825
2016-06-28 16:42:40 +00:00
Konstantin Belousov
6f56cb8dbf Complete r302215. TDF_SBDRY | TDF_SERESTART and TDF_SBDRY |
TDF_SEINTR flags values, unlike TDF_SBDRY, must be treated almost as
if TDF_SBDRY is not set for STOP signal delivery.  The only difference
is that sig_suspend_threads() should abort the sleep instead of doing
immediate suspension.

Reported by:	ngie
Sponsored by:	The FreeBSD Foundation
MFC after:	12 days
Approved by:	re (gjb)
2016-06-28 16:41:50 +00:00
Konstantin Belousov
9eb3f14324 Fix userspace build after r302235: do not expose bool field of the
structure, change it to int.

The real fix is to sanitize user-visible definitions in sys/event.h,
e.g. the affected struct knlist is of no use for userspace programs.

Reported and tested by:	jkim
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Approved by:	re (gjb)
2016-06-27 23:34:53 +00:00
Konstantin Belousov
9e590ff04b When filt_proc() removes event from the knlist due to the process
exiting (NOTE_EXIT->knlist_remove_inevent()), two things happen:
- knote kn_knlist pointer is reset
- INFLUX knote is removed from the process knlist.
And, there are two consequences:
- KN_LIST_UNLOCK() on such knote is nop
- there is nothing which would block exit1() from processing past the
  knlist_destroy() (and knlist_destroy() resets knlist lock pointers).
Both consequences result either in leaked process lock, or
dereferencing NULL function pointers for locking.

Handle this by stopping embedding the process knlist into struct proc.
Instead, the knlist is allocated together with struct proc, but marked
as autodestroy on the zombie reap, by knlist_detach() function.  The
knlist is freed when last kevent is removed from the list, in
particular, at the zombie reap time if the list is empty.  As result,
the knlist_remove_inevent() is no longer needed and removed.

Other changes:

In filt_procattach(), clear NOTE_EXEC and NOTE_FORK desired events
from kn_sfflags for knote registered by kernel to only get NOTE_CHILD
notifications.  The flags leak resulted in excessive
NOTE_EXEC/NOTE_FORK reports.

Fix immediate note activation in filt_procattach().  Condition should
be either the immediate CHILD_NOTE activation, or immediate NOTE_EXIT
report for the exiting process.

In knote_fork(), do not perform racy check for KN_INFLUX before kq
lock is taken.  Besides being racy, it did not accounted for notes
just added by scan (KN_SCAN).

Some minor and incomplete style fixes.

Analyzed and tested by:	Eric Badger <eric@badgerio.us>
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Approved by:	re (gjb)
Differential revision:	https://reviews.freebsd.org/D6859
2016-06-27 21:52:17 +00:00
Konstantin Belousov
883a5a4a6a When sleeping waiting for either local or remote advisory lock,
interrupt sleeps with the ERESTART on the suspension attempts.
Otherwise, single-threading requests are deferred until the locks are
granted for NFS files, which causes hangs.

When retrying local registration of the remotely-granted adv lock,
allow full suspension and check for suspension, for usual reasons.

Reported by:	markj, pho
Reviewed by:	jilles
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Approved by:	re (gjb)
2016-06-26 20:08:42 +00:00
Konstantin Belousov
3a1e5dd8e6 Rewrite sigdeferstop(9) and sigallowstop(9) into more flexible
framework allowing to set the suspension policy for the dynamic block.
Extend the currently possible policies of stopping on interruptible
sleeps and ignoring such sleeps by two more: do not suspend at
interruptible sleeps, but interrupt them with either EINTR or ERESTART.

Reviewed by:	jilles
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Approved by:	re (gjb)
2016-06-26 20:07:24 +00:00
Konstantin Belousov
8a06de9e92 Do not clear robust lists pointers on fork. The forked child thread
lists must be functional.

Reported by:	Daniel Engberg <daniel.engberg.lists@pyret.net>,
	Guy Yur <guyyur@gmail.com>
Tested by:	Guy Yur <guyyur@gmail.com>
Sponsored by:	The FreeBSD Foundation
Approved by:	re (gjb), including the KBI change
2016-06-25 11:31:25 +00:00
Jilles Tjoelker
6ea906eec0 posixshm: Fix lock leak when mac_posixshm_check_read rejects read.
While reading the code, I noticed that shm_read() returns without unlocking
foffset and rangelock if mac_posixshm_check_read() rejects the read.

Reviewed by:	kib, jhb, rwatson
Approved by:	re (gjb)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D6927
2016-06-23 20:59:13 +00:00
Brooks Davis
a72c64b0b6 Generate syscall tables and update pipe() implementation after r302094.
Mark the pipe() system call as COMPAT10.

As of r302092 libc uses pipe2() with a zero flags value instead of pipe().

Approved by:	re (gjb)
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D6816
2016-06-22 21:18:19 +00:00
Brooks Davis
e16e64098c Mark the pipe() system call as COMPAT10.
As of r302092 libc uses pipe2() with a zero flags value instead of pipe().

Commit with regenerated files and implementation to follow.

Approved by:	re (gjb)
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D6816
2016-06-22 21:15:59 +00:00
Brooks Davis
e52e02ba24 Add support for COMPAT10 keywords in syscalls.master.
Approved by:	re (gjb)
Sponsored by:	DARPA, AFRL
2016-06-22 21:12:53 +00:00
John Baldwin
b1012d8036 Account for AIO socket operations in thread/process resource usage.
File and disk-backed I/O requests store counts of read/written disk
blocks in each AIO job so that they can be charged to the thread that
completes an AIO request via aio_return() or aio_waitcomplete().  This
change extends AIO jobs to store counts of received/sent messages and
updates socket backends to set these counts accordingly.  Note that
the socket backends are careful to only charge a single messages for
each AIO request even though a single request on a blocking socket might
invoke sosend or soreceive multiple times.  This is to mimic the
resource accounting of synchronous read/write.

Adjust the UNIX socketpair AIO test to verify that the message resource
usage counts update accordingly for aio_read and aio_write.

Approved by:	re (hrs)
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D6911
2016-06-21 22:19:06 +00:00
Bjoern A. Zeeb
89856f7e2d Get closer to a VIMAGE network stack teardown from top to bottom rather
than removing the network interfaces first. This change is rather larger
and convoluted as the ordering requirements cannot be separated.

Move the pfil(9) framework to SI_SUB_PROTO_PFIL, move Firewalls and
related modules to their own SI_SUB_PROTO_FIREWALL.
Move initialization of "physical" interfaces to SI_SUB_DRIVERS,
move virtual (cloned) interfaces to SI_SUB_PSEUDO.
Move Multicast to SI_SUB_PROTO_MC.

Re-work parts of multicast initialisation and teardown, not taking the
huge amount of memory into account if used as a module yet.

For interface teardown we try to do as many of them as we can on
SI_SUB_INIT_IF, but for some this makes no sense, e.g., when tunnelling
over a higher layer protocol such as IP. In that case the interface
has to go along (or before) the higher layer protocol is shutdown.

Kernel hhooks need to go last on teardown as they may be used at various
higher layers and we cannot remove them before we cleaned up the higher
layers.

For interface teardown there are multiple paths:
(a) a cloned interface is destroyed (inside a VIMAGE or in the base system),
(b) any interface is moved from a virtual network stack to a different
network stack ("vmove"), or (c) a virtual network stack is being shut down.
All code paths go through if_detach_internal() where we, depending on the
vmove flag or the vnet state, make a decision on how much to shut down;
in case we are destroying a VNET the individual protocol layers will
cleanup their own parts thus we cannot do so again for each interface as
we end up with, e.g., double-frees, destroying locks twice or acquiring
already destroyed locks.
When calling into protocol cleanups we equally have to tell them
whether they need to detach upper layer protocols ("ulp") or not
(e.g., in6_ifdetach()).

Provide or enahnce helper functions to do proper cleanup at a protocol
rather than at an interface level.

Approved by:		re (hrs)
Obtained from:		projects/vnet
Reviewed by:		gnn, jhb
Sponsored by:		The FreeBSD Foundation
MFC after:		2 weeks
Differential Revision:	https://reviews.freebsd.org/D6747
2016-06-21 13:48:49 +00:00
Konstantin Belousov
d9a503bec4 Fix typo. Note that atomic is still required even for interlocked case.
Sponsored by:	The FreeBSD Foundation
Approved by:	re (marius)
2016-06-20 15:45:50 +00:00
Mateusz Guzik
e896fb3bae vfs: ifdef out noop vop_* primitives on !DEBUG_VFS_LOCKS kernels
This removes calls to empty functions like vop_lock_{pre/post} from
common vfs routines.

Approved by:	re (gjb)
2016-06-17 19:41:30 +00:00
Konstantin Belousov
f8a75278dc Add VFS interface to flush specified amount of free vnodes belonging
to mount points with the given filesystem type, specified by mount
vfs_ops pointer.

Based on patch by:	mckusick
Reviewed by:	avg, mckusick
Tested by:	allanjude, madpilot
Sponsored by:	The FreeBSD Foundation
Approved by:	re (gjb)
2016-06-17 17:33:25 +00:00
Konstantin Belousov
5c2cf81845 Update comments for the MD functions managing contexts for new
threads, to make it less confusing and using modern kernel terms.

Rename the functions to reflect current use of the functions, instead
of the historic KSE conventions:
  cpu_set_fork_handler -> cpu_fork_kthread_handler (for kthreads)
  cpu_set_upcall -> cpu_copy_thread (for forks)
  cpu_set_upcall_kse -> cpu_set_upcall (for new threads creation)

Reviewed by:	jhb (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (hrs)
Differential revision:	https://reviews.freebsd.org/D6731
2016-06-16 12:05:44 +00:00
Konstantin Belousov
bd07998e0e Remove XXX comments from kern_thread.c. In one case, there is no
reason for it in modern times.  In the other case, expand the comment
stating instead of doubting.

Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (hrs)
X-Differential revision:	https://reviews.freebsd.org/D6731
2016-06-16 12:01:11 +00:00
Konstantin Belousov
13d2cd3b68 Remove code duplication.
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (hrs)
X-Differential revision:	https://reviews.freebsd.org/D6731
2016-06-16 11:58:46 +00:00
John Baldwin
fe0bdd1d2c Move backend-specific fields of kaiocb into a union.
This reduces the size of kaiocb slightly. I've also added some generic
fields that other backends can use in place of the BIO-specific fields.

Change the socket and Chelsio DDP backends to use 'backend3' instead of
abusing _aiocb_private.status directly. This confines the use of
_aiocb_private to the AIO internals in vfs_aio.c.

Reviewed by:	kib (earlier version)
Approved by:	re (gjb)
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D6547
2016-06-15 20:56:45 +00:00
Konstantin Belousov
9fdbfd3b6c Do not assume that we own the use reference on the covered vnode until
we set MNTK_UNMOUNT flag on the mp.  Otherwise parallel unmount which
wins race with us could dereference the covered vnode, and we are
left with the locked freed memory.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
Approved by:	re (gjb)
MFC after:	1 week
2016-06-15 15:56:03 +00:00
Jamie Gritton
932a6e432d Fix a vnode leak when giving a child jail a too-long path when
debug.disablefullpath=1.
2016-06-09 21:59:11 +00:00
Jamie Gritton
cf0313c679 Re-order some jail parameter reading to prevent a vnode leak. 2016-06-09 20:43:14 +00:00
Jamie Gritton
176ff3a066 Clean up some logic in jail error messages, replacing a missing test and
a redundant test with a single correct test.
2016-06-09 20:39:57 +00:00
Mariusz Zaborski
fb4cdc96a8 Define tunable instead of using CTLFLAG_RWTUN flag with kern.corefile.
The allproc_lock lock used in the sysctl_kern_corefile function is initialized
in the procinit function which is called after setting sysctl values at boot.
That means if we set kern.corefile at boot we will be trying to use
lock with is uninitialized and machine will crash.

If we define kern.corefile as tunable instead of using CTFLAG_RWTUN we will
not call the sysctl_kern_corefile function and we will not use an uninitialized
lock. When machine will boot then we will start using function depending on
the lock.

Reviewed by:	pjd
2016-06-09 20:23:30 +00:00
Conrad Meyer
8a3aeac27b Add DDB command "kldstat"
It prints much the same information as kldstat(8) without any arguments.

Suggested by:	jhibbits
Sponsored by:	EMC / Isilon Storage Division
2016-06-09 18:27:41 +00:00
Conrad Meyer
dd6ea7f7bc kvprintf: Pad %*c to width, like %*s
Sponsored by:	EMC / Isilon Storage Division
2016-06-09 18:24:51 +00:00
Jamie Gritton
ef0ddea316 Make sure the OSD methods for jail set and remove can't run concurrently,
by holding allprison_lock exclusively (even if only for a moment before
downgrading) on all paths that call PR_METHOD_REMOVE.  Since they may run
on a downgraded lock, it's still possible for them to run concurrently
with PR_METHOD_GET, which will need to use the prison lock.
2016-06-09 16:41:41 +00:00
Jamie Gritton
5f02f22af1 Remove a comment that was part of copied code, and is misleading in
the new location.
2016-06-09 15:34:33 +00:00
Mark Johnston
508d856999 Fix some cosmetic issues in kern_fail.c omitted from r296927.
Obtained from:	Matthew Bryan <matthew.bryan@isilon.com>
2016-06-09 13:17:08 +00:00
Konstantin Belousov
3fc292d56b Old process credentials for setuid execve must not be dereferenced
when the process credentials were not changed.  This can happen if an
error occured trying to activate the setuid binary.  And on error, if
new credentials were not yet assigned, they must be freed to not
create the leak.

Use oldcred == NULL as the predicate to detect credential
reassignment.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
2016-06-08 04:37:03 +00:00
Mariusz Zaborski
b3a734483e Introduce the PD_CLOEXEC for pdfork(2).
Reviewed by:	mjg
2016-06-08 02:09:14 +00:00
Svatopluk Kraus
c4263292fe Remove temporary solution for storing interrupt mapping data as
it's not needed after r301451 and follow-ups r301453, r301539.

This makes INTRNG clean of all additions related to various buses.
2016-06-07 09:03:27 +00:00
Michal Meloun
949883bd72 INTRNG: As follow up of r301451, implement mapping and configuration
of gpio pin interrupts by new way.

Note: This removes last consumer of intr_ddata machinery and we remove it
in separate commit.
2016-06-07 05:08:24 +00:00
Bjoern A. Zeeb
3af72c1124 Implement a show panic command to DDB which will helpfully print the
panic string again if set, in case it scrolled out of the active
window.  This avoids having to remember the symbol name.

Also add a show callout <addr> command to DDB in order to inspect
some struct callout fields in case of panics in the callout code.
This may help to see if there was memory corruption or to further
ease debugging problems.

Obtained from:	projects/vnet
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Reviewed by:	jhb (comment only on the show panic initally)
Differential Revision:	https://reviews.freebsd.org/D4527
2016-06-06 20:57:24 +00:00
Konstantin Belousov
93ccd6bf87 Get rid of struct proc p_sched and struct thread td_sched pointers.
p_sched is unused.

The struct td_sched is always co-allocated with the struct thread,
except for the thread0.  Avoid useless indirection, instead calculate
td_sched location using simple pointer arithmetic in td_get_sched(9).
For thread0, which is statically allocated, create a structure to
emulate layout of the dynamic allocation.

Reviewed by:	jhb (previous version)
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D6711
2016-06-05 17:04:03 +00:00
Konstantin Belousov
314381b529 Use ANSI function definition.
Sponsored by:	The FreeBSD Foundation
2016-06-05 16:55:55 +00:00
Svatopluk Kraus
ad5244ece1 INTRNG - change the way how an interrupt mapping data are provided
to the framework in OFW (FDT) case.

This is a follow-up to r301451.

Differential Revision:	https://reviews.freebsd.org/D6634
2016-06-05 16:20:12 +00:00
Svatopluk Kraus
0869297dd9 (1) Add a new bus method to get a mapping data for an interrupt.
BUS_MAP_INTR() is used to get an interrupt mapping data according
to provided hints. The hints could be modified afterwards, but only
if mapping data was allocated. This method is intended to be called
before BUS_ALLOC_RESOURCE().

An interrupt mapping data describes an interrupt - hardware number,
type, configuration, cpu binding, and whatever is needed to setup it.

(2) Introduce a method which allows storing of an additional data
in struct resource to be available for bus drivers. This method is
convenient in two ways:
 - there is no need to rework existing bus drivers as they can simply
   be extended to provide an additional data,
 - there is no need to modify any existing bus methods as struct
   resource is already passed to them as argument and thus stored data
   is simply accessible by other bus drivers.
For now, implement this method only for INTRNG.

This is motivated by needs of modern SOCs where hardware initialization
is not straightforward and resources descriptions are complex, opaque
for everyone but provider, and may vary from SOC to SOC. Typical
situation is that one bus driver can fetch a resource description for
its child device, but it's opaque for this driver. Another bus driver
knows a provider for this kind of resource and can pass this resource
description to it. In fact, something like device IVARS would be
perfect for that if implemented generally enough. Unfortunatelly, IVARS
are usable only by their owners now. Only owner knows its IVARS layout,
thus other bus drivers are not able to use them.

Differential Revision:	https://reviews.freebsd.org/D6632
2016-06-05 16:07:57 +00:00
Andrew Turner
d1605cda2b Add an interface to handle interrupt controllers that have a contiguous
range of interrupts they pass to a second controller driver to handle.
The parent driver is expected to detect when one of these interrupts has
been triggered and call intr_child_irq_handler to pass the interrupt to
a child. The children controllers are then expected to manage the range
by allocating interrupts as needed.

This will initially be used by the ARM GICv3 driver, but is is expected to
be useful for other driver where this type of allocation applies.

Obtained from:	ABT Systems Ltd
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D6436
2016-06-03 10:13:18 +00:00
Mateusz Guzik
f3c8e16ea5 taskqueue: plug a leak in _taskqueue_create
While here make some style fixes and postpone the sprintf so that it is
only done when the function can no longer fail.

CID:	1356041
2016-06-02 15:52:34 +00:00
Mateusz Guzik
fc4f686d59 Microoptimize locking primitives by avoiding unnecessary atomic ops.
Inline version of primitives do an atomic op and if it fails they fallback to
actual primitives, which immediately retry the atomic op.

The obvious optimisation is to check if the lock is free and only then proceed
to do an atomic op.

Reviewed by:	jhb, vangyzen
2016-06-01 18:32:20 +00:00
Bjoern A. Zeeb
3f58662dd9 The pr_destroy field does not allow us to run the teardown code in a
specific order.  VNET_SYSUNINITs however are doing exactly that.
Thus remove the VIMAGE conditional field from the domain(9) protosw
structure and replace it with VNET_SYSUNINITs.
This also allows us to change some order and to make the teardown functions
file local static.
Also convert divert(4) as it uses the same mechanism ip(4) and ip6(4) use
internally.

Slightly reshuffle the SI_SUB_* fields in kernel.h and add a new ones, e.g.,
for pfil consumers (firewalls), partially for this commit and for others
to come.

Reviewed by:		gnn, tuexen (sctp), jhb (kernel.h)
Obtained from:		projects/vnet
MFC after:		2 weeks
X-MFC:			do not remove pr_destroy
Sponsored by:		The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D6652
2016-06-01 10:14:04 +00:00
Gleb Smirnoff
34e05ebe72 Fix kernel stack disclosures in the Linux and 4.3BSD compat layers.
Submitted by:	CTurt
Security:	SA-16:20
Security:	SA-16:21
2016-05-31 16:56:30 +00:00
Edward Tomasz Napierala
f7bd221730 Cosmetics - add missing space after ellipses in shutdown messages.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-05-31 15:27:33 +00:00
Jamie Gritton
ee8d6bd352 Mark jail(2), and the sysctls that it (and only it) uses as deprecated.
jail(8) has long used jail_set(2), and those sysctl only cause confusion.
2016-05-30 05:21:24 +00:00
Mateusz Guzik
2dbdf49cf4 fd: provide a common exit point for unlock in kern_dup
While here assert dropped filedesc lock on return from closefp.
2016-05-27 17:00:15 +00:00
Mateusz Guzik
cda688443a exec: get rid of one vnode lock/unlock pair in do_execve
The lock was temporarily dropped for vrele calls, but they can be
postponed to a point where the lock is not held in the first place.

While here shuffle other code not needing the lock.
2016-05-27 15:03:38 +00:00
Bryan Drewery
1afd78b34d exec: Provide execpath in imgp for the process_exec hook.
This was previously set after the hook and only if auxargs were present.
Now always provide it if possible.

MFC after:	2 weeks
Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D6546
2016-05-26 23:19:39 +00:00
Bryan Drewery
881010f05d exec: Add credential change information into imgp for process_exec hook.
This allows an EVENTHANDLER(process_exec) hook to see if the new image
will cause credentials to change whether due to setgid/setuid or because
of POSIX saved-id semantics.

This adds 3 new fields into image_params:
  struct ucred *newcred		Non-null if the credentials will change.
  bool credential_setid		True if the new image is setuid or setgid.

This will pre-determine the new credentials before invoking the image
activators, where the process_exec hook is called.  The new credentials
will be installed into the process in the same place as before, after
image activators are done handling the image.

MFC after:	2 weeks
Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D6544
2016-05-26 23:18:54 +00:00
Conrad Meyer
571ebf7685 crypto routines: Hint minimum buffer sizes to the compiler
Use the C99 'static' keyword to hint to the compiler IVs and output digest
sizes.  The keyword informs the compiler of the minimum valid size for a given
array.  Obviously not every pointer can be validated (i.e., the compiler can
produce false negative but not false positive reports).

No functional change.  No ABI change.

Sponsored by:	EMC / Isilon Storage Division
2016-05-26 19:29:29 +00:00
Hans Petter Selasky
84e717c4cf Add support for boolean sysctl's.
Because the size of bool can be implementation defined, make a bool
sysctl handler which handle bools. Userspace sees the bools like
unsigned 8-bit integers. Values are filtered to either 1 or 0 upon
read and write, similar to what a compiler would do.

Requested by:	kmacy @
Sponsored by:	Mellanox Technologies
2016-05-26 08:41:55 +00:00
Ian Lepore
a66dc0c52b Include machine/acle-compat.h in cdefs.h on arm if the compiler doesn't
have ACLE support built in.  The ACLE (ARM C Language Extensions) defines
a set of standardized symbols which indicate the architecture version and
features available.  ACLE support is built in to modern compilers (both
clang and gcc), but absent from gcc prior to 4.4.

ARM (the company) provides the acle-compat.h header file to define the
right symbols for older versions of gcc.  Basically, acle-compat.h does
for arm about the same thing cdefs.h does for freebsd: defines
standardized macros that work no matter which compiler you use.  If ARM
hadn't provided this file we would have ended up with a big #ifdef __arm__
section in cdefs.h with our own compatibility shims.

Remove #include <machine/acle-compat.h> from the zillion other places (an
ever-growing list) that it appears.  Since style(9) requires sys/types.h
or sys/param.h early in the include list, and both of those lead to
including cdefs.h, only a couple special cases still need to include
acle-compat.h directly.

Loves it:     imp
2016-05-25 19:44:26 +00:00
Konstantin Belousov
c5e44d6cd5 Silence false LOR report due to the taskqueue mutex and kqueue lock
named the same.

Reported by:	Doug Luce <doug@freebsd.con.com>
Sponsored by:	The FreeBSD Foundation
2016-05-24 21:13:33 +00:00
John Baldwin
778ce4f297 Return the correct status when a partially completed request is cancelled.
After the previous changes to fix requests on blocking sockets to complete
across multiple operations, an edge case exists where a request can be
cancelled after it has partially completed.  POSIX doesn't appear to
dictate exactly how to handle this case, but in general I feel that
aio_cancel() should arrange to cancel any request it can, but that any
partially completed requests should return a partial completion rather
than ECANCELED.  To that end, fix the socket AIO cancellation routine to
return a short read/write if a partially completed request is cancelled
rather than ECANCELED.

Sponsored by:	Chelsio Communications
2016-05-24 21:09:05 +00:00
Andrew Turner
974692e3bf Limit calling pmc_hook to when the interrupt comes while running userspace.
We may enable interrupts from within the callback, e.g. in a data abort
during copyin. If we receive an interrupt at that time pmc_hook will be
called again and, as it is handling userspace stack tracing, will hit a
KASSERT as it checks if the trapframe is from userland.

With this I can run hwpmc with intrng on a ThunderX and have it trace all
CPUs.

Obtained from:	ABT Systems Ltd
Sponsored by:	The FreeBSD Foundation
2016-05-24 12:06:56 +00:00
John Baldwin
1717b68af1 Don't prematurely return short completions on blocking sockets.
Always requeue an AIO job at the head of the socket buffer's queue if
sosend() or soreceive() returns EWOULDBLOCK on a blocking socket.
Previously, requests were only requeued if they returned EWOULDBLOCK
and completed no data.  Now after a partial completion on a blocking
socket the request is queued and the remaining request is retried when
the socket is ready.  This allows writes larger than the currently
available space on a blocking socket to fully complete.  Reads on a
blocking socket that satifsy the low watermark can still return a short
read (same as read()).

In order to track previously completed data, the internal 'status'
field of the AIO job is used to store the amount of previously
computed data.

Non-blocking sockets continue to return short completions for both
reads and writes.

Add a test for a "large" AIO write on a blocking socket that writes
twice the socket buffer size to a UNIX domain socket.

Sponsored by:	Chelsio Communications
2016-05-24 03:13:27 +00:00
Alan Somers
37f32e5379 Fix build of kern/subr_unit.c, broken by r300539
Reported by:	peter
Pointyhat to:	asomers
Sponsored by:	Spectra Logic Corp
2016-05-24 00:14:58 +00:00
Alan Somers
1b82e02f4d Add bit_count to the bitstring(3) api
Add a bit_count function, which efficiently counts the number of bits set in
a bitstring.

sys/sys/bitstring.h
tests/sys/sys/bitstring_test.c
share/man/man3/bitstring.3
	Add bit_alloc

sys/kern/subr_unit.c
	Use bit_count instead of a naive counting loop in check_unrhdr, used
	when INVARIANTS are enabled. The userland test runs about 6x faster
	in a generic build, or 8.5x faster when built for Nehalem, which has
	the POPCNT instruction.

sys/sys/param.h
	Bump __FreeBSD_version due to the addition of bit_alloc

UPDATING
	Add a note about the ABI incompatibility of the bitstring(3)
	changes, as suggested by lidl.

Suggested by:	gibbs
Reviewed by:	gibbs, ngie
MFC after:	9 days
X-MFC-With:	299090, 300538
Relnotes:	yes
Sponsored by:	Spectra Logic Corp
Differential Revision:	https://reviews.freebsd.org/D6255
2016-05-23 20:29:18 +00:00
Andrew Turner
df7a2251cc Add the needed hwpmc hooks to subr_intr.c. This is needed for the correct
operation of hwpmc on, for example, arm64 with intrng.

Obtained from:	ABT Systems Ltd
Sponsored by:	The FreeBSD Foundation
2016-05-23 15:26:35 +00:00
Hans Petter Selasky
bc64e52679 Use DELAY() instead of _sleep() when SCHEDULER_STOPPED() is set inside
pause_sbt(). This allows pause() to continue working during a panic()
which is not invoking KDB. This is useful when debugging graphics
drivers using the LinuxKPI.

Obtained from:	kmacy @
MFC after:	1 week
2016-05-23 10:31:54 +00:00
Baptiste Daroussin
306e53bce9 Fix typo introduced by me (not the submitter) when fixing typos 2016-05-22 13:10:48 +00:00
Baptiste Daroussin
2fd642c899 Fix typos in the comments
Submitted by:	cipherwraith666@gmail.com (via github)
2016-05-22 13:04:45 +00:00
Andriy Gapon
7107bed0f0 fix loss of taskqueue wakeups (introduced in r300113)
Submitted by:	kmacy
Tested by:	dchagin
2016-05-21 14:51:49 +00:00
John Baldwin
20fee1093e Add sglist functions for working with arrays of VM pages.
sglist_count_vmpages() determines the number of segments required for
a buffer described by an array of VM pages. sglist_append_vmpages()
adds the segments described by such a buffer to an sglist.  The latter
function is largely pulled from sglist_append_bio(), and
sglist_append_bio() now uses sglist_append_vmpages().

Reviewed by:	kib
Sponsored by:	Chelsio Communications
2016-05-20 23:28:43 +00:00
John Baldwin
f0ec174043 Consistently set status to -1 when completing an AIO request with an error.
Sponsored by:	Chelsio Communications
2016-05-20 19:46:25 +00:00
John Baldwin
cc981af204 Add new bus methods for mapping resources.
Add a pair of bus methods that can be used to "map" resources for direct
CPU access using bus_space(9).  bus_map_resource() creates a mapping and
bus_unmap_resource() releases a previously created mapping.  Mappings are
described by 'struct resource_map' object.  Pointers to these objects can
be passed as the first argument to the bus_space wrapper API used for bus
resources.

Drivers that wish to map all of a resource using default settings
(for example, using uncacheable memory attributes) do not need to change.
However, drivers that wish to use non-default settings can now do so
without jumping through hoops.

First, an RF_UNMAPPED flag is added to request that a resource is not
implicitly mapped with the default settings when it is activated.  This
permits other activation steps (such as enabling I/O or memory decoding
in a device's PCI command register) to be taken without creating a
mapping.  Right now the AGP drivers don't set RF_ACTIVE to avoid using
up a large amount of KVA to map the AGP aperture on 32-bit platforms.
Once RF_UNMAPPED is supported on all platforms that support AGP this
can be changed to using RF_UNMAPPED with RF_ACTIVE instead.

Second, bus_map_resource accepts an optional structure that defines
additional settings for a given mapping.

For example, a driver can now request to map only a subset of a resource
instead of the entire range.  The AGP driver could also use this to only
map the first page of the aperture (IIRC, it calls pmap_mapdev() directly
to map the first page currently).  I will also eventually change the
PCI-PCI bridge driver to request mappings of the subset of the I/O window
resource on its parent side to create mappings for child devices rather
than passing child resources directly up to nexus to be mapped.  This
also permits bridges that do address translation to request suitable
mappings from a resource on the "upper" side of the bus when mapping
resources on the "lower" side of the bus.

Another attribute that can be specified is an alternate memory attribute
for memory-mapped resources.  This can be used to request a
Write-Combining mapping of a PCI BAR in an MI fashion.  (Currently the
drivers that do this call pmap_change_attr() directly for x86 only.)

Note that this commit only adds the MI framework.  Each platform needs
to add support for handling RF_UNMAPPED and thew new
bus_map/unmap_resource methods.  Generally speaking, any drivers that
are calling rman_set_bustag() and rman_set_bushandle() need to be
updated.

Discussed on:	arch
Reviewed by:	cem
Differential Revision:	https://reviews.freebsd.org/D5237
2016-05-20 17:57:47 +00:00
Mark Johnston
5e0a6f31e5 Move IPv6 malloc tag definitions into the IPv6 code. 2016-05-20 04:45:08 +00:00
Scott Long
7e52504fc2 Adjust the creation of tq_name so it can be freed correctly
Reviewed by:	jhb, allanjude
Differential Revision:	D6454
2016-05-19 17:14:24 +00:00
Kenneth D. Merry
9a6844d55f Add support for managing Shingled Magnetic Recording (SMR) drives.
This change includes support for SCSI SMR drives (which conform to the
Zoned Block Commands or ZBC spec) and ATA SMR drives (which conform to
the Zoned ATA Command Set or ZAC spec) behind SAS expanders.

This includes full management support through the GEOM BIO interface, and
through a new userland utility, zonectl(8), and through camcontrol(8).

This is now ready for filesystems to use to detect and manage zoned drives.
(There is no work in progress that I know of to use this for ZFS or UFS, if
anyone is interested, let me know and I may have some suggestions.)

Also, improve ATA command passthrough and dispatch support, both via ATA
and ATA passthrough over SCSI.

Also, add support to camcontrol(8) for the ATA Extended Power Conditions
feature set.  You can now manage ATA device power states, and set various
idle time thresholds for a drive to enter lower power states.

Note that this change cannot be MFCed in full, because it depends on
changes to the struct bio API that break compatilibity.  In order to
avoid breaking the stable API, only changes that don't touch or depend on
the struct bio changes can be merged.  For example, the camcontrol(8)
changes don't depend on the new bio API, but zonectl(8) and the probe
changes to the da(4) and ada(4) drivers do depend on it.

Also note that the SMR changes have not yet been tested with an actual
SCSI ZBC device, or a SCSI to ATA translation layer (SAT) that supports
ZBC to ZAC translation.  I have not yet gotten a suitable drive or SAT
layer, so any testing help would be appreciated.  These changes have been
tested with Seagate Host Aware SATA drives attached to both SAS and SATA
controllers.  Also, I do not have any SATA Host Managed devices, and I
suspect that it may take additional (hopefully minor) changes to support
them.

Thanks to Seagate for supplying the test hardware and answering questions.

sbin/camcontrol/Makefile:
	Add epc.c and zone.c.

sbin/camcontrol/camcontrol.8:
	Document the zone and epc subcommands.

sbin/camcontrol/camcontrol.c:
	Add the zone and epc subcommands.

	Add auxiliary register support to build_ata_cmd().  Make sure to
	set the CAM_ATAIO_NEEDRESULT, CAM_ATAIO_DMA, and CAM_ATAIO_FPDMA
	flags as appropriate for ATA commands.

	Add a new get_ata_status() function to parse ATA result from SCSI
	sense descriptors (for ATA passthrough over SCSI) and ATA I/O
	requests.

sbin/camcontrol/camcontrol.h:
	Update the build_ata_cmd() prototype

	Add get_ata_status(), zone(), and epc().

sbin/camcontrol/epc.c:
	Support for ATA Extended Power Conditions features.  This includes
	support for all features documented in the ACS-4 Revision 12
	specification from t13.org (dated February 18, 2016).

	The EPC feature set allows putting a drive into a power power mode
	immediately, or setting timeouts so that the drive will
	automatically enter progressively lower power states after various
	idle times.

sbin/camcontrol/fwdownload.c:
	Update the firmware download code for the new build_ata_cmd()
	arguments.

sbin/camcontrol/zone.c:
	Implement support for Shingled Magnetic Recording (SMR) drives
	via SCSI Zoned Block Commands (ZBC) and ATA Zoned Device ATA
	Command Set (ZAC).

	These specs were developed in concert, and are functionally
	identical.  The primary differences are due to SCSI and ATA
	differences.  (SCSI is big endian, ATA is little endian, for
	example.)

	This includes support for all commands defined in the ZBC and
	ZAC specs.

sys/cam/ata/ata_all.c:
	Decode a number of additional ATA command names in ata_op_string().

	Add a new CCB building function, ata_read_log().

	Add ata_zac_mgmt_in() and ata_zac_mgmt_out() CCB building
	functions.  These support both DMA and NCQ encapsulation.

sys/cam/ata/ata_all.h:
	Add prototypes for ata_read_log(), ata_zac_mgmt_out(), and
	ata_zac_mgmt_in().

sys/cam/ata/ata_da.c:
	Revamp the ada(4) driver to support zoned devices.

	Add four new probe states to gather information needed for zone
	support.

	Add a new adasetflags() function to avoid duplication of large
	blocks of flag setting between the async handler and register
	functions.

	Add new sysctl variables that describe zone support and paramters.

	Add support for the new BIO_ZONE bio, and all of its subcommands:
	DISK_ZONE_OPEN, DISK_ZONE_CLOSE, DISK_ZONE_FINISH, DISK_ZONE_RWP,
	DISK_ZONE_REPORT_ZONES, and DISK_ZONE_GET_PARAMS.

sys/cam/scsi/scsi_all.c:
	Add command descriptions for the ZBC IN/OUT commands.

	Add descriptions for ZBC Host Managed devices.

	Add a new function, scsi_ata_pass() to do ATA passthrough over
	SCSI.  This will eventually replace scsi_ata_pass_16() -- it
	can create the 12, 16, and 32-byte variants of the ATA
	PASS-THROUGH command, and supports setting all of the
	registers defined as of SAT-4, Revision 5 (March 11, 2016).

	Change scsi_ata_identify() to use scsi_ata_pass() instead of
	scsi_ata_pass_16().

	Add a new scsi_ata_read_log() function to facilitate reading
	ATA logs via SCSI.

sys/cam/scsi/scsi_all.h:
	Add the new ATA PASS-THROUGH(32) command CDB.  Add extended and
	variable CDB opcodes.

	Add Zoned Block Device Characteristics VPD page.

	Add ATA Return SCSI sense descriptor.

	Add prototypes for scsi_ata_read_log() and scsi_ata_pass().

sys/cam/scsi/scsi_da.c:
	Revamp the da(4) driver to support zoned devices.

	Add five new probe states, four of which are needed for ATA
	devices.

	Add five new sysctl variables that describe zone support and
	parameters.

	The da(4) driver supports SCSI ZBC devices, as well as ATA ZAC
	devices when they are attached via a SCSI to ATA Translation (SAT)
	layer.  Since ZBC -> ZAC translation is a new feature in the T10
	SAT-4 spec, most SATA drives will be supported via ATA commands
	sent via the SCSI ATA PASS-THROUGH command.  The da(4) driver will
	prefer the ZBC interface, if it is available, for performance
	reasons, but will use the ATA PASS-THROUGH interface to the ZAC
	command set if the SAT layer doesn't support translation yet.
	As I mentioned above, ZBC command support is untested.

	Add support for the new BIO_ZONE bio, and all of its subcommands:
	DISK_ZONE_OPEN, DISK_ZONE_CLOSE, DISK_ZONE_FINISH, DISK_ZONE_RWP,
	DISK_ZONE_REPORT_ZONES, and DISK_ZONE_GET_PARAMS.

	Add scsi_zbc_in() and scsi_zbc_out() CCB building functions.

	Add scsi_ata_zac_mgmt_out() and scsi_ata_zac_mgmt_in() CCB/CDB
	building functions.  Note that these have return values, unlike
	almost all other CCB building functions in CAM.  The reason is
	that they can fail, depending upon the particular combination
	of input parameters.  The primary failure case is if the user
	wants NCQ, but fails to specify additional CDB storage.  NCQ
	requires using the 32-byte version of the SCSI ATA PASS-THROUGH
	command, and the current CAM CDB size is 16 bytes.

sys/cam/scsi/scsi_da.h:
	Add ZBC IN and ZBC OUT CDBs and opcodes.

	Add SCSI Report Zones data structures.

	Add scsi_zbc_in(), scsi_zbc_out(), scsi_ata_zac_mgmt_out(), and
	scsi_ata_zac_mgmt_in() prototypes.

sys/dev/ahci/ahci.c:
	Fix SEND / RECEIVE FPDMA QUEUED in the ahci(4) driver.

	ahci_setup_fis() previously set the top bits of the sector count
	register in the FIS to 0 for FPDMA commands.  This is okay for
	read and write, because the PRIO field is in the only thing in
	those bits, and we don't implement that further up the stack.

	But, for SEND and RECEIVE FPDMA QUEUED, the subcommand is in that
	byte, so it needs to be transmitted to the drive.

	In ahci_setup_fis(), always set the the top 8 bits of the
	sector count register.  We need it in both the standard
	and NCQ / FPDMA cases.

sys/geom/eli/g_eli.c:
	Pass BIO_ZONE commands through the GELI class.

sys/geom/geom.h:
	Add g_io_zonecmd() prototype.

sys/geom/geom_dev.c:
	Add new DIOCZONECMD ioctl, which allows sending zone commands to
	disks.

sys/geom/geom_disk.c:
	Add support for BIO_ZONE commands.

sys/geom/geom_disk.h:
	Add a new flag, DISKFLAG_CANZONE, that indicates that a given
	GEOM disk client can handle BIO_ZONE commands.

sys/geom/geom_io.c:
	Add a new function, g_io_zonecmd(), that handles execution of
	BIO_ZONE commands.

	Add permissions check for BIO_ZONE commands.

	Add command decoding for BIO_ZONE commands.

sys/geom/geom_subr.c:
	Add DDB command decoding for BIO_ZONE commands.

sys/kern/subr_devstat.c:
	Record statistics for REPORT ZONES commands.  Note that the
	number of bytes transferred for REPORT ZONES won't quite match
	what is received from the harware.  This is because we're
	necessarily counting bytes coming from the da(4) / ada(4) drivers,
	which are using the disk_zone.h interface to communicate up
	the stack.  The structure sizes it uses are slightly different
	than the SCSI and ATA structure sizes.

sys/sys/ata.h:
	Add many bit and structure definitions for ZAC, NCQ, and EPC
	command support.

sys/sys/bio.h:
	Convert the bio_cmd field to a straight enumeration.  This will
	yield more space for additional commands in the future.  After
	change r297955 and other related changes, this is now possible.
	Converting to an enumeration will also prevent use as a bitmask
	in the future.

sys/sys/disk.h:
	Define the DIOCZONECMD ioctl.

sys/sys/disk_zone.h:
	Add a new API for managing zoned disks.  This is very close to
	the SCSI ZBC and ATA ZAC standards, but uses integers in native
	byte order instead of big endian (SCSI) or little endian (ATA)
	byte arrays.

	This is intended to offer to the complete feature set of the ZBC
	and ZAC disk management without requiring the application developer
	to include SCSI or ATA headers.  We also use one set of headers
	for ioctl consumers and kernel bio-level consumers.

sys/sys/param.h:
	Bump __FreeBSD_version for sys/bio.h command changes, and inclusion
	of SMR support.

usr.sbin/Makefile:
	Add the zonectl utility.

usr.sbin/diskinfo/diskinfo.c
	Add disk zoning capability to the 'diskinfo -v' output.

usr.sbin/zonectl/Makefile:
	Add zonectl makefile.

usr.sbin/zonectl/zonectl.8
	zonectl(8) man page.

usr.sbin/zonectl/zonectl.c
	The zonectl(8) utility.  This allows managing SCSI or ATA zoned
	disks via the disk_zone.h API.  You can report zones, reset write
	pointers, get parameters, etc.

Sponsored by:	Spectra Logic
Differential Revision:	https://reviews.freebsd.org/D6147
Reviewed by:	wblock (documentation)
2016-05-19 14:08:36 +00:00
Gleb Smirnoff
e987742995 The SA-16:19 wouldn't have happened if the sockargs() had properly typed
argument for length.  While here make it static and convert to ANSI C.

Reviewed by:	C Turt
2016-05-18 22:05:50 +00:00
Ravi Pokala
08907ec39d Fix misleading comments in bus_if.m
While looking at r300073, I noticed these incorrect comments in the context
of the diff.

Reviewed by:	imp, jhb
Differential Revision:	https://reviews.freebsd.org/D6431
2016-05-18 16:25:34 +00:00
Andrew Turner
9346e9130d Return the struct intr_pic pointer from intr_pic_register. This will be
needed in later changes where we may not be able to lock the pic list lock
to perform a lookup, e.g. from within interrupt context.

Obtained from:	ABT Systems Ltd
Sponsored by:	The FreeBSD Foundation
2016-05-18 15:05:44 +00:00
Konstantin Belousov
3f7ca894de Ensure that ftruncate(2) is performed synchronously when file is
opened in O_SYNC mode, at least for UFS.  This also handles
truncation, done due to the O_SYNC | O_TRUNC flags combination to
open(2), in synchronous way.

Noted by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-05-18 12:03:57 +00:00
Scott Long
4c7070db25 Import the 'iflib' API library for network drivers. From the author:
"iflib is a library to eliminate the need for frequently duplicated device
independent logic propagated (poorly) across many network drivers."

Participation is purely optional.  The IFLIB kernel config option is
provided for drivers that want to transition between legacy and iflib
modes of operation.  ixl and ixgbe driver conversions will be committed
shortly.  We hope to see participation from the Broadcom and maybe
Chelsio drivers in the near future.

Submitted by:   mmacy@nextbsd.org
Reviewed by:    gallatin
Differential Revision:  D5211
2016-05-18 04:35:58 +00:00
Mark Johnston
ef89d843d9 Do not acquire the thread lock in hardclock_cnt() unless needed.
This function only sets thread flags if a SIGPROF or SIGVTALRM timer
has fired, which is almost never the case.

MFC after:	2 weeks
2016-05-18 03:55:54 +00:00
Mark Johnston
d2f5f8db87 Micro-optimize sleepq_broadcast().
- Avoid a conditional branch on the return value of sleepq_resume_thread()
  by ORing its return value into the boolean wakeup_swapper. This is
  consistent with other sleepqueue functions which just pass this return
  value to their caller.
- sleepq_resume_thread() unconditionally removes the thread from its queue,
  so there's no need to maintain a pointer to the next element in the queue.

MFC after:	2 weeks
2016-05-18 03:50:21 +00:00
Mark Johnston
be2dfd58fe Remove the MUTEX_DEBUG kernel option.
It has no counterpart among the other lock primitives and has been a
no-op for years. Mutex consistency checks are generally done whenver
INVARIANTS is enabled.
2016-05-18 03:34:02 +00:00
Mark Johnston
5002e19502 Guard the lockstat:::thread-spin probe with KDTRACE_HOOKS.
X-MFC-With:	r300103
2016-05-18 03:23:07 +00:00
Mark Johnston
156fbc14a0 lockstat:::thread-spin should only fire after spinning for the lock.
MFC after:	1 week
2016-05-18 03:21:21 +00:00
Gleb Smirnoff
17cd649f4a Add a comment and KASSERT that a M_NOFREE mbuf has always EXT_EXTREF ext.
Submitted by:	kmacy
2016-05-17 23:15:16 +00:00
Warner Losh
0ac974ec78 Don't forget to quote \ characters with \. 2016-05-17 22:52:42 +00:00
Gleb Smirnoff
7349ea785c Validate that user supplied control message length is not negative.
Submitted by:	C Turt <cturt hardenedbsd.org>
Security:	SA-16:19
Security:	CVE-2016-1887
2016-05-17 22:28:53 +00:00
John Baldwin
ed7ed7f0ca Document the formatting requirements of location and pnpinfo strings.
devd requires location and pnpinfo strings generated by bus drivers
to be formatted as a list of name=value keypairs.  Non-conforming
bus drivers cause devd to mis-parse device events for these buses.

Note that this documents the desired requirements.  devctl_safe_quote()
doesn't yet escape backslash characters, and devd doesn't handle escaped
characters in quoted values.

Differential Revision:	https://reviews.freebsd.org/D6252
2016-05-17 19:34:07 +00:00
Konstantin Belousov
2a339d9e3d Add implementation of robust mutexes, hopefully close enough to the
intention of the POSIX IEEE Std 1003.1TM-2008/Cor 1-2013.

A robust mutex is guaranteed to be cleared by the system upon either
thread or process owner termination while the mutex is held.  The next
mutex locker is then notified about inconsistent mutex state and can
execute (or abandon) corrective actions.

The patch mostly consists of small changes here and there, adding
neccessary checks for the inconsistent and abandoned conditions into
existing paths.  Additionally, the thread exit handler was extended to
iterate over the userspace-maintained list of owned robust mutexes,
unlocking and marking as terminated each of them.

The list of owned robust mutexes cannot be maintained atomically
synchronous with the mutex lock state (it is possible in kernel, but
is too expensive).  Instead, for the duration of lock or unlock
operation, the current mutex is remembered in a special slot that is
also checked by the kernel at thread termination.

Kernel must be aware about the per-thread location of the heads of
robust mutex lists and the current active mutex slot.  When a thread
touches a robust mutex for the first time, a new umtx op syscall is
issued which informs about location of lists heads.

The umtx sleep queues for PP and PI mutexes are split between
non-robust and robust.

Somewhat unrelated changes in the patch:
1. Style.
2. The fix for proper tdfind() call use in umtxq_sleep_pi() for shared
   pi mutexes.
3. Removal of the userspace struct pthread_mutex m_owner field.
4. The sysctl kern.ipc.umtx_vnode_persistent is added, which controls
   the lifetime of the shared mutex associated with a vnode' page.

Reviewed by:	jilles (previous version, supposedly the objection was fixed)
Discussed with:	brooks, Martin Simmons <martin@lispworks.com> (some aspects)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2016-05-17 09:56:22 +00:00
Andrew Turner
3fc155dc64 Introduce MSI and MSI-X support to intrng. This adds a new msi device
interface with 5 methods to mirror the 5 MSI/MSI-X methods in the pcib
interface. The pcib driver will need to perform a device specific lookup
to find the MSI controller and pass this to intrng as the xref. Intrng
will finally find the controller and have it handle the requested operation.

Obtained from:	ABT Systems Ltd
MFH:		yes
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D5985
2016-05-16 09:11:40 +00:00
Andriy Gapon
27d4b35f6e vfs_read_dirent: increment ncookies after adding a cookie
It seems that at present vfs_read_dirent() is used only with filesystems
that do not support cookies, so the bug never manifested itself.

MFC after:	1 week
2016-05-16 07:31:11 +00:00
Andriy Gapon
8614f45b2d dounmount: do not call mountcheckdirs() for mounts with MNT_IGNORE
This is a bit hackish, but the flag is currently set only for ZFS
snapshots mounted under .zfs.  mountcheckdirs() can change cdir/rdir
references to a covered vnode.  But for the said snapshots the covered
vnode is really ephemeral and it must never be accessed (except
for a few specific cases).

To do:	consider removing mountcheckdirs() entirely

MFC after:	5 days
2016-05-16 07:23:24 +00:00
John Baldwin
fdce57a042 Add an EARLY_AP_STARTUP option to start APs earlier during boot.
Currently, Application Processors (non-boot CPUs) are started by
MD code at SI_SUB_CPU, but they are kept waiting in a "pen" until
SI_SUB_SMP at which point they are released to run kernel threads.
SI_SUB_SMP is one of the last SYSINIT levels, so APs don't enter
the scheduler and start running threads until fairly late in the
boot.

This change moves SI_SUB_SMP up to just before software interrupt
threads are created allowing the APs to start executing kernel
threads much sooner (before any devices are probed).  This allows
several initialization routines that need to perform initialization
on all CPUs to now perform that initialization in one step rather
than having to defer the AP initialization to a second SYSINIT run
at SI_SUB_SMP.  It also permits all CPUs to be available for
handling interrupts before any devices are probed.

This last feature fixes a problem on with interrupt vector exhaustion.
Specifically, in the old model all device interrupts were routed
onto the boot CPU during boot.  Later after the APs were released at
SI_SUB_SMP, interrupts were redistributed across all CPUs.

However, several drivers for multiqueue hardware allocate N interrupts
per CPU in the system.  In a system with many CPUs, just a few drivers
doing this could exhaust the available pool of interrupt vectors on
the boot CPU as each driver was allocating N * mp_ncpu vectors on the
boot CPU.  Now, drivers will allocate interrupts on their desired CPUs
during boot meaning that only N interrupts are allocated from the boot
CPU instead of N * mp_ncpu.

Some other bits of code can also be simplified as smp_started is
now true much earlier and will now always be true for these bits of
code.  This removes the need to treat the single-CPU boot environment
as a special case.

As a transition aid, the new behavior is available under a new kernel
option (EARLY_AP_STARTUP).  This will allow the option to be turned off
if need be during initial testing.  I plan to enable this on x86 by
default in a followup commit in the next few days and to have all
platforms moved over before 11.0.  Once the transition is complete,
the option will be removed along with the !EARLY_AP_STARTUP code.

These changes have only been tested on x86.  Other platform maintainers
are encouraged to port their architectures over as well.  The main
things to check for are any uses of smp_started in MD code that can be
simplified and SI_SUB_SMP SYSINITs in MD code that can be removed in
the EARLY_AP_STARTUP case (e.g. the interrupt shuffling).

PR:		kern/199321
Reviewed by:	markj, gnn, kib
Sponsored by:	Netflix
2016-05-14 18:22:52 +00:00
Edward Tomasz Napierala
ebc2f37754 Stop hiding errors that result in failure to mount /dev. Otherwise,
missing /dev directory makes one end up with a completely deaf (init
without stdout/stderr) system with no hints on the console, unless
you've booted up with bootverbose.

MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-05-12 07:38:10 +00:00
Conrad Meyer
fe4be618c9 subr_vmem: Fix double-free in error case of vmem_create
If vmem_init() fails, 'vm' is already destroyed and freed.  Don't free it
again.

Reported by:	Coverity
CID:		1042110
Sponsored by:	EMC / Isilon Storage Division
2016-05-11 23:16:11 +00:00
Konstantin Belousov
54a33d2f97 Add vfs_hash_ref(9) function, which finds a vnode by the hash value
and returns it referenced.

The function is similar to vfs_hash_get(9), but unlike the later,
returned vnode is not locked.  This operation cannot be requested with
the vget(9) flags.

Reviewed and tested by:	rmacklem
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-05-11 06:32:22 +00:00
Konstantin Belousov
cd85d599d8 Style: wrap long lines.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-05-11 06:27:00 +00:00
John Baldwin
8d791e5af1 Add a new bus method to fetch device-specific CPU sets.
bus_get_cpus() returns a specified set of CPUs for a device.  It accepts
an enum for the second parameter that indicates the type of cpuset to
request.  Currently two valus are supported:

 - LOCAL_CPUS (on x86 this returns all the CPUs in the package closest to
   the device when DEVICE_NUMA is enabled)
 - INTR_CPUS (like LOCAL_CPUS but only returns 1 SMT thread for each core)

For systems that do not support NUMA (or if it is not enabled in the kernel
config), LOCAL_CPUS fails with EINVAL.  INTR_CPUS is mapped to 'all_cpus'
by default.  The idea is that INTR_CPUS should always return a valid set.

Device drivers which want to use per-CPU interrupts should start using
INTR_CPUS instead of simply assigning interrupts to all available CPUs.
In the future we may wish to add tunables to control the policy of
INTR_CPUS (e.g. should it be local-only or global, should it ignore
SMT threads or not).

The x86 nexus driver exposes the internal set of interrupt CPUs from the
the x86 interrupt code via INTR_CPUS.

The ACPI bus driver and PCI bridge drivers use _PXM to return a suitable
LOCAL_CPUS set when _PXM exists and DEVICE_NUMA is enabled.  They also and
the global INTR_CPUS set from the nexus driver with the per-domain set from
_PXM to generate a local INTR_CPUS set for child devices.

Compared to the r298933, this version uses 'struct _cpuset' in
<sys/bus.h> instead of 'cpuset_t' to avoid requiring <sys/param.h>
(<sys/_cpuset.h> still requires <sys/param.h> for MAXCPU even though
<sys/_bitset.h> does not after recent changes).
2016-05-09 20:50:21 +00:00
Andrew Turner
b48c608386 Check malloc succeeded in pic_create, with M_NOWAIT it may return NULL.
Obtained from:	ABT Systems Ltd
Sponsored by:	The FreeBSD Foundation
2016-05-09 12:24:39 +00:00
Mateusz Guzik
0cfe1a1fec fd: assert dropped filedesc lock in fdcloseexec 2016-05-08 03:26:12 +00:00
Svatopluk Kraus
a100280e59 Set correct size to the size member of struct intr_map_data when
initialized. As the size member is not used at the present,
it did not break anything.
2016-05-06 08:54:00 +00:00
Ed Maste
b4f6936599 Add explicit cast to fix mips and powerpc build after r299090
Sponsored by:	The FreeBSD Foundation
2016-05-05 15:21:33 +00:00
Svatopluk Kraus
cd642c88a1 INTRNG - redefine struct intr_map_data to avoid headers pollution. Each
struct associated with some type defined in enum intr_map_data_type
must have struct intr_map_data on the top of its own definition now.
When such structs are used, correct type and size must be filled in.

There are three such structs defined in sys/intr.h now. Their
definitions should be moved to corresponding headers by follow-up
commits.

While this change was propagated to all INTRNG like PICs,
pic_map_intr() method implementations were corrected on some places.
For this specific method, it's ensured by a caller that the 'data'
argument passed to this method is never NULL. Also, the return error
values were standardized there.
2016-05-05 13:31:19 +00:00
Svatopluk Kraus
15adccc687 Remove superfluous check. The pic_dev member of struct pic
is never NULL on PIC found by pic_lookup().
2016-05-05 13:23:38 +00:00
Adrian Chadd
d17f808fab s/struct device */device_t/g
Submitted by:	kmacy
2016-05-04 23:31:52 +00:00
Alan Somers
8907f744ff Improve performance and functionality of the bitstring(3) api
Two new functions are provided, bit_ffs_at() and bit_ffc_at(), which allow
for efficient searching of set or cleared bits starting from any bit offset
within the bit string.

Performance is improved by operating on longs instead of bytes and using
ffsl() for searches within a long. ffsl() is a compiler builtin in both
clang and gcc for most architectures, converting what was a brute force
while loop search into a couple of instructions.

All of the bitstring(3) API continues to be contained in the header file.
Some of the functions are large enough that perhaps they should be uninlined
and moved to a library, but that is beyond the scope of this commit.

sys/sys/bitstring.h:
        Convert the majority of the existing bit string implementation from
        macros to inline functions.

        Properly protect the implementation from inadvertant macro expansion
        when included in a user's program by prefixing all private
        macros/functions and local variables with '_'.

        Add bit_ffs_at() and bit_ffc_at(). Implement bit_ffs() and
        bit_ffc() in terms of their "at" counterparts.

        Provide a kernel implementation of bit_alloc(), making the full API
        usable in the kernel.

        Improve code documenation.

share/man/man3/bitstring.3:
        Add pre-exisiting API bit_ffc() to the synopsis.

        Document new APIs.

        Document the initialization state of the bit strings
        allocated/declared by bit_alloc() and bit_decl().

        Correct documentation for bitstr_size(). The original code comments
        indicate the size is in bytes, not "elements of bitstr_t". The new
        implementation follows this lead. Only hastd assumed "elements"
        rather than bytes and it has been corrected.

etc/mtree/BSD.tests.dist:
tests/sys/Makefile:
tests/sys/sys/Makefile:
tests/sys/sys/bitstring.c:
        Add tests for all existing and new functionality.

include/bitstring.h
	Include all headers needed by sys/bitstring.h

lib/libbluetooth/bluetooth.h:
usr.sbin/bluetooth/hccontrol/le.c:
        Include bitstring.h instead of sys/bitstring.h.

sbin/hastd/activemap.c:
        Correct usage of bitstr_size().

sys/dev/xen/blkback/blkback.c
        Use new bit_alloc.

sys/kern/subr_unit.c:
        Remove hard-coded assumption that sizeof(bitstr_t) is 1.  Get rid of
        unrb.busy, which caches the number of bits set in unrb.map.  When
        INVARIANTS are disabled, nothing needs to know that information.
        callapse_unr can be adapted to use bit_ffs and bit_ffc instead.
        Eliminating unrb.busy saves memory, simplifies the code, and
        provides a slight speedup when INVARIANTS are disabled.

sys/net/flowtable.c:
        Use the new kernel implementation of bit-alloc, instead of hacking
        the old libc-dependent macro.

sys/sys/param.h
        Update __FreeBSD_version to indicate availability of new API

Submitted by:   gibbs, asomers
Reviewed by:    gibbs, ngie
MFC after:      4 weeks
Sponsored by:   Spectra Logic Corp
Differential Revision:  https://reviews.freebsd.org/D6004
2016-05-04 22:34:11 +00:00
Roger Pau Monné
731c90b713 rtc: fix inverted resolution check
The current code in clock_register checks if the newly added clock has a
resolution value higher than the current one in order to make it the
default, which is wrong. Clocks with a lower resolution value should be
better than ones with a higher resolution value, in fact with the current
code FreeBSD is always selecting the worse clock.

Reviewed by:		kib jhb jkim
Sponsored by:		Citrix Systems R&D
MFC after:		2 weeks
Differential revision:	https://reviews.freebsd.org/D6185
2016-05-04 13:48:59 +00:00
Sepherosa Ziehau
d5f0ea7ca2 kern: Factor out function to convert hash flags to malloc(9) flags
Suggested by:	jhb
Reviewed by:	jhb, kib
Sponsored by:	Microsoft OSTC
Differential Revision:	https://reviews.freebsd.org/D6184
2016-05-04 03:07:52 +00:00
Konstantin Belousov
c89e1b8739 Add EVFILT_VNODE open, read and close notifications.
While there, order EVFILT_VNODE notes descriptions alphabetically.

Based on submission, and tested by:	Vladimir Kondratyev <wulf@cicgroup.ru>
MFC after:	2 weeks
2016-05-03 15:17:43 +00:00
Sepherosa Ziehau
f8ce3dfaf1 kern: Add phashinit_flags(), which allows malloc(M_NOWAIT)
It will be used for the upcoming LRO hash table initialization.
And probably will be useful in other cases, when M_WAITOK can't
be used.

Reviewed by:	jhb, kib
Sponsored by:	Microsoft OSTC
Differential Revision:	https://reviews.freebsd.org/D6138
2016-05-03 07:17:13 +00:00
John Baldwin
8a08b7d36b Revert bus_get_cpus() for now.
I really thought I had run this through the tinderbox before committing,
but many places need <sys/types.h> -> <sys/param.h> for <sys/bus.h> now.
2016-05-03 01:17:40 +00:00
John Baldwin
bc153c692f Add a new bus method to fetch device-specific CPU sets.
bus_get_cpus() returns a specified set of CPUs for a device.  It accepts
an enum for the second parameter that indicates the type of cpuset to
request.  Currently two valus are supported:

 - LOCAL_CPUS (on x86 this returns all the CPUs in the package closest to
   the device when DEVICE_NUMA is enabled)
 - INTR_CPUS (like LOCAL_CPUS but only returns 1 SMT thread for each core)

For systems that do not support NUMA (or if it is not enabled in the kernel
config), LOCAL_CPUS fails with EINVAL.  INTR_CPUS is mapped to 'all_cpus'
by default.  The idea is that INTR_CPUS should always return a valid set.

Device drivers which want to use per-CPU interrupts should start using
INTR_CPUS instead of simply assigning interrupts to all available CPUs.
In the future we may wish to add tunables to control the policy of
INTR_CPUS (e.g. should it be local-only or global, should it ignore
SMT threads or not).

The x86 nexus driver exposes the internal set of interrupt CPUs from the
the x86 interrupt code via INTR_CPUS.

The ACPI bus driver and PCI bridge drivers use _PXM to return a suitable
LOCAL_CPUS set when _PXM exists and DEVICE_NUMA is enabled.  They also and
the global INTR_CPUS set from the nexus driver with the per-domain set from
_PXM to generate a local INTR_CPUS set for child devices.

Reviewed by:	wblock (manpage)
Differential Revision:	https://reviews.freebsd.org/D5519
2016-05-02 18:00:38 +00:00
Konstantin Belousov
f7b71c8a5b Issue NOTE_EXTEND when a directory entry is added to or removed from
the monitored directory as the result of rename(2) operation.  The
renames staying in the directory are not reported.

Submitted by:	Vladimir Kondratyev <wulf@cicgroup.ru>
MFC after:	2 weeks
2016-05-02 13:18:17 +00:00
Konstantin Belousov
bd2ead6b2e Fix reporting of NOTE_LINK when directory link count changes due to
rename removing or adding subdirectory entry.

Discussed with and tested by:	Vladimir Kondratyev <wulf@cicgroup.ru>
NetBSD PR:	48958 (http://gnats.netbsd.org/48958)
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2016-05-02 13:13:32 +00:00
Pedro F. Giffuni
e3043798aa sys/kern: spelling fixes in comments.
No functional change.
2016-04-29 22:15:33 +00:00
Pedro F. Giffuni
31b6732008 sys/kern: spelling fixes.
Mostly on comments but affects some debug messages.

MFC after: 2 weeks
2016-04-29 21:54:28 +00:00
Alan Somers
794277da54 Automate the subr_unit test.
Build and install the subr_unit test program originally written by phk, and
run it with the other ATF tests.

tests/sys/kern/Makefile
	* Build and install the subr_unit test as a plain test

sys/kern/subr_unit.c
	* Reduce the default number of repetitions from 100 to 1, and add a
	  command-line parser to override it.
	* Don't be so noisy by default
	* Fix an include problem for the test build

Reviewed by:	ngie
MFC after:	4 weeks
Sponsored by:	Spectra Logic Corp
Differential Revision:	https://reviews.freebsd.org/D6038
2016-04-29 21:11:31 +00:00
John Baldwin
5163d2ec94 Expose soaio_enqueue().
This can be used by protocol-specific AIO handlers to queue work to the
socket AIO daemon pool.

Sponsored by:	Chelsio Communications
2016-04-29 20:12:45 +00:00
John Baldwin
8722384b22 Introduce a new protocol hook pru_aio_queue.
This allows a protocol to claim individual AIO requests instead of using
the default socket AIO handling.

Sponsored by:	Chelsio Communications
2016-04-29 20:11:09 +00:00
Pedro F. Giffuni
f3b827f3ec bufs: make B_DIRTY and B_PERSISTENT flags available
It appears these flags were related to ext2fs but are completely
unused nowadays. Retire them.

Suggested by: mckusick
2016-04-29 16:32:28 +00:00
Michal Meloun
8442087f15 INTRNG: Define 'INTR_IRQ_INVALID' constant and use it consistently
as error indicator.
2016-04-28 12:04:12 +00:00
Michal Meloun
39f6c1bdf4 GPIO: Add support for gpio pin interrupts.
Add new function gpio_alloc_intr_resource(), which allows an allocation
of interrupt resource associated to given gpio pin. It also allows to
specify interrupt configuration.

Note: This functionality is dependent on INTRNG, and must be
implemented in each GPIO controller.
2016-04-28 12:03:22 +00:00
John Baldwin
e240255ffc Add a bus_null_rescan() method that always fails with an error.
Use this in place of kobj_error_method to disable BUS_RESCAN() on
PCI drivers that do not use the "standard" scanning algorithm.
2016-04-27 17:49:42 +00:00
John Baldwin
88eb5c506d Add 'devctl delete' that calls device_delete_child().
'devctl delete' can be used to delete a device that is no longer present.
As an anti-foot-shooting measure, 'delete' will not delete a device
unless it's parent bus says it is no longer present.  This can be
overridden by passing the force ('-f') flag.

Note that this command should be used with care.  If a device is deleted
that is actually present it can't be resurrected unless the parent bus
device's driver supports rescans.

Differential Revision:	https://reviews.freebsd.org/D6019
2016-04-27 16:33:17 +00:00
John Baldwin
a907c6914c Add a new rescan method to the bus interface.
The BUS_RESCAN() method rescans a single bus device checking for devices
that have been added or removed from the bus.  A new 'rescan' command is
added to devctl(8) to trigger a rescan.

Differential Revision:	https://reviews.freebsd.org/D6016
2016-04-27 16:29:03 +00:00
Jamie Gritton
73d9e52d2f Delay revmoing the last jail reference in prison_proc_free, and instead
put it off into the pr_task.  This is similar to prison_free, and in fact
uses the same task even though they do something slightly different.

This resolves a LOR between the process lock and allprison_lock, which
came about in r298565.

PR:		48471
2016-04-27 02:25:21 +00:00
Conrad Meyer
da95a2ae56 posix4_mib: Don't overrun facility_initialized array
The facility_initialized and facility arrays are the same size and were
intended to be indexed the same.  I believe this mismatch was just a
typo/braino in r208731.

Reported by:	Coverity
CID:		1017430
Sponsored by:	EMC / Isilon Storage Division
2016-04-27 00:10:32 +00:00
Conrad Meyer
a286650b08 subr_mbpool: Don't free bogus pointer in error paths
An mbpool is allocated with a contiguous array of mbpages.  Freeing an
individual mbpage has never been valid.  Don't do it.

This bug has been present since this code was introduced in r117624 (2003).

Reported by:	Coverity
CID:		1009687
Sponsored by:	EMC / Isilon Storage Division
2016-04-26 23:58:55 +00:00
Jamie Gritton
1fb6767d27 Use crcopysafe in jail_attach. 2016-04-26 21:19:12 +00:00
Conrad Meyer
aa90aec270 osd(9): Change array pointer to array pointer type from void*
This is a minor follow-up to r297422, prompted by a Coverity warning.  (It's
not a real defect, just a code smell.)  OSD slot array reservations are an
array of pointers (void **) but were cast to void* and back unnecessarily.
Keep the correct type from reservation to use.

osd.9 is updated to match, along with a few trivial igor fixes.

Reported by:	Coverity
CID:		1353811
Sponsored by:	EMC / Isilon Storage Division
2016-04-26 19:57:35 +00:00
Jamie Gritton
5579267b08 Redo the changes to the SYSV IPC sysctl functions from r298585, so they
don't (mis)use sbufs.

PR:		48471
2016-04-26 18:17:44 +00:00
Pedro F. Giffuni
55e0987aea sys: extend use of the howmany() macro when available.
We have a howmany() macro in the <sys/param.h> header that is
convenient to re-use as it makes things easier to read.
2016-04-26 15:38:17 +00:00
Ruslan Bukin
3a32292401 Add support for RISC-V. 2016-04-26 12:29:47 +00:00
Ruslan Bukin
30b72b6871 Move arm's devmap to some generic place, so it can be used
by other architectures.

Reviewed by:	imp
Differential Revision:	https://reviews.freebsd.org/D6091
Sponsored by:	DARPA, AFRL
Sponsored by:	HEIF5
2016-04-26 11:53:37 +00:00
Jamie Gritton
0bfd7a267e Fix the logic in r298585: shm_prison_cansee returns an errno, so is
the opposite of a boolean.

PR:		48471
2016-04-25 22:30:10 +00:00
Jamie Gritton
52a510ace9 Encapsulate SYSV IPC objects in jails. Define per-module parameters
sysvmsg, sysvsem, and sysvshm, with the following bahavior:

inherit: allow full access to the IPC primitives.  This is the same as
the current setup with allow.sysvipc is on.  Jails and the base system
can see (and moduly) each other's objects, which is generally considered
a bad thing (though may be useful in some circumstances).

disable: all no access, same as the current setup with allow.sysvipc off.

new: A jail may see use the IPC objects that it has created.  It also
gets its own IPC key namespace, so different jails may have their own
objects using the same key value.  The parent jail (or base system) can
see the jail's IPC objects, but not its keys.

PR:		48471
Submitted by:	based on work by kikuchan98@gmail.com
MFC after:	5 days
2016-04-25 17:06:50 +00:00
Jamie Gritton
add14c83aa Use the new PR_METHOD_REMOVE to clean up jail handling in POSIX
message queues.
2016-04-25 04:36:54 +00:00
Jamie Gritton
b6f47c231f Pass the current/new jail to PR_METHOD_CHECK, which pushes the call
until after the jail is found or created.  This requires unlocking the
jail for the call and re-locking it afterward, but that works because
nothing in the jail has been changed yet, and other processes won't
change the important fields as long as allprison_lock remains held.

Keep better track of name vs namelc in kern_jail_set.  Name should
always be the hierarchical name (relative to the caller), and namelc
the last component.

PR:		48471
MFC after:	5 days
2016-04-25 04:27:58 +00:00
Jamie Gritton
cc5fd8c748 Add a new jail OSD method, PR_METHOD_REMOVE. It's called when a jail is
removed from the user perspective, i.e. when the last pr_uref goes away,
even though the jail mail still exist in the dying state.  It will also
be called if either PR_METHOD_CREATE or PR_METHOD_SET fail.

PR:		48471
MFC after:	 5 days
2016-04-25 04:24:00 +00:00
Jamie Gritton
2a54950713 Remove the PR_REMOVE flag, which was meant as a temporary marker for
a jail that might be seen mid-removal.  It hasn't been doing the right
thing since at least the ability to resurrect dying jails, and such
resurrection also makes it unnecessary.
2016-04-25 03:58:08 +00:00
Pedro F. Giffuni
d9c9c81c08 sys: use our roundup2/rounddown2() macros when param.h is available.
rounddown2 tends to produce longer lines than the original code
and when the code has a high indentation level it was not really
advantageous to do the replacement.

This tries to strike a balance between readability using the macros
and flexibility of having the expressions, so not everything is
converted.
2016-04-21 19:57:40 +00:00
Edward Tomasz Napierala
bbe4eb6d54 Get rid of rctl_lock; use racct_lock where appropriate. The fast paths
already required both of them, so having a separate rctl_lock didn't
buy us anything.

Reviewed by:	mjg@
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D5914
2016-04-21 16:22:52 +00:00
Pedro F. Giffuni
8dfea46460 Remove slightly used const values that can be replaced with nitems().
Suggested by:	jhb
2016-04-21 15:38:28 +00:00
Konstantin Belousov
3e937c3a77 Arm and arm64 both have fueword() implemented for some time. Correct
the comment.

Sponsored by:	The FreeBSD Foundation
2016-04-20 17:28:21 +00:00
Pedro F. Giffuni
63b6b7a74a Indentation issues.
Contract some lines leftover from r298310.

Mea culpa.
2016-04-20 16:19:44 +00:00
Conrad Meyer
b483e111c4 kern_rctl: Fix resource leak in error path
Ordinarily, rctl_write_outbuf frees 'sb'.  However, if we are in low memory
conditions we skip past the rctl_write_outbuf.  In that case, free 'sb'.

Reported by:	Coverity
CID:		1338539
Sponsored by:	EMC / Isilon Storage Division
2016-04-20 02:09:38 +00:00
Pedro F. Giffuni
02abd40029 kernel: use our nitems() macro when it is available through param.h.
No functional change, only trivial cases are done in this sweep,

Discussed in:	freebsd-current
2016-04-19 23:48:27 +00:00
Edward Tomasz Napierala
74a7305a91 Fix debugging printf.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-04-19 13:36:31 +00:00
Konstantin Belousov
2cfddaa6ff Fix umtx lock/trylock for compat32.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2016-04-19 11:37:43 +00:00
Mark Johnston
11748dae80 Use a loop instead of a goto in sysctl_kern_proc_kstack().
MFC after:	3 days
2016-04-17 23:22:32 +00:00
Konstantin Belousov
ccd0ec4066 The struct thread td_estcpu member is only used by the 4BSD scheduler.
Move it to the struct td_sched for 4BSD, removing always present
field, otherwise unused for ULE.

New scheduler method sched_estcpu() returns the estimation for
kinfo_proc consumption.  As before, it always returns 0 for ULE.

Remove sched_tick() scheduler method, unused both by 4BSD and ULE.

Update locking comment for the 4BSD struct td_sched, copying it from
the same comment for ULE.

Spell MAXPRI as PRI_MAX_TIMESHARE in the 4BSD comment.

Based on some notes from, and reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
2016-04-17 11:04:27 +00:00
Conrad Meyer
5dc5dab6eb Add 4Kn kernel dump support
(And 4Kn minidump support, but only for amd64.)

Make sure all I/O to the dump device is of the native sector size.  To
that end, we keep a native sector sized buffer associated with dump
devices (di->blockbuf) and use it to pad smaller objects as needed (e.g.
kerneldumpheader).

Add dump_write_pad() as a convenience API to dump smaller objects with
zero padding.  (Rather than pull in NPM leftpad, we wrote our own.)

Savecore(1) has been updated to deal with these dumps.  The format for
512-byte sector dumps should remain backwards compatible.

Minidumps for other architectures are left as an exercise for the
reader.

PR:		194279
Submitted by:	ambrisko@
Reviewed by:	cem (earlier version), rpokala
Tested by:	rpokala (4Kn/512 except 512 fulldump), cem (512 fulldump)
Relnotes:	yes
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D5848
2016-04-15 17:45:12 +00:00
Pedro F. Giffuni
b85f65af68 kern: for pointers replace 0 with NULL.
These are mostly cosmetical, no functional change.

Found with devel/coccinelle.
2016-04-15 16:10:11 +00:00
Edward Tomasz Napierala
23e6fff29d Allocate RACCT/RCTL zones without UMA_ZONE_NOFREE; no idea why it was there
in the first place.

MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-04-15 13:34:59 +00:00
Edward Tomasz Napierala
c1a43e73c5 Sort variable declarations.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-04-15 11:55:29 +00:00
Warner Losh
15be49f5a1 Create wrappers for uint64_t and int64_t for the tunables. While not
strictly necessary, it is more convenient.
2016-04-15 03:09:55 +00:00
Jamie Gritton
44c16975a2 Clean up some style(9) violations. 2016-04-14 17:07:26 +00:00
Jamie Gritton
adb023ae59 Separate POSIX mqueue objects in jails; actually, separate them by the
jail's root, so jails that don't have their own filesystem directory
also won't have their own mqueue namespace.

PR:		208082
2016-04-13 20:15:49 +00:00
Jamie Gritton
cc7b259a26 Separate POSIX sem/shm objects in jails, by prepending the jail's path
name to the object's "path".  While the objects don't have real path
names, it's a filesystem-like namespace, which allows jails to be
kept to their own space, but still allows the system / jail parent to
access a jail's IPC.

PR:		208082
2016-04-13 20:14:13 +00:00
Edward Tomasz Napierala
f459a81824 Fix overflow checking.
There are some other potential problems related to overflowing racct
counters; I'll revisit those later.

Submitted by:	Pieter de Goeje (earlier version)
Reviewed by:	emaste@
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-04-12 18:13:24 +00:00
Pedro F. Giffuni
74b8d63dcc Cleanup unnecessary semicolons from the kernel.
Found with devel/coccinelle.
2016-04-10 23:07:00 +00:00
John Baldwin
70e22add96 Add a function to lookup a device_t object by name.
This just walks the global list of devices looking for one with the
requested name.  The one use case outside of devctl2's implementation
is for DDB commands that wish to lookup devices by name.
2016-04-10 05:05:02 +00:00
John Baldwin
62d70a8174 Add more fine-grained kernel options for NUMA support.
VM_NUMA_ALLOC is used to enable use of domain-aware memory allocation in
the virtual memory system.  DEVICE_NUMA is used to enable affinity
reporting for devices such as bus_get_domain().

MAXMEMDOM must still be set to a value greater than for any NUMA support
to be effective.  Note that 'cpuset -gd' always works if MAXMEMDOM is
enabled and the system supports NUMA.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D5782
2016-04-09 13:58:04 +00:00
Bjoern A. Zeeb
029f99dcc4 Make the KASSERT message in hash destroy more informative.
While the pointer might not be too helpful, the malloc type might at
least give a good hint about which hashtbl we are talking.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Reviewed by:	gnn, emaste
Differential Revision:	https://reviews.freebsd.org/D5802
2016-04-09 09:24:05 +00:00
Edward Tomasz Napierala
8bd8c8f14c Make it possible to tweak RCTL throttling sysctls at runtime.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-04-08 18:15:31 +00:00
Andriy Gapon
a449bdba32 topo_set_pu_id: turn a check into an assertion
The new id must not be present in any cpu set in any topology element.

MFC after:	30 days
2016-04-08 11:59:11 +00:00
Konstantin Belousov
b715d9af68 Use the ABI-prescribed name for SHT_X86_64_UNWIND in the loader and
kernel linker, after the r297686.

Sponsored by:	The FreeBSD Foundation
2016-04-08 10:23:48 +00:00
Svatopluk Kraus
cf55df9f83 Fix intr_irq_shuffle(). After r297539, ISRCs doing IPI may be also
registered into global interrupt table. Thus, they must be filtered out
like per-cpu interrupts. Fortunately, it does not influence anything
on interrupt controllers which already use INTRNG.
2016-04-07 15:16:33 +00:00
Svatopluk Kraus
5b613c19b5 Implement intr_isrc_init_on_cpu() and use it to replace very same
code implemented in every interrupt controller driver running SMP.
This function returns true, if provided ISRC should be enabled on
given cpu.
2016-04-07 15:00:25 +00:00
Edward Tomasz Napierala
ae34b6ff96 Add four new RCTL resources - readbps, readiops, writebps and writeiops,
for limiting disk (actually filesystem) IO.

Note that in some cases these limits are not quite precise. It's ok,
as long as it's within some reasonable bounds.

Testing - and review of the code, in particular the VFS and VM parts - is
very welcome.

MFC after:	1 month
Relnotes:	yes
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D5080
2016-04-07 04:23:25 +00:00
Svatopluk Kraus
4be58cba48 Fix PIC lookup by device and xref. There was not taken into account
the situation that someone has a pointer to device but not its xref.
This situation is regular now, after r297539.
2016-04-06 12:48:45 +00:00
Edward Tomasz Napierala
4c230cdafd Use proper locking macros in RACCT in RCTL.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-04-05 11:30:52 +00:00
Andriy Gapon
c77702de74 x86 topo: add some comments, descriptions and references to documentation
Plus a minor cosmetic change.

MFC after:	1 month
2016-04-05 10:36:40 +00:00
Andriy Gapon
4725e6bff3 new x86 smp topology detection code
Previously, the code determined a topology of processing units
(hardware threads, cores, packages) and then deduced a cache topology
using certain assumptions.  The new code builds a topology that
includes both processing units and caches using the information
provided by the hardware.

At the moment, the discovered full topology is used only to creeate
a scheduling topology for SCHED_ULE.
There is no KPI for other kernel uses.

Summary:
- based on APIC ID derivation rules for Intel and AMD CPUs
- can handle non-uniform topologies
- requires homogeneous APIC ID assignment (same bit widths for ID
  components)
- topology for dual-node AMD CPUs may not be optimal
- topology for latest AMD CPU models may not be optimal as the code is
  several years old
- supports only thread/package/core/cache nodes

Todo:
  - AMD dual-node processors
  - latest AMD processors
  - NUMA nodes
  - checking for homogeneity of the APIC ID assignment across packages
  - more flexible cache placement within topology
  - expose topology to userland, e.g., via sysctl nodes

Long term todo:
  - KPI for CPU sharing and affinity with respect to various resources
    (e.g., two logical processors may share the same FPU, etc)

Reviewed by:	mav
Tested by:	mav
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D2728
2016-04-04 16:09:29 +00:00
Andrew Turner
6b42a1f4c0 Include sys/rman.h directly rather than relying on header pollution.
Obtained from:	ABT Systems Ltd
Sponsored by:	The FreeBSD Foundation
2016-04-04 10:52:43 +00:00
Svatopluk Kraus
bff6be3e9b Remove FDT specific parts from INTRNG. Change its interface to make it
universal.

(1) New struct intr_map_data is defined as a container for arbitrary
description of an interrupt used by a device. Typically, an interrupt
number and configuration relevant to an interrupt controller is encoded
in such description. However, any additional information may be encoded
too like a set of cpus on which an interrupt should be enabled or vendor
specific data needed for setup of an interrupt in controller. The struct
intr_map_data itself is meant to be opaque for INTRNG.

(2) An intr_map_irq() function is created which takes an interrupt
controller identification and struct intr_map_data as arguments and
returns global interrupt number which identifies an interrupt.

(3) A set of functions to be used by bus drivers is created as well as
a corresponding set of methods for interrupt controller drivers. These
sets take both struct resource and struct intr_map_data as one of the
arguments. There is a goal to keep struct intr_map_data in struct
resource, however, this way a final solution is not limited to that.

(4) Other small changes are done to reflect new situation.

This is only first step aiming to create stable interface for interrupt
controller drivers. Thus, some temporary solution is taken. Interrupt
descriptions for devices are stored in INTRNG and two specific mapping
function are created to be temporary used by bus drivers. That's why
the struct intr_map_data is not opaque for INTRNG now. This temporary
solution will be replaced by final one in next step.

Differential Revision:	https://reviews.freebsd.org/D5730
2016-04-04 09:15:25 +00:00
Edward Tomasz Napierala
f70c075e32 Add configurable rate limit for "log" and "devctl" actions.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-04-02 09:11:52 +00:00
Edward Tomasz Napierala
097e0da79d Fix mismerge.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-04-01 18:45:04 +00:00
Edward Tomasz Napierala
862d03fb7f Drop the 'resource' argument to racct_decay(); it wouldn't make sense
to iterate separately for each resource.

MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-04-01 18:36:10 +00:00
John Baldwin
611fcff994 Cap IOSIZE_MAX to INT_MAX for 32-bit processes.
Previously, freebsd32 binaries could submit read/write requests with lengths
greater than INT_MAX that a native kernel would have rejected.

Reviewed by:	kib
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D5788
2016-04-01 18:29:38 +00:00
Edward Tomasz Napierala
659e74662f Call rctl_enforce() in all cases the resource usage goes up, even when called
from racct_*_force() functions.  It makes the "log" and "devctl" actions work
in those cases.

MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-04-01 17:28:55 +00:00
Edward Tomasz Napierala
0b9f1ecb87 Reorder the functions; no functional changes.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-04-01 17:21:55 +00:00
Edward Tomasz Napierala
7cea96f606 Reduce code duplication.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-04-01 17:17:32 +00:00
Edward Tomasz Napierala
1028719823 Reduce code duplication. There should be no (intended) functional changes.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-04-01 17:05:46 +00:00
Sean Bruno
910938f079 Repair a overflow condition where a user could submit a string that was
not getting a proper bounds check.

Thanks to CTurt for pointing at this with a big red blinking neon sign.

PR:		206761
Submitted by:	sson
Reviewed by:	cturt@hardenedbsd.org
MFC after:	3 days
2016-04-01 16:16:26 +00:00
John Baldwin
b4f1d267b7 Rework handling of thread sleeps before timers are working.
Previously, calls to *sleep() and cv_*wait*() immediately returned during
early boot.  Instead, permit threads that request a sleep without a
timeout to sleep as wakeup() works during early boot.  Sleeps with
timeouts are harder to emulate without working timers, so just punt and
panic explicitly if any thread tries to use those before timers are
working.  Any threads that depend on timeouts should either wait until
SI_SUB_KICK_SCHEDULER to start or they should use DELAY() until timers
are available.

Until APs are started earlier this should be a no-op as other kthreads
shouldn't get a chance to start running until after timers are working
regardless of when they were created.

Reviewed by:	kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D5724
2016-03-31 18:10:29 +00:00
Edward Tomasz Napierala
ac3c9819ab Refactor; no functional changes.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-03-31 17:32:28 +00:00
John Baldwin
4d805eacfa Tidy up the unmapped I/O code in qphysio.
- Move some blocks around to reduce the number of 'if (unmap)' checks.
- Use 'pbuf == NULL' instead of 'unmap'.
- Use nitems.
- Pull an assignment out of an if expression.

Reviewed by:	kib
Sponsored by:	Chelsio Communications
2016-03-31 17:27:30 +00:00
Edward Tomasz Napierala
b450d4479d Fix overflows, making it impossible to add negative amounts using rctl(8).
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-03-31 17:00:47 +00:00
Jamie Gritton
320d842101 Add osd_reserve() and osd_set_reserved(), which allow M_WAITOK allocation
of an OSD array,
2016-03-30 16:57:28 +00:00
Gleb Smirnoff
9c64cfe56c The sendfile(2) allows to send extra data from userspace before the file
data (headers).  Historically the size of the headers was not checked
against the socket buffer space.  Application could easily overcommit the
socket buffer space.

With the new sendfile (r293439) the problem remained, but a KASSERT was
inserted that checked that amount of data written to the socket matches
its space.  In case when size of headers is bigger that socket space,
KASSERT fires.  Without INVARIANTS the new sendfile won't panic, but
would report incorrect amount of bytes sent.

o With this change, the headers copyin is moved down into the cycle, after
  the sbspace() check.  The uio size is trimmed by socket space there,
  which fixes the overcommit problem and its consequences.
o The compatibility handling for FreeBSD 4 sendfile headers API is pushed
  up the stack to syscall wrappers.  This required a copy and paste of the
  code, but in turn this allowed to remove extra stack carried parameter
  from fo_sendfile_t, and embrace entire compat code into #ifdef.  If in
  future we got more fo_sendfile_t function, the copy and paste level would
  even reduce.

Reviewed by:	emax, gallatin, Maxim Dounin <mdounin mdounin.ru>
Tested by:	Vitalij Satanivskij <satan ukr.net>
Sponsored by:	Netflix
2016-03-29 19:57:11 +00:00
Edward Tomasz Napierala
35030a5dd4 Remove some NULL checks for M_WAITOK allocations.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-03-29 13:56:59 +00:00
Jamie Gritton
859f4d0611 Move the various per-type arrays of OSD data into a single structure array. 2016-03-28 22:18:37 +00:00
Warner Losh
cb49b65481 Move pccard_safe_quote() up to subr_bus.c and rename to
devctl_safe_quote() so it can be used more generally.
2016-03-28 20:16:29 +00:00
Navdeep Parhar
fddd4f6273 Plug leak in m_unshare.
m_unshare passes on the source mbuf's flags as-is to m_getcl and this
results in a leak if the flags include M_NOFREE.  The fix is to clear
the bits not listed in M_COPYALL before calling m_getcl.  M_RDONLY
should probably be filtered out too but that's outside the scope of this
fix.

Add assertions in the zone_mbuf and zone_pack ctors to catch similar
bugs.

Update netmap_get_mbuf to not pass M_NOFREE to m_getcl.  It's not clear
what the original code was trying to do but it's likely incorrect.
Updated code is no different functionally but it avoids the newly added
assertions.

Reviewed by:	gnn@
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D5698
2016-03-26 23:39:53 +00:00
Conrad Meyer
c975a5d359 Add td_swinvoltick to track last involuntary context switch
Expose in DDB via "show thread."

Reviewed by:	markj
Sponsored by:	EMC / Isilon Storage Division
2016-03-25 19:35:29 +00:00
Gleb Smirnoff
c8f59118be Space and style(9) corrections for recent mbuf changes. 2016-03-24 20:06:52 +00:00
Svatopluk Kraus
61c8fde5d6 Generalize IPI support for ARM intrng and use it for interrupt
controller IPI provider.

New struct intr_ipi is defined which keeps all info about an IPI:
its name, counter, send and dispatch methods. Generic intr_ipi_setup(),
intr_ipi_send() and intr_ipi_dispatch() functions are implemented.

An IPI provider must implement two functions:
(1) an intr_ipi_send_t function which is able to send an IPI,
(2) a setup function which initializes itself for an IPI and
    calls intr_ipi_setup() with appropriate arguments.

Differential Revision:	https://reviews.freebsd.org/D5700
2016-03-24 09:55:11 +00:00
George V. Neville-Neil
dcd070d80a Move mbuf provider under SDT to indicate that it is FreeBSD specific
and not a stable interface.

Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	Rubicon Communications (Netgate)
Differential Revision:	https://reviews.freebsd.org/D5716
2016-03-24 08:26:06 +00:00
Bryan Drewery
1c9b29f95f Pass the expected struct radix_node_head * to vfs_free_netcred.
No functional change.

struct radix_node_head's first element is rh so this was already
referring to the same address.  It was likely an unintended
s/rnh/&rnh->rh/ change from r294706 as all other rnh_walktree() callers
pass the expected struct radix_node_head * rather than obscurely passing
the address of their first element.

Sponsored by:	EMC / Isilon Storage Division
2016-03-24 04:40:07 +00:00
Bryan Drewery
57a8e34141 Fix M_RTABLE memory leak from r274118 (11/2014).
Replace free(M_RTABLE) with rn_detachhead() to match rn_inithead().

This would trigger when reloading NFS exports and was similar to
problems with pf reload [1].

PR:		194078 [1]
Sponsored by:	EMC / Isilon Storage Division
2016-03-24 03:08:39 +00:00
Edward Tomasz Napierala
68d35798b9 Wait for root mount tokens before showing the root mount prompt.
This restores the pre-r290196 behaviour, eliminating the need to manually
press '.' a couple of times to get USB to finish probing.

Note that there's still something wrong with the console (character
echoing doesn't quite work), and there's also a reported problem with
BHyVe, but those two don't seem related to the problem above.

MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-03-22 13:46:01 +00:00
George V. Neville-Neil
480f4e946d Add an mbuf provider to DTrace.
The mbuf provider is made up of a set of Statically Defined Tracepoints
which help us look into mbufs as they are allocated and freed.  This can be
used to inspect the buffers or for a simplified mbuf leak detector.

New tracepoints are:

mbuf:::m-init
mbuf:::m-gethdr
mbuf:::m-get
mbuf:::m-getcl
mbuf:::m-clget
mbuf:::m-cljget
mbuf:::m-cljset
mbuf:::m-free
mbuf:::m-freem

There is also a translator for mbufs which gives some visibility into the structure,
see mbuf.d for more details.

Reviewed by:	bz, markj
MFC after:	2 weeks
Sponsored by:	Rubicon Communications (Netgate)
Differential Revision:	https://reviews.freebsd.org/D5682
2016-03-22 13:16:52 +00:00
John Baldwin
70f52fd69f Regen. 2016-03-21 21:38:35 +00:00
John Baldwin
bb430bc740 Fully handle size_t lengths in AIO requests.
First, update the return types of aio_return() and aio_waitcomplete() to
ssize_t.

POSIX requires aio_return() to return a ssize_t so that it can represent
all return values from read() and write().  aio_waitcomplete() should use
ssize_t for the same reason.

aio_return() has used ssize_t in <aio.h> since r31620 but the manpage and
system call entry were not updated.  aio_waitcomplete() has always
returned int.

Note that this does not require new system call stubs as this is
effectively only an API change in how the compiler interprets the return
value.

Second, allow aio_nbytes values up to IOSIZE_MAX instead of just INT_MAX.

aio_read/write should now honor the same length limits as normal read/write.

Third, use longs instead of ints in the aio_return() and aio_waitcomplete()
system call functions so that the 64-bit size_t in the in-kernel aiocb
isn't truncated to 32-bits before being copied out to userland or
being returned.

Finally, a simple test has been added to verify the bounds checking on the
maximum read size from a file.
2016-03-21 21:37:33 +00:00
Maxim Konovalov
ab77175063 o "avaliable" -> "available".
PR:		208141
Submitted by:	Tyler Littlefield
2016-03-21 08:03:50 +00:00
Pedro F. Giffuni
5166fdde7f aio_qphysio(): Avoid uninitialized pointer read on error.
For the !unmap case it may happen that pbuf gets called unreferenced
when vm_fault_quick_hold_pages() fails.
Initialize it so it doesn't cause trouble.

CID:		1352776
Reviewed by:	jhb
MFC after:	1 week
2016-03-18 19:04:01 +00:00
Justin Hibbits
da1b038af9 Use uintmax_t (typedef'd to rman_res_t type) for rman ranges.
On some architectures, u_long isn't large enough for resource definitions.
Particularly, powerpc and arm allow 36-bit (or larger) physical addresses, but
type `long' is only 32-bit.  This extends rman's resources to uintmax_t.  With
this change, any resource can feasibly be placed anywhere in physical memory
(within the constraints of the driver).

Why uintmax_t and not something machine dependent, or uint64_t?  Though it's
possible for uintmax_t to grow, it's highly unlikely it will become 128-bit on
32-bit architectures.  64-bit architectures should have plenty of RAM to absorb
the increase on resource sizes if and when this occurs, and the number of
resources on memory-constrained systems should be sufficiently small as to not
pose a drastic overhead.  That being said, uintmax_t was chosen for source
clarity.  If it's specified as uint64_t, all printf()-like calls would either
need casts to uintmax_t, or be littered with PRI*64 macros.  Casts to uintmax_t
aren't horrible, but it would also bake into the API for
resource_list_print_type() either a hidden assumption that entries get cast to
uintmax_t for printing, or these calls would need the PRI*64 macros.  Since
source code is meant to be read more often than written, I chose the clearest
path of simply using uintmax_t.

Tested on a PowerPC p5020-based board, which places all device resources in
0xfxxxxxxxx, and has 8GB RAM.
Regression tested on qemu-system-i386
Regression tested on qemu-system-mips (malta profile)

Tested PAE and devinfo on virtualbox (live CD)

Special thanks to bz for his testing on ARM.

Reviewed By: bz, jhb (previous)
Relnotes:	Yes
Sponsored by:	Alex Perez/Inertial Computing
Differential Revision: https://reviews.freebsd.org/D4544
2016-03-18 01:28:41 +00:00
Conrad Meyer
82e17adcb4 fail(9): Only gather/print stacks if STACK is enabled
This is a follow-up fix to the earlier r296927.

Reported by:	bz
Sponsored by:	EMC / Isilon Storage Division
2016-03-17 01:05:53 +00:00
Conrad Meyer
70e20d4e1a fail(9): Upstreaming some fail point enhancements
This is several year's worth of fail point upgrades done at EMC Isilon. They
are interdependent enough that it makes sense to put a single diff up for them.
Primarily, we added:

- Changing all mainline execution paths to be lockless, which lets us use fail
  points in more sleep-sensitive areas, and allows more parallel execution
- A number of additional commands, including 'pause' that lets us do some
  interesting deterministic repros of race conditions
- The ability to dump the stacks of all threads sleeping on a fail point
- A number of other API changes to allow marking up the fail point's context in
  the code, and firing callbacks before and after execution
- A man page update

Submitted by:	Matthew Bryan <matthew.bryan@isilon.com>
Reviewed by:	cem (earlier version), jhb, kib, pho
With feedback from:	bdrewery
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D5427
2016-03-16 04:22:32 +00:00
Gleb Smirnoff
1d52250154 Free the temporary buffer in sysctl_handle_counter_u64_array().
Submitted by:	mjg
2016-03-15 00:21:32 +00:00
Gleb Smirnoff
b5b7b142a7 Provide sysctl(9) macro to deal with array of counter(9). 2016-03-15 00:05:00 +00:00
Justin T. Gibbs
5405e7e2ee Provide high precision conversion from ns,us,ms -> sbintime in kevent
In timer2sbintime(), calculate the second and fractional second portions of
the sbintime separately. When calculating the the fractional second portion,
use a 64bit multiply to prevent excess truncation. This avoids the ~7% error
in the original conversion for ns, and smaller errors of the same type for us
and ms.

PR: 198139
Reviewed by: jhb
MFC after: 1 week
Differential Revision:    https://reviews.freebsd.org/D5397
2016-03-12 23:02:53 +00:00
John Baldwin
f479b2ac89 Do not include system call wrappers in libc for old FreeBSD system calls.
The base system libc is only used to run binaries built on FreeBSD 7.0 and
later.  It does not need to include system call wrappers for system calls
only used by FreeBSD binaries built on versions older than 7.0.  This was
already true for "COMPAT" system calls, but now wrappers for system calls
used on FreeBSD 4 and 6 are excluded as well.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D5597
2016-03-12 22:53:46 +00:00
Edward Tomasz Napierala
2a5a08cb38 Refactor the way we restore cn_lkflags; no functional changes.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-03-12 09:05:43 +00:00
Edward Tomasz Napierala
f69db55151 Remove cn_consume from 'struct componentname'. It was never set to anything
other than 0.

Reviewed by:	kib@
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D5611
2016-03-12 08:50:38 +00:00
Edward Tomasz Napierala
213ed83855 Fix autofs triggering problem. Assume you have an NFS server,
192.168.1.1, with share "share". This commit fixes a problem
where "mkdir /net/192.168.1.1/share/meh" would return spurious
error instead of creating the directory if the target filesystem
wasn't mounted yet; subsequent attempts would work correctly.

The failure scenario is kind of complicated to explain, but it all
boils down to calling VOP_MKDIR() for the target filesystem (NFS)
with wrong dvp - the autofs vnode instead of the filesystem root
mounted over it.

Reviewed by:	kib@
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D5442
2016-03-12 07:54:42 +00:00
John Baldwin
47cedcbd72 Use SI_SUB_LAST instead of SI_SUB_SMP as the "catch-all" subsystem.
Reviewed by:	kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D5515
2016-03-11 23:18:06 +00:00
John Baldwin
8d91aced32 Regen. 2016-03-09 19:06:46 +00:00
John Baldwin
399e8c1773 Simplify AIO initialization now that it is standard.
- Mark AIO system calls as STD and remove the helpers to dynamically
  register them.
- Use COMPAT6 for the old system calls with the older sigevent instead of
  an 'o' prefix.
- Simplify the POSIX configuration to note that AIO is always available.
- Handle AIO in the default VOP_PATHCONF instead of special casing it in
  the pathconf() system call.  fpathconf() is still hackish.
- Remove freebsd32_aio_cancel() as it just called the native one directly.

Reviewed by:	kib
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D5589
2016-03-09 19:05:11 +00:00
Konstantin Belousov
3cfce8e4df Convert all panics from the link_elf_obj kernel linker for object
files format into printfs and errors to caller.  Some leaks of
resources are there, but the same leaks are present in other error
pathes.  With the change, the kernel at least boots even when module
with unexpected or corrupted ELF structure is preloaded.

Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-03-07 18:44:06 +00:00
Konstantin Belousov
13f28d969a In the link_elf_obj.c, handle sections of type SHT_AMD64_UNWIND same
as SHT_PROGBITS.  This is needed after the clang 3.8 import, which
generates that type for .eh_frame section, which had SHT_PROGBITS type
before.

Reported by:	 Nikolai Lifanov <lifanov@mail.lifanov.com>
PR:	207729
Tested by:	dim (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-03-06 00:31:11 +00:00
Justin Hibbits
534ccd7bbf Replace all resource occurrences of '0UL/~0UL' with '0/~0'.
Summary:
The idea behind this is '~0ul' is well-defined, and casting to uintmax_t, on a
32-bit platform, will leave the upper 32 bits as 0.  The maximum range of a
resource is 0xFFF.... (all bits of the full type set).  By dropping the 'ul'
suffix, C type promotion rules apply, and the sign extension of ~0 on 32 bit
platforms gets it to a type-independent 'unsigned max'.

Reviewed By: cem
Sponsored by:	Alex Perez/Inertial Computing
Differential Revision: https://reviews.freebsd.org/D5255
2016-03-03 05:07:35 +00:00
Konstantin Belousov
5db9ed8062 If callout_stop_safe() noted that the callout is currently executing,
but next invocation is cancelled while migrating,
sleepq_check_timeout() needs to be informed that the callout is
stopped.  Otherwise the thread switches off CPU and never become
runnable, since running callout could have already raced with us,
while the migrating and cancelled callout could be one which is
expected to set TDP_TIMOFAIL flag for us.  This contradicts with the
expected behaviour of callout_stop() for other callers, which
e.g. decrement references from the callout callbacks.

Add a new flag CS_MIGRBLOCK requesting report of the situation as
'successfully stopped'.

Reviewed by:	jhb (previous version)
Tested by:	cognet, pho
PR:	200992
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D5221
2016-03-02 18:46:17 +00:00
Gleb Smirnoff
0ea37a86ba Fix regression in r296242 affecting several drivers. For EXT_NET_DRV,
EXT_MOD_TYPE, EXT_DISPOSABLE types we should first execute the free
callback, then free the mbuf, otherwise we will derefernce memory that
was just freed.

Reported and tested:	jhibbits
2016-03-02 02:12:01 +00:00
Bryan Drewery
3b77ac5f9e Correct a comment. 2016-03-01 23:58:53 +00:00
John Baldwin
7b8cfe26a0 Use SCHEDULER_STOPPED() in cv_*wait*() instead of checking panicstr.
Reviewed by:	kib
MFC after:	1 month
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D5516
2016-03-01 22:51:44 +00:00
John Baldwin
f3215338ef Refactor the AIO subsystem to permit file-type-specific handling and
improve cancellation robustness.

Introduce a new file operation, fo_aio_queue, which is responsible for
queueing and completing an asynchronous I/O request for a given file.
The AIO subystem now exports library of routines to manipulate AIO
requests as well as the ability to run a handler function in the
"default" pool of AIO daemons to service a request.

A default implementation for file types which do not include an
fo_aio_queue method queues requests to the "default" pool invoking the
fo_read or fo_write methods as before.

The AIO subsystem permits file types to install a private "cancel"
routine when a request is queued to permit safe dequeueing and cleanup
of cancelled requests.

Sockets now use their own pool of AIO daemons and service per-socket
requests in FIFO order.  Socket requests will not block indefinitely
permitting timely cancellation of all requests.

Due to the now-tight coupling of the AIO subsystem with file types,
the AIO subsystem is now a standard part of all kernels.  The VFS_AIO
kernel option and aio.ko module are gone.

Many file types may block indefinitely in their fo_read or fo_write
callbacks resulting in a hung AIO daemon.  This can result in hung
user processes (when processes attempt to cancel all outstanding
requests during exit) or a hung system.  To protect against this, AIO
requests are only permitted for known "safe" files by default.  AIO
requests for all file types can be enabled by setting the new
vfs.aio.enable_usafe sysctl to a non-zero value.  The AIO tests have
been updated to skip operations on unsafe file types if the sysctl is
zero.

Currently, AIO requests on sockets and raw disks are considered safe
and are enabled by default.  aio_mlock() is also enabled by default.

Reviewed by:	cem, jilles
Discussed with:	kib (earlier version)
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D5289
2016-03-01 18:12:14 +00:00
John Baldwin
cbc4d2db75 Remove taskqueue_enqueue_fast().
taskqueue_enqueue() was changed to support both fast and non-fast
taskqueues 10 years ago in r154167.  It has been a compat shim ever
since.  It's time for the compat shim to go.

Submitted by:	Howard Su <howard0su@gmail.com>
Reviewed by:	sephe
Differential Revision:	https://reviews.freebsd.org/D5131
2016-03-01 17:47:32 +00:00
Svatopluk Kraus
32d001e479 Remove an alternative way for dealing with root interrupt controller
which is not complete. Likely, it was forgotten and not removed before
committing.
2016-03-01 11:27:58 +00:00
Svatopluk Kraus
169e6abd8f Mark other parts of interrupt framework as INTR_SOLO option specific.
Note that isrc_arg member of struct intr_irqsrc is used only for
INTR_SOLO and IPI filter. This should be remembered if IPI filters
and their arguments will be stored on another place.

This option could be unusable very soon, if interrupt controllers
implementations will not be implemented considering it.
2016-03-01 10:57:29 +00:00
Gleb Smirnoff
56a5f52e80 New way to manage reference counting of mbuf external storage.
The m_ext.ext_cnt pointer becomes a union. It can now hold the refcount
value itself. To tell that m_ext.ext_flags flag EXT_FLAG_EMBREF is used.
The first mbuf to attach a cluster stores the refcount. The further mbufs
to reference the cluster point at refcount in the first mbuf. The first
mbuf is freed only when the last reference is freed.

The benefit over refcounts stored in separate slabs is that now refcounts
of different, unrelated mbufs do not share a cache line.

For EXT_EXTREF mbufs the zone_ext_refcnt is no longer needed, and m_extadd()
becomes void, making widely used M_EXTADD macro safe.

For EXT_SFBUF mbufs the sf_ext_ref() is removed, which was an optimization
exactly against the cache aliasing problem with regular refcounting.

Discussed with:		rrs, rwatson, gnn, hiren, sbruno, np
Reviewed by:		rrs
Differential Revision:	https://reviews.freebsd.org/D5396
Sponsored by:		Netflix
2016-03-01 00:17:14 +00:00
Konstantin Belousov
1bdbd70599 Implement process-shared locks support for libthr.so.3, without
breaking the ABI.  Special value is stored in the lock pointer to
indicate shared lock, and offline page in the shared memory is
allocated to store the actual lock.

Reviewed by:	vangyzen (previous version)
Discussed with:	deischen, emaste, jhb, rwatson,
	Martin Simmons <martin@lispworks.com>
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2016-02-28 17:52:33 +00:00
Svatopluk Kraus
5b70c08cdf Move IPI related parts back to (ARM) machine specific file now, when
the interrupt framework is also going to be used by another (MIPS)
architecture. IPI implementations may vary much across different
architectures.

An IPI implementation should still define INTR_IPI_COUNT and use
intr_ipi_setup_counters() to setup IPI counters which are inside of
intrcnt[] and intrnames[] arrays. Those are used for sysctl and ddb.
Then, intr_ipi_increment_count() should be used to increment obtained
counter.

Reviewed by:	imp
Differential Revision:	https://reviews.freebsd.org/D5459
2016-02-27 12:03:07 +00:00
Ed Schouten
afc055d90f Remove the errno argument from unp_drop().
While there, add a comment to clarify that ECONNRESET should always be
returned for POSIX conformance.

Suggested by:	Steven Hartland
2016-02-26 12:46:34 +00:00
Mark Johnston
0acf5d0bfd Improve error handling for posix_fallocate(2) and posix_fadvise(2).
- Set td_errno so that ktrace and dtrace can obtain the syscall error
  number in the usual way.
- Pass negative error numbers directly to the syscall layer, as they're
  not intended to be returned to userland.

Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D5425
2016-02-25 19:58:23 +00:00
Ed Schouten
72c8072ee5 Make asynchronous connection failures on UNIX sockets fail with ECONNRESET.
While making CloudABI work well on Linux, I discovered that I had a
FreeBSD-ism in one of my unit tests. The test did the following:

- Create UNIX socket 1, bind it, make it listen.
- Create UNIX socket 2, connect it to UNIX socket 1.
- Close UNIX socket 1.
- Obtain SO_ERROR from socket 2.

On FreeBSD this returns ECONNABORTED, while on Linux it returns
ECONNRESET. I dug through some of the relevant specifications[1] and it
looks like Linux is all right here. ECONNABORTED should only be returned
when the local connection (socket 2) is aborted; not the peer (socket 1).

It is of course slightly misleading: the function in which we set this
error is called uipc_abort(), but keep in mind that we're aborting the
peer, thus resetting the local socket.

[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/connect.html

Reviewed by:	cem
Sponsored by:	Nuxi, the Netherlands
Differential Revision:	https://reviews.freebsd.org/D5419
2016-02-24 17:10:32 +00:00
Konstantin Belousov
0791e0c0e7 Provide more correct sizing of the KVA consumed by a vnode, used by
the virtvnodes calculation.  Include the size of fs-specific v_data as
the nfs nclnode inline, the NFS nclnode is bigger than either ZFS
znode or UFS inode.  Include the size of namecache_ts and short cache
path element, multiplied by the name cache population factor, again
inline.

Inline defines are used to avoid pollution of the vnode.h with the
subsystem-private objects.  Non-significant unsynchronized changes of
the definitions are fine, we do not care about that precision, and
e.g. ZFS consumes much malloced memory per vnode for reasons
unaccounted in the formula.

Lower the partition of kmem dedicated to vnodes, from 1/7 to 1/10.

The measures reduce vnode cache pressure on kmem and bring the vnode
cache memory use below some apparent thresholds that were exceeded by
r291244 due to more robust vnode reuse.

Reported and tested by:	marius (i386, previous version)
Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-02-24 15:15:46 +00:00
Bryan Drewery
200241b504 Fix build after r295934. 2016-02-23 23:37:10 +00:00
Mariusz Zaborski
94e6fdd806 According to the sys/kern/capabilities.conf, gethostid(3) should be allowed.
Pointed out by:	Milosz Kaniewski <m.kaniewski@wheelsystems.com>
Approved by:	pjd (mentor)
MFC after:	3 days
Sponsored by:	Wheel Systems, http://wheelsystems.com
2016-02-23 22:02:25 +00:00
Ian Lepore
85143dd18d Allow a dynamic env to override a compiled-in static env by passing in the
override indication in the env data.

Submitted by:	bde
2016-02-21 18:35:01 +00:00
Justin Hibbits
7915adb560 Introduce a RMAN_IS_DEFAULT_RANGE() macro, and use it.
This simplifies checking for default resource range for bus_alloc_resource(),
and improves readability.

This is part of, and related to, the migration of rman_res_t from u_long to
uintmax_t.

Discussed with:	jhb
Suggested by:	marcel
2016-02-20 01:32:58 +00:00
Mark Johnston
88c2beac9c Ensure that we test the event condition when a disabled kevent is enabled.
r274560 modified kqueue_register() to only test the event condition if the
corresponding knote is not disabled. However, this check takes place before
the EV_ENABLE flag is used to clear the KN_DISABLED flag on the knote, so
enabling a previously-disabled kevent would not result in a notification for
a triggered event. This change fixes the problem by testing for EV_ENABLED
before possibly checking the event condition.

This change also updates a kqueue regression test to exercise this case.

PR:		206368
Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D5307
2016-02-19 01:49:33 +00:00
Mark Johnston
fe169828c3 Return an error if both EV_ENABLE and EV_DISABLE are specified for a kevent.
Currently, this combination results in EV_DISABLE being ignored.

Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D5307
2016-02-19 01:35:01 +00:00
Zbigniew Bodek
910905c74f Fix build for i386 and arm64 after r295755
- Take bus_space_tag_t type into consideration when returning
  default, zero value.
- Include missing rman.h required by ofw_pci.h
2016-02-18 15:44:45 +00:00
Zbigniew Bodek
b998c9656b Introduce bus_get_bus_tag() method
Provide bus_get_bus_tag() for sparc64, powerpc, arm, arm64 and mips
nexus and its children in order to return a platform specific default tag.

This is required to ensure generic correctness of the bus_space tag.
It is especially needed for arches where child bus tag does not match
the parent bus tag. This solves the problem with ppc architecture
where the PCI bus tag differs from parent bus tag which is big-endian.

This commit is a part of the following patch:
https://reviews.freebsd.org/D4879

Submitted by:  Marcin Mazurek <mma@semihalf.com>
Obtained from: Semihalf
Sponsored by:  Annapurna Labs
Reviewed by:   jhibbits, mmel
Differential Revision: https://reviews.freebsd.org/D4879
2016-02-18 13:00:04 +00:00
Konstantin Belousov
fa48f413ef In bnoreuselist(), check both ends of the specified logical block
numbers range.

This effectively skips indirect and extdata blocks on the buffer
queue.  Since their logical block numbers are negative, bnoreuselist()
could loop infinitely.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-02-17 19:39:57 +00:00
Warner Losh
c55f57071a Create an API to reset a struct bio (g_reset_bio). This is mandatory
for all struct bio you get back from g_{new,alloc}_bio. Temporary
bios that you create on the stack or elsewhere should use this before
first use of the bio, and between uses of the bio. At the moment, it
is nothing more than a wrapper around bzero, but that may change in
the future. The wrapper also removes one place where we encode the
size of struct bio in the KBI.
2016-02-17 17:16:02 +00:00
Andrew Turner
96922be6b5 Remove an unused FDT header, fdt_common.h should only be needed in a few
places, mostly in sys/dev/fdt and legacy code.

Sponsored by:	ABT Systems Ltd
2016-02-15 17:05:03 +00:00
Adrian Chadd
0cc5515a85 Allow MIPS INTRNG code to be built without FDT support.
This patch allows the newly imported INTRNG code to be built without necessarily
having FDT support in the kernel.  This may be useful for some MIPS platforms
that wish to move to INTRNG, but not to FDT at the same time.

Basically all the code is already within ifdef's where FDT is concerned,
it's just the headers that aren't.

Submitted by:	Stanislav Galabov <sgalabov@gmail.com>
Differential Revision:	https://reviews.freebsd.org/D5249
2016-02-15 14:34:35 +00:00
Gleb Smirnoff
5e4bc63b7c o Gather all mbuf(9) allocation functions into kern_mbuf.c, and all
mbuf(9) manipulation functions into uipc_mbuf.c.  This looks like
  the initial intent, but had diffused in the last decade.

o Gather all declarations in mbuf.h in one place and sort them.

o Uninline m_clget() and m_cljget().

There are no functional changes in this patch.

The patch comes from a larger version, where all mbuf(9) allocation was
uninlined, which allowed to make mbuf(9) UMA zones private to kern_mbuf.c.
The performance impact of the total uninlining is still unclear, so we
are holding on now with larger version.

Together with:	melifaro, olivier
2016-02-11 21:32:23 +00:00
Konstantin Belousov
0be1e0e879 Remove useless checks for NULL before calling free(9), in the kernel
elf linkers.

Found by:	Related PVS-Studio diagnostic
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-02-10 21:35:00 +00:00
Konstantin Belousov
86a448c3a4 Finish r173600. There is no need to test a condition if both cases
result in the same value.

Found by:	PVS-Studio
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-02-10 21:16:37 +00:00
Gleb Smirnoff
b4b12e52fb Garbage collect unused arguments of m_init(). 2016-02-10 18:54:18 +00:00
Gleb Smirnoff
b28cc462ad Include sys/_task.h into uma_int.h, so that taskqueue.h isn't a
requirement for uma_int.h.

Suggested by:	jhb
2016-02-09 20:22:35 +00:00
Konstantin Belousov
db57c70a5b Rename P_KTHREAD struct proc p_flag to P_KPROC.
I left as is an apparent bug in ntoskrnl_var.h:AT_PASSIVE_LEVEL()
definition.

Suggested by:	jhb
Sponsored by:	The FreeBSD Foundation
2016-02-09 16:30:16 +00:00
John Baldwin
fb1f4582ff Call kthread_exit() rather than kproc_exit() for a premature kthread exit.
Kernel threads (and processes) are supposed to call kthread_exit() (or
kproc_exit()) to terminate.  However, the kernel includes a fallback in
fork_exit() to force a kthread exit if a kernel thread's "main" routine
returns.  This fallback was added back when the kernel only had processes
and was not updated to call kthread_exit() instead of kproc_exit() when
threads were added to the kernel.

This mistake was particular exciting when the errant thread belonged to
proc0.  Due to the missing P_KTHREAD flag the fallback did not kick in
and instead tried to return to userland via whatever garbage was in the
trapframe.  With P_KTHREAD set it tried to terminate proc0 resulting in
other amusements.

PR:		204999
MFC after:	1 week
2016-02-08 23:11:23 +00:00
John Baldwin
6270fa5f72 Mark proc0 as a kernel process via the P_KTHREAD flag.
All other kernel processes have this flag set and all threads in proc0
(including thread0) have the similar TDP_KTHREAD flag set.

PR:		204999
Submitted by:	Oliver Pinter @ HardenedBSD
Reviewed by:	kib
MFC after:	1 week
2016-02-08 23:06:27 +00:00
Konstantin Belousov
2dab579bfc Remove the assert which outlived its usefulness, and, by default,
disable compilation of the code which made it possible to call
stop_all_proc() from usermode at all.

Move the comment to the preamble of stop_all_proc() and reword it to
give overview of the function intent.

proc0 has P_HADTHREADS flag set due to kthread_add(), but no
P_KTHREAD, which triggered the assert, which does not serve a purpose
now.

Reported by:	Oliver Pinter
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-02-08 10:54:27 +00:00
Jilles Tjoelker
f00fb5457e semget(): Check for [EEXIST] error first.
Although POSIX literally permits failing with [EINVAL] if IPC_CREAT and
IPC_EXCL were both passed, the semaphore set already exists and has fewer
semaphores than nsems, this does not allow an application to retry safely:
if the [EINVAL] is actually because of the semmsl limit, an infinite loop
would result.

PR:		206927
2016-02-07 22:12:39 +00:00
Pedro F. Giffuni
7559912039 Minor grammar fix in comment. 2016-02-07 16:18:12 +00:00
Kirk McKusick
b00b459084 Clarify a comment in kern_openat() about the use of falloc_noinstall().
Suggested by: Steve Jacobson
2016-02-07 01:04:47 +00:00
Mateusz Guzik
0c829a301d fork: ansify sys_pdfork
No functional changes.
2016-02-06 09:01:03 +00:00