Commit Graph

15465 Commits

Author SHA1 Message Date
Eric Badger
28d2efa983 sleepq_catch_signals: do thread suspension before signal check
Since locks are dropped when a thread suspends, it's possible for another
thread to deliver a signal to the suspended thread. If the thread awakens from
suspension without checking for signals, it may go to sleep despite having
a pending signal that should wake it up. Therefore the suspension check is
done first, so any signals sent while suspended will be caught in the
subsequent signal check.

Reviewed by:	kib
Approved by:	kib (mentor)
MFC after:	2 weeks
Sponsored by:	Dell EMC
Differential Revision:	https://reviews.freebsd.org/D9530
2017-02-14 17:13:23 +00:00
Andriy Gapon
937c1b0757 try to fix RACCT_RSS accounting
There could be a race between the vm daemon setting RACCT_RSS based on
the vm space and vmspace_exit (called from exit1) resetting RACCT_RSS to
zero.  In that case we can get a zombie process with non-zero RACCT_RSS.
If the process is jailed, that may break accounting for the jail.
There could be other consequences.

Fix this race in the vm daemon by updating RACCT_RSS only when a process
is in the normal state.  Also, make accounting a little bit more
accurate by refreshing the page resident count after calling
vm_pageout_map_deactivate_pages().
Finally, add an assert that the RSS is zero when a process is reaped.

PR:		210315
Reviewed by:	trasz
Differential Revision: https://reviews.freebsd.org/D9464
2017-02-14 13:54:05 +00:00
Konstantin Belousov
496ab0532d Rework r313352.
Rename kern_vm_* functions to kern_*.  Move the prototypes to
syscallsubr.h.  Also change Mach VM types to uintptr_t/size_t as
needed, to avoid headers pollution.

Requested by:	alc, jhb
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D9535
2017-02-13 09:04:38 +00:00
Konstantin Belousov
987ff18184 Consistently handle negative or wrapping offsets in the mmap(2) syscalls.
For regular files and posix shared memory, POSIX requires that
[offset, offset + size) range is legitimate.  At the maping time,
check that offset is not negative.  Allowing negative offsets might
expose the data that filesystem put into vm_object for internal use,
esp. due to OFF_TO_IDX() signess treatment.  Fault handler verifies
that the mapped range is valid, assuming that mmap(2) checked that
arithmetic gives no undefined results.

For device mappings, leave the semantic of negative offsets to the
driver.  Correct object page index calculation to not erronously
propagate sign.

In either case, disallow overflow of offset + size.

Update mmap(2) man page to explain the requirement of the range
validity, and behaviour when the range becomes invalid after mapping.

Reported and tested by:	royger (previous version)
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-02-12 21:05:44 +00:00
Konstantin Belousov
f2277b64ec Switch copyout_map() to use vm_mmap_object() instead of vm_mmap().
This is both a microoptimization and a move of the consumer to more
commonly used vm function.

Suggested and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-02-12 20:54:31 +00:00
Mateusz Guzik
c4a48867f1 lockmgr: implement fast path
The main lockmgr routine takes 8 arguments which makes it impossible to
tail-call it by the intermediate vop_stdlock/unlock routines.

The routine itself starts with an if-forest and reads from the lock itself
several times.

This slows things down both single- and multi-threaded. With the patch
single-threaded fstats go 4% up and multithreaded up to ~27%.

Note that there is still a lot of room for improvement.

Reviewed by:	kib
Tested by:	pho
2017-02-12 09:49:44 +00:00
John Baldwin
bb9b710477 Regenerate all the system call tables to drop "created from" lines.
One of the ibcs2 files contains some actual changes (new headers) as
it hasn't been regenerated after older changes to makesyscalls.sh.
2017-02-10 19:45:02 +00:00
John Baldwin
807a7231f2 Drop the "created from" line from files generated by makesyscalls.sh.
This information is less useful when the generated files are included in
source control along with the source.  If needed it can be reconstructed
from the $FreeBSD$ tag in the generated file.  Removing this information
from the generated output permits committing the generated files along
with the change to the system call master list without having inconsistent
metadata in the generated files.

Reviewed by:	emaste, kib
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D9497
2017-02-10 19:25:52 +00:00
Konstantin Belousov
e83a71c656 Fix r313495.
The file type DTYPE_VNODE can be assigned as a fallback if VOP_OPEN()
did not initialized file type.  This is a typical code path used by
normal file systems.

Also, change error returned for inappropriate file type used for
O_EXLOCK to EOPNOTSUPP, as declared in the open(2) man page.

Reported by:	cy, dhw, Iblis Lin <iblis@hs.ntnu.edu.tw>
Tested by:	dhw
Sponsored by:	The FreeBSD Foundation
MFC after:	13 days
2017-02-10 14:49:04 +00:00
Konstantin Belousov
e628e1b919 Increase a chance of devfs_close() calling d_close cdevsw method.
If a file opened over a vnode has an advisory lock set at close,
vn_closefile() acquires additional vnode use reference to prevent
freeing the vnode in vn_close().  Side effect is that for device
vnodes, devfs_close() sees that vnode reference count is greater than
one and refuses to call d_close().  Create internal version of
vn_close() which can avoid dropping the vnode reference if needed, and
use this to execute VOP_CLOSE() without acquiring a new reference.

Note that any parallel reference to the vnode would still prevent
d_close call, if the reference is not from an opened file, e.g. due to
stat(2).

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-02-09 23:36:50 +00:00
Konstantin Belousov
7903b00087 Do not establish advisory locks when doing open(O_EXLOCK) or open(O_SHLOCK)
for files which do not have DTYPE_VNODE type.

Both flock(2) and fcntl(2) syscalls refuse to acquire advisory lock on
a file which type is not DTYPE_VNODE.  Do the same when lock is
requested from open(2).

Restructure the block in vn_open_vnode() which handles O_EXLOCK and
O_SHLOCK open flags to make it easier to quit its execution earlier
with an error.

Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-02-09 23:35:57 +00:00
Mateusz Guzik
8eaaf58a5f rwlock: fix r313454
The runlock slow path would update wrong variable before restarting the
loop, in effect corrupting the state.

Reported by:	pho
2017-02-09 13:32:19 +00:00
Mateusz Guzik
3b3cf014fc locks: tidy up unlock fallback paths
Update comments to note these functions are reachable if lockstat is
enabled.

Check if the lock has any bits set before attempting unlock, which saves
an unnecessary atomic operation.
2017-02-09 08:19:30 +00:00
Mateusz Guzik
834f70f32f sx: implement slock/sunlock fast path
See r313454.
2017-02-08 19:29:34 +00:00
Mateusz Guzik
b0a61642d4 rwlock: implemenet rlock/runlock fast path
This improves singlethreaded throughput on my test machine from ~247 mln
ops/s to ~328 mln.

It is mostly about avoiding the setup cost of lockstat.

Reviewed by:	jhb (previous version)
2017-02-08 19:28:46 +00:00
John Baldwin
885f13dc96 Copy the e_machine and e_flags fields from the binary into an ELF core dump.
In the kernel, cache the machine and flags fields from ELF header to use in
the ELF header of a core dump. For gcore, the copy these fields over from
the ELF header in the binary.

This matters for platforms which encode ABI information in the flags field
(such as o32 vs n32 on MIPS).

Reviewed by:	kib
Sponsored by:	DARPA / AFRL
Differential Revision:	https://reviews.freebsd.org/D9392
2017-02-07 20:34:03 +00:00
Emmanuel Vadot
beaa6e1e64 subr_sfbus.c need sys/proc.h for struct thread definition.
This fixes kernel build for armv6.

Discussed with: kib
2017-02-07 17:31:24 +00:00
Mateusz Guzik
dbccc8105c rwlock: implement RW_LOCK_WRITER_RECURSED bit
This moves recursion handling out of the inlined wunlock path and in
particular saves a read and a branch.

Discussed with:
2017-02-07 17:04:31 +00:00
Mateusz Guzik
f743ea9638 Bump struct thread alignment to 32.
This gives additional bits to use in locking primitives which store
the lock thread pointer in the lock value.

Discussed with:	kib
2017-02-07 17:03:22 +00:00
Mateusz Guzik
3c798b2b1f locks: follow up r313386
Unfinished diff was committed by accident. The loop in lock_delay
was changed to decrement, but the loop iterator was still incrementing.
2017-02-07 16:01:07 +00:00
Mateusz Guzik
8e5a3e9a9d locks: change backoff to exponential
Previous implementation would use a random factor to spread readers and
reduce chances of starvation. This visibly reduces effectiveness of the
mechanism.

Switch to the more traditional exponential variant. Try to limit starvation
by imposing an upper limit of spins after which spinning is half of what
other threads get. Note the mechanism is turned off by default.

Reviewed by:	kib (previous version)
2017-02-07 14:49:36 +00:00
Edward Tomasz Napierala
1110d0029a Make root_mount_hold() work after boot. This is important for two
reasons. First is rerooting into USB-mounted device that happens
to be not yet enumerated. The second is when mounting with (non-root)
filesystem on USB device on a hub that's enumerated later than the root
mount: the rc scripts explicitly mount for the root mount holds to be
released, but each USB bus takes the hold asynchronously, and if that
happens after root mount, it would just get ignored.

Reviewed by:	marcel
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D9388
2017-02-06 20:44:34 +00:00
Edward Tomasz Napierala
4f9d7bad48 In r290196 the root mount hold mechanism was changed to make it not wait
for mount hold release if the root device already exists.  So, unless your
rootdev is not on USB - ie in the usual case - the root mount won't wait
for USB.  However, the old behaviour was sometimes used as "wait until USB
is fully enumerated", and r290196 broke that.

This commit adds vfs.root_mount_always_wait tunable, to force the kernel
to always wait for root mount holds, even if the root is already there.

Reviewed by:	kib
MFC after:	2 weeks
Relnotes:	yes
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D9387
2017-02-06 20:36:59 +00:00
Andrew Turner
c0d5237034 Only allow the pic type to be either a PIC or MSI type. All interrupt
controller drivers handle either MSI/MSI-X interrupts, or regular
interrupts, as such enforce this in the interrupt handling framework.
If a later driver was to handle both it would need to create one of each.

This will allow future changes to allow the xref space to overlap, but
refer to different drivers.

Obtained from:	ABT Systems Ltd
Sponsored by:	The FreeBSD Foundation
X-Differential Revision:	https://reviews.freebsd.org/D8616
2017-02-06 13:08:48 +00:00
Mateusz Guzik
c1aaf63cb5 locks: fix recursion support after recent changes
When a relevant lockstat probe is enabled the fallback primitive is called with
a constant signifying a free lock. This works fine for typical cases but breaks
with recursion, since it checks if the passed value is that of the executing
thread.

Read the value if necessary.
2017-02-06 09:40:14 +00:00
Mateusz Guzik
993ddec44d rwlock: move lockstat handling out of inline primitives
See r313275 for details.

One difference here is that recursion handling was removed from the fallback
routine. As it is it was never supposed to see a recursed lock in the first
place. Future changes will move it out of inline variants, but right now
there is no easy to way to test if the lock is recursed without reading
additional words.
2017-02-05 13:37:23 +00:00
Edward Tomasz Napierala
96ee43103d Add kern_cpuset_getaffinity() and kern_cpuset_getaffinity(),
and use it in compats instead of their sys_*() counterparts.

Reviewed by:	kib, jhb, dchagin
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D9383
2017-02-05 13:24:54 +00:00
Mateusz Guzik
6ebb77b6a6 sx: move lockstat handling out of inline primitives
See r313275 for details.
2017-02-05 09:54:16 +00:00
Mateusz Guzik
dc0896512c mtx: fixup r313278, the assignemnt was supposed to go inside the loop 2017-02-05 09:53:13 +00:00
Mateusz Guzik
cae4ab7f37 mtx: fix up _mtx_obtain_lock_fetch usage in thread lock
Since _mtx_obtain_lock_fetch no longer sets the argument to MTX_UNOWNED,
callers have to do it on their own.
2017-02-05 09:35:17 +00:00
Mateusz Guzik
08da267775 mtx: move lockstat handling out of inline primitives
Lockstat requires checking if it is enabled and if so, calling a 6 argument
function. Further, determining whether to call it on unlock requires
pre-reading the lock value.

This is problematic in at least 3 ways:
- more branches in the hot path than necessary
- additional cacheline ping pong under contention
- bigger code

Instead, check first if lockstat handling is necessary and if so, just fall
back to regular locking routines. For this purpose a new macro is introduced
(LOCKSTAT_PROFILE_ENABLED).

LOCK_PROFILING uninlines all primitives. Fold in the current inline lock
variant into the _mtx_lock_flags to retain the support. With this change
the inline variants are not used when LOCK_PROFILING is defined and thus
can ignore its existence.

This results in:
   text	   data	    bss	    dec	    hex	filename
22259667	1303208	4994976	28557851	1b3c21b	kernel.orig
21797315	1303208	4994976	28095499	1acb40b	kernel.patched

i.e. about 3% reduction in text size.

A remaining action is to remove spurious arguments for internal kernel
consumers.
2017-02-05 08:04:11 +00:00
Mateusz Guzik
3ae56ce958 sx: add witness support missed in r313272 2017-02-05 06:51:45 +00:00
Mateusz Guzik
9d2e4290ff sx: uninline slock/sunlock
Shared locking routines explicitly read the value and test it. If the
change attempt fails, they fall back to a regular function which would
retry in a loop.

The problem is that with many concurrent readers the risk of failure is pretty
high and even the value returned by fcmpset is very likely going to be stale
by the time the loop in the fallback routine is reached.

Uninline said primitives. It gives a throughput increase when doing concurrent
slocks/sunlocks with 80 hardware threads from ~50 mln/s to ~56 mln/s.

Interestingly, rwlock primitives are already not inlined.
2017-02-05 05:20:29 +00:00
Mateusz Guzik
fa47404353 sx: switch to fcmpset
Discussed with:	jhb
Tested by:	pho (previous version)
2017-02-05 04:54:20 +00:00
Mateusz Guzik
c84f347985 rwlock: switch to fcmpset
Discussed with:	jhb
Tested by:	pho
2017-02-05 04:53:13 +00:00
Mateusz Guzik
90836c3270 mtx: switch to fcmpset
The found value is passed to locking routines in order to reduce cacheline
accesses.

mtx_unlock grows an explicit check for regular unlock. On ll/sc architectures
the routine can fail even if the lock could have been handled by the inline
primitive.

Discussed with:	jhb
Tested by:	pho (previous version)
2017-02-05 03:26:34 +00:00
Mateusz Guzik
2d78a5531e vfs: use atomic_fcmpset in vfs_refcount_* 2017-02-05 03:23:16 +00:00
Mark Johnston
69d2418faa Make witness_warn() always print to the console.
witness_warn() either breaks into the debugger or panics the system, so its
output should go to the console regardless of the witness(4) output channel
configuration.

MFC after:	1 week
Sponsored by:	Dell EMC Isilon
2017-02-05 02:27:04 +00:00
Mateusz Guzik
3a2f282532 fd: switch fget_unlocked to atomic_fcmpset 2017-02-05 01:40:27 +00:00
Jason A. Harmening
ad62ba6e96 Revert r313037
The switch to get_pcpu() in MI code seems to cause hangs on MIPS.
Back out until we can get a better idea of what's happening there.

Reported by:	kan, lidl
2017-02-04 06:24:49 +00:00
Hartmut Brandt
4b481ba0ed Merge filt_soread and filt_solisten and decide what to do when checking
for EVFILT_READ at the point of the check not when the event is registers.
This fixes a problem with asio when accepting a connection.

Reviewed by:	kib@, Scott Mitchell
2017-02-01 13:12:07 +00:00
Jason A. Harmening
65ed483615 Implement get_pcpu() for the remaining architectures and use it to
replace pcpu_find(curcpu) in MI code.
2017-02-01 03:32:49 +00:00
Edward Tomasz Napierala
b38b22b0b2 Add kern_pread() and kern_pwrite(), and use it in compats instead
of their sys_*() counterparts. The svr4 is left unchanged.

Reviewed by:	kib@
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D9379
2017-01-31 15:35:18 +00:00
Edward Tomasz Napierala
fc8bde8ffe Replace calls to sys_truncate() with kern_truncate().
Reviewed by:	kib@
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D9371
2017-01-31 15:19:44 +00:00
Edward Tomasz Napierala
ea2ebdc19e Add kern_cpuset_getid() and kern_cpuset_setid(), and use them
in compat32 instead of their sub_*() counterparts.

Reviewed by:	jhb@, kib@
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D9382
2017-01-31 15:11:23 +00:00
Andriy Gapon
826b3d3187 put very expensive sanity checks of advisory locks under DIAGNOSTIC
The checks have quadratic complexity over a number of advisory locks
active for a file and that could be a lot.  What's the worse is that the
checks are done while holding ls_lock.  That could lead to a long a very
long backlog and performance degradation even if all requested locks are
compatible (e.g. all shared locks).

The checks used to be under INVARIANTS.

Discussed with:	kib
MFC after:	2 weeks
Sponsored by:	Panzura
2017-01-30 15:20:13 +00:00
Edward Tomasz Napierala
d293f35c09 Add kern_listen(), kern_shutdown(), and kern_socket(), and use them
instead of their sys_*() counterparts in various compats. The svr4
is left untouched, because there's no point.

Reviewed by:	ed@, kib@
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D9367
2017-01-30 12:57:22 +00:00
Edward Tomasz Napierala
f67d6b5f12 Add kern_lseek() and use it instead of sys_lseek() in various compats.
I didn't touch svr4/, there's no point.

Reviewed by:	ed@, kib@
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D9366
2017-01-30 12:24:47 +00:00
Edward Tomasz Napierala
ae6b6ef6cb Replace sys_ftruncate() with kern_ftruncate() in various compats.
Reviewed by:	kib@
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D9368
2017-01-30 11:50:54 +00:00
Mateusz Guzik
dfecf51dd0 cache: use vrefact for '.' lookups and refing the rdir in fullpath 2017-01-30 03:20:05 +00:00
Mateusz Guzik
3071469d57 fd: sprinkle __read_mostly and __exclusive_cache_line 2017-01-30 03:07:32 +00:00
Baptiste Daroussin
b4b4b5304b Revert crap accidentally committed 2017-01-28 16:31:23 +00:00
Baptiste Daroussin
814aaaa7da Revert r312923 a better approach will be taken later 2017-01-28 16:30:14 +00:00
Mateusz Guzik
95839d3d25 hwpmc: annotate pmc_hook and pmc_intr as __read_mostly
MFC after:	1 month
2017-01-27 22:14:42 +00:00
Mateusz Guzik
f1f7f1cb29 hwpmc: partially depessimize mmap handling if the module is not loaded
In particular this means the pmc sx lock is no longer taken when an
executable mapping succeeds.

MFC after:	1 week
2017-01-27 22:13:15 +00:00
Mateusz Guzik
290511163d Sprinkle __read_mostly on backoff and lock profiling code.
MFC after:	1 month
2017-01-27 15:03:51 +00:00
Mateusz Guzik
17071ff298 cache: annotate with __read_mostly and __exclusive_cache_line
MFC after:	1 month
2017-01-27 14:56:36 +00:00
Sean Bruno
de414cfe14 A few more style bugs lying around in here.
Submitted by:	bde
2017-01-26 13:48:45 +00:00
Gleb Smirnoff
beb4b31200 For non-listening AF_UNIX sockets return error code EOPNOTSUPP to match
documentation and SUS.
2017-01-25 22:26:45 +00:00
Ed Maste
f27ac8e297 ANSIfy kern_ntptime.c 2017-01-25 20:22:32 +00:00
Sean Bruno
06bb7c507a Replace overlooked smp_started checks and variable use in a print
with the now used tqg_smp_started.

Submitted by:	bde
2017-01-25 15:54:44 +00:00
Ed Maste
77ebe276ba imgact_elf: refactor et_dyn_addr calculation
This simplifies the logic somewhat. It is extracted from the change in
review in D5603.

Differential Revision:	https://reviews.freebsd.org/D9321
2017-01-24 22:46:43 +00:00
Mateusz Guzik
543b2f425d proc: perform a lockless check in sys_issetugid
Discussed with:	kib
MFC after:	1 week
2017-01-24 21:48:57 +00:00
Conrad Meyer
90a79ac576 Use time_t for intermediate values to avoid overflow in clock_ts_to_ct
Add additionally safety and overflow checks to clock_ts_to_ct and the
BCD routines while we're here.

Perform a safety check in sys_clock_settime() first to avoid easy local
root panic, without having to propagate an error value back through
dozens of APIs currently lacking error returns.

PR:		211960, 214300
Submitted by:	Justin McOmie <justin.mcomie at gmail.com>, kib@
Reported by:	Tim Newsham <tim.newsham at nccgroup.trust>
Reviewed by:	kib@
Sponsored by:	Dell EMC Isilon, FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D9279
2017-01-24 18:05:29 +00:00
Sean Bruno
bd84f70044 iflib:
Add internal tracking of smp startup status to reliably figure out
     what methods are to be used to get gtaskqueue up and running.

e1000:
     Calculating this pointer gives undefined behaviour when (last == -1)
     (it is before the buffer).  The pointer is always followed.  Panics
     occurred when it points to an unmapped page.  Otherwise, the pointed-to
     garbage tends to not have the E1000_TXD_STAT_DD bit set in it, so in the
     broken case the loop was usually null and the function just returned, and
     this was acidentally correct.

Submitted by:	bde
Reported by:	Matt Macy <mmacy@nextbsd.org>
2017-01-24 16:05:42 +00:00
Sean Bruno
36fa5d5b64 Revert 312696 due to build tests. 2017-01-24 15:55:52 +00:00
Sean Bruno
562a3182f6 iflib:
Add internal tracking of smp startup status to reliably figure out
   what methods are to be used to get gtaskqueue up and running.

e1000:
   Calculating this pointer gives undefined behaviour when (last == -1)
   (it is before the buffer).  The pointer is always followed.  Panics
   occurred when it points to an unmapped page.  Otherwise, the pointed-to
   garbage tends to not have the E1000_TXD_STAT_DD bit set in it, so in the
   broken case the loop was usually null and the function just returned, and
   this was acidentally correct.

Submitted by:	bde
Reviewed by:	Matt Macy <mmacy@nextbsd.org>
2017-01-24 14:48:32 +00:00
Konstantin Belousov
3467f88cd6 Add comments explaining unobvious td_critnest adjustments in
critical_exit().

Based on the discussion with:	jhb
Reviewed by:	imp
Sponsored by:	The FreeBSD Foundation
Differential revision:	D9276
MFC after:	1 week
2017-01-22 19:41:42 +00:00
Konstantin Belousov
25c6816845 More style cleanup. Use ANSI C definition for vn_closefile(). Switch
to VNASSERT in _vn_lock(), simplify messages.

Sponsored by:	The FreeBSD Foundation
X-MFC with:	r312600, r312601, r312602, r312606
2017-01-22 19:38:45 +00:00
Konstantin Belousov
aec8391d46 Provide fallback VOP methods for crossmp vnode.
In particular, crossmp vnode might leak into rename code.

PR:	216380
Reported by:	fnacl@protonmail.com
Sponsored by:	The FreeBSD Foundation
X-MFC with:	r309425
2017-01-22 19:36:02 +00:00
Edward Tomasz Napierala
5c93966020 Remove redundant KASSERT. 2017-01-22 15:35:51 +00:00
Edward Tomasz Napierala
8acac5a9f5 Improve debugging printf. 2017-01-22 15:27:14 +00:00
Mateusz Guzik
eaf0969bda vfs: fix LK_RETRY logic braino in r312600 2017-01-21 20:34:20 +00:00
Mateusz Guzik
829857c893 vfs: __predict_false the need to handle F_HASLOCK
Also reorder the check with DTYPE_VNODE. Passed files are vnodes vast
majority of the time, so it is typically true.
2017-01-21 19:01:42 +00:00
Mateusz Guzik
abbc538d9a vfs: fix whitespace damage in r312600
While here wrap the previously overly long line so that it fits 80 chars.
2017-01-21 18:56:58 +00:00
Mateusz Guzik
1091fb52c1 vfs: refactor _vn_lock
Stop testing for LK_RETRY and error multiple times. Also postpone the
VI_DOOMED until after LK_RETRY was seen as it reads from the vnode.

No functional changes.
2017-01-21 18:38:16 +00:00
Mateusz Guzik
067115e050 vfs: hide the getvnode NULL mp message behind DIAGNOSTIC
Since crossmp vnode changes the message was being printed on each boot.

Reported by:	trasz
Discussed with:	kib
2017-01-21 16:59:50 +00:00
Hans Petter Selasky
10c8755706 Fix for race leading to endless timer interrupts related to
configtimer().

During normal operation "state->nextcallopt" will always be less than
or equal to "state->nextcall" and checking only "state->nextcallopt"
before calling "callout_process()" is sufficient. However when
"configtimer()" is called a race might happen requiring both of these
binary times to be checked.

Short description of race:

1) A configtimer() call will reset both "state->nextcall" and
"state->nextcallopt" to the same binary time.

2) If a "callout_reset()" call happens between "configtimer()" and the
next "callout_process()" call, "state->nextcallopt" will get updated
and "state->nextcall" will remain at the current time. Refer to logic
inside cpu_new_callout().

3) getnextcpuevent() only respects "state->nextcall" and returns this
value over and over again, even if it is in the past, until "now >=
state->nextcallopt" becomes true. Then these two time variables are
corrected by a "callout_process()" call and the situation goes back to
normal.

The problem manifests itself in different ways. The common factor is
the timer process(es) consume all CPU on one or more CPU cores for a
long time, blocking other kernel processes from getting execution
time. This can be seen by very high interrupt counts as displayed by
"vmstat -i | grep timer" right after boot.

When EARLY_AP_STARTUP was enabled in r310177 the likelyhood of hitting
this bug apparently increased.

Example output from "vmstat -i" before patch:
cpu0:timer                          7591         69
cpu9:timer                      39031773     358089
cpu4:timer                          9359         85
cpu3:timer                          9100         83
cpu2:timer                          9620         88

Example output from "vmstat -i" after patch:
cpu0:timer                          4242         34
cpu6:timer                          5531         44
cpu3:timer                          6450         52
cpu1:timer                          4545         36
cpu9:timer                          7153         58

Before the patch cpu9 in the example above, was spinning in a loop in
order to reach 39 million interrupts just a few seconds after
bootup. After the patch the timer interrupt counts are more or less
consistent.

Discussed with:		mav @
Reported by:		several people
MFC after:		1 week
Sponsored by:		Mellanox Technologies
2017-01-20 17:40:31 +00:00
Ed Maste
039644eca9 ANSYfy kern_ktrace.c and remove archaic register keyword
Sponsored by:	The FreeBSD Foundation
2017-01-20 14:59:56 +00:00
Andriy Gapon
c468ff880a don't abort writing of a core dump after EFAULT
It's possible to get EFAULT when writing a segment backed by a file
if the segment extends beyond the file.
The core dump could still be useful if we skip the rest of the segment
and proceed to other segements.
The skipped segment (or a portion of it) will be zero-filled.

While there, use 'const' to signify that core_write() only reads the
buffer and use __DECONST before calling vn_rdwr_inchunks() because it
can be used for both reading and writing.

Before the change:
kernel: Failed to write core file for process mmap_trunc_core (error 14)
kernel: pid 77718 (mmap_trunc_core), uid 1001: exited on signal 6

After the change:
kernel: Failed to fully fault in a core file segment at VA 0x800645000 with size 0x4000 to be written at offset 0x29000 for process mmap_trunc_core
kernel: pid 4901 (mmap_trunc_core), uid 1001: exited on signal 6 (core dumped)

Reviewed by:	julian, kib
Obtained from:	Panzura (older version of the change)
MFC after:	5 days
Sponsored by:	Panzura
Differential Revision: https://reviews.freebsd.org/D9233
2017-01-20 13:39:07 +00:00
Andriy Gapon
ad9dadc437 fix a thread preemption regression in schedulers introduced in r270423
Commit r270423 fixed a regression in sched_yield() that was introduced
in earlier changes.  Unfortunately, at the same time it introduced an
new regression.  The problem is that SWT_RELINQUISH (6), like all other
SWT_* constants and unlike SW_* flags, is not a bit flag.  So, (flags &
SWT_RELINQUISH) is true in cases where that was not really indended,
for example, with SWT_OWEPREEMPT (2) and SWT_REMOTEPREEMPT (11).

A straight forward fix would be to use (flags & SW_TYPE_MASK) ==
SWT_RELINQUISH, but my impression is that the switch types are designed
mostly for gathering statistics, not for influencing scheduling
decisions.

So, I decided that it would be better to check for SW_PREEMPT flag
instead.  That's also the same flag that was checked before r239157.
I double-checked how that flag is used and I am confident that the flag
is set only in the places where we really have the preemption:
- critical_exit + td_owepreempt
- sched_preempt in the ULE scheduler
- sched_preempt in the 4BSD scheduler

Reviewed by:	kib, mav
MFC after:	4 days
Sponsored by:	Panzura
Differential Revision: https://reviews.freebsd.org/D9230
2017-01-19 18:46:41 +00:00
Mateusz Guzik
c5f61e6f96 sx: reduce lock accesses similarly to r311172
Discussed with:	jhb
Tested by:	pho (previous version)
2017-01-18 17:55:08 +00:00
Mateusz Guzik
3f0a0612e8 rwlock: reduce lock accesses similarly to r311172
Discussed with:     jhb
Tested by:	pho (previous version)
2017-01-18 17:53:57 +00:00
Hans Petter Selasky
f3e7afe2d7 Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.

- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.

- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().

- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.

- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.

- How rate limiting works:

1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.

2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.

3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.

4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.

Reviewed by:		wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision:	https://reviews.freebsd.org/D3687
Sponsored by:		Mellanox Technologies
MFC after:		3 months
2017-01-18 13:31:17 +00:00
Ed Maste
bf9ebe74e2 disambiguate msleep KASSERT diagnostics
Previously "panic: msleep" could happen for a few different reasons.
Break the KASSERTs out into individual cases to identify the failing
condition. Found during the investigation that resulted in r308288.

Reviewed by:	kib, jhb
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D8604
2017-01-16 20:34:42 +00:00
Sean Bruno
374f3e042c Remove Assert that seems to be hit in various configurations during
normal operations.
2017-01-16 19:01:41 +00:00
Maxim Sobolev
339efd75a4 Add a new socket option SO_TS_CLOCK to pick from several different clock
sources to return timestamps when SO_TIMESTAMP is enabled. Two additional
clock sources are:

o nanosecond resolution realtime clock (equivalent of CLOCK_REALTIME);
o nanosecond resolution monotonic clock (equivalent of CLOCK_MONOTONIC).

In addition to this, this option provides unified interface to get bintime
(equivalent of using SO_BINTIME), except it also supported with IPv6 where
SO_BINTIME has never been supported. The long term plan is to depreciate
SO_BINTIME and move everything to using SO_TS_CLOCK.

Idea for this enhancement has been briefly discussed on the Net session
during dev summit in Ottawa last June and the general input was positive.

This change is believed to benefit network benchmarks/profiling as well
as other scenarios where precise time of arrival measurement is necessary.

There are two regression test cases as part of this commit: one extends unix
domain test code (unix_cmsg) to test new SCM_XXX types and another one
implementis totally new test case which exchanges UDP packets between two
processes using both conventional methods (i.e. calling clock_gettime(2)
before recv(2) and after send(2)), as well as using setsockopt()+recv() in
receive path. The resulting delays are checked for sanity for all supported
clock types.

Reviewed by:    adrian, gnn
Differential Revision:  https://reviews.freebsd.org/D9171
2017-01-16 17:46:38 +00:00
Sean Bruno
227743cad4 Change startup order for the no EARLY_AP_STARTUP case to initialize
gtaskqueue bits at SI_SUB_INIT_IF instead of waiting until SI_SUB_SMP
which is far too late.

Add an assertion in taskqgroup_attach() to catch startup initialization
failures in the future.

Reported by:	kib bde
2017-01-16 16:58:12 +00:00
Hiren Panchasara
7d03ff1fe9 Add kevent EVFILT_EMPTY for notification when a client has received all data
i.e. everything outstanding has been acked.

Reviewed by:	bz, gnn (previous version)
MFC after:	3 days
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D9150
2017-01-16 08:25:33 +00:00
Conrad Meyer
db4fcadf52 "Buses" is the preferred plural of "bus"
Replace archaic "busses" with modern form "buses."

Intentionally excluded:
* Old/random drivers I didn't recognize
  * Old hardware in general
* Use of "busses" in code as identifiers

No functional change.

http://grammarist.com/spelling/buses-busses/

PR:		216099
Reported by:	bltsrc at mail.ru
Sponsored by:	Dell EMC Isilon
2017-01-15 17:54:01 +00:00
Enji Cooper
d75a788085 Revert r312119 and reword the intent to fix -Wshadow issues
between exp(3) and `exp` var.

The approach taken previously was not ideal for multiple
functional and stylistic reasons.

Add to existing sed call in Makefile to replace `exp` with
`exponent` instead.

MFC after:	13 days
Requested by:	bde
2017-01-15 09:25:33 +00:00
Mark Johnston
d53d6fa9a8 Suppress a warning about m_assertbuf being unused.
MFC after:	1 week
2017-01-15 03:53:20 +00:00
Sean Bruno
4ecb427a49 Fix hangs in a uniprocessor configuration (qemu, virtualbox, real hw).
sys/net/iflib.c:
  Add ctx to filter_info and don't skpi interrupt early on unless we're on an
  SMP system

sys/kern/subr_gtaskqueue.c:
  Skip smp check if we're running UP

Submitted by:	Matt Macy <mmacy@nextbsd.org>
Reported by:	emaste bde
2017-01-15 00:50:10 +00:00
Mark Johnston
42d33c1f4d Stop the scheduler upon panic even in non-SMP kernels.
This is needed for kernel dumps to work, as the panicking thread will call
into code that makes use of kernel locks.

Reported and tested by:	Eugene Grosbein
MFC after:	1 week
2017-01-14 22:16:03 +00:00
Enji Cooper
d467b2ee0c encode_long, encode_timeval: mechanically replace exp with exponent
This helps fix a -Wshadow issue with exp(3) with tests/sys/acct/acct_test,
which include math.h, which in turn defines exp(3)

MFC after:	2 weeks
Tested with:	clang, gcc 4.2.1, gcc 4.9
Sponsored by:	Dell EMC Isilon
2017-01-14 05:06:14 +00:00
Enji Cooper
66db8cca1a Clean up trailing whitespace
MFC after:	3 days
Sponsored by:	Dell EMC Isilon
2017-01-14 04:16:13 +00:00
Enji Cooper
5e8fcdfe1b Fix -Wunused on gcc 4.9 (x was set but not used)
MFC after:	3 days
Sponsored by:	Dell EMC Isilon
2017-01-14 04:13:28 +00:00
Gleb Smirnoff
4fce19da8d Remove deprecated fgetsock() and fputsock(). 2017-01-13 22:16:41 +00:00
Ian Lepore
d5b937680c Correct the comments about how much buffer is allocated. 2017-01-13 17:03:23 +00:00
Ian Lepore
a6f63533a7 Check tty_gone() after allocating IO buffers. The tty lock has to be
dropped then reacquired due to using M_WAITOK, which opens a window in
which the tty device can disappear.  Check for this and return ENXIO
back up the call chain so that callers can cope.

This closes a race where TF_GONE would get set while buffers were being
allocated as part of ttydev_open(), causing a subsequent call to
ttydevsw_modem() later in ttydev_open() to assert.

Reported by:	pho
Reviewed by:	kib
2017-01-13 16:37:38 +00:00
Ian Lepore
e046e8e680 Restructure the tty_drain loop so that device-busy is checked one more time
after tty_timedwait() returns an error only if the error is EWOULDBLOCK;
other errors cause an immediate return.  This fixes the case of the tty
disappearing while in tty_drain().

Reported by:	pho
2017-01-12 21:18:43 +00:00
Ravi Pokala
8e712af70b Remove writability requirement for single-mbuf, contiguous-range
m_pulldown()

m_pulldown() only needs to determine if a mbuf is writable if it is going to
copy data into the data region of an existing mbuf. It does this to create a
contiguous data region in a single mbuf from multiple mbufs in the chain. If
the requested memory region is already contiguous and nothing needs to
change, the mbuf does not need to be writeable.

Submitted by:	Brian Mueller <bmueller@panasas.com>
Reviewed by:	bz
MFC after:	1 week
Sponsored by:	Panasas
Differential Revision:	https://reviews.freebsd.org/D9053
2017-01-12 06:38:03 +00:00
Ian Lepore
f64342e354 Rework tty_drain() to poll the hardware for completion, and restore
drain timeout handling to historical freebsd behavior.

The primary reason for these changes is the need to have tty_drain() call
ttydevsw_busy() at some reasonable sub-second rate, to poll hardware that
doesn't signal an interrupt when the transmit shift register becomes empty
(which includes virtually all USB serial hardware).  Such hardware hangs
in a ttyout wait, because it never gets an opportunity to trigger a wakeup
from the sleep in tty_drain() by calling ttydisc_getc() again, after
handing the last of the buffered data to the hardware.

While researching the history of changes to tty_drain() I stumbled across
some email describing the historical BSD behavior of tcdrain() and close()
on serial ports, and the ability of comcontrol(1) to control timeout
behavior.  Using that and some advice from Bruce Evans as a guide, I've
put together these changes to implement the hardware polling and restore
the historical timeout behaviors...

 - tty_drain() now calls ttydevsw_busy() in a loop at 10 Hz to accomodate
   hardware that requires polling for busy state.

 - The "new historical" behavior for draining during close(2) is retained:
   the drain timeout is "1 second without making any progress".  When the
   1-second timeout expires, if the count of bytes remaining in the tty
   layer buffer is smaller than last time, the timeout is extended for
   another second.  Unfortunately, the same logic cannot be extended all
   the way down to the hardware, because the interface to that layer is a
   simple busy/not-busy indication.

 - Due to the previous point, an application that needs a guarantee that
   all data has been transmitted must use TIOCDRAIN/tcdrain(3) before
   calling close(2).

 - The historical behavior of honoring the drainwait setting for TIOCDRAIN
   (used by tcdrain(3)) is restored.

 - The historical kern.drainwait sysctl to control the global default
   drainwait time is restored, but is now named kern.tty_drainwait.

 - The historical default drainwait timeout of 300 seconds is restored.

 - Handling of TIOCGDRAINWAIT and TIOCSDRAINWAIT ioctls is restored
   (this also makes the comcontrol(1) drainwait verb work again).

 - Manpages are updated to document these behaviors.

Reviewed by:	bde (prior version)
2017-01-12 00:48:06 +00:00
Mark Johnston
90e17792c8 Do not set BIO_DONE if the BIO specifies a completion handler.
biowait() will otherwise race with completions of such BIOs. In-tree code
only calls biowait() on BIOs that do not specify a handler, so this change
should not have any functional impact.

Reviewed by:	mav
MFC after:	1 month
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D9070
2017-01-10 21:41:28 +00:00
John Baldwin
14da48cbe4 Set MORETOCOME for AIO write requests on a socket.
Add a MSG_MOREOTOCOME message flag. When this flag is set, sosend*
set PRUS_MOREOTOCOME when invoking the protocol send method. The aio
worker tasks for sending on a socket set this flag when there are
additional write jobs waiting on the socket buffer.

Reviewed by:	adrian
MFC after:	1 month
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D8955
2017-01-06 23:41:45 +00:00
Konstantin Belousov
6e89d383c7 Explicitely add "opt_compat.h" to kern_exec.c: fix powerpc LINT builds.
sys/ptrace.h includes sys/signal.h, which includes sys/_sigset.h.
Note that sys/_sigset.h only defines osigset_t if COMPAT_43 was defined.

Two lines later, sys/ptrace.h includes machine/reg.h, which in case of
powerpc, includes opt_compat.h.

After the include headers reordering in r311345, we have sys/ptrace.h
included before sys/sysproto.h.

If COMPAT_43 was requested in the kernel config, the result is that
sys/_sigset.h does not define osigset_t, but sys/sysproto.h sees
COMPAT_43 and uses osigset_t.

Fix this by explicitely including opt_compat.h to cover the whole
kern/kern_exec.c scope.

Sponsored by:	The FreeBSD Foundation
2017-01-06 16:56:24 +00:00
Konstantin Belousov
2f304845e2 Do not allocate struct statfs on kernel stack.
Right now size of the structure is 472 bytes on amd64, which is
already large and stack allocations are indesirable.  With the ino64
work, MNAMELEN is increased to 1024, which will make it impossible to have
struct statfs on the stack.

Extracted from:	ino64 work by gleb
Discussed with:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-01-05 17:19:26 +00:00
Konstantin Belousov
607fa849d2 Some style fixes for getfstat(2)-related code.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-01-05 17:03:35 +00:00
Mark Johnston
ec492b13f1 Add a small allocator for exec_map entries.
Upon each execve, we allocate a KVA range for use in copying data to the
new image. Pages must be faulted into the range, and when the range is
freed, the backing pages are freed and their mappings are destroyed. This
is a lot of needless overhead, and the exec_map management becomes a
bottleneck when many CPUs are executing execve concurrently. Moreover, the
number of available ranges is fixed at 16, which is insufficient on large
systems and potentially excessive on 32-bit systems.

The new allocator reduces overhead by making exec_map allocations
persistent. When a range is freed, pages backing the range are marked clean
and made easy to reclaim. With this change, the exec_map is sized based on
the number of CPUs.

Reviewed by:	kib
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D8921
2017-01-05 01:44:12 +00:00
Mark Johnston
eeeaa7ba22 Sort includes in kern_exec.c.
MFC after:	1 week
2017-01-05 01:28:08 +00:00
Gleb Smirnoff
bfc8c24c73 Move bogus_page declaration to vm_page.h and initialization to vm_page.c.
Reviewed by:	kib
2017-01-04 22:27:19 +00:00
Konstantin Belousov
6c4338f2ef The callers of kern_getfsstat(UIO_SYSSPACE) expect that *buf always
returns memory which must be freed, regardless of the error.  Assign
NULL to *buf in case we are not going to allocate any memory due to
invalid mode.

Reported and tested by:	pho
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks (together with r310638)
Differential revision:	https://reviews.freebsd.org/D9042
2017-01-04 16:09:45 +00:00
Edward Tomasz Napierala
5ec7cde488 Fix bug that would result in a kernel crash in some cases involving
a symlink and an autofs mount request.  The crash was caused by namei()
calling bcopy() with a negative length, caused by numeric underflow:
in lookup(), in the relookup path, the ni_pathlen was decremented too
many times.  The bug was introduced in r296715.

Big thanks to Alex Deiter for his help with debugging this.

Reviewed by:	kib@
Tested by:	Alex Deiter <alex.deiter at gmail.com>
MFC after:	1 month
2017-01-04 14:43:57 +00:00
Mateusz Guzik
391df78ad4 mtx: plug open-coded mtx_lock access missed in r311172 2017-01-04 02:25:31 +00:00
Mateusz Guzik
5e5ad162ad Reduce lock accesses in thread lock similarly to r311172. 2017-01-03 23:08:11 +00:00
Mateusz Guzik
2604eb9e17 mtx: reduce lock accesses
Instead of spuriously re-reading the lock value, read it once.

This change also has a side effect of fixing a performance bug:
on failed _mtx_obtain_lock, it was possible that re-read would find
the lock is unowned, but in this case the primitive would make a trip
through turnstile code.

This is diff reduction to a variant which uses atomic_fcmpset.

Discussed with:	jhb (previous version)
Tested by:	pho (previous version)
2017-01-03 21:36:15 +00:00
Konstantin Belousov
7ee34a31fd There is no need to use temporary statfs buffer for fsid obliteration
and prison enforcement.  Do it on the caller buffer directly.

Besides eliminating memory copies, this change also removes large
structure from the kernel stack.

Extracted from:	ino64 work by gleb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-01-02 18:59:23 +00:00
Konstantin Belousov
b961dc3193 Style.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-01-02 18:49:48 +00:00
Konstantin Belousov
f2af4041fa Move common code from kern_statfs() and kern_fstatfs() into a new helper.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-01-02 18:20:22 +00:00
Mark Johnston
b5442eba5c Factor out instances of a knote detach followed by a knote_drop() call.
Reviewed by:	kib (previous version)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D9015
2017-01-02 01:23:21 +00:00
Sean Bruno
1248952a50 2017 IFLIB updates in preparation for commits to e1000 and ixgbe.
- iflib - add checksum in place support (mmacy)
- iflib - initialize IP for TSO (going to be needed for e1000) (mmacy)
- iflib - move isc_txrx from shared context to softc context (mmacy)
- iflib - Normalize checks in TXQ drainage. (shurd)
- iflib - Fix queue capping checks (mmacy)
- iflib - Fix invalid assert, em can need 2 sentinels (mmacy)
- iflib - let the driver determine what capabilities are set and what
          tx csum flags are used (mmacy)
- add INVARIANTS debugging hooks to gtaskqueue enqueue (mmacy)
- update bnxt(4) to support the changes to iflib (shurd)

Some other various, sundry updates.  Slightly more verbose changelog:

Submitted by:	mmacy@nextbsd.org
Reviewed by:	shurd
mFC after:
Sponsored by:	LimeLight Networks and Dell EMC Isilon
2017-01-02 00:56:33 +00:00
Mateusz Guzik
d4db49c4c7 fd: access openfiles once in falloc_noinstall
This is similar to what's done with nprocs.

Note this is only a band aid.
2017-01-01 08:55:28 +00:00
Mateusz Guzik
41b0046a4d vfs: switch nodes_created, recycles_count and free_owe_inact to counter(9)
Reviewed by:	kib
2016-12-31 19:59:31 +00:00
Mateusz Guzik
0b3b55a0f2 Remove cpu_spinwait after seq_consistent.
It does not add any benefit as the read routine will do it as necessary.
2016-12-30 06:26:17 +00:00
Mateusz Guzik
4938d86764 cache: sprinkle __predict_false 2016-12-29 16:35:49 +00:00
Mateusz Guzik
b37707533e cache: move shrink lock init to nchinit
This gets rid of unnecesary sysinit usage.

While here also rename the lock to be consistent with the rest.
2016-12-29 12:01:54 +00:00
Mateusz Guzik
0569bc9ca9 cache: depessimize hashing macros/inlines
All hash sizes are power-of-2, but the compiler does not know that for sure
and 'foo % size' forces doing a division.

Store the size - 1 and use 'foo & hash' instead which allows mere shift.
2016-12-29 08:41:25 +00:00
Mateusz Guzik
6dd9661b77 cache: drop the NULL check from VP2VNODELOCK
Now that negative entries are annotated with a dedicated flag, NULL vnodes
are no longer passed.
2016-12-29 08:34:50 +00:00
John Baldwin
1fabda45c3 Regen after r310638.
Differential Revision:	https://reviews.freebsd.org/D8854
2016-12-27 20:22:17 +00:00
John Baldwin
34ed0c63c8 Rename the 'flags' argument to getfsstat() to 'mode' and validate it.
This argument is not a bitmask of flags, but only accepts a single value.
Fail with EINVAL if an invalid value is passed to 'flag'.  Rename the
'flags' argument to getmntinfo(3) to 'mode' as well to match.

This is a followup to r308088.

Reviewed by:	kib
MFC after:	1 month
2016-12-27 20:21:11 +00:00
Konstantin Belousov
fd30dd7c26 Make knote KN_INFLUX state counted. This is final fix for the issue
closed by r310302 for knote().

If KN_INFLUX | KN_SCAN flags are set for the note passed to knote() or
knote_fork(), i.e. the knote is scanned, we might erronously clear
INFLUX when finishing notification.  For normal knote() it was fixed
in r310302 simply by remembering the fact that we do not own
KN_INFLUX, since there we own knlist lock and scan thread cannot clear
KN_INFLUX until we drop the lock.  For knote_fork(), the situation is
more complicated, e must drop knlist lock AKA the process lock, since
we need to register new knotes.

Change KN_INFLUX into counter and allow shared ownership of the
in-flux state between scan and knote_fork() or knote().  Both in-flux
setters need to ensure that knote is not dropped in parallel.  Added
assert about kn_influx == 1 in knote_drop() verifies that in-flux state
is not shared when knote is destroyed.

Since KBI of the struct knote is changed by addition of the int
kn_influx field, reorder kn_hook and kn_hookid to fill pad on LP64
arches [1].  This keeps sizeof(struct knote) to same 128 bytes as it
was before addition of kn_influx, on amd64.

Reviewed by:	markj
Suggested by:	markj [1]
Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D8898
2016-12-26 19:33:40 +00:00
Konstantin Belousov
5c36b2e8cb Change knlist_destroy() to assert that knlist is empty instead of
accepting the wrong state and printing warning.  Do not obliterate
kl_lock and kl_unlock pointers, they are often useful for post-mortem
analysis.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
X-Differential revision:	https://reviews.freebsd.org/D8898
2016-12-26 19:28:10 +00:00
Konstantin Belousov
34311568dc Style.
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
X-Differential revision:	https://reviews.freebsd.org/D8898
2016-12-26 19:26:40 +00:00
Konstantin Belousov
fc05543fa7 Some optimizations for kqueue timers.
There is no need to do two allocations per kqueue timer. Gather all
data needed by the timer callout into the structure and allocate it at
once.

Use the structure to preserve the result of timer2sbintime(), to not
perform repeated 64bit calculations in callout.

Remove tautological casts.
Remove now unused p_nexttime [1].

Noted by:	markj [1]
Reviewed by:	markj (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
X-MFC note:	do not remove p_nexttime
Differential revision:	https://reviews.freebsd.org/D8901
2016-12-25 19:49:35 +00:00
Konstantin Belousov
7611b72816 Some style.
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
X-Differential revision:	https://reviews.freebsd.org/D8901
2016-12-25 19:38:07 +00:00
Mark Johnston
eab80d9276 Add a comment explaining the race fixed by r310423.
Suggested and reviewed by: jhb
X-MFC With:	r310423
2016-12-23 05:02:17 +00:00
Mark Johnston
aa3c544349 Revert part of r300109.
The removal of TAILQ_FOREACH_SAFE introduced a small race: when the last
thread on a sleepqueue is awoken, it reclaims the sleepqueue and may begin
executing on a different CPU before sleepq_resume_thread() returns. This
leaves a window during which it may go back to sleep and incorrectly be
awoken again by the caller of sleepq_broadcast().

Reported and tested by:	pho
MFC after:	3 days
Sponsored by:	Dell EMC Isilon
2016-12-22 17:51:44 +00:00
John Baldwin
99bc7e4123 Don't spin in pause() during early boot for kthreads other than thread0.
pause() uses a spin loop to simulate a sleep during early boot.  However,
we only need this for thread0 to get far enough in the boot process to
enable timers (at which point pause() can sleep).  For other kthreads,
sleeping in pause() is ok as the callout will be scheduled and will
eventually fire once thread0 initializes timers.

Tested by: 	Steven Kargl
Sleuthing by:	markj
MFC after:	1 week
Sponsored by:	Netflix
2016-12-20 19:44:44 +00:00
Konstantin Belousov
4afd808be7 Do not clear KN_INFLUX when not owning influx state.
For notes in KN_INFLUX|KN_SCAN state, the influx bit is set by a
parallel scan.  When knote() reports event for the vnode filters,
which require kqueue unlocked, it unconditionally sets and then clears
influx to keep note around kqueue unlock.  There, do not clear influx
flag if a scan set it, since we do not own it, instead we prevent scan
from executing by holding knlist lock.

The knote_fork() function has somewhat similar problem, it might set
KN_INFLUX for scanned note, drop kqueue and list locks, and then clear
the flag after relock.  A solution there would be different enough, as
well as the test program, so close the reported issue first.

Reported and test case provided by:	yjh0502@gmail.com
PR:	214923
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-12-19 22:18:36 +00:00
Konstantin Belousov
69baec3619 Switch from stdatomic.h to atomic.h for kernel.
Apparently stdatomic.h implementation for gcc 4.2 on sparc64 does not
work properly.  This effectively reverts r251803.

Reported and tested by:	lidl
Discussed with:	ed
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-12-16 17:41:20 +00:00
Ed Schouten
669a25b50d Document the existence of the {0, 6, ...} sysctl. 2016-12-15 15:45:11 +00:00
Jilles Tjoelker
b9a6fb9343 reaper: Make REAPER_KILL_SUBTREE actually work.
MFC after:	2 weeks
2016-12-14 22:49:20 +00:00
Ed Schouten
ae15715360 Add a "device_index" label to all sysctls under dev.$driver.$index.
This way it becomes possible to graph a property for all instances of a
single driver. For example, graphing the number of packets across all
USB controllers, the amount of dropped packets on all NICs, etc.

Reviewed by:	cem
Differential Revision:	https://reviews.freebsd.org/D8775
2016-12-14 13:03:01 +00:00
Ed Schouten
fd0f59709d Add labels to sysctls related to clocks.
Sysctls like kern.eventtimer.et.*.quality currently embed the name of
the clock device. This is problematic for the Prometheus metrics
exporter for two reasons:

- Some of those clocks have dashes in their names, which Prometheus
  doesn't allow to be used in metric names.
- It doesn't allow for extracting the same property of all clocks on the
  system from within a single query.

Attach these nodes to have a label, so that the Prometheus metrics
exporter gives these metric a uniform name with the name of the clock
attached as a label.

Reviewed by:	cem
Differential Revision:	https://reviews.freebsd.org/D8775
2016-12-14 12:56:58 +00:00
Ed Schouten
1e1f3941e4 Add support for attaching aggregation labels to sysctl objects.
I'm currently working on writing a metrics exporter for the Prometheus
monitoring system to provide access to sysctl metrics. Prometheus and
sysctl have some structural differences:

- sysctl is a tree of string component names.
- Prometheus uses a flat namespace for its metrics, but allows you to
  attach labels with values to them, so that you can do aggregation.

An initial version of my exporter simply translated

    hw.acpi.thermal.tz1.temperature

to

    sysctl_hw_acpi_thermal_tz1_temperature_celcius

while we should ideally have

    sysctl_hw_acpi_thermal_temperature_celcius{thermal_zone="tz1"}

allowing you to graph all thermal zones on a system in one go.

The change presented in this commit adds support for accomplishing this,
by providing the ability to attach labels to nodes. In the example I
gave above, the label "thermal_zone" would be attached to "tz1". As this
is a feature that will only be used very rarely, I decided to not change
the KPI too aggressively.

Discussed on:	hackers@
Reviewed by:	cem
Differential Revision:	https://reviews.freebsd.org/D8775
2016-12-14 12:47:34 +00:00
Gleb Smirnoff
1276a8363c Zero return value when counter_rate() switches over to next second and
value is positive, but below the limit.
2016-12-13 20:11:45 +00:00
Mateusz Guzik
25e578de55 vfs: use vrefact in getcwd and fchdir 2016-12-12 19:16:35 +00:00
Edward Tomasz Napierala
e3d4c4dcde Undo r309891. Konstantin is right in that this condition normally
cannot happen - the um_dev field is assigned at mount and never written
to afterwards.
2016-12-12 19:11:04 +00:00
Mateusz Guzik
5afb134c32 vfs: add vrefact, to be used when the vnode has to be already active
This allows blind increment of relevant counters which under contention
is cheaper than inc-not-zero loops at least on amd64.

Use it in some of the places which are guaranteed to see already active
vnodes.

Reviewed by:	kib (previous version)
2016-12-12 15:37:11 +00:00
Edward Tomasz Napierala
223cb0e434 Avoid dereferencing NULL pointers in devtoname(). I've seen it panic,
called from ufs_print() in DDB.

MFC after:	1 month
2016-12-12 15:22:21 +00:00
Konstantin Belousov
778aa66a68 Enable lookup_cap_dotdot and lookup_cap_dotdot_nonlocal.
Requested and reviewed by:	cem
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D8746
2016-12-12 11:12:04 +00:00
Konstantin Belousov
545d312293 When a zombie gets reparented due to the parent exit, send SIGCHLD to
the reaper.

The traditional reaper init(8) is aware of zombies silently reparented
to it after the parents exit, it loops around waitpid(2) to collect
them.  For other reapers, the silent reparenting is surprising and
collecting zombies requires a thread blocking in waitpid(2) just for
that purpose.  It seems that sending second SIGCHLD is a better
workaround than forcing all reapers to obey the setup.

Reported by:	 Michael Zuo <muh.muhten@gmail.com>, jilles
PR:	213928
Reviewed by:	jilles (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-12-12 11:11:50 +00:00
Alan Cox
2d612d2dd2 When tmpfs and POSIX shm pagein a page for the sole purpose of performing
truncation, immediately queue the page for asynchronous laundering rather
than making the page pass through inactive queue first.

Reviewed by:	kib, markj
2016-12-11 19:24:41 +00:00
Konrad Witaszczyk
480f31c214 Add support for encrypted kernel crash dumps.
Changes include modifications in kernel crash dump routines, dumpon(8) and
savecore(8). A new tool called decryptcore(8) was added.

A new DIOCSKERNELDUMP I/O control was added to send a kernel crash dump
configuration in the diocskerneldump_arg structure to the kernel.
The old DIOCSKERNELDUMP I/O control was renamed to DIOCSKERNELDUMP_FREEBSD11 for
backward ABI compatibility.

dumpon(8) generates an one-time random symmetric key and encrypts it using
an RSA public key in capability mode. Currently only AES-256-CBC is supported
but EKCD was designed to implement support for other algorithms in the future.
The public key is chosen using the -k flag. The dumpon rc(8) script can do this
automatically during startup using the dumppubkey rc.conf(5) variable.  Once the
keys are calculated dumpon sends them to the kernel via DIOCSKERNELDUMP I/O
control.

When the kernel receives the DIOCSKERNELDUMP I/O control it generates a random
IV and sets up the key schedule for the specified algorithm. Each time the
kernel tries to write a crash dump to the dump device, the IV is replaced by
a SHA-256 hash of the previous value. This is intended to make a possible
differential cryptanalysis harder since it is possible to write multiple crash
dumps without reboot by repeating the following commands:
# sysctl debug.kdb.enter=1
db> call doadump(0)
db> continue
# savecore

A kernel dump key consists of an algorithm identifier, an IV and an encrypted
symmetric key. The kernel dump key size is included in a kernel dump header.
The size is an unsigned 32-bit integer and it is aligned to a block size.
The header structure has 512 bytes to match the block size so it was required to
make a panic string 4 bytes shorter to add a new field to the header structure.
If the kernel dump key size in the header is nonzero it is assumed that the
kernel dump key is placed after the first header on the dump device and the core
dump is encrypted.

Separate functions were implemented to write the kernel dump header and the
kernel dump key as they need to be unencrypted. The dump_write function encrypts
data if the kernel was compiled with the EKCD option. Encrypted kernel textdumps
are not supported due to the way they are constructed which makes it impossible
to use the CBC mode for encryption. It should be also noted that textdumps don't
contain sensitive data by design as a user decides what information should be
dumped.

savecore(8) writes the kernel dump key to a key.# file if its size in the header
is nonzero. # is the number of the current core dump.

decryptcore(8) decrypts the core dump using a private RSA key and the kernel
dump key. This is performed by a child process in capability mode.
If the decryption was not successful the parent process removes a partially
decrypted core dump.

Description on how to encrypt crash dumps was added to the decryptcore(8),
dumpon(8), rc.conf(5) and savecore(8) manual pages.

EKCD was tested on amd64 using bhyve and i386, mipsel and sparc64 using QEMU.
The feature still has to be tested on arm and arm64 as it wasn't possible to run
FreeBSD due to the problems with QEMU emulation and lack of hardware.

Designed by:	def, pjd
Reviewed by:	cem, oshogbo, pjd
Partial review:	delphij, emaste, jhb, kib
Approved by:	pjd (mentor)
Differential Revision:	https://reviews.freebsd.org/D4712
2016-12-10 16:20:39 +00:00
Mark Johnston
02315a6759 Use a consistent snapshot of the lock state in owner_mtx().
MFC after:	2 weeks
2016-12-10 02:59:34 +00:00
Mark Johnston
c365a2934e Return a non-NULL owner only if the lock is exclusively held in owner_sx().
Fix some whitespace bugs while here.

MFC after:	2 weeks
2016-12-10 02:56:44 +00:00
Gleb Smirnoff
5040da77c1 Use acquire write to cr_lock to complement with release write at end
of locked region.

Submitted by:	kib
2016-12-09 19:07:31 +00:00
Gleb Smirnoff
169170209c Provide counter_ratecheck(), a MP-friendly substitution to ppsratecheck().
When rated event happens at a very quick rate, the ppsratecheck() is not
only racy, but also becomes a performance bottleneck.

Together with:	rrs, jtl
2016-12-09 17:58:34 +00:00
Robert Watson
52b42f6287 Regnerate system-call definitions following r309677 correcting a whitespace
glitch in syscalls.master.
2016-12-07 16:12:27 +00:00
Robert Watson
82d8d2b8bc Replace spaces with tabs in definition of SCTP system calls, for consistency
with the remainder of the syscalls.master file.  This problem does not occur
in the freebsd32 version of the same system calls.
2016-12-07 16:11:55 +00:00
Eric van Gyzen
3d32d4a7c9 Export the whole thread name in kinfo_proc
kinfo_proc::ki_tdname is three characters shorter than
thread::td_name.  Add a ki_moretdname field for these three
extra characters.  Add the new field to kinfo_proc32, as well.
Update all in-tree consumers to read the new field and assemble
the full name, except for lldb's HostThreadFreeBSD.cpp, which
I will handle separately.  Bump __FreeBSD_version.

Reviewed by:	kib
MFC after:	1 week
Relnotes:	yes
Sponsored by:	Dell EMC
Differential Revision:	https://reviews.freebsd.org/D8722
2016-12-07 15:04:22 +00:00
Konstantin Belousov
435da98564 Restructure the code to handle reporting of non-exited processes from
wait(2).
- Do not acquire the process spinlock if neither WTRAPPED nor WUNTRACED
  options were passed [1].
- Extract the code to report alive process into a new helper
  report_alive_proc() and use it for trapped, stopped and continued
  childrens.

Note that the process spinlock is required around the WTRAPPED and
WUNTRACED tests, because P_STOPPED_TRACE and P_STOPPED_SIG flags are
set before other threads are stopped at the suspension point, and that
threads increment p_suspcount while owning only the process spinlock,
the process lock is dropped by them.  If the spinlock is not taken for
tests, the syscall thread might miss both p_suspcount increment and
wakeup in wakeup in thread_suspend_switch().

Based on the submission by:	mjg [1]
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-12-04 20:44:58 +00:00
Eric van Gyzen
ff07dd913e thr_set_name(): silently truncate the given name as needed
Instead of failing with ENAMETOOLONG, which is swallowed by
pthread_set_name_np() anyway, truncate the given name to MAXCOMLEN+1
bytes.  This is more likely what the user wants, and saves the
caller from truncating it before the call (which was the only
recourse).

Polish pthread_set_name_np(3) and add a .Xr to thr_set_name(2)
so the user might find the documentation for this behavior.

Reviewed by:	jilles
MFC after:	3 days
Sponsored by:	Dell EMC
2016-12-03 01:14:21 +00:00
Mateusz Guzik
a2d3554542 vfs: provide fake locking primitives for the crossmp vnode
Since the vnode is only expected to be shared locked, we can save a
little overhead by only pretending we are locking in the first place.

Reviewed by:	kib
Tested by:	pho
2016-12-02 18:03:15 +00:00
Mateusz Guzik
a4ce25b5b0 vfs: fix a whitespace nit in r309307 2016-11-30 02:17:03 +00:00
Mateusz Guzik
1babea0341 vfs: avoid VOP_ISLOCKED in the common case in lookup 2016-11-30 02:14:53 +00:00
Mark Johnston
64910ddbff Launder VPO_NOSYNC pages upon vnode deactivation.
As of r234483, vnode deactivation causes non-VPO_NOSYNC pages to be
laundered. This behaviour has two problems:

1. Dirty VPO_NOSYNC pages must be laundered before the vnode can be
   reclaimed, and this work may be unfairly deferred to the vnlru process
   or an unrelated application when the system is under vnode pressure.
2. Deactivation of a vnode with dirty VPO_NOSYNC pages requires a scan of
   the corresponding VM object's memq for non-VPO_NOSYNC dirty pages; if
   the laundry thread needs to launder pages from an unreferenced such
   vnode, it will reactivate and deactivate the vnode with each laundering,
   potentially resulting in a large number of expensive scans.

Therefore, ensure that all dirty pages are laundered upon deactivation,
i.e., when all maps of the vnode are removed and all references are
released.

Reviewed by:	alc, kib
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D8641
2016-11-26 21:00:27 +00:00
John Baldwin
9f3aabb9eb Permit timed sleeps for threads other than thread0 before timers are working.
The callout subsystem already handles early callouts and schedules
the first clock interrupt appropriately based on the currently pending
callouts.  The one nit to fix was that callouts scheduled via C_HARDCLOCK
during early boot could fire too early once timers were enabled as the
per-CPU base time is always zero until timers are initialized.  The change
in callout_when() handles this case by using the current uptime as the
base time of the callout during bootup if the per-CPU base time is zero.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	Netflix
2016-11-25 18:02:43 +00:00
Mateusz Guzik
746b6e8176 wait: avoid relocking the child if proc_to_reap returns 1
proc_to_reap would always unlock. However, if it returned 1, kern_wait6
would immediately lock it again. Save the dance.

Reviewed by:	kib
2016-11-24 18:21:48 +00:00
Mateusz Guzik
8b0e0c91e0 cache: ensure that the number of bucket locks does not exceed hash size
The size can be changed by side effect of modifying kern.maxvnodes.

Since numbucketlocks was not modified, setting a sufficiently low value
would give more locks than actual buckets, which would then lead to
corruption.

Force the number of buckets to be not smaller.

Note this should not matter for real world cases.

Reported and tested by:	pho
2016-11-23 19:50:12 +00:00
Mark Johnston
99e6e1930c Release laundered vnode pages to the head of the inactive queue.
The swap pager enqueues laundered pages near the head of the inactive queue
to avoid another trip through LRU before reclamation. This change adds
support for this behaviour to the vnode pager and makes use of it in UFS and
ext2fs. Some ioflag handling is consolidated into a common subroutine so
that this support can be easily extended to other filesystems which make use
of the buffer cache. No changes are needed for ZFS since its putpages
routine always undirties the pages before returning, and the laundry
thread requeues the pages appropriately in this case.

Reviewed by:	alc, kib
Differential Revision:	https://reviews.freebsd.org/D8589
2016-11-23 17:53:07 +00:00
Ruslan Bukin
dd7d4f199e Revert r306186 ("Adjust the sopt_val pointer on bigendian systems").
This logic doesn't work with bigger sopt_valsize (e.g. when ipfw
passing 2048 bytes rule).

Reported by:	adrian
Sponsored by:	DARPA, AFRL
2016-11-22 18:31:43 +00:00
Konstantin Belousov
eb962424ba Restore vnode pager statistic for buffer pagers.
Reviewed by:	alc, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D8585
2016-11-22 10:06:39 +00:00
John Baldwin
5d8cce1764 Initialize 'ticks' earlier in boot after 'hz' is set.
This avoids the time-warp after kthreads have started running and the
required fixup to td_slptick and td_blktick in the EARLY_AP_STARTUP
case.  Now, 'ticks' is initialized before any kthreads are created or
any context switches are performed.

Tested by:	gavin
MFC after:	2 weeks
Sponsored by:	Netflix
2016-11-22 01:02:59 +00:00
Robert Watson
1279fdafce Audit 'fd' and 'cmd' arguments to fcntl(2), and when generating BSM,
always audit the file-descriptor number and vnode information for all
fnctl(2) commands, not just locking-related ones.  This was likely an
oversight in the original adaptation of this code from XNU.

MFC after:	3 days
Sponsored by:	DARPA, AFRL
2016-11-22 00:41:24 +00:00
Gleb Smirnoff
00b5ffde8e Add flag SF_USER_READAHEAD to sendfile(2). When specified, the syscall won't
do any speculations about readahead, and use exactly the amount of readahead
specified by user.  E.g. setting SF_FLAGS(0, SF_USER_READAHEAD) will guarantee
that no readahead at all will be performed.
2016-11-17 21:36:18 +00:00
Gleb Smirnoff
5dba303d01 Use bogus_page to properly reduce number of I/Os in sendfile(2). The new
sendfile_swapin() loop works this way:

- Find first invalid page in the request.
- Do vm_pager_has_page() and get count of pages, that can be taken in
  single I/O.
- Trim valid pages from the end of the request.
- Cycle through the request and substitute to bogus_page all valid
  pages that are in the middle of the request.
- After I/O launched (pager copies array of pages into buf(9), it
  is important to restore proper page pointers with help vm_page_lookup().

Count bogus pages used and report them in sendfile stats.
2016-11-17 21:02:55 +00:00
Ruslan Bukin
6e18247a3d Fix build when no INET and INET6 in kernel config.
Submitted by:	kan
Sponsored by:	DARPA, AFRL
2016-11-17 16:13:30 +00:00
Alan Cox
7667839a7e Remove most of the code for implementing PG_CACHED pages. (This change does
not remove user-space visible fields from vm_cnt or all of the references to
cached pages from comments.  Those changes will come later.)

Reviewed by:	kib, markj
Tested by:	pho
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8497
2016-11-15 18:22:50 +00:00
Mateusz Guzik
6ce45c6ac3 cache: plug a write-only variable in cache_negative_zap_one 2016-11-15 03:43:10 +00:00
Mateusz Guzik
317cac6d5a cache: fix a race between entry removal and demotion
The negative list shrinker can demote an entry with only hotlist + neglist
locks held. On the other hand entry removal possibly sets the NCF_DVDROP
without aformentioned locks held prior to detaching it from the respective
netlist., which can lose the update made by the shrinker.

Reported and tested by:	truckman
2016-11-15 03:38:05 +00:00
Adrian Chadd
8ffa01a061 [mips] enable relbuf on mips for now to work around page aliasing in mips hardware.
Although the higher end MIPS hardware handles cache aliasing issues in
hardware, the older cores (r4k, etc) and some compile versions of the
newer cores (mips24k, mips34k, mips74k) don't have this feature.
This means we end up with some very unfortunate behaviour that was
made very obvious by some recent changes to the FFS pager by kib.

So, flip this off until we get our MIPS pmap/cache code upgraded to
handle aliased pages in software.

Discussed with: kib, bsdimp, juli
2016-11-15 01:41:45 +00:00
Adrian Chadd
0046bef85a [mips] make UMTX_CHAINS configurable at compile time.
The default (512) wastes quite a bit of space which doesn't really buy
us much on highly embedded systems which don't take a lot of locks in
parallel.

This makes it at least build time configurable so people can experiment.
2016-11-15 01:34:38 +00:00
Konstantin Belousov
ae44bb0146 Initialize reserved bytes in struct mq_attr and its 32compat
counterpart, to avoid kernel stack content leak in kmq_setattr(2)
syscall.  Also slightly simplify the checks around copyout()s.

Reported by:	Vlad Tsyrklevich <vlad902+spam@gmail.com>
PR:	214488
MFC after:	1 week
2016-11-14 13:20:10 +00:00
Konstantin Belousov
714b7df502 Provide simple mutual exclusion between mount point update and unmount.
Currently mount update keeps vfs_busy(9) reference on the mount point
during MNT_UPDATE VFS_MOUNT() vfsops call.  This already provides the
exclusion, but is problematic for filesystems which need to perform
namei(9) during VFS_MOUNT(MNT_UPDATE) operations, e.g. to refresh
mnt_from path, because namei(9) must not be called while the
vfs_busy(9) reference is owned.

Check for MNT_UPDATE flag before setting MNTK_UNMOUNT, and for
MNTK_UNMOUNT before entering innards of vfs_domount_update(), failing
syscalls with EBUSY if conflict is detected.  Keep vfs_busy(9)
reference around VFS_MOUNT(MNT_UPDATE) calls still to not change VFS
KPI.

In the update path in ffs_mount(), drop vfs_busy() reference around
namei(), which is now safe due to unmount never executing in parallel
with VFS_MOUNT(MNT_UPDATE), and which avoids the deadlock.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-11-13 21:49:51 +00:00
Konstantin Belousov
9eb8f495b8 Move common cleanup code into helper.
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-11-13 21:39:55 +00:00
Justin Hibbits
6487a709f3 Add two new ddb commands: show device/show all devices
Shows several useful pieces of information from the device including the softc
and ivars pointers.
2016-11-13 00:46:11 +00:00
John Baldwin
892f0ab0ab Allow scheduling during early boot.
- Send IPI wakeups once SMP is started even if cold is true.
- Permit preemptions when cold is true.

These changes are needed for EARLY_AP_STARTUP.

MFC after:	2 weeks
Sponsored by:	Netflix
2016-11-12 00:23:09 +00:00
John Baldwin
a6b91f0f45 Don't place threads on the run queue after waking up other CPUs.
The other CPU might resume and see a still-empty runq and go back to
sleep before sched_add() adds the thread to the runq.  This results
in a lost wakeup and a potential hang if the system is otherwise
completely idle.

The race originated due to a micro-optimization (my fault) in 4BSD in
that it avoided putting a thread on the run queue if the scheduler was
going to preempt to the new thread.  To avoid complexity while fixing
this race, just drop this optimization.  4BSD now always sets the
"owepreempt" flag when a preemption is warranted and defers the actual
preemption to the thread_unlock of the caller the same as ULE.

MFC after:	2 weeks
Sponsored by:	Netflix
2016-11-12 00:14:13 +00:00
Bryan Drewery
28323add09 Fix improper use of "its".
Sponsored by:	Dell EMC Isilon
2016-11-08 23:59:41 +00:00
Konstantin Belousov
9a639daf77 Tweaks for the buffer pager.
Pass current thread credentials instead of NOCRED.
Only allow unmapped buffers for filesystem which proclaimed the support.

For all filesystems which currently use buffer pager (UFS, msdosfs and
cd9660), the changes are effectively nop.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-11-08 10:10:55 +00:00
Konstantin Belousov
9bd4f0a2c6 vn_fullpath1() checked VV_ROOT and then unreferenced
vp->v_mount->mnt_vnodecovered unlocked.  This allowed unmount to race.
Lock vnode after we noticed the VV_ROOT flag.  See comments for
explanation why unlocked check for the flag is considered safe.

Reported and tested by:	avg
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-11-07 10:55:56 +00:00
Konstantin Belousov
75409ce1dd Remove remnants of the recursive sleep support. Instead assert that
we never try to sleep while the thread is on a sleepqueue.

Reviewed by:	jhb
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D8422
2016-11-02 20:57:20 +00:00
Konstantin Belousov
7359fdcf5f Allow some dotdot lookups in capability mode.
If dotdot lookup does not escape from the file descriptor passed as
the lookup root, we can allow the component traversal.  Track the
directories traversed, and check the result of dotdot lookup against
the recorded list of the directory vnodes.

Dotdot lookups are enabled by sysctl vfs.lookup_cap_dotdot, currently
disabled by default until more verification of the approach is done.

Disallow non-local filesystems for dotdot, since remote server might
conspire with the local process to allow it to escape the namespace.
This might be too cautious, provide the knob
vfs.lookup_cap_dotdot_nonlocal to override as well.

Idea by:	rwatson
Discussed with:	emaste, jonathan, rwatson
Reviewed by:	mjg (previous version)
Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 week
Differential revision:	https://reviews.freebsd.org/D8110
2016-11-02 12:43:15 +00:00
Konstantin Belousov
1bf6a0900d Remove tautological casts.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-11-02 12:10:39 +00:00
Konstantin Belousov
ec84693535 Style fixes.
Discussed with:	emaste
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-11-02 12:02:31 +00:00
Edward Tomasz Napierala
53ae7e833c Fix getfsstat(2) with MNT_WAIT to not skip filesystems that are in the
process of being unmounted.  Previously it would skip them, even if the
unmount eventually failed eg due to the filesystem being busy.

This behaviour broke autounmountd(8) - if you tried to manually unmount
a mounted filesystem, using 'automount -u', and the autounmountd attempted
to refresh the filesystem list in that very moment, it would conclude that
the filesystem got unmounted and not try to unmount it afterwards.

Reviewed by:	kib@
Tested by:	pho@
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D8030
2016-11-02 09:43:19 +00:00
Conrad Meyer
8532d381a9 Add BUF_TRACKING and FULL_BUF_TRACKING buffer debugging
Upstream the BUF_TRACKING and FULL_BUF_TRACKING buffer debugging code.
This can be handy in tracking down what code touched hung bios and bufs
last. The full history is especially useful, but adds enough bloat that
it shouldn't be enabled in release builds.

Function names (or arbitrary string constants) are tracked in a
fixed-size ring in bufs. Bios gain a pointer to the upper buf for
tracking. SCSI CCBs gain a pointer to the upper bio for tracking.

Reviewed by:	markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8366
2016-10-31 23:09:52 +00:00
Mark Johnston
ac91917211 Fix WITNESS hints for pagequeue locks.
MFC after:	1 week
2016-10-29 20:01:48 +00:00
Edward Tomasz Napierala
6eeff7a7b2 Fix getfsstat(2) handling of flags. The 'flags' argument is an enum,
not a bitfield. For the intended usage - being passed either MNT_WAIT,
or MNT_NOWAIT - this shouldn't introduce any changes in behaviour.

Reviewed by:	jhb@
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D8373
2016-10-29 12:38:30 +00:00