Commit Graph

15258 Commits

Author SHA1 Message Date
Mateusz Guzik
bb697a20d7 cache: fix up a corner case in r307650
If no negative entry is found on the last list, the ncp pointer will be
left uninitialized and a non-null value will make the function assume an
entry was found.

Fix the problem by initializing to NULL on entry.

Reported by:	glebius
2016-10-20 19:55:50 +00:00
Kevin Lo
61f481fb7e Remove register keyword.
Reviewed by:	kib
2016-10-20 01:21:10 +00:00
Kevin Lo
7c68685366 Remove a sentence about putting initialization in init_proc.c or kern_proc.c
and useless comment.

Reviewed by:	kib
2016-10-20 01:19:37 +00:00
Sean Bruno
026204b4c6 Resolve whitespace diff to NextBSD.
Check to see that the taskqueue thread count requires us to acutally
iterate over the thread count to bind to cpus.

Submitted by:	mmacy@nextbsd.org
2016-10-19 21:01:24 +00:00
Mateusz Guzik
53dc58f2dc Mark a bunch of mpsafe sysctls as such.
This gives me a sysctl Giant-free buildworld.
2016-10-19 19:42:01 +00:00
Mateusz Guzik
a45a1a25b8 cache: split negative entry LRU into multiple lists
This splits the ncneg_mtx lock while preserving the hit ratio at least
during buildworld.

Create N dedicated lists for new negative entries.

Entries with at least one hit get promoted to the hot list, where they
get requeued every M hits.

Shrinking demotes one hot entry and performs a round-robin shrinking of
regular lists.

Reviewed by:	kib
2016-10-19 18:29:52 +00:00
Sean Bruno
abf38392c6 Assert that we're assigning a non-null taskqueue.
ref: 535865d02c

Fix cpu assignment by assuring stride is non-zero, assert that all tasks
have a valid taskqueue.
ref: db39817623

Start cpu assignment from zero.
ref: d99d39b6b6

Submitted by:	mmacy@nextbsd.org
2016-10-18 14:00:26 +00:00
Sean Bruno
12d1b8c9f3 Ensure that tasks with a specific cpu set prior to smp starting get
re-attached to a thread running on that cpu.

ref: fcc20e306b

Submitted by:	mmacy@nextbsd.org
2016-10-18 13:55:34 +00:00
Sean Bruno
dc35f36560 Tell gtask to what we've been bound.
ref: 54414984cf

Submitted by:	mmacy@nextbsd.org
2016-10-18 13:16:27 +00:00
Ed Maste
9e62195361 makesyscalls.sh: remove trailing space on the "created from" line
In r10905 and r10906 makesyscalls was modified to avoid emitting a
literal $Id$ string in the generated file, with:

    gsub("[$]Id: ", "", $0)
    gsub(" [$]", "", $0)

Then r11294 added some functionality and also tried to address the $Id$
problem in a different way, by removing every $:

    sed -e 's/\$//g ...

This rendered the gsub infeffective. The gsub was later updated to
track the $Id$ -> $FreeBSD$ switch, even though it did not do anything.

Revert the addition of the s/\$//g, and update the gsub to keep the
resulting format the same.

Discussed with:	bde
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2016-10-17 13:52:24 +00:00
Hans Petter Selasky
d3bf5efc1f Fix device delete child function.
When detaching device trees parent devices must be detached prior to
detaching its children. This is because parent devices can have
pointers to the child devices in their softcs which are not
invalidated by device_delete_child(). This can cause use after free
issues and panic().

Device drivers implementing trees, must ensure its detach function
detaches or deletes all its children before returning.

While at it remove now redundant device_detach() calls before
device_delete_child() and device_delete_children(), mostly in
the USB controller drivers.

Tested by:		Jan Henrik Sylvester <me@janh.de>
Reviewed by:		jhb
Differential Revision:	https://reviews.freebsd.org/D8070
MFC after:		2 weeks
2016-10-17 10:20:38 +00:00
Konstantin Belousov
5975e53d40 Fix a race in vm_page_busy_sleep(9).
Suppose that we have an exclusively busy page, and a thread which can
accept shared-busy page.  In this case, typical code waiting for the
page xbusy state to pass is
again:
	VM_OBJECT_WLOCK(object);
	...
	if (vm_page_xbusied(m)) {
		vm_page_lock(m);
 		VM_OBJECT_WUNLOCK(object);    <---1
		vm_page_busy_sleep(p, "vmopax");
 		goto again;
	}

Suppose that the xbusy state owner locked the object, unbusied the
page and unlocked the object after we are at the line [1], but before we
executed the load of the busy_lock word in vm_page_busy_sleep().  If it
happens that there is still no waiters recorded for the busy state,
the xbusy owner did not acquired the page lock, so it proceeded.

More, suppose that some other thread happen to share-busy the page
after xbusy state was relinquished but before the m->busy_lock is read
in vm_page_busy_sleep().  Again, that thread only needs vm_object lock
to proceed.  Then, vm_page_busy_sleep() reads busy_lock value equal to
the VPB_SHARERS_WORD(1).

In this case, all tests in vm_page_busy_sleep(9) pass and we are going
to sleep, despite the page being share-busied.

Update check for m->busy_lock == VPB_UNBUSIED in vm_page_busy_sleep(9)
to also accept shared-busy state if we only wait for the xbusy state to
pass.

Merge sequential if()s with the same 'then' clause in
vm_page_busy_sleep().

Note that the current code does not share-busy pages from parallel
threads, the only way to have more that one sbusy owner is right now
is to recurse.

Reported and tested by:	pho (previous version)
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D8196
2016-10-13 14:41:05 +00:00
Conrad Meyer
d9ce8a41ea kern_linker: Handle module-loading failures in preloaded .ko files
The runtime kernel loader, linker_load_file, unloads kernel files that
failed to load all of their modules. For consistency, treat preloaded
(loader.conf loaded) kernel files in the same way.

Reviewed by:	kib
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8200
2016-10-13 02:06:23 +00:00
Ed Maste
2a059700b6 Use correct size type in do_setopt_accept_filter
Submitted by:	ecturt@gmail.com
2016-10-12 00:56:49 +00:00
Oleksandr Tymoshenko
609b0fe966 INTRNG - fix MSI/MSIX release path
Use isrc in attached MSI data structure instead of using map's
isrc directly. map's isrc is set to NULL on IRQ deactivation
which happens prior to pci_release_msi so MSI_RELEASE_MSI
receives array of NULLs

Reviewed by:	mmel
Differential Revision:	https://reviews.freebsd.org/D8206
2016-10-11 17:00:29 +00:00
Sean Bruno
1ee17b070d Fix bug where malloc(.., M_NOWAIT) return value is not checked, Change to
M_WAITOK and move outside the mutex

Submitted by:	shurd
Reviewed by:	mmacy@nextbsd.org
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D7649
2016-10-11 14:08:53 +00:00
Mateusz Guzik
45571f8886 vfs: assert empty tmp free list on unmount 2016-10-08 13:38:05 +00:00
Mateusz Guzik
c6c44ff7eb vfs: clear the tmp free list flag before taking the free vnode list lock
Safe access is already guaranteed because of the mnt_listmx lock.
2016-10-08 13:36:59 +00:00
Konstantin Belousov
f71d08566c Limit scope of the optimization in r306608 to dounmount() caller only.
Other uses of cache_purgevfs() do rely on the cache purge for correct
operations, when paths are invalidated without unmount.

Reported and tested by:	jkim
Discussed with:	mjg
Sponsored by:	The FreeBSD Foundation
2016-10-07 11:38:28 +00:00
Bryan Drewery
32641585a9 vrefl: Assert that the interlock is held.
Sponsored by:	Dell EMC Isilon
MFC after:	2 weeks
2016-10-06 18:10:19 +00:00
Bryan Drewery
5a22c9582c Add vrecyclel() to vrecycle() a vnode with the interlock already held.
Obtained from:	OneFS
Sponsored by:	Dell EMC Isilon
MFC after:	2 weeks
2016-10-06 18:09:22 +00:00
Conrad Meyer
f43292ecf4 vfs_bio: Remove a leading space (style)
Introduced in r282085.

Sponsored by:	Dell EMC Isilon
2016-10-05 23:42:02 +00:00
Bryan Drewery
0617f64ec6 Correct some comments after r294299.
Sponsored by:	Dell EMC Isilon
2016-10-04 21:44:20 +00:00
Ed Maste
65eea7ede6 ANSIfy inflate.c
Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D8143
2016-10-04 17:57:30 +00:00
Konstantin Belousov
5420f76b59 Style.
Reviewed by:	emaste
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2016-10-04 15:23:03 +00:00
Mateusz Guzik
4876636eb7 cache: ignore purgevfs requests for filesystems with few vnodes
purgevfs is purely optional and induces lock contention in workloads
which frequently mount and unmount filesystems.

In particular, poudriere will do this for filesystems with 4 vnodes or
less. Full cache scan is clearly wasteful.

Since there is no explicit counter for namecache entries, the number of
vnodes used by the target fs is checked.

The default limit is the number of bucket locks.

Reviewed by:	kib
2016-10-03 00:02:32 +00:00
Mateusz Guzik
5bb81f9b2d vfs: batch free vnodes in per-mnt lists
Previously free vnodes would always by directly returned to the global
LRU list. With this change up to mnt_free_list_batch vnodes are collected
first.

syncer runs always return the batch regardless of its size.

While vnodes on per-mnt lists are not counted as free, they can be
returned in case of vnode shortage.

Reviewed by:	kib
Tested by:	pho
2016-09-30 17:27:17 +00:00
Mateusz Guzik
8660b707ff vfs: remove the __bo_vnode field from struct vnode
The pointer can be obtained using __containerof instead.

Reviewed by:	kib
2016-09-30 17:11:03 +00:00
Gleb Smirnoff
7ed6b78b92 Provide kern.maxphys sysctl, which returns MAXPHYS. Naming matches NetBSD. 2016-09-29 23:07:28 +00:00
Allan Jude
0176ca2ed5 Allow reading the following sysctl MIBs in capability mode:
kern.hostname, kern.domainname, and kern.hostuuid

This allows sandboxed applications to read these sysctls

Submitted by:	cem (original version)
Reviewed by:	cem, jonathan, rwatson (original version)
Sponsored by:	ScaleEngine Inc.
Differential Revision:	https://reviews.freebsd.org/D8015
2016-09-29 16:29:49 +00:00
Hans Petter Selasky
99eca1b2b3 While draining a timeout task prevent the taskqueue_enqueue_timeout()
function from restarting the timer.

Commonly taskqueue_enqueue_timeout() is called from within the task
function itself without any checks for teardown. Then it can happen
the timer stays active after the return of taskqueue_drain_timeout(),
because the timeout and task is drained separately.

This patch factors out the teardown flag into the timeout task itself,
allowing existing code to stay as-is instead of applying a teardown
flag to each and every of the timeout task consumers.

Add assert to taskqueue_drain_timeout() which prevents parallel
execution on the same timeout task.

Update manual page documenting the return value of
taskqueue_enqueue_timeout().

Differential Revision:	https://reviews.freebsd.org/D8012
Reviewed by:	kib, trasz
MFC after:	1 week
2016-09-29 10:38:20 +00:00
Hiren Panchasara
7c9a4d09d6 Revert r306337. dhw@ reproted a panic which seems related to this and bde@ has
raised some issues.
2016-09-26 15:45:30 +00:00
Eric van Gyzen
310ab671b8 Make no assertions about mutex state when the scheduler is stopped.
This changes the assert path to match the lock and unlock paths.

MFC after:	1 week
Sponsored by:	Dell EMC
2016-09-26 15:30:30 +00:00
Hiren Panchasara
41bb1a25a9 In sendit(), if mp->msg_control is present, then in sockargs() we are allocating
mbuf to store mp->msg_control. Later in kern_sendit(), call to getsock_cap(),
will check validity of file pointer passed, if this fails EBADF is returned but
mbuf allocated in sockargs() is not freed. Fix this possible leak.

Submitted by:	Lohith Bellad <lohith.bellad@me.com>
Reviewed by:	adrian
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D7910
2016-09-26 10:13:58 +00:00
Julian Elischer
1c8260b61d Give the user a clue as to which process hit maxfiles.
MFC after:	1 week
Sponsored by:	Panzura
2016-09-24 22:56:13 +00:00
Konstantin Belousov
939457e3e0 Add the foundation copyrights to procctl kernel sources.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-09-23 12:32:20 +00:00
Mariusz Zaborski
ad5e83dd3c fd: fix up fget_cap
If the kernel is not compiled with the CAPABILITIES kernel options
fget_unlocked doesn't return the sequence number so fd_modify will
always report modification, in that case we got infinity loop.

Reported by:	br
Reviewed by:	mjg
Tested by:	br, def
2016-09-23 08:13:46 +00:00
Mateusz Guzik
deffc4a026 fd: fix up fgetvp_rights after r306184
fget_cap_locked returns a referenced file, but the fgetvp_rights does
not need it. Instead, due to the filedesc lock being held, it can
ref the vnode after the file was looked up.

Fix up fget_cap_locked to be consistent with other _locked helpers and not
ref the file.

This plugs a leak introduced in r306184.

Pointy hat to: mjg, oshogbo
2016-09-23 06:51:46 +00:00
Mateusz Guzik
1d2541fd1a cache: get rid of the global lock
Add a table of vnode locks and use them along with bucketlocks to provide
concurrent modification support. The approach taken is to preserve the
current behaviour of the namecache and just lock all relevant parts before
any changes are made.

Lookups still require the relevant bucket to be locked.

Discussed with:		kib
Tested by:	pho
2016-09-23 04:45:11 +00:00
Gleb Smirnoff
a2d8f9d2fc Fix regression from r297400, which truncates headers in case of low socket
buffer and put a small optimization for low socket buffer case:

- Do not hack uio_resid, and let m_uiotombuf() properly take care of it. This
  fixes truncation of headers at low buffer.
- If headers ate all the space, jump right to the end of the cycle, to
  avoid doing single page I/O and allocating zero length mbuf.
- Clear hdr_uio only if space is positive, which indicates that all uio
  was copied in.

Reviewed by:	pluknet, jtl, emax, rrs, lstewart, emax, gallatin, scottl
2016-09-22 20:34:44 +00:00
Ruslan Bukin
30f3bfe58e Adjust the sopt_val pointer on bigendian systems (e.g. MIPS64EB).
sooptcopyin() checks if size of data provided by user is <= than we can
accept, else it strips down the size. On bigendian platforms we have to
move pointer as well so we copy the actual data.

Reviewed by:	gnn
Sponsored by:	DARPA, AFRL
Sponsored by:	HEIF5
Differential Revision:	https://reviews.freebsd.org/D7980
2016-09-22 12:41:53 +00:00
Mariusz Zaborski
6490bc6529 fd: simplify fgetvp_rights by using fget_cap_locked
Reviewed by:	mjg
2016-09-22 11:54:20 +00:00
Mariusz Zaborski
85b0f9de11 capsicum: propagate rights on accept(2)
Descriptor returned by accept(2) should inherits capabilities rights from
the listening socket.

PR:		201052
Reviewed by:	emaste, jonathan
Discussed with:	many
Differential Revision:	https://reviews.freebsd.org/D7724
2016-09-22 09:58:46 +00:00
Mark Johnston
bdaf6d6913 Regenerate syscall provider argument strings. 2016-09-22 04:50:03 +00:00
Mark Johnston
5a4dfc8d83 Annotate syscall provider pointer arguments with the "userland" keyword.
This causes dtrace to automatically copyin arguments from userland, so
one no longer has to explicitly use the copyin() action to do so. Moreover,
copyin() on userland addresses is a no-op, so existing scripts should be
unaffected by this change.

Discussed with:	rstone
MFC after:	2 weeks
2016-09-22 04:49:31 +00:00
Konstantin Belousov
851194715d Make resettodr_lock accessible outside subr_rtc.c. Protect
CLOCK_GETTIME() with the lock.

Now all time-related accesses to the CMOS for RTC should be under the
lock.  This is needed to allow upcoming EFI Runtime Services support
to provide required execution environment for the firmware calls.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-09-21 10:15:08 +00:00
Konstantin Belousov
643f6f47fd Add PROC_TRAPCAP procctl(2) controls and global sysctl kern.trap_enocap.
Both can be used to cause processes in capability mode to receive
SIGTRAP when ENOTCAPABLE or ECAPMODE errors are returned from
syscalls.

Idea by:	emaste
Reviewed by:	oshogbo (previous version), emaste
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D7965
2016-09-21 08:23:33 +00:00
Edward Tomasz Napierala
e313b4dd95 Fix bug introduced with r302388, which could cause processes accessing
automounted shares to hang with "vfs_busy" wchan.

(As a workaround one can run 'automount -u' from cron.)

Reviewed by:	kib@
MFC after:	1 month
2016-09-21 05:44:13 +00:00
Sepherosa Ziehau
a5ec35dfee Fix LINT building.
Sponsored by:	Microsoft
2016-09-18 07:37:00 +00:00
Ed Maste
69a2875821 Renumber license clauses in sys/kern to avoid skipping #3 2016-09-15 13:16:20 +00:00
Kevin Lo
c3bef61e58 Remove the 4.3BSD compatible macro m_copy(), use m_copym() instead.
Reviewed by:	gnn
Differential Revision:	https://reviews.freebsd.org/D7878
2016-09-15 07:41:48 +00:00
Mariusz Zaborski
6e70b4f058 fd: add fget_cap and fget_cap_locked primitives
They can be used to obtain capabilities along with a referenced fp.

Reviewed by:	mjg@
2016-09-12 22:46:19 +00:00
John Baldwin
71499f6a2d Make device_quiet() an attachment property.
In particular, reset the DF_QUIET flag when detaching from a device so
that a driver that marks a device quiet doesn't dictate policy for a
different driver that may claim the device in the future.

Reviewed by:	rpokala, wblock
MFC after:	2 weeks
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D7803
2016-09-12 18:06:42 +00:00
Mateusz Guzik
a27815330c cache: improve scalability by introducing bucket locks
An array of bucket locks is added.

All modifications still require the global cache_lock to be held for
writing. However, most readers only need the relevant bucket lock and in
effect can run concurrently to the writer as long as they use a
different lock. See the added comment for more details.

This is an intermediate step towards removal of the global lock.

Reviewed by:	kib
Tested by:	pho
2016-09-10 16:29:53 +00:00
Konstantin Belousov
2e4fd101fa Fix build 2016-09-10 09:00:12 +00:00
Jilles Tjoelker
d30e66e53a wait: Do not copyout uninitialized status/rusage/wrusage.
If wait4() or wait6() return 0 because of WNOHANG, the status, rusage and
wrusage information should not be returned.

PR:		212048
Reported by:	Casey Lucas
MFC after:	2 weeks
2016-09-09 21:58:48 +00:00
Mateusz Guzik
a0d45f0fc8 locks: add backoff for spin mutexes and thread lock
Reviewed by:	jhb
2016-09-09 19:13:02 +00:00
Ed Maste
82b3cec52b ANSIfy uipc_syscalls.c
Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D7839
2016-09-09 17:40:26 +00:00
Ed Maste
e62264e2dd Update capabilities.conf comment
getdtablesize is per-process state, not global state
2016-09-08 14:04:04 +00:00
Kevin Lo
cee4a05669 In m_devget(), if the data fits in a packet header mbuf, check the amount
of data is less than or equal to MHLEN instead of MLEN when placing initial
small packet header at end of mbuf.

Reviewed by:	glebius
MFC after:	3 days
2016-09-08 01:02:53 +00:00
Brooks Davis
ed6d876b19 Modernize the initalization of sigproptbl.
Use C99 designators to set the value of each slot and the nitems macro to
check for valid entries. In the process, switch to indexing by signal
number rather than signal-1 for improved clarity.

Obtained from:	CheriBSD (a6053c5abf)
Sponsored by:	DARPA, AFRL
Reviewed by:	kib
2016-09-06 22:03:53 +00:00
Mateusz Guzik
5b7d9ae2fd cv: do a lockless check for no waiters in cv_signal and cv_broadcastpri
In case of some consumers like zfs there are no waiters vast majority of
the time

Reviewed by:	jhb
MFC after:	1 week
2016-09-06 17:16:59 +00:00
Mateusz Guzik
591df14528 cache: defer freeing entries until after the global lock is dropped
This also defers vdrop for held vnodes.

Glanced at by:	kib
2016-09-04 16:52:14 +00:00
Mateusz Guzik
31977b420a cache: manage negative entry list with a dedicated lock
Since negative entries are managed with a LRU list, a hit requires a
modificaton.

Currently the code tries to upgrade the global lock if needed and is
forced to retry the lookup if it fails.

Provide a dedicated lock for use when the cache is only shared-locked.

Reviewed by:	kib
MFC after:	1 week
2016-09-04 08:58:35 +00:00
Mateusz Guzik
b9042ae1bf cache: put all negative entry management code into dedicated functions
Reviewed by:	kib
MFC after:	1 week
2016-09-04 08:55:15 +00:00
Mark Johnston
3da0f3c9ae Micro-optimize sleepq_signal().
Lift a comparison out of the loop that finds the highest-priority thread
on the queue.

MFC after:	1 week
2016-09-04 00:29:48 +00:00
Brooks Davis
fd50a70770 Merge from CheriBSD:
Rename sigprop-table constants to SIGPROP_ from SA_ to reduce the
impression of a namespace collision.

Submitted by:	rwatson
Reviewed by:	jhb, kib (slightly different versions)
Obtained from:	CheriBSD (814ec5771c)
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D7616
2016-09-02 18:22:56 +00:00
Ed Maste
dd38731e09 allow kern.proc.nfds sysctl in capability mode
Reviewed by:	allanjude
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D7733
2016-09-01 02:51:50 +00:00
Patrick Kelsey
da2ded6575 _taskqueue_start_threads() now fails if it doesn't actually start any threads.
Reviewed by:	jhb
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D7701
2016-09-01 02:05:46 +00:00
Mark Johnston
99ab95db4d Rename unp_dispose_so() to unp_dispose().
It implements the dom_dispose method for local socket domain, so its name
should match the method name.
2016-08-31 21:48:22 +00:00
Ed Maste
bce38b9f35 Regnerate after r305140, getdtablesize in capability mode
Sponsored by:	The FreeBSD Foundation
2016-08-31 18:37:51 +00:00
Ed Maste
ca380195ab Allow getdtablesize in capability mode
getdtablesize is "trivial global state" and is similar to
getrlimit(RLIMIT_NOFILE), so should be permitted in capability mode.

Reviewed by:	oshogbo
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D7719
2016-08-31 18:33:15 +00:00
Allan Jude
61bd7ae0ec Eliminate unnecessary loop in _cap_check()
Calling cap_rights_contains() several times with the same inputs is not
going to produce a different output. The variable being iterated, i, is
never used inside the for loop.

The loop is actually done in cap_rights_contains()

Submitted by:	Ryan Moeller <ryan@freqlabs.com>
Reviewed by:	oshogbo, ed
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D7369
2016-08-31 17:52:11 +00:00
Nathan Whitehorn
09c697016b Back out misfired extra file in r305108. 2016-08-31 04:03:55 +00:00
Nathan Whitehorn
c9a124dc9a Refix operation on sparse CPU mappings as in r302372, temporarily broken
by r304716.

PR:		kern/210106
MFC after:	2 days
2016-08-31 04:02:52 +00:00
Mateusz Guzik
4cbafea09c fd: add fdeget_locked and use in kern_descrip 2016-08-30 21:53:22 +00:00
Bryan Drewery
533f3e1026 Reduce duplicated logic for !SMP
Sponsored by:	EMC / Isilon Storage Division
2016-08-30 19:26:07 +00:00
John Baldwin
e05ec081fe Implement 'devctl clear driver' to undo a previous 'devctl set driver'.
Add a new 'clear driver' command for devctl along with the accompanying
ioctl and devctl_clear_driver() library routine to reset a device to
use a wildcard devclass instead of a fixed devclass.  This can be used
to undo a previous 'set driver' command.  After the device's name has
been reset to permit wildcard names, it is reprobed so that it can
attach to newly-available (to it) device drivers.

MFC after:	1 month
Sponsored by:	Chelsio Communications
2016-08-29 22:48:36 +00:00
Mateusz Guzik
11d3ad2eab vfs: provide a common exit point in namei for error cases
This shortens the function, adds the SDT_PROBE use for error cases and
consistenly unrefs rootdir last.

Reviewed by:	kib
MFC after:	2 weeks
2016-08-27 22:43:41 +00:00
Konstantin Belousov
9ce60e28fd Consistently delimit each vnode description block with two blank
lines.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2016-08-27 18:12:42 +00:00
Konstantin Belousov
0f2d97838d In both do_rw_wrlock() and do_rw_rdlock() after r304808, do not
obliterate possible error from sleep with errors from
umtxq_check_susp(), when looping to clear URWLOCK_{READ,WRITE}_WAITERS.

Noted and reviewed by:	vangyzen
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-08-25 19:15:02 +00:00
Konstantin Belousov
28e21133f3 Prevent leak of URWLOCK_READ_WAITERS flag for urwlocks.
If there was some error, e.g. the sleep was interrupted, as in the
referenced PR, do_rw_rdlock() did not cleared URWLOCK_READ_WAITERS.
Since unlock only wakes up write waiters when there is no read
waiters, for URWLOCK_PREFER_READER kind of locks, the result was
missed wakeups for writers.

In particular, the most visible victims are ld-elf.so locks in
processes which loaded libthr, because rtld locks are urwlocks in
prefer-reader mode.  Normal rwlocks fall into prefer-reader mode only
if thread already owns rw lock in read mode, which is not typical and
correspondingly less visible.  In the PR, unowned rtld bind lock was
waited for in the process where only one thread was left alive.

Note that do_rw_wrlock() correctly clears URWLOCK_WRITE_WAITERS in
case of errors.

Reported and tested by:	longwitz@incore.de
PR:	211947
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-08-25 16:35:42 +00:00
Bruce Evans
d350ce61cf Less-quick fix for locking fixes in r172250. r172250 added a second
syscons spinlock for the output routine alone.  It is better to extend
the coverage of the first syscons spinlock added in r162285.  2 locks
might work with complicated juggling, but no juggling was done.  What
the 2 locks actually did was to cover some of the missing locking in
each other and deadlock less often against each other than a single
lock with larger coverage would against itself.  Races are preferable
to deadlocks here, but 2 locks are still worse since they are harder
to understand and fix.

Prefer deadlocks to races and merge the second lock into the first one.

Extend the scope of the spinlocking to all of sc_cnputc() instead of
just the sc_puts() part.  This further prefers deadlocks to races.

Extend the kdb_active hack from sc_puts() internals for the second lock
to all spinlocking.  This reduces deadlocks much more than the other
changes increases them.  The s/p,10* test in ddb gets much further now.
Hide this detail in the SC_VIDEO_LOCK() macro.  Add namespace pollution
in 1 nested #include and reduce namespace pollution in other nested
#includes to pay for this.

Move the first lock higher in the witness order.  The second lock was
unnaturally low and the first lock was unnaturally high.  The second
lock had to be above "sleepq chain" and/or "callout" to avoid spurious
LORs for visual bells in sc_puts().  Other console driver locks are
already even higher (but not adjacent like they should be) except when
they are missing from the table.  Audio bells also benefit from the
syscons lock being high so that audio mutexes have chance of being
lower.  Otherwise, console drviver locks should be as low as possible.
Non-spurious LORs now occur if the bell code calls printf() or is
interrupted (perhaps by an NMI) and the interrupt handler calls
printf().  Previous commits turned off many bells in console i/o but
missed ones done by the teken layer.
2016-08-25 13:46:52 +00:00
Robert Watson
70a98c110e Audit the accepted (or rejected) username argument to setlogin(2).
(NB: This was likely a mismerge from XNU in audit support, where the
text argument to setlogin(2) is captured -- but as a text token,
whereas this change uses the dedicated login-name field in struct
audit_record.)

MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
2016-08-20 20:28:08 +00:00
Robert Watson
c3c0088bb0 Audit additional vnode information in the implementation of the
ftruncate(2) system call.  This was not required by the Common
Criteria, which needed only open-time audit.

MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
2016-08-20 18:51:48 +00:00
Mark Johnston
e5574e0966 Don't set P2_PTRACE_FSTP in a process that invokes ptrace(PT_TRACE_ME).
Such processes are stopped synchronously by a direct call to
ptracestop(SIGTRAP) upon exec. P2_PTRACE_FSTP causes the exec()ing thread
to suspend itself while waiting for a SIGSTOP that never arrives.

Reviewed by:	kib
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D7576
2016-08-19 17:57:14 +00:00
Michal Meloun
895c8b1c39 INTRNG: Rework handling with resources. Partially revert r301453.
- Read interrupt properties at bus enumeration time and store
   it into global mapping table.
 - At bus_activate_resource() time, given mapping entry is resolved and
   connected to real interrupt source. A copy of mapping entry is attached
   to given resource.
 - At bus_setup_intr() time, mapping entry stored in resource is used
   for delivery of requested interrupt configuration.
 - For MSI/MSIX interrupts, mapping entry is created within
   pci_alloc_msi()/pci_alloc_msix() call.
 - For legacy PCI interrupts, mapping entry must be created within
   pcib_route_interrupt() by pcib driver itself.

Reviewed by: nwhitehorn, andrew
Differential Revision: https://reviews.freebsd.org/D7493
2016-08-19 10:52:39 +00:00
Mark Johnston
7f649dda55 Correct a check for P2_PTRACE_FSTP in ptracestop().
MFC after:	1 day
2016-08-19 01:27:24 +00:00
George V. Neville-Neil
3e7e23332f Remove the obsolete and unused openbsd_poll system call. (Phase 2)
Reported by:	brooks
Reviewed by:	brooks, jhb
Differential Revision:	https://reviews.freebsd.org/D7548
2016-08-18 10:54:39 +00:00
George V. Neville-Neil
5cba398b0c Remove unusedd and obsolete openbsd_poll system call. (Phase 1)
Reported by:	brooks
Reviewed by:	brooks,jhb
Differential Revision:	https://reviews.freebsd.org/D7548
2016-08-18 10:50:40 +00:00
Bryan Drewery
b387915115 Garbage collect _umtx_lock(2)/_umtx_unlock(2) references removed in r263318.
This has no real impact on the resulting libc.so file.

MFC after:	3 days
Sponsored by:	EMC / Isilon Storage Division
2016-08-17 10:20:05 +00:00
Konstantin Belousov
e2a18110f0 Remove duplicated code.
aio_aqueue() calls aio_init_aioinfo() as the first action. There is no
need to duplicate the code in kern_aio_fsync().

Also fix indent for aio_aqueue() definition.

Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D7523
2016-08-17 10:14:22 +00:00
Konstantin Belousov
1680854946 Implement userspace gettimeofday(2) with HPET timecounter.
Right now, userspace (fast) gettimeofday(2) on x86 only works for
RDTSC.  For older machines, like Core2, where RDTSC is not C2/C3
invariant, and which fall to HPET hardware, this means that the call
has both the penalty of the syscall and of the uncached hw behind the
QPI or PCIe connection to the sought bridge.  Nothing can me done
against the access latency, but the syscall overhead can be removed.
System already provides mappable /dev/hpetX devices, which gives
straight access to the HPET registers page.

Add yet another algorithm to the x86 'vdso' timehands. Libc is updated
to handle both RDTSC and HPET.  For HPET, the index of the hpet device
to mmap is passed from kernel to userspace, index might be changed and
libc invalidates its mapping as needed.

Remove cpu_fill_vdso_timehands() KPI, instead require that
timecounters which can be used from userspace, to provide
tc_fill_vdso_timehands{,32}() methods.  Merge i386 and amd64
libc/<arch>/sys/__vdso_gettc.c into one source file in the new
libc/x86/sys location.  __vdso_gettc() internal interface is changed
to move timecounter algorithm detection into the MD code.

Measurements show that RDTSC even with the syscall overhead is faster
than userspace HPET access.  But still, userspace HPET is three-four
times faster than syscall HPET on several Core2 and SandyBridge
machines.

Tested by:	Howard Su <howard0su@gmail.com>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
Differential revision:	https://reviews.freebsd.org/D7473
2016-08-17 09:52:09 +00:00
Gleb Smirnoff
dc4ee9a895 Fix a stupid typo (or copy/paste buffer malfunction). 2016-08-16 23:00:22 +00:00
Gleb Smirnoff
c0f50fa012 We should not be allowing a timeout to reset when a drain is in progress on
it (either async or sync drain).

At this moment the only user of drain is TCP, but TCP wouldn't reschedule a
callout after it has drained it, since it drains only when a tcpcb is closed.
This for now the problem isn't observed.

Submitted by:	rrs
2016-08-16 21:55:34 +00:00
Ed Schouten
93d9ebd82e Eliminate use of sys_fsync() and sys_fdatasync().
Make the kern_fsync() function public, so that it can be used by other
parts of the kernel. Fix up existing consumers to make use of it.

Requested by:	kib
2016-08-15 20:11:52 +00:00
Eric Badger
b0f2185bbe sem_post(): wake up the sleeper only after adjusting has_waiters
If the caller of sem_post() wakes up a thread sleeping via sem_wait()
before it clears the has_waiters flag, the caller of sem_wait() has no way of
knowing when it is safe to destroy the semaphore and reuse the memory. This is
because the caller of sem_post() may be interrupted between the wake step and
the clearing of has_waiters. It will then write into the has_waiters flag in
userspace after being preempted for some unknown amount of time.

Reviewed by:	jhb, kib, vangyzen
Approved by:	kib (mentor), vangyzen (mentor)
MFC after:	2 weeks
Sponsored by:	Dell Inc.
Differential Revision:	https://reviews.freebsd.org/D7505
2016-08-15 20:09:09 +00:00
Konstantin Belousov
47e61f6cc6 Implement VOP_FDATASYNC() for msdosfs.
Standard VOP_FSYNC() implementation just syncs data buffers, and due
to this, is the correct and efficient implementation for msdosfs or
any other filesystem which uses bufer cache trivially.  Provide
globally visible wrapper vop_stdfdatasync_buf() for future consumption
by other filesystems.

Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D7471
2016-08-15 19:17:00 +00:00
Konstantin Belousov
1d2537a26a Regen after r304176, fdatasync(2) addition. 2016-08-15 19:15:46 +00:00
Konstantin Belousov
295af703a0 Add an implementation of fdatasync(2).
The syscall is a trivial wrapper around new VOP_FDATASYNC(), sharing
code with fsync(2).  For all filesystems, this commit provides the
implementation which delegates the work of VOP_FDATASYNC() to
VOP_FSYNC().  This is functionally correct but not efficient.

This is not yet POSIX-compliant implementation, because it does not
ensure that queued AIO requests are completed before returning.

Reviewed by:	mckusick
Discussed with:	avg (ZFS), jhb (AIO part)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D7471
2016-08-15 19:08:51 +00:00
Konstantin Belousov
c73fb33115 VOP_FSYNC() does not take cred as an argument. Correct comment.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-08-15 18:55:33 +00:00
Alan Cox
ce3ee09b53 Eliminate unneeded vm_page_xbusy() and vm_page_xunbusy() operations when
neither vm_pager_has_page() nor vm_pager_get_pages() is called.

Reviewed by:	kib, markj
MFC after:	3 weeks
2016-08-14 22:00:45 +00:00
Bruce Evans
99061149a3 Print the tid of curthread in "show pcpu" in ddb.
It was remarkably hard to trace all current threads.  "show pcpu" only
showed the pid, and there was nothing (?) better than searching ps output
to find the tids on CPUs.  This change simplifies the search, but you
still have to trace the tid for each CPU manually.
2016-08-14 15:52:00 +00:00
Alan Cox
fc9bbf2794 Eliminate two calls to vm_page_xunbusy() that are both unnecessary and
incorrect from the error cases in exec_map_first_page().  They are
unnecessary because we automatically unbusy the page in vm_page_free()
when we remove it from the object.  The calls are incorrect because they
happen after the page is freed, so we might actually unbusy the page
after it has been reallocated to a different object.  (This error was
introduced in r292373.)

Reviewed by:	kib
MFC after:	1 week
2016-08-13 18:10:32 +00:00
Edward Tomasz Napierala
f8acef5a3e Remove unused "X" vnode lock assertion, somehow missed in r303743.
MFC after:	1 month
2016-08-12 22:22:11 +00:00
Edward Tomasz Napierala
f83cc0aae4 Print vnode details when vnode locking assertion gets triggered.
MFC after:	1 month
2016-08-12 22:20:52 +00:00
Stephen Hurd
23ac9029f9 Update iflib to support more NIC designs
- Move group task queue into kern/subr_gtaskqueue.c
- Change intr_enable to return an int so it can be detected if it's not
  implemented
- Allow different TX/RX queues per set to be different sizes
- Don't split up TX mbufs before transmit
- Allow a completion queue for TX as well as RX
- Pass the RX budget to isc_rxd_available() to allow an earlier return
  and avoid multiple calls

Submitted by:	shurd
Reviewed by:	gallatin
Approved by:	scottl
Differential Revision:	https://reviews.freebsd.org/D7393
2016-08-12 21:29:44 +00:00
Mark Johnston
5004817335 Remove b_pin_count from struct buf.
It was added in r153192 for XFS and doesn't appear to have been used for
anything else. XFS was disconnected in r241607 and removed entirely in
r247631.

Reported by:	mlaier
Reviewed by:	imp, kib
Differential Revision:	https://reviews.freebsd.org/D7468
2016-08-11 07:58:23 +00:00
Edward Tomasz Napierala
411455a8fb Replace all remaining calls to vprint(9) with vn_printf(9), and remove
the old macro.

MFC after:	1 month
2016-08-10 16:12:31 +00:00
Mateusz Guzik
7c34b35b57 ktrace: do a lockless check on fork to see if tracing is enabled
This saves 2 lock acquisitions in the common case.
2016-08-10 15:25:44 +00:00
Mateusz Guzik
382172be68 sigio: do a lockless check in funsetownlist
There is no need to grab the lock first to see if sigio is used, and it
typically is not.
2016-08-10 15:24:15 +00:00
Konstantin Belousov
3a77833e87 Fix indentation.
Reported by:	hselasky
MFC after:	17 days
2016-08-10 14:41:53 +00:00
Konstantin Belousov
49c394a970 Re-schedule signals after kthread exits, since apparently there are
processes which combine kernel and non-kernel threads, e.g. nfsd.  For
such processes, termination of a kthread must recheck signal delivery
among other threads according to masks.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-08-10 13:47:12 +00:00
Jean-Sébastien Pédron
bd937497ea Consistently use device_t
Several files use the internal name of `struct device` instead of
`device_t` which is part of the public API. This patch changes all
`struct device *` to `device_t`.

The remaining occurrences of `struct device` are those referring to the
Linux or OpenBSD version of the structure, or the code is not built on
FreeBSD and it's unclear what to do.

Submitted by:	Matthew Macy <mmacy@nextbsd.org> (previous version)
Approved by:	emaste, jhibbits, sbruno
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D7447
2016-08-09 19:32:06 +00:00
Stephen J. Kiernan
0ce1624d0e Move IPv4-specific jail functions to new file netinet/in_jail.c
_prison_check_ip4 renamed to prison_check_ip4_locked

Move IPv6-specific jail functions to new file netinet6/in6_jail.c
_prison_check_ip6 renamed to prison_check_ip6_locked

Add appropriate prototypes to sys/sys/jail.h

Adjust kern_jail.c to call prison_check_ip4_locked and
prison_check_ip6_locked accordingly.

Add netinet/in_jail.c and netinet6/in6_jail.c to the list of files that
need to be built when INET and INET6, respectively, are configured in the
kernel configuration file.

Reviewed by:	jtl
Approved by:	sjg (mentor)
Sponsored by:	Juniper Networks, Inc.
Differential Revision:	https://reviews.freebsd.org/D6799
2016-08-09 02:16:21 +00:00
Mark Johnston
434ac8b6b7 Handle races with listening socket close when connecting a unix socket.
If the listening socket is closed while sonewconn() is executing, the
nascent child socket is aborted, which results in recursion on the
unp_link lock when the child's pru_detach method is invoked. Fix this
by using a flag to mark such sockets, and skip a part of the socket's
teardown during detach.

Reported by:	Raviprakash Darbha <rdarbha@juniper.net>
Tested by:	pho
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D7398
2016-08-08 20:25:04 +00:00
Bryan Drewery
c1fa440409 Regenerate after r303755.
MFC after:	3 days
X-MFC-With:	r303755
Sponsored by:	EMC / Isilon Storage Division
2016-08-04 19:15:51 +00:00
Bryan Drewery
417d5dec39 Still provide freebsd10_* symbols from libc for COMPAT10.
r296773 was done to only remove libc symbols for <7.  We want to provide
the syscall symbols going forward for 7+.

Discussed with:	jhb
MFC after:	3 days
Sponsored by:	EMC / Isilon Storage Division
2016-08-04 19:14:18 +00:00
Edward Tomasz Napierala
7b255097eb Remove unused - never actually implemented - vnode lock types
from vnode_if.src.

MFC after:	1 month
2016-08-04 13:45:18 +00:00
Bryan Drewery
78be18ae6e Correct some comments.
Sponsored by:	EMC / Isilon Storage Division
MFC after:	3 days
2016-08-03 18:48:56 +00:00
Mateusz Guzik
0453ade508 locks: fix sx compilation on mips after r303643
The kernel.h header is required for the SYSINIT macro, which apparently
was present on amd64 by accident.

Reported by:	kib
2016-08-03 09:15:10 +00:00
Konstantin Belousov
e69ba32f88 Remove mention of the Giant from the fork_return() description.
Making emphasis on this lock in the core function comment is confusing
for the modern kernel.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2016-08-03 07:10:09 +00:00
Ed Schouten
e938ebbc0c Regenerate system call tables for r303699 and r303700. 2016-08-03 06:36:45 +00:00
Ed Schouten
1b42875d4b Re-add traling slash that was removed in r303699.
I must have accidentally pressed some random key in vim.
2016-08-03 06:35:58 +00:00
Ed Schouten
a813fdc6c3 mprotect(): Change prototype to comply to POSIX.
Our mprotect() function seems to take a "const void *" address to the
pages whose permissions need to be adjusted. POSIX uses "void *". Simply
stick to the POSIX one to prevent us from writing unportable code.

PR:		211423 (exp-run)
Tested by:	antoine@ (Thanks!)
2016-08-03 06:33:04 +00:00
Mateusz Guzik
fa5000a4f3 locks: fix compilation for KDTRACE_HOOKS && !ADAPTIVE_* case
Reported by:	Michael Butler <imb protected-networks.net>
2016-08-02 03:05:59 +00:00
Mateusz Guzik
0412689595 locks: fix up ifdef guards introduced in r303643
Both sx and rwlocks had copy-pasted ADAPTIVE_MUTEXES instead of the correct
define.

MFC after:	1 week
2016-08-02 00:15:08 +00:00
Mateusz Guzik
1ada904147 Implement trivial backoff for locking primitives.
All current spinning loops retry an atomic op the first chance they get,
which leads to performance degradation under load.

One classic solution to the problem consists of delaying the test to an
extent. This implementation has a trivial linear increment and a random
factor for each attempt.

For simplicity, this first thouch implementation only modifies spinning
loops where the lock owner is running. spin mutexes and thread lock were
not modified.

Current parameters are autotuned on boot based on mp_cpus.

Autotune factors are very conservative and are subject to change later.

Reviewed by:	kib, jhb
Tested by:	pho
MFC after:	1 week
2016-08-01 21:48:37 +00:00
Mateusz Guzik
61852185ba locks: change sleep_cnt and spin_cnt types to u_int
Both variables are uint64_t, but they only count spins or sleeps.
All reasonable values which we can get here comfortably hit in 32-bit range.

Suggested by: kib
MFC after:	1 week
2016-07-31 12:11:55 +00:00
Mateusz Guzik
e0c45af904 sx: increment spin_cnt before cpu_spinwait in xlock
The change is a no-op only done for consistency with the rest of the file.
2016-07-30 22:23:31 +00:00
Mateusz Guzik
7a54be1870 rwlock: s/READER/WRITER/ in wlock lockstat annotation 2016-07-30 22:21:48 +00:00
Konstantin Belousov
50c22263bb Cache getbintime(9) answer in timehands, similarly to getnanotime(9)
and getmicrotime(9).

Suggested and reviewed by:	bde (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
2016-07-30 09:25:57 +00:00
John Baldwin
e2325d8285 Don't treat NOCPU as a valid CPU to CPU_ISSET.
If a thread is created bound to a cpuset it might already be bound before
it's very first timeslice, and td_lastcpu will be NOCPU in that case.

MFC after:	1 week
2016-07-29 20:19:14 +00:00
John Baldwin
005ce8e4e6 Fix locking issues with aio_fsync().
- Use correct lock in aio_cancel_sync when dequeueing job.
- Add _locked variants of aio_set/clear_cancel_function and use those
  to avoid lock recursion when adding and removing fsync jobs to the
  per-process sync queue.
- While here, add a basic test for aio_fsync().

PR:		211390
Reported by:	Randy Westlund <rwestlun@gmail.com>
MFC after:	1 week
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D7339
2016-07-29 18:26:15 +00:00
Brooks Davis
40018b91dd Don't create pointless backups of generated files in "make sysent".
Any sensible workflow will include a revision control system from which
to restore the old files if required.  In normal usage, developers just
have to clean up the mess.

Reviewed by:	jhb
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D7353
2016-07-28 21:29:04 +00:00
Ed Schouten
5590eb985e Regenerate system call table for r303435. 2016-07-28 12:22:34 +00:00
Ed Schouten
d9c4cd2fbc Change the return type of msgrcv() to ssize_t as required by POSIX.
It looks like the msgrcv() system call is already written in such a way
that the size is internally computed as a size_t and written into all of
td_retval[0]. This means that it is effectively already returning
ssize_t. It's just that the userspace prototype doesn't match up.
2016-07-28 12:22:01 +00:00
Konstantin Belousov
2d19b736ed Rewrite subr_sleepqueue.c use of callouts to not depend on the
specifics of callout KPI.  Esp., do not depend on the exact interface
of callout_stop(9) return values.

The main change is that instead of requiring precise callouts, code
maintains absolute time to wake up.  Callouts now should ensure that a
wake occurs at the requested moment, but we can tolerate both run-away
callout, and callout_stop(9) lying about running callout either way.

As consequence, it removes the constant source of the bugs where
sleepq_check_timeout() causes uninterruptible thread state where the
thread is detached from CPU, see e.g. r234952 and r296320.

Patch also removes dual meaning of the TDF_TIMEOUT flag, making code
(IMO much) simpler to reason about.

Tested by:	pho
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
Differential revision:	https://reviews.freebsd.org/D7137
2016-07-28 09:09:55 +00:00
Konstantin Belousov
a9e182e895 Extract the calculation of the callout fire time into the new function
callout_when(9).  See the man page update for the description of the
intended use.

Tested by:	pho
Reviewed by:	jhb, bjk (man page updates)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
X-Differential revision:	https://reviews.freebsd.org/D7137
2016-07-28 08:57:01 +00:00
Konstantin Belousov
b7a25e63b6 When a debugger attaches to the process, SIGSTOP is sent to the
target.  Due to a way issignal() selects the next signal to deliver
and report, if the simultaneous or already pending another signal
exists, that signal might be reported by the next waitpid(2) call.
This causes minor annoyance for debuggers, which must be prepared to
take any signal as the first event, then filter SIGSTOP later.

More importantly, for tools like gcore(1), which attach and then
detach without processing events, SIGSTOP might leak to be delivered
after PT_DETACH.  This results in the process being unintentionally
stopped after detach, which is fatal for automatic tools.

The solution is to force SIGSTOP to be the first signal reported after
the attach.  Attach code is modified to set P2_PTRACE_FSTP to indicate
that the attaching ritual was not yet finished, and issignal() prefers
SIGSTOP in that condition.  Also, the thread which handles
P2_PTRACE_FSTP is made to guarantee to own p_xthread during the first
waitpid(2).  All that ensures that SIGSTOP is consumed first.

Additionally, if P2_PTRACE_FSTP is still set on detach, which means
that waitpid(2) was not called at all, SIGSTOP is removed from the
queue, ensuring that the process is resumed on detach.

In issignal(), when acting on STOPing signals, remove the signal from
queue before suspending.  Otherwise parallel attach could result in
ptracestop() acting on that STOP as if it was the STOP signal from the
attach.  Then SIGSTOP from attach leaks again.

As a minor refactoring, some bits of the common attach code is moved
to new helper proc_set_traced().

Reported by:	markj
Reviewed by:	jhb, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D7256
2016-07-28 08:41:13 +00:00
Stephen J. Kiernan
4ac21b4f09 Prepare for network stack as a module
- Move cr_canseeinpcb to sys/netinet/in_prot.c in order to separate the
   INET and INET6-specific code from the rest of the prot code (It is only
   used by the network stack, so it makes sense for it to live with the
   other network stack code.)
 - Move cr_canseeinpcb prototype from sys/systm.h to netinet/in_systm.h
 - Rename cr_seeotheruids to cr_canseeotheruids and cr_seeothergids to
   cr_canseeothergids, make them non-static, and add prototypes (so they
   can be seen/called by in_prot.c functions.)
 - Remove sw_csum variable from ip6_forward in ip6_forward.c, as it is an
   unused variable.

Reviewed by:	gnn, jtl
Approved by:	sjg (mentor)
Sponsored by:	Juniper Networks, Inc.
Differential Revision:	https://reviews.freebsd.org/D2901
2016-07-27 20:34:09 +00:00
John Baldwin
b9a53e161b Adjust tests in fsync job scheduling loop to reduce indentation. 2016-07-27 19:31:25 +00:00
Ed Maste
18c23dffac ANSIfy kern_proc.c and delete register keyword
Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D6478
2016-07-27 14:27:08 +00:00
Konstantin Belousov
1822421c0b Remove Giant from settime(), tc_setclock_mtx guards tc_windup() calls,
and there is no other issues with parallel settime().  Remove spl()
vestiges there as well.

Tested by:	pho (as part of the whole patch)
Reviewed by:	jhb (same)
Discussed wit:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
Differential revision:	https://reviews.freebsd.org/D7302
2016-07-27 11:54:24 +00:00
Konstantin Belousov
5760b029ee Prevent parallel tc_windup() calls, both parallel top-level calls from
setclock() and from simultaneous top-level and interrupt.  For this,
tc_windup() is protected with a tc_setclock_mtx spinlock, in the try
mode when called from hardclock interrupt.  If spinlock cannot be
obtained without spinning from the interrupt context, this means that
top-level executes tc_windup() on other core and our try may be
avoided.

The boottimebin and boottime variables should be adjusted from
tc_windup().  To be correct, they must be part of the timehands and
read using lockless protocol.  Remove the globals and reimplement the
getboottime(9)/getboottimebin(9) KPI using the timehands read
protocol.

Tested by:	pho (as part of the whole patch)
Reviewed by:	jhb (same)
Discussed wit:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
X-Differential revision:	https://reviews.freebsd.org/D7302
2016-07-27 11:49:41 +00:00
Konstantin Belousov
4493f659e5 Fix a bug in r302252.
Change ntpadj_lock to spinlock always, and rename stuff removing
ADJ/adj from the names. ntp_update_second() requires ntp_lock and is
called from the tc_windup(), so ntp_lock must be a spinlock.  Add
missed lock to ntp_update_second().

Tested by:	pho (as part of the whole patch)
Reviewed by:	jhb (same)
Noted by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
X-Differential revision:	https://reviews.freebsd.org/D7302
2016-07-27 11:40:06 +00:00
Konstantin Belousov
21547fc7ca Reduce the resettodr_lock scope to only CLOCK_SETTIME() call.
Tested by:	pho (as part of the whole patch)
Reviewed by:	jhb (same)
Discussed with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
X-Differential revision:	https://reviews.freebsd.org/D7302
2016-07-27 11:34:25 +00:00
Konstantin Belousov
4d29106e55 Style.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
X-Differential revision:	https://reviews.freebsd.org/D7302
2016-07-27 11:33:33 +00:00
Konstantin Belousov
a83c016f00 Reduce number of timehands to just two. This is useful because
consumers can now be only one tc_windup() call late.

Use C99 initialization.

Tested by:	pho (as part of the whole patch)
Reviewed by:	jhb (same)
Discussed with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
X-Differential revision:	https://reviews.freebsd.org/D7302
2016-07-27 11:27:52 +00:00
Konstantin Belousov
584b675ed6 Hide the boottime and bootimebin globals, provide the getboottime(9)
and getboottimebin(9) KPI. Change consumers of boottime to use the
KPI.  The variables were renamed to avoid shadowing issues with local
variables of the same name.

Issue is that boottime* should be adjusted from tc_windup(), which
requires them to be members of the timehands structure.  As a
preparation, this commit only introduces the interface.

Some uses of boottime were found doubtful, e.g. NLM uses boottime to
identify the system boot instance.  Arguably the identity should not
change on the leap second adjustment, but the commit is about the
timekeeping code and the consumers were kept bug-to-bug compatible.

Tested by:	pho (as part of the bigger patch)
Reviewed by:	jhb (same)
Discussed with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
X-Differential revision:	https://reviews.freebsd.org/D7302
2016-07-27 11:08:59 +00:00
Stephen J. Kiernan
cc37baea09 Add the NUM_CORE_FILES kernel config option which specifies the limit for the
number of core files allowed by a particular process when using the %I core
file name pattern.

Sanity check at compile time to ensure the value is within the valid range of
0-10.

Reviewed by:	jtl, sjg
Approved by:	sjg (mentor)
Sponsored by:	Juniper Networks, Inc.
Differential Revision:	https://reviews.freebsd.org/D6812
2016-07-27 03:21:02 +00:00
Ed Schouten
f63cd251b2 Add shmatt_t.
It looks like our "struct shmid_ds::shm_nattch" deviates from the
standard in the sense that it is a signed integer, whereas POSIX
requires that it is unsigned, having a special type shmatt_t.

Patch up our native and 32-bit copies to use a new shmatt_t that is an
unsigned integer. As it's unsigned, we can relax the comparisons that
are performed on it. Leave the Linux, iBCS2, etc. copies of the
structure alone.

Reviewed by:	ngie
Differential Revision:	https://reviews.freebsd.org/D6655
2016-07-26 17:23:49 +00:00
Conrad Meyer
af326ace9d devfs: Move most ioctl logic down to vnode layer
Devfs' file layer ioctl is now just a thin shim around the vnode layer.

Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D7286
2016-07-25 16:28:02 +00:00
Konstantin Belousov
90b581f2cc Implement mtx_trylock_spin(9).
Discussed with:	bde
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D7192
2016-07-23 05:30:55 +00:00
John Baldwin
9c20dc9963 Add more documentation regarding unsafe AIO requests.
The asynchronous I/O changes made previously result in different
behavior out of the box. Previously all AIO requests failed with
ENOSYS / SIGSYS unless aio.ko was explicitly loaded. Now, some AIO
requests complete and others ("unsafe" requests) fail with EOPNOTSUPP.

Reword the introductory paragraph in aio(4) to add a general
description of AIO before describing the vfs.aio.enable_unsafe sysctl.

Remove the ENOSYS error description from aio_fsync(2), aio_read(2),
and aio_write(2) and replace it with a description of EOPNOTSUPP.

Remove the ENOSYS error description from aio_mlock(2).

Log a message to the system log the first time a process requests an
"unsafe" AIO request that fails with EOPNOTSUPP. This is modeled on
the log message used for processes using the legacy pty devices.

Reviewed by:	kib (earlier version)
MFC after:	1 week
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D7151
2016-07-21 22:49:47 +00:00
Konstantin Belousov
492fe1b774 Hide counted_warning(9) under #ifdef _KERNEL braces, to allow building
subr_prf.c in userspace for libsbuf.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-07-21 17:59:30 +00:00
Konstantin Belousov
9fe297bbdc Declare aio requests on files from local filesystems safe.
Two notes:
- I allow AIO on reclaimed vnodes, since it is deterministically terminated
  fast.
- devfs mounts are marked as MNT_LOCAL, but device vnodes have type
  VCHR, so the slow device io is not allowed.

Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D7273
2016-07-21 17:07:06 +00:00
Konstantin Belousov
9837947b07 Provide counter_warning(9) KPI which allows to issue limited number of
warnings for some kernel events, mostly intended for the use of
obsoleted or otherwise undersired interfaces.

This is an abstracted and race-expelled code from compat pty driver.

Requested and reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D7270
2016-07-21 16:34:56 +00:00
Conrad Meyer
1005d8aff4 imgact_elf: Rename the segment iterator to match reality
The each_writable_segment routine evaluates segments on a slightly little more
nuanced metric than simply "writable" or not.  Rename the function to more
closely match its behavior (each_dumpable_segment).

Suggested by:	jhb
Sponsored by:	EMC / Isilon Storage Division
2016-07-20 22:51:33 +00:00
Conrad Meyer
f3325003d9 ANSI-fy imgact_elf.c
Sponsored by:	EMC / Isilon Storage Division
2016-07-20 22:46:56 +00:00
Conrad Meyer
07f825e871 Fix DEBUG build on 64-bit arch after r303099
Reported by:	Larry Rosenman <ler at lerctr.org>
2016-07-20 18:11:22 +00:00
Conrad Meyer
c17b0bd2a6 Extend ELF coredump to support more than 65535 segments
The ELF e_phnum field is only 16 bits wide. To support more than 65535 segments
(program headers), Sun's "Linker and Libraries Guide" table 7-7 (or 12-7,
depending on document version) prescribes a special first section header where
sh_info represents the real number of program headers.

Test code to follow, when it is ready.

Reference:	http://docs.oracle.com/cd/E18752_01/pdf/817-1984.pdf

Reviewed by:	emaste, markj
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D7255
2016-07-20 16:59:36 +00:00
Gleb Smirnoff
9f3391243b Redo the r302894: the very new value for a non-scheduled callout is -1.
This was recently added in r290664.

Noticed by:	hselasky
Tested by:	Larry Rosenman <ler lerctr.org>
PR:		210884
2016-07-20 16:48:25 +00:00
Gleb Smirnoff
47e4280922 Revert r303037. It re-introduces the panic with TCP timers.
Agreed by:	rrs, re (gjb)
2016-07-20 16:44:22 +00:00
Randall Stewart
3d84a18803 This reverts out Gleb's changes and adds three small
fixes that I think closes up the races Gleb was
looking for. This is running quite nicely in Netflix and
now no longer causes TCP-tcb leaks.

Differential Revision:	7135
2016-07-19 18:31:19 +00:00
John Baldwin
ccb83afd81 Include process IDs in core dumps.
When threads were added to the kernel, the pr_pid member of the
NT_PRSTATUS note was repurposed to store LWP IDs instead of process
IDs.  However, the process ID was no longer recorded in core dumps.
This change adds a pr_pid field to prpsinfo (NT_PRSINFO).  Rather than
bumping the prpsinfo version number, note parsers can use the note's
payload size to determine if pr_pid is present.

Reviewed by:	kib, emaste (older version)
MFC after:	2 months
Differential Revision:	https://reviews.freebsd.org/D7117
2016-07-18 15:14:23 +00:00
John Baldwin
fc4f075a1a Add PTRACE_VFORK to trace vfork events.
First, PL_FLAG_FORKED events now also set a PL_FLAG_VFORKED flag when
the new child was created via vfork() rather than fork().  Second, a
new PL_FLAG_VFORK_DONE event can now be enabled via the PTRACE_VFORK
event mask.  This new stop is reported after the vfork parent resumes
due to the child calling exit or exec.  Debuggers can use this stop to
reinsert breakpoints in the vfork parent process before it resumes.

Reviewed by:	kib
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D7045
2016-07-18 14:53:55 +00:00
Konstantin Belousov
77d6809483 The assertion re-added in r302614 was triggered when stopping signal
is delivered to vforked child.  Issue is that we avoid stopping such
children in issignal() to not block parents.  But executed AST, which
ignored stops, leaves the child with the signal pending but no AST
pending.

On first exec after vfork(), call signotify() to handle pending
reenabled signals.  Adjust the assert to not check vfork children
until exec.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-07-18 10:53:47 +00:00
Gleb Smirnoff
809a9d1353 Revert the last commit. It must get more review and testing first. 2016-07-18 09:29:08 +00:00
Gleb Smirnoff
ef58c6a7a3 Redo the r302894: the very new value for a non-scheduled callout is -1.
This was recently added in r290664.

Noticed by:	hselasky
PR:		210884
2016-07-18 09:26:06 +00:00
Konstantin Belousov
56d48d6dcb Fix another bug after r302350.
Reported and tested by:	pho
PR:	210884
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2016-07-18 04:30:34 +00:00
Konstantin Belousov
86f1146329 Another issue reported on http://seclists.org/oss-sec/2016/q3/68 is
that struct kevent member ident has uintptr_t type, which is silently
truncated to int in the call to fget().  Explicitely check for the
valid range.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-07-16 13:24:58 +00:00
Konstantin Belousov
f470cca578 In ptrace_vm_entry(), do not call vmspace_free() while owning a vm
object lock.

The vmspace_free() operations might need to lock map, object etc on
last dereference.  Postpone the free until object's inspection is
done.

Reported and tested by:	will
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-07-15 23:26:33 +00:00
John Baldwin
8d570f64aa Add a mask of optional ptrace() events.
ptrace() now stores a mask of optional events in p_ptevents.  Currently
this mask is a single integer, but it can be expanded into an array of
integers in the future.

Two new ptrace requests can be used to manipulate the event mask:
PT_GET_EVENT_MASK fetches the current event mask and PT_SET_EVENT_MASK
sets the current event mask.

The current set of events include:
- PTRACE_EXEC: trace calls to execve().
- PTRACE_SCE: trace system call entries.
- PTRACE_SCX: trace syscam call exits.
- PTRACE_FORK: trace forks and auto-attach to new child processes.
- PTRACE_LWP: trace LWP events.

The S_PT_SCX and S_PT_SCE events in the procfs p_stops flags have
been replaced by PTRACE_SCE and PTRACE_SCX.  PTRACE_FORK replaces
P_FOLLOW_FORK and PTRACE_LWP replaces P2_LWP_EVENTS.

The PT_FOLLOW_FORK and PT_LWP_EVENTS ptrace requests remain for
compatibility but now simply toggle corresponding flags in the
event mask.

While here, document that PT_SYSCALL, PT_TO_SCE, and PT_TO_SCX both
modify the event mask and continue the traced process.

Reviewed by:	kib
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D7044
2016-07-15 15:32:09 +00:00
Gleb Smirnoff
2138e263cb Fix regression introduced by r302350. The change of return value for a
callout that wasn't scheduled at all was unintentional and yielded in
several panics.

PR:		210884
2016-07-15 09:28:32 +00:00
Konstantin Belousov
d7e8cfd63d Do not allow creation of char or block special nodes with VNOVAL dev_t.
As was reported on http://seclists.org/oss-sec/2016/q3/68, tmpfs code
contains assertion that rdev != VNOVAL.  On FreeBSD, there is no other
consequences except triggering the assert.  To be compatible with
systems where device nodes have some significance, reject mknod(2)
call with dev == VNOVAL at the syscall level.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-07-15 09:23:18 +00:00
John Baldwin
c77547d2f9 Include command line arguments in core dump process info.
Fill in pr_psargs in the NT_PRSINFO ELF core dump note with command
line arguments.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D7116
2016-07-14 23:20:05 +00:00
Mark Johnston
0649caae32 Let DDB's buf printer handle NULL pointers in the buf page array.
A buf's b_pages and b_npages fields may be inconsistent after a panic.
For instance, vfs_vmio_invalidate() sets b_npages to zero only after all
pages are unwired and their page array entries are cleared.

MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
2016-07-14 18:49:05 +00:00
Eric Badger
fdb6320d45 Add explicit detection of KVM hypervisor
Set vm_guest to a new enum value (VM_GUEST_KVM) when kvm is detected and use
vm_guest in conditionals testing for KVM.

Also, fix a conditional checking if we're running in a VM which caught only
the generic VM case, but not more specific VMs (KVM, VMWare, etc.).  (Spotted
by: vangyzen).

Differential revision:	https://reviews.freebsd.org/D7172
Sponsored by:	Dell Inc.
Approved by:	kib (mentor), vangyzen (mentor)
Reviewed by:	alc
MFC after:	4 weeks
2016-07-13 19:19:18 +00:00
Konstantin Belousov
de56aee0bf Trace timeval parameters to the getitimer(2) and setitimer(2) syscalls.
Reviewed by:	jhb
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D7158
2016-07-13 14:37:58 +00:00
Konstantin Belousov
8f01bee46b Revive the check, disabled in r197963.
Despite the implication (process has pending signals -> the current
thread marked for AST and has TDF_NEEDSIGCHK set) is not true due to
other thread might manipulate its signal blocking mask, it should still
hold for the single-threaded processes.  Enable check for the condition
for single-threaded case, and replicate it from userret() to ast() as
well, where we check that ast indeed has no signal to deliver.

Note that the check is under DIAGNOSTIC, it is not enabled for INVARIANTS
but !DIAGNOSTIC since it imposes too heavy-weight locking for day-to-day
used debugging kernel.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-07-12 03:53:15 +00:00
Konstantin Belousov
4f4c35bb38 Add assert to complement r302328.
AST must not execute with TDF_SBDRY or TDF_SEINTR/TDF_SERESTART thread
flags set, which is asserted in userret(). As the consequence, -1 return
from cursig() must not be possible.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-07-12 03:52:05 +00:00
Nathan Whitehorn
8c636a11dc Remove assumptions in MI code that the BSP is CPU 0.
MFC after:	2 weeks
2016-07-11 21:25:28 +00:00
Konstantin Belousov
725496ced5 Fix grammar.
Submitted by:	alc
MFC after:	2 weeks
2016-07-11 17:04:22 +00:00
Konstantin Belousov
19efd8a5a8 In vgonel(), postpone setting BO_DEAD until VOP_RECLAIM() is called,
if vnode is VMIO.  For VMIO vnodes, set BO_DEAD in vm_object_terminate().

The vnode_destroy_object(), when calling into vm_object_terminate(),
must be able to flush buffers.  BO_DEAD purpose is to quickly destroy
buffers on write when the underlying vnode is not operable any more
(one example is the devfs node after geom is gone).  Setting BO_DEAD
for reclaiming vnode before object is terminated is premature, and
results in unability to flush buffers with live SU dependencies from
vinvalbuf() in vm_object_terminate().

Reported by:	David Cross <dcrosstech@gmail.com>
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-07-11 14:19:09 +00:00
Robert Watson
5fa69ff015 In process-descriptor close(2) and fstat(2), audit target process
information.  pgkill(2) already audits target process ID.

MFC after:	3 days
Sponsored by:	DARPA, AFRL
2016-07-10 14:17:36 +00:00
Robert Watson
e5ec733909 Do allow auditing of read(2) and write(2) system calls, by assigning
those system calls audit event identifiers AUE_READ and AUE_WRITE.
While auditing file-descriptor I/O is not required by the Common
Criteria, in practice this proves useful for both live and forensic
analysis.

NB: freebsd32 already assigns AUE_READ and AUE_WRITE to read(2) and
write(2).

MFC after:	3 days
Sponsored by:	DARPA, AFRL
2016-07-10 13:42:33 +00:00
Robert Watson
8ec75c0fc3 Audit the file-descriptor number argument for openat(2). Remove a comment
about the desirability of auditing the number, as it was in fact in the
wrong place (in the common path for open(2) and openat(2), and only the
latter accepts a file-descriptor argument).  Where other ABIs support
openat(2), it may be necessary to do additional argument auditing as it is
not performed in kern_openat(9).

MFC after:	3 days
Sponsored by:	DARPA, AFRL
2016-07-10 09:50:21 +00:00
Robert Watson
51d1f69069 Audit file-descriptor arguments to I/O system calls such as
read(2), write(2), dup(2), and mmap(2).  This auditing is not
required by the Common Criteria (and hence was not being
performed), but is valuable in both contemporary live analysis
and forensic use cases.

MFC after:	3 days
Sponsored by:	DARPA, AFRL
2016-07-10 08:04:02 +00:00
Edward Tomasz Napierala
debc480e03 Add new unmount(2) flag, MNT_NONBUSY, to check whether there are
any open vnodes before proceeding. Make autounmound(8) use this flag.
Without it, even an unsuccessfull unmount causes filesystem flush,
which interferes with normal operation.

Reviewed by:	kib@
Approved by:	re (gjb@)
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D7047
2016-07-07 09:03:57 +00:00
Nathan Whitehorn
96c85efb4b Replace a number of conflations of mp_ncpus and mp_maxid with either
mp_maxid or CPU_FOREACH() as appropriate. This fixes a number of places in
the kernel that assumed CPU IDs are dense in [0, mp_ncpus) and would try,
for example, to run tasks on CPUs that did not exist or to allocate too
few buffers on systems with sparse CPU IDs in which there are holes in the
range and mp_maxid > mp_ncpus. Such circumstances generally occur on
systems with SMT, but on which SMT is disabled. This patch restores system
operation at least on POWER8 systems configured in this way.

There are a number of other places in the kernel with potential problems
in these situations, but where sparse CPU IDs are not currently known
to occur, mostly in the ARM machine-dependent code. These will be fixed
in a follow-up commit after the stable/11 branch.

PR:		kern/210106
Reviewed by:	jhb
Approved by:	re (glebius)
2016-07-06 14:09:49 +00:00
Gleb Smirnoff
d153eeee97 The paradigm of a callout is that it has three consequent states:
not scheduled -> scheduled -> running -> not scheduled. The API and the
manual page assume that, some comments in the code assume that, and looks
like some contributors to the code also did. The problem is that this
paradigm isn't true. A callout can be scheduled and running at the same
time, which makes API description ambigouous. In such case callout_stop()
family of functions/macros should return 1 and 0 at the same time, since it
successfully unscheduled future callout but the current one is running.
Before this change we returned 1 in such a case, with an exception that
if running callout was migrating we returned 0, unless CS_MIGRBLOCK was
specified.

With this change, we now return 0 in case if future callout was unscheduled,
but another one is still in action, indicating to API users that resources
are not yet safe to be freed.

However, the sleepqueue code relies on getting 1 return code in that case,
and there already was CS_MIGRBLOCK flag, that covered one of the edge cases.
In the new return path we will also use this flag, to keep sleepqueue safe.

Since the flag CS_MIGRBLOCK doesn't block migration and now isn't limited to
migration edge case, rename it to CS_EXECUTING.

This change fixes panics on a high loaded TCP server.

Reviewed by:	jch, hselasky, rrs, kib
Approved by:	re (gjb)
Differential Revision:	https://reviews.freebsd.org/D7042
2016-07-05 18:47:17 +00:00
Gleb Smirnoff
a0d20ecb94 Compile in the kassert_panic() function with INVARIANT_SUPPORT
option, not INVARIANTS.  The function is required if we want
to load in a module that is compiled with INVARIANTS.

Reviewed by:	jhb
Approved by:	re (gjb)
2016-07-05 18:34:34 +00:00
Mark Johnston
f61d6c5a5b Ensure that spinlock sections are balanced even after a panic.
vpanic() uses spinlock_enter() to disable interrupts before dumping core.
However, when the scheduler is stopped and INVARIANTS is not configured,
thread_lock() does not acquire a spinlock section, while thread_unlock()
releases one. This can result in interrupts staying enabled while the
kernel dumps core, complicating post-mortem analysis of the crash.

Approved by:	re (gjb)
MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
2016-07-05 17:59:04 +00:00
Robert Watson
971711fb7c Call audit hooks to capture vnode attributes for three file-descriptor
method implementations: fstat(2), close(2), and poll(2).  This change
synchronises auditing here with similar auditing for VFS-specific system
calls such as stat(2) that audit more complete vnode information.

Sponsored by:	DARPA, AFRL
Approved by:	re (kib)
MFC after:	1 week
2016-07-05 16:37:01 +00:00
Ed Maste
1cbb879df8 add description for debug.elf{32,64}_legacy_coredump sysctl
Approved by:	re (kib)
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2016-07-05 14:46:06 +00:00
Konstantin Belousov
46e47c4f8d Provide helper macros to detect 'non-silent SBDRY' state and to
calculate appropriate return value for stops.  Simplify the code by
using them.

Fix typo in sig_suspend_threads().  The thread which sleep must be
aborted is td2. (*)

In issignal(), when handling stopping signal for thread in
TD_SBDRY_INTR state, do not stop, this is wrong and fires assert.
This is yet another place where execution should be forced out of
SBDRY-protected region.  For such case, return -1 from issignal() and
translate it to corresponding error code in sleepq_catch_signals().
Assert that other consumers of cursig() are not affected by the new
return value. (*)

Micro-optimize, mostly VFS and VOP methods, by avoiding calling the
functions when SIGDEFERSTOP_NOP non-change is requested. (**)

Reported and tested by:	pho (*)
Requested by:	bde (**)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Approved by:	re (gjb)
2016-07-03 18:19:48 +00:00
Konstantin Belousov
d12d6a8476 Remove racy assert. The thread which changes vnode usecount from 0 to 1
does it under the vnode interlock, but the interlock is not owned by the
asserting thread.  As result, we might read increased use counter but also
still see VI_OWEINACT.

In collaboration with: nwhitehorn
Hardware donated by: IBM LTC
Sponsored by:	The FreeBSD Foundation (kib)
Approved by:	re (gjb)
2016-07-03 01:56:48 +00:00
Konstantin Belousov
e18ee4957d When a process knote was attached to the process which is already exiting,
the knote is activated immediately.  If the exit1() later activates
knotes, such knote is attempted to be activated second time.  Detect
the condition by zeroed kn_ptr.p_proc pointer, and avoid excessive
activation.

Before r302235, such knotes were removed from the knlist immediately
upon activation.

Reported by:	truckman
Sponsored by:	The FreeBSD Foundation
Approved by:	re (gjb)
2016-07-01 20:11:28 +00:00
Konstantin Belousov
364c516cff Currently the ntptime code and resettodr() are Giant-locked. In
particular, the Giant is supposed to protect against parallel
ntp_adjtime(2) invocations.  But, for instance, sys_ntp_adjtime() does
copyout(9) under Giant and then examines time_status to return syscall
result.  Since copyout(9) could sleep, the syscall result might be
inconsistent.

Another and more important issue is that if PPS is configured,
hardpps(9) is executed without any protection against the parallel
top-level code invocation. Potentially, this may result in the
inconsistent state of the ntptime state variables, but I cannot say
how serious such distortion is. The non-functional splclock() call in
sys_ntp_adjtime() protected against clock interrupts calling hardpps()
in the pre-SMP era.

Modernize the locking. A mutex protects ntptime data.  Due to the
hardpps() KPI legitimately serving from the interrupt filters (and
e.g. uart(4) does call it from filter), the lock cannot be sleepable
mutex if PPS_SYNC is defined.  Otherwise, use normal sleepable mutex
to reduce interrupt latency.

Reviewed by:	  imp, jhb
Sponsored by:	  The FreeBSD Foundation
Approved by:	  re (gjb)
Differential revision:	https://reviews.freebsd.org/D6825
2016-06-28 16:43:23 +00:00