Commit Graph

15142 Commits

Author SHA1 Message Date
John Baldwin
5d8cce1764 Initialize 'ticks' earlier in boot after 'hz' is set.
This avoids the time-warp after kthreads have started running and the
required fixup to td_slptick and td_blktick in the EARLY_AP_STARTUP
case.  Now, 'ticks' is initialized before any kthreads are created or
any context switches are performed.

Tested by:	gavin
MFC after:	2 weeks
Sponsored by:	Netflix
2016-11-22 01:02:59 +00:00
Robert Watson
1279fdafce Audit 'fd' and 'cmd' arguments to fcntl(2), and when generating BSM,
always audit the file-descriptor number and vnode information for all
fnctl(2) commands, not just locking-related ones.  This was likely an
oversight in the original adaptation of this code from XNU.

MFC after:	3 days
Sponsored by:	DARPA, AFRL
2016-11-22 00:41:24 +00:00
Gleb Smirnoff
00b5ffde8e Add flag SF_USER_READAHEAD to sendfile(2). When specified, the syscall won't
do any speculations about readahead, and use exactly the amount of readahead
specified by user.  E.g. setting SF_FLAGS(0, SF_USER_READAHEAD) will guarantee
that no readahead at all will be performed.
2016-11-17 21:36:18 +00:00
Gleb Smirnoff
5dba303d01 Use bogus_page to properly reduce number of I/Os in sendfile(2). The new
sendfile_swapin() loop works this way:

- Find first invalid page in the request.
- Do vm_pager_has_page() and get count of pages, that can be taken in
  single I/O.
- Trim valid pages from the end of the request.
- Cycle through the request and substitute to bogus_page all valid
  pages that are in the middle of the request.
- After I/O launched (pager copies array of pages into buf(9), it
  is important to restore proper page pointers with help vm_page_lookup().

Count bogus pages used and report them in sendfile stats.
2016-11-17 21:02:55 +00:00
Ruslan Bukin
6e18247a3d Fix build when no INET and INET6 in kernel config.
Submitted by:	kan
Sponsored by:	DARPA, AFRL
2016-11-17 16:13:30 +00:00
Alan Cox
7667839a7e Remove most of the code for implementing PG_CACHED pages. (This change does
not remove user-space visible fields from vm_cnt or all of the references to
cached pages from comments.  Those changes will come later.)

Reviewed by:	kib, markj
Tested by:	pho
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8497
2016-11-15 18:22:50 +00:00
Mateusz Guzik
6ce45c6ac3 cache: plug a write-only variable in cache_negative_zap_one 2016-11-15 03:43:10 +00:00
Mateusz Guzik
317cac6d5a cache: fix a race between entry removal and demotion
The negative list shrinker can demote an entry with only hotlist + neglist
locks held. On the other hand entry removal possibly sets the NCF_DVDROP
without aformentioned locks held prior to detaching it from the respective
netlist., which can lose the update made by the shrinker.

Reported and tested by:	truckman
2016-11-15 03:38:05 +00:00
Adrian Chadd
8ffa01a061 [mips] enable relbuf on mips for now to work around page aliasing in mips hardware.
Although the higher end MIPS hardware handles cache aliasing issues in
hardware, the older cores (r4k, etc) and some compile versions of the
newer cores (mips24k, mips34k, mips74k) don't have this feature.
This means we end up with some very unfortunate behaviour that was
made very obvious by some recent changes to the FFS pager by kib.

So, flip this off until we get our MIPS pmap/cache code upgraded to
handle aliased pages in software.

Discussed with: kib, bsdimp, juli
2016-11-15 01:41:45 +00:00
Adrian Chadd
0046bef85a [mips] make UMTX_CHAINS configurable at compile time.
The default (512) wastes quite a bit of space which doesn't really buy
us much on highly embedded systems which don't take a lot of locks in
parallel.

This makes it at least build time configurable so people can experiment.
2016-11-15 01:34:38 +00:00
Konstantin Belousov
ae44bb0146 Initialize reserved bytes in struct mq_attr and its 32compat
counterpart, to avoid kernel stack content leak in kmq_setattr(2)
syscall.  Also slightly simplify the checks around copyout()s.

Reported by:	Vlad Tsyrklevich <vlad902+spam@gmail.com>
PR:	214488
MFC after:	1 week
2016-11-14 13:20:10 +00:00
Konstantin Belousov
714b7df502 Provide simple mutual exclusion between mount point update and unmount.
Currently mount update keeps vfs_busy(9) reference on the mount point
during MNT_UPDATE VFS_MOUNT() vfsops call.  This already provides the
exclusion, but is problematic for filesystems which need to perform
namei(9) during VFS_MOUNT(MNT_UPDATE) operations, e.g. to refresh
mnt_from path, because namei(9) must not be called while the
vfs_busy(9) reference is owned.

Check for MNT_UPDATE flag before setting MNTK_UNMOUNT, and for
MNTK_UNMOUNT before entering innards of vfs_domount_update(), failing
syscalls with EBUSY if conflict is detected.  Keep vfs_busy(9)
reference around VFS_MOUNT(MNT_UPDATE) calls still to not change VFS
KPI.

In the update path in ffs_mount(), drop vfs_busy() reference around
namei(), which is now safe due to unmount never executing in parallel
with VFS_MOUNT(MNT_UPDATE), and which avoids the deadlock.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-11-13 21:49:51 +00:00
Konstantin Belousov
9eb8f495b8 Move common cleanup code into helper.
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-11-13 21:39:55 +00:00
Justin Hibbits
6487a709f3 Add two new ddb commands: show device/show all devices
Shows several useful pieces of information from the device including the softc
and ivars pointers.
2016-11-13 00:46:11 +00:00
John Baldwin
892f0ab0ab Allow scheduling during early boot.
- Send IPI wakeups once SMP is started even if cold is true.
- Permit preemptions when cold is true.

These changes are needed for EARLY_AP_STARTUP.

MFC after:	2 weeks
Sponsored by:	Netflix
2016-11-12 00:23:09 +00:00
John Baldwin
a6b91f0f45 Don't place threads on the run queue after waking up other CPUs.
The other CPU might resume and see a still-empty runq and go back to
sleep before sched_add() adds the thread to the runq.  This results
in a lost wakeup and a potential hang if the system is otherwise
completely idle.

The race originated due to a micro-optimization (my fault) in 4BSD in
that it avoided putting a thread on the run queue if the scheduler was
going to preempt to the new thread.  To avoid complexity while fixing
this race, just drop this optimization.  4BSD now always sets the
"owepreempt" flag when a preemption is warranted and defers the actual
preemption to the thread_unlock of the caller the same as ULE.

MFC after:	2 weeks
Sponsored by:	Netflix
2016-11-12 00:14:13 +00:00
Bryan Drewery
28323add09 Fix improper use of "its".
Sponsored by:	Dell EMC Isilon
2016-11-08 23:59:41 +00:00
Konstantin Belousov
9a639daf77 Tweaks for the buffer pager.
Pass current thread credentials instead of NOCRED.
Only allow unmapped buffers for filesystem which proclaimed the support.

For all filesystems which currently use buffer pager (UFS, msdosfs and
cd9660), the changes are effectively nop.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-11-08 10:10:55 +00:00
Konstantin Belousov
9bd4f0a2c6 vn_fullpath1() checked VV_ROOT and then unreferenced
vp->v_mount->mnt_vnodecovered unlocked.  This allowed unmount to race.
Lock vnode after we noticed the VV_ROOT flag.  See comments for
explanation why unlocked check for the flag is considered safe.

Reported and tested by:	avg
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-11-07 10:55:56 +00:00
Konstantin Belousov
75409ce1dd Remove remnants of the recursive sleep support. Instead assert that
we never try to sleep while the thread is on a sleepqueue.

Reviewed by:	jhb
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D8422
2016-11-02 20:57:20 +00:00
Konstantin Belousov
7359fdcf5f Allow some dotdot lookups in capability mode.
If dotdot lookup does not escape from the file descriptor passed as
the lookup root, we can allow the component traversal.  Track the
directories traversed, and check the result of dotdot lookup against
the recorded list of the directory vnodes.

Dotdot lookups are enabled by sysctl vfs.lookup_cap_dotdot, currently
disabled by default until more verification of the approach is done.

Disallow non-local filesystems for dotdot, since remote server might
conspire with the local process to allow it to escape the namespace.
This might be too cautious, provide the knob
vfs.lookup_cap_dotdot_nonlocal to override as well.

Idea by:	rwatson
Discussed with:	emaste, jonathan, rwatson
Reviewed by:	mjg (previous version)
Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 week
Differential revision:	https://reviews.freebsd.org/D8110
2016-11-02 12:43:15 +00:00
Konstantin Belousov
1bf6a0900d Remove tautological casts.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-11-02 12:10:39 +00:00
Konstantin Belousov
ec84693535 Style fixes.
Discussed with:	emaste
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-11-02 12:02:31 +00:00
Edward Tomasz Napierala
53ae7e833c Fix getfsstat(2) with MNT_WAIT to not skip filesystems that are in the
process of being unmounted.  Previously it would skip them, even if the
unmount eventually failed eg due to the filesystem being busy.

This behaviour broke autounmountd(8) - if you tried to manually unmount
a mounted filesystem, using 'automount -u', and the autounmountd attempted
to refresh the filesystem list in that very moment, it would conclude that
the filesystem got unmounted and not try to unmount it afterwards.

Reviewed by:	kib@
Tested by:	pho@
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D8030
2016-11-02 09:43:19 +00:00
Conrad Meyer
8532d381a9 Add BUF_TRACKING and FULL_BUF_TRACKING buffer debugging
Upstream the BUF_TRACKING and FULL_BUF_TRACKING buffer debugging code.
This can be handy in tracking down what code touched hung bios and bufs
last. The full history is especially useful, but adds enough bloat that
it shouldn't be enabled in release builds.

Function names (or arbitrary string constants) are tracked in a
fixed-size ring in bufs. Bios gain a pointer to the upper buf for
tracking. SCSI CCBs gain a pointer to the upper bio for tracking.

Reviewed by:	markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8366
2016-10-31 23:09:52 +00:00
Mark Johnston
ac91917211 Fix WITNESS hints for pagequeue locks.
MFC after:	1 week
2016-10-29 20:01:48 +00:00
Edward Tomasz Napierala
6eeff7a7b2 Fix getfsstat(2) handling of flags. The 'flags' argument is an enum,
not a bitfield. For the intended usage - being passed either MNT_WAIT,
or MNT_NOWAIT - this shouldn't introduce any changes in behaviour.

Reviewed by:	jhb@
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D8373
2016-10-29 12:38:30 +00:00
Konstantin Belousov
c39baa7480 Generalize UFS buffer pager to allow it serving other filesystems
which also use buffer cache.

Most important addition to the code is the handling of filesystems
where the block size is less than the machine page size, which might
require reading several buffers to validate single page.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-10-28 11:43:59 +00:00
Marcel Moolenaar
07f862a769 Include <stdarg.h> instead of <machine/stdarg.h> when compiled as
part of libsbuf. The former is the standard header, and allows us
to compile libsbuf on macOS/linux.
2016-10-24 18:03:04 +00:00
Konstantin Belousov
835c2787be Handle broadcast NMIs.
On several Intel chipsets, diagnostic NMIs sent from BMC or NMIs
reporting hardware errors are broadcasted to all CPUs.

When kernel is configured to enter kdb on NMI, the outcome is
problematic, because each CPU tries to enter kdb.  All CPUs are
executing NMI handlers, which set the latches disabling the nested NMI
delivery; this means that stop_cpus_hard(), used by kdb_enter() to
stop other cpus by broadcasting IPI_STOP_HARD NMI, cannot work.  One
indication of this is the harmless but annoying diagnostic "timeout
stopping cpus".

Much more harming behaviour is that because all CPUs try to enter kdb,
and if ddb is used as debugger, all CPUs issue prompt on console and
race for the input, not to mention the simultaneous use of the ddb
shared state.

Try to fix this by introducing a pseudo-lock for simultaneous attempts
to handle NMIs.  If one core happens to enter NMI trap handler, other
cores see it and simulate reception of the IPI_STOP_HARD.  More,
generic_stop_cpus() avoids sending IPI_STOP_HARD and avoids waiting
for the acknowledgement, relying on the nmi handler on other cores
suspending and then restarting the CPU.

Since it is impossible to detect at runtime whether some stray NMI is
broadcast or unicast, add a knob for administrator (really developer)
to configure debugging NMI handling mode.

The updated patch was debugged with the help from Andrey Gapon (avg)
and discussed with him.

Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D8249
2016-10-24 16:40:27 +00:00
Konstantin Belousov
55ee7a4c5f In the fueword64(9) wrapper for architectures which do not implemented
native fueword64(9) still, use proper type for local where fuword64()
result is stored.

Note that fueword64() is unused in the tree.

Submitted by:	Chunhui He <hchunhui@mail.ustc.edu.cn>
PR:	212520
MFC after:	1 week
2016-10-23 11:23:17 +00:00
Conrad Meyer
8798ef0679 ddb(4): Add sleepchains to "show allchains"
Reported by:	markj
Reviewed by:	markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8320
2016-10-22 18:02:20 +00:00
Hiren Panchasara
9d71a3975e Rework r306337.
In sendit(), if mp->msg_control is present, then in sockargs() we are
allocating mbuf to store mp->msg_control. Later in kern_sendit(), call
to getsock_cap(), will check validity of file pointer passed, if this
fails EBADF is returned but mbuf allocated in sockargs() is not freed.
Made code changes to free the same.

Since freeing control mbuf in sendit() after checking (control != NULL)
may lead to double freeing of control mbuf in sendit(), we can free
control mbuf in kern_sendit() if there are any errors in the routine.

Submitted by:		    Lohith Bellad <lohith.bellad@me.com>
Reviewed by:		    glebius
MFC after:		    3 weeks
Differential Revision:	    https://reviews.freebsd.org/D8152
2016-10-21 18:27:30 +00:00
Mariusz Zaborski
4b83a77606 capsicum: perform copyout without the fildesc lock held in sys_cap_ioctls_get
Reviewed by:	pjd
2016-10-21 16:12:23 +00:00
Mateusz Guzik
bb697a20d7 cache: fix up a corner case in r307650
If no negative entry is found on the last list, the ncp pointer will be
left uninitialized and a non-null value will make the function assume an
entry was found.

Fix the problem by initializing to NULL on entry.

Reported by:	glebius
2016-10-20 19:55:50 +00:00
Kevin Lo
61f481fb7e Remove register keyword.
Reviewed by:	kib
2016-10-20 01:21:10 +00:00
Kevin Lo
7c68685366 Remove a sentence about putting initialization in init_proc.c or kern_proc.c
and useless comment.

Reviewed by:	kib
2016-10-20 01:19:37 +00:00
Sean Bruno
026204b4c6 Resolve whitespace diff to NextBSD.
Check to see that the taskqueue thread count requires us to acutally
iterate over the thread count to bind to cpus.

Submitted by:	mmacy@nextbsd.org
2016-10-19 21:01:24 +00:00
Mateusz Guzik
53dc58f2dc Mark a bunch of mpsafe sysctls as such.
This gives me a sysctl Giant-free buildworld.
2016-10-19 19:42:01 +00:00
Mateusz Guzik
a45a1a25b8 cache: split negative entry LRU into multiple lists
This splits the ncneg_mtx lock while preserving the hit ratio at least
during buildworld.

Create N dedicated lists for new negative entries.

Entries with at least one hit get promoted to the hot list, where they
get requeued every M hits.

Shrinking demotes one hot entry and performs a round-robin shrinking of
regular lists.

Reviewed by:	kib
2016-10-19 18:29:52 +00:00
Sean Bruno
abf38392c6 Assert that we're assigning a non-null taskqueue.
ref: 535865d02c

Fix cpu assignment by assuring stride is non-zero, assert that all tasks
have a valid taskqueue.
ref: db39817623

Start cpu assignment from zero.
ref: d99d39b6b6

Submitted by:	mmacy@nextbsd.org
2016-10-18 14:00:26 +00:00
Sean Bruno
12d1b8c9f3 Ensure that tasks with a specific cpu set prior to smp starting get
re-attached to a thread running on that cpu.

ref: fcc20e306b

Submitted by:	mmacy@nextbsd.org
2016-10-18 13:55:34 +00:00
Sean Bruno
dc35f36560 Tell gtask to what we've been bound.
ref: 54414984cf

Submitted by:	mmacy@nextbsd.org
2016-10-18 13:16:27 +00:00
Ed Maste
9e62195361 makesyscalls.sh: remove trailing space on the "created from" line
In r10905 and r10906 makesyscalls was modified to avoid emitting a
literal $Id$ string in the generated file, with:

    gsub("[$]Id: ", "", $0)
    gsub(" [$]", "", $0)

Then r11294 added some functionality and also tried to address the $Id$
problem in a different way, by removing every $:

    sed -e 's/\$//g ...

This rendered the gsub infeffective. The gsub was later updated to
track the $Id$ -> $FreeBSD$ switch, even though it did not do anything.

Revert the addition of the s/\$//g, and update the gsub to keep the
resulting format the same.

Discussed with:	bde
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2016-10-17 13:52:24 +00:00
Hans Petter Selasky
d3bf5efc1f Fix device delete child function.
When detaching device trees parent devices must be detached prior to
detaching its children. This is because parent devices can have
pointers to the child devices in their softcs which are not
invalidated by device_delete_child(). This can cause use after free
issues and panic().

Device drivers implementing trees, must ensure its detach function
detaches or deletes all its children before returning.

While at it remove now redundant device_detach() calls before
device_delete_child() and device_delete_children(), mostly in
the USB controller drivers.

Tested by:		Jan Henrik Sylvester <me@janh.de>
Reviewed by:		jhb
Differential Revision:	https://reviews.freebsd.org/D8070
MFC after:		2 weeks
2016-10-17 10:20:38 +00:00
Konstantin Belousov
5975e53d40 Fix a race in vm_page_busy_sleep(9).
Suppose that we have an exclusively busy page, and a thread which can
accept shared-busy page.  In this case, typical code waiting for the
page xbusy state to pass is
again:
	VM_OBJECT_WLOCK(object);
	...
	if (vm_page_xbusied(m)) {
		vm_page_lock(m);
 		VM_OBJECT_WUNLOCK(object);    <---1
		vm_page_busy_sleep(p, "vmopax");
 		goto again;
	}

Suppose that the xbusy state owner locked the object, unbusied the
page and unlocked the object after we are at the line [1], but before we
executed the load of the busy_lock word in vm_page_busy_sleep().  If it
happens that there is still no waiters recorded for the busy state,
the xbusy owner did not acquired the page lock, so it proceeded.

More, suppose that some other thread happen to share-busy the page
after xbusy state was relinquished but before the m->busy_lock is read
in vm_page_busy_sleep().  Again, that thread only needs vm_object lock
to proceed.  Then, vm_page_busy_sleep() reads busy_lock value equal to
the VPB_SHARERS_WORD(1).

In this case, all tests in vm_page_busy_sleep(9) pass and we are going
to sleep, despite the page being share-busied.

Update check for m->busy_lock == VPB_UNBUSIED in vm_page_busy_sleep(9)
to also accept shared-busy state if we only wait for the xbusy state to
pass.

Merge sequential if()s with the same 'then' clause in
vm_page_busy_sleep().

Note that the current code does not share-busy pages from parallel
threads, the only way to have more that one sbusy owner is right now
is to recurse.

Reported and tested by:	pho (previous version)
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D8196
2016-10-13 14:41:05 +00:00
Conrad Meyer
d9ce8a41ea kern_linker: Handle module-loading failures in preloaded .ko files
The runtime kernel loader, linker_load_file, unloads kernel files that
failed to load all of their modules. For consistency, treat preloaded
(loader.conf loaded) kernel files in the same way.

Reviewed by:	kib
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8200
2016-10-13 02:06:23 +00:00
Ed Maste
2a059700b6 Use correct size type in do_setopt_accept_filter
Submitted by:	ecturt@gmail.com
2016-10-12 00:56:49 +00:00
Oleksandr Tymoshenko
609b0fe966 INTRNG - fix MSI/MSIX release path
Use isrc in attached MSI data structure instead of using map's
isrc directly. map's isrc is set to NULL on IRQ deactivation
which happens prior to pci_release_msi so MSI_RELEASE_MSI
receives array of NULLs

Reviewed by:	mmel
Differential Revision:	https://reviews.freebsd.org/D8206
2016-10-11 17:00:29 +00:00
Sean Bruno
1ee17b070d Fix bug where malloc(.., M_NOWAIT) return value is not checked, Change to
M_WAITOK and move outside the mutex

Submitted by:	shurd
Reviewed by:	mmacy@nextbsd.org
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D7649
2016-10-11 14:08:53 +00:00