Commit Graph

15875 Commits

Author SHA1 Message Date
Pedro F. Giffuni
7aa472731e ANSI-fy exec_shell_imgact().
Fix a stray space while here.

PR:	223317
MFC after:	3 days
2017-11-04 15:41:08 +00:00
Konstantin Belousov
30c438723d Convert explicit panic() call to assert.
Based on github pull request:	#113
Submitted by:	pmarillo@github
MFC after:	1 week
2017-11-04 10:49:34 +00:00
Mateusz Guzik
a2c36a24b6 Special-case pget lookups where pid == curproc->pid
Saves on allproc_lock acquires during buildworld, poudriere etc.

Submitted by:	Pawel Biernacki <pawel.biernacki@gmail.com>
Sponsored by:	Mysterious Code Ltd.
Differential Revision:	D12929
2017-11-03 19:21:36 +00:00
Mateusz Guzik
ac850e5a8d namecache: fix .. check broken after r324378
wtf by:	mjg
Diagnosed by:	avg
2017-11-01 08:40:04 +00:00
Mateusz Guzik
59e260f860 Fixup r325264, take #2
whack an unused variable
2017-11-01 06:46:58 +00:00
Mateusz Guzik
5644fffa25 namecache: ncnegfactor 16 -> 12
It is used on each new entry addition to decide whether to whack an existing
negative entry in order to prevent a blow out in size, but the parameter was
set years ago and never revisited.

Building with poudriere results in about 400 evictions per second which
unnecessarily grab entries from the hot list.

With the new parameter there are next to no evictions of the sort.
2017-11-01 06:45:41 +00:00
Mateusz Guzik
5d03f1e11f Fixup r325264
Accidentally committed an incomplete diff.
2017-11-01 06:38:46 +00:00
Mateusz Guzik
c0b5261b55 Save on loginclass list locking by checking if caller already uses the struct 2017-11-01 06:12:14 +00:00
Mateusz Guzik
5949c7e504 Save on uihash table locking by checking if the caller already uses the struct
In particular with poudriere this saves about 90% of lookups.
2017-11-01 05:51:20 +00:00
John Baldwin
e012fe34cb Discard the correct thread event reported for a ptrace stop.
When multiple threads wish to report a tracing event to a debugger,
both threads call ptracestop() and one thread will win the race to be
the reporting thread (p->p_xthread).  The debugger uses PT_LWPINFO
with the process ID to determine which thread / LWP is reporting an
event and the details of that event.  This event is cleared as a side
effect of the subsequent ptrace event that resumed the process
(PT_CONTINUE, PT_STEP, etc.).  However, ptrace() was clearing the
event identified by the LWP ID passed to the resume request even if
that wasn't the 'p_xthread'.  This could result in clearing an event
that had not yet been observed by the debugger and leaving the
existing event for 'p_thread' pending so that it was reported a second
time.

Specifically, if the debugger stopped due to a software breakpoint in
one thread, but then switched to another thread that was used to
resume (e.g. if the user switched to a different thread and issued a
step), the resume request (PT_STEP) cleared a pending event (if any)
for the thread being stepped.  However, the process immediately
stopped and the first thread reported it's breakpoint event a second
time.  The debugger decremented the PC for "both" breakpoint events
which resulted in the PC now pointing into the middle of an
instruction (on x86) and a SIGILL fault when the process was resumed a
second time.

To fix, always clear the pending event for 'p_xthread' when resuming a
process.  ptrace() still honors the requested LWP ID when enabling
single-stepping (PT_STEP) or setting a different PC (PT_CONTINUE).

Reported by:	GDB testsuite (gdb.threads/continue-pending-status.exp)
Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D12794
2017-10-27 03:16:19 +00:00
Alan Somers
df485bdb3c Fix aio_suspend in 32-bit emulation
An off-by-one error has been present since the system call was first present
in 185878.  It additionally became a memory corruption bug after change
324941.  The failure is actually revealed by our existing AIO tests.
However, apparently nobody's been running those in 32-bit emulation mode.

Reported by:	Coverity, cem
CID:		1382114
MFC after:	18 days
X-MFC-With:	324941
Sponsored by:	Spectra Logic Corp
2017-10-26 19:45:15 +00:00
Warner Losh
7d41b6f078 Handle RB_POWERCYCLE in the MI part of the kernel
Signal init with SIGWINCH in shutdown_nice for RB_POWERCYCLE.

Sponsored by: Netflix
2017-10-25 15:30:44 +00:00
Mark Johnston
64a16434d8 Add support for compressed kernel dumps.
When using a kernel built with the GZIO config option, dumpon -z can be
used to configure gzip compression using the in-kernel copy of zlib.
This is useful on systems with large amounts of RAM, which require a
correspondingly large dump device. Recovery of compressed dumps is also
faster since fewer bytes need to be copied from the dump device.

Because we have no way of knowing the final size of a compressed dump
until it is written, the kernel will always attempt to dump when
compression is configured, regardless of the dump device size. If the
dump is aborted because we run out of space, an error is reported on
the console.

savecore(8) is modified to handle compressed dumps and save them to
vmcore.<index>.gz, as it does when given the -z option.

A new rc.conf variable, dumpon_flags, is added. Its value is added to
the boot-time dumpon(8) invocation that occurs when a dump device is
configured in rc.conf.

Reviewed by:	cem (earlier version)
Discussed with:	def, rgrimes
Relnotes:	yes
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D11723
2017-10-25 00:51:00 +00:00
Alan Somers
913b932900 Remove artificial restriction on lio_listio's operation count
In r322258 I made p1003_1b.aio_listio_max a tunable. However, further
investigation shows that there was never any good reason for that limit to
exist in the first place. It's used in two completely different ways:

* To size a UMA zone, which globally limits the number of concurrent
  aio_suspend calls.

* To artifically limit the number of operations in a single lio_listio call.
  There doesn't seem to be any memory allocation associated with this limit.

This change does two things:

* Properly names aio_suspend's UMA zone, and sizes it based on a new constant.

* Eliminates the artifical restriction on lio_listio. Instead, lio_listio
  calls will now be limited by the more generous max_aio_queue_per_proc. The
  old p1003_1b.aio_listio_max is now an alias for
  vfs.aio.max_aio_queue_per_proc, so sysconf(3) will still work with
  _SC_AIO_LISTIO_MAX.

Reported by:	bde
Reviewed by:	jhb
MFC after:	3 weeks
Sponsored by:	Spectra Logic Corp
Differential Revision:	https://reviews.freebsd.org/D12120
2017-10-23 23:12:01 +00:00
Mateusz Guzik
5132933a08 Bump WITNESS_PENDLIST to accomodate sleepq chain bump.
Reported by:	ngie
2017-10-23 01:00:35 +00:00
Mateusz Guzik
9e68989764 Make the sleepq chain hash size configurable per-arch and bump on amd64.
While here cache-align chains.

This shortens longest found chain during poudriere -j 80 from 32 to 16.

Pushing this higher up will probably require allocation on boot.
2017-10-22 20:43:50 +00:00
Mateusz Guzik
5a17c5524f sdt: make all sdt probe sites test one variable
This saves on cache misses at the expense of a slight grow of .text.

Note this is a bandaid for lack of hotpatching.

Discussed with:	markj
2017-10-22 20:22:23 +00:00
Mateusz Guzik
614e1868d6 Change kdb_active type to u_char.
Fixes warnings from gcc and keeps the small size. Perhaps nesting should be moved
to another variablle.

Reported by:	ngie
2017-10-22 13:42:56 +00:00
Enji Cooper
f2374e0cc5 Clean up trailing whitespace in kdb_thr_ctx(..)
MFC after:	1 week
2017-10-22 12:12:52 +00:00
Konstantin Belousov
456a73ef01 Remove the support for mknod(S_IFMT), which created dummy vnodes with
VBAD type.

FFS ffs_write() VOP catches such vnodes and panics, other VOPs do not
check for the type and their behaviour is really undefined.  The
comment claims that this support was done for 'badsect' to flag bad
sectors, we do not have such facility in kernel anyway.

Reported by:	Dmitry Vyukov <dvyukov@google.com>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-22 08:11:45 +00:00
Mateusz Guzik
be49509eea mtx: implement thread lock fastpath
MFC after:	1 week
2017-10-21 22:40:09 +00:00
Michal Meloun
904d8c492f Add AT_HWCAP2 ELF auxiliary vector.
- allocate value for new AT_HWCAP2 auxiliary vector on all platforms.
 - expand 'struct sysentvec' by new 'u_long *sv_hwcap2', in exactly
   same way as for AT_HWCAP.

MFC after:	1 month
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D12699
2017-10-21 12:05:01 +00:00
Mark Johnston
a3e8a25a52 Avoid the nbp lookup in the final loop iteration in flushbuflist().
The end of the loop must re-lookup the next buf since the bufobj lock
is dropped in the loop body. If the lookup fails, the loop is restarted.
This mechanism non-obviously also terminates the loop when the end of
the buf list is reached. Split up the two loops termination cases to
make the code a bit less fragile. No functional change intended.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D12730
2017-10-20 14:56:13 +00:00
Mateusz Guzik
62bf13cbf9 mtx: fix up UP build after r324778
Reported by:	Michael Butler
2017-10-20 14:04:01 +00:00
Mateusz Guzik
c48a94251d Mark kdb_active as __read_frequently and switch to bool to eat less space. 2017-10-20 04:02:53 +00:00
Mateusz Guzik
2567807c32 rwlock: reduce lockstat branches in the slowpath
MFC after:	1 week
2017-10-20 03:32:42 +00:00
Mateusz Guzik
cbc2d7c218 mtx: stop testing SCHEDULER_STOPPED in kabi funcs for spin mutexes
There is nothing panic-breaking to do in the unlock case and the lock
case will fallback to the slow path doing the check already.

MFC after:	1 week
2017-10-20 00:34:25 +00:00
Mateusz Guzik
0d74fe267b mtx: clean up locking spin mutexes
1) shorten the fast path by pushing the lockstat probe to the slow path
2) test for kernel panic only after it turns out we will have to spin,
in particular test only after we know we are not recursing

MFC after:	1 week
2017-10-20 00:30:35 +00:00
Mateusz Guzik
9b8de76beb sysctl: only take mem lock if oldlen is > 4 * PAGE_SIZE
The previous limit of just one page is hit by ps.

The entire mechanism should be reworked, if not whacked. It seems the intent
is to reduce kernel dos-ability - some handlers wire the amount of memory
passed here. Handlers should probably stop wiring in the first place or in
the worst case indicate they are doing so so that the check is done only if
necessary. It should also probably be a counter, not a lock.

MFC after:	1 week
2017-10-19 01:38:31 +00:00
Mateusz Guzik
e6b645ae89 execve: avoid one proc lock/unlock trip unless PTRACE_EXEC is set
MFC after:	1 week
2017-10-19 00:46:15 +00:00
Mateusz Guzik
80a2397a38 Tidy up pmc support at execve.
The proc-specific check is inherently racy, so the code can just unlock
beforehand.

MFC after:	1 week
2017-10-19 00:38:14 +00:00
Mateusz Guzik
cb1c79008e sysvsem: check if semu_list has anything on it before grabbing the lock
This should get a process-specific support instead.

MFC after:	1 week
2017-10-19 00:31:00 +00:00
Mateusz Guzik
c69a1a50cd Don't take Giant for SMP status and cpu topology sysctls.
Not only this lock doesn't play any role here, dirtying it slows down
other things a little bit as giant-held checks (e.g. DROP_GIANT) are
spread all over the kernel.

MFC after:	1 week
2017-10-18 22:00:44 +00:00
Mark Johnston
46fcd1af63 Move kernel dump offset tracking into MI code.
All of the kernel dump implementations keep track of the current offset
("dumplo") within the dump device. However, except for textdumps, they
all write the dump sequentially, so we can reduce code duplication by
having the MI code keep track of the current offset. The new
dump_append() API can be used to write at the current offset.

This is needed to implement support for kernel dump compression in the
MI kernel dump code.

Also simplify dump_encrypted_write() somewhat: use dump_write() instead
of duplicating its bounds checks, and get rid of the redundant offset
tracking.

Reviewed by:	cem
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D11722
2017-10-18 15:38:05 +00:00
Brooks Davis
39ed7f250a Remove mbpool(9) now that it has no consumers.
mbpool existed to support NICs with memory interfaces and all remaining
comsumers were removed earlier this year with NATM.

Reviewed by:	jhb
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D10513
2017-10-18 00:18:03 +00:00
Mark Johnston
fa00affd18 Fix a racy VI_DOOMED check in MNT_VNODE_FOREACH_ALL().
MNT_VNODE_FOREACH_ALL() is supposed to avoid returning doomed vnodes,
but the VI_DOOMED check it used was done without the vnode interlock
held, so it could race with a concurrent vgone().

Submitted by:	Don Morris <don.morris@isilon.com>
Reviewed by:	kib, mckusick
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D12704
2017-10-17 19:41:45 +00:00
Andriy Voskoboinyk
6623429867 mbuf(9): unbreak m_fragment()
- Fix it by replacing m_cat() with m_prev->m_next = m_new
(m_cat() will try to append data - as a result, there will be no
fragmentation).
- Move some constants out of the loop.

Was previously tested with D4077.

Differential Revision:	https://reviews.freebsd.org/D4090
2017-10-16 21:46:11 +00:00
Konstantin Belousov
e9445808a8 Re-evaluate thread' signal mask after ptracestop().
The stop drops process lock, which allows the signal mask to be
changed and our selected signal might become blocked, i.e. should be
returned to the process queue instead of delivery.

Also, for the existing check of the process no longer having an
attached debugger, we should not loose the signal, but requeue it.

Reported and tested by:	bdrewery
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-16 20:21:51 +00:00
Konstantin Belousov
cd735d8f5a Improve assertion that an ignored or blocked signal is not delivered.
Split two conditions into separate asserts.  Print additional details,
like the signal number and action value.

Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-16 20:15:19 +00:00
Konstantin Belousov
0167b33b81 Style.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-16 20:11:29 +00:00
Matt Joras
0d8e04054e Properly reset the fields in clean_unrhdr.
In r324542 I neglected to reset the first and last fields of struct
unrhdr. This causes a tmpfs to fail the unr(9) consistency checks with
DIAGNOSTIC on. Fix this by resetting the fields by calling init_unrhdr.
While here, change a loop to use TAILQ_FOREACH_SAFE since it is more
readable and equally fast.

Reported by:	David Wolfskill <david@catwhisker.org>
Approved by:	rstone (mentor)
Sponsored by:	Dell EMC Isilon
2017-10-16 16:14:50 +00:00
Tijl Coosemans
11ce4d9f39 When a Linux program tries to access a /path the kernel tries
/compat/linux/path before /path.  Stop following symbolic links when
looking up /compat/linux/path so dead symbolic links aren't ignored.
This allows syscalls like readlink(2) and lstat(2) to work on such links.
And open(2) will return an error now instead of trying /path.
2017-10-15 18:53:21 +00:00
Mateusz Guzik
e280ce465b mtx: fix up owner_mtx after r324609
Now that MTX_UNOWNED is 0 the test was alwayas false.
2017-10-14 00:47:30 +00:00
Alan Cox
41bf90bb78 Address two problems with sendfile(..., SF_NOCACHE) and apply one
"optimization".  First, sendfile(..., SF_NOCACHE) frees pages without
checking whether those pages are mapped.  This can leave the system
with mappings to free or repurposed pages.  Second, a page can be
busied between the time of the current busy test and acquiring the
object lock.  Essentially, the test performed before the object lock
is acquired can only be regarded as an optimization to short-circuit
further work on the page.  It cannot, however, be relied upon to prove
that it is safe to free the page.  Third, when sendfile(..., SF_NOCACHE)
was originally implemented, vm_page_deactivate_noreuse() did not yet
exist.  Use vm_page_deactivate_noreuse() instead of vm_page_deactivate(),
because it comes closer to freeing the page.

In collaboration with:	glebius
Discussed with:	gallatin, kib, markj
X-MFC after:	r324448
2017-10-13 16:31:50 +00:00
Andriy Gapon
f92e3400bc remove process and jail directory machinations from dounmount
The manipulations done by mountcheckdirs() are not that useful during
the unmount, they can bring about unexpected security consequences.

Thic change effectively reverts the change in r73241.

The change also allows to simplify the handling of rootvnode global
variable.

Discussed with:	mckusick, mjg, kib
Reviewed by:	trasz
MFC after:	1 month
Differential Revision: https://reviews.freebsd.org/D12366
2017-10-13 09:42:05 +00:00
Ed Maste
05e47051a2 regen init_sysent.c r324560 2017-10-12 15:48:37 +00:00
Ed Maste
5532aa9bb4 allow posix_fallocate in capability mode
posix_fallocate is logically equivalent to writing zero blocks to the
desired file size and there is no reason to prevent calling it in
capability mode. posix_fallocate already checked for the CAP_WRITE
right, so we merely need to list it in capabilities.conf.

Reviewed by:	allanjude
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D12640
2017-10-12 15:45:53 +00:00
Matt Joras
333dcaa498 Add clearing function for unr(9).
Previously before you could call unrhdr_delete you needed to
individually free every allocated unit. It is useful to be able to tear
down the unr without having to go through this process, as it is
significantly faster than freeing the individual units.

Reviewed by:	cem, lidl
Approved by:	rstone (mentor)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D12591
2017-10-11 21:53:50 +00:00
Konstantin Belousov
70e3b262d1 The th_bintime, th_microtime and th_nanotime members of the timehand
all cache the last system time (uptime + boottime).  Only the format
differs.  Do not re-calculate the bintime and simply use the value
used to calculate the microtime and nanotime.

Group all the updates under the relevant comment.  Remove obsoleted
XXX part.

Submitted by:	Sebastian Huber <sebastian.huber@embedded-brains.de>
MFC after:	1 week
2017-10-11 11:03:11 +00:00
Sean Bruno
1f9916ed08 match sendfile() error handling to send().
Sendfile() should match the error checking order of send() which
is currently:

SBS_CANTSENDMORE
so_error
SS_ISCONNECTED

Submitted by:	Jason Eggleston <jason@eggnet.com>
Reviewed by:	glebius
MFC after:	2 weeks
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D12633
2017-10-10 22:21:05 +00:00
Sean Bruno
009ad5724d Revert r324405 at the request of the submitter pending better solution.
Submitted by:	Jason Eggleston <jason@eggnet.com>
Sponsored by:	Limelight Networks
2017-10-10 00:32:21 +00:00
Gleb Smirnoff
9c82bec42d Improvements to sendfile(2) mbuf free routine.
o Fall back to default m_ext free mech, using function pointer in
  m_ext_free, and remove sf_ext_free() called directly from mbuf code.
  Testing on modern CPUs showed no regression.
o Provide internally used flag EXT_FLAG_SYNC, to mark that I/O uses
  SF_SYNC flag.  Lack of the flag allows us not to dereference
  ext_arg2, saving from a cache line miss.
o Create function sendfile_free_page() that later will be used, for
  multi-page mbufs.  For now compiler will inline it into
  sendfile_free_mext().

In collaboration with:	gallatin
Differential Revision:	https://reviews.freebsd.org/D12615
2017-10-09 21:06:16 +00:00
Gleb Smirnoff
07e87a1d55 In mb_dupcl() don't copy full m_ext, to avoid cache miss. Respectively,
in mb_free_ext() always use fields from the original refcount holding
mbuf (see. r296242) mbuf.  Cuts another cache miss from mb_free_ext().

However, treat EXT_EXTREF mbufs differently, since they are different -
they don't have a refcount holding mbuf.

Provide longer comments in m_ext declaration to explain this change
and change from r296242.

In collaboration with:	gallatin
Differential Revision:	https://reviews.freebsd.org/D12615
2017-10-09 20:51:58 +00:00
Gleb Smirnoff
e8fd18f306 Shorten list of arguments to mbuf external storage freeing function.
All of these arguments are stored in m_ext, so there is no reason
to pass them in the argument list.  Not all functions need the second
argument, some don't even need the first one.  The second argument
lives in next cache line, so not dereferencing it is a performance
gain.  This was discovered in sendfile(2), which will be covered by
next commits.

The second goal of this commit is to bring even more flexibility
to m_ext mbufs, allowing to create more fields in m_ext, opaque to
the generic mbuf code, and potentially set and dereferenced by
subsystems.

Reviewed by:	gallatin, kbowling
Differential Revision:	https://reviews.freebsd.org/D12615
2017-10-09 20:35:31 +00:00
Hans Petter Selasky
32b413d7f0 When showing the sleepqueues from the in-kernel debugger,
properly dump all the sendqueues and not just the first one

History:
It appears that in the commit which introduced the code,
r165272, the array indexes of "sq_blocked[0]" and "td_name[i]"
were interchanged. In r180927 "td_name[i]" was corrected to
"td_name[0]", but "sq_blocked[0]" was left unchanged.

PR:		222624
Discussed with:	kmacy @
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2017-10-09 18:33:29 +00:00
Alan Cox
03ca213761 The recent change to initialization of blists (r324420) relied on '-1'
appearing only where the code explicitly set it, but since much of the
data was not initialized, '-1' appeared other places too, and led to
panics.  Clear the allocated data before initializing nonzero values by
allocating with M_ZERO.

Submitted by:	Doug Moore <dougm@rice.edu>
Reported by:	Oleg V. Nauman <oleg@theweb.org.ua>, cy
Tested by:	Oleg V. Nauman <oleg@theweb.org.ua>
MFC after:	1 week
X-MFC with:	r324420
Differential Revision:	https://reviews.freebsd.org/D12627
2017-10-09 18:19:06 +00:00
Alan Cox
8eefcd407b The blst_radix_init function has two purposes - to compute the number of
nodes to allocate for the blist, and to initialize them.  The computation
can be done much more quickly by identifying the terminating node, if any,
at every level of the tree and then summing the number of nodes at each
level that precedes the topmost terminator.  The initialization can also be
done quickly, since settings at the root mark the tree as all-allocated, and
only a few terminator nodes need to be marked in the rest of the tree.
Eliminate blst_radix_init, and perform its two functions more simply in
blist_create.

The allocation of the blist takes places in two pieces, but there's no good
reason to do so, when a single allocation is sufficient, and simpler.
Allocate the blist struct, and the array of nodes associated with it, with a
single allocation.

Submitted by:	Doug Moore <dougm@rice.edu>
Reviewed by:	markj (an earlier version)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D11968
2017-10-08 22:17:39 +00:00
Ian Lepore
7f92689427 Add eventhandler notifications for newbus device attach/detach.
The detach case is slightly complicated by the fact that some in-kernel
consumers may want to know before a device detaches (so they can release
related resources, stop using the device, etc), but the detach can fail. So
there are pre- and post-detach notifications for those consumers who need to
handle all cases.

A couple salient comments from the review, they amount to some helpful
documentation about these events, but there's currently no good place for
such documentation...

Note that in the current newbus locking model, DETACH_BEGIN and
DETACH_COMPLETE/FAILED sequence of event handler invocation might interweave
with other attach/detach events arbitrarily. The handlers should be prepared
for such situations.

Also should note that detach may be called after the parent bus knows the
hardware has left the building. In-kernel consumers have to be prepared to
cope with this race.

Differential Revision:	https://reviews.freebsd.org/D12557
2017-10-08 17:33:49 +00:00
Ian Lepore
fc09164658 Restore the ability to deregister an eventhandler from within the callback.
When the EVENTHANDLER(9) subsystem was created, it was a documented feature
that an eventhandler callback function could safely deregister itself. In
r200652 that feature was inadvertantly broken by adding drain-wait logic to
eventhandler_deregister(), so that it would be safe to unload a module upon
return from deregistering its event handlers.

There are now 145 callers of EVENTHANDLER_DEREGISTER(), and it's likely many
of them are depending on the drain-wait logic that has been in place for 8
years. So instead of creating a separate eventhandler_drain() and adding it
to some or all of those 145 call sites, this creates a separate
eventhandler_drain_nowait() function for the specific purpose of
deregistering a callback from within the running callback.

Differential Revision:	https://reviews.freebsd.org/D12561
2017-10-08 17:21:16 +00:00
Sean Bruno
75c8dfb6ae Check so_error early in sendfile() call. Prior to this patch, if a
connection was reset by the remote end, sendfile() would just report
ENOTCONN instead of ECONNRESET.

Submitted by:	Jason Eggleston <jason@eggnet.com>
Reviewed by:	glebius
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D12575
2017-10-07 23:30:57 +00:00
Mateusz Guzik
709939a7b7 namecache: factor out ~MAKEENTRY lookups from the common path
Lookups of the sort are rare compared to regular ones and succesfull ones
result in removing entries from the cache.

In the current code buckets are rlocked and a trylock dance is performed,
which can fail and cause a restart. Fixing it will require a little bit
of surgery and in order to keep the code maintaineable the 2 cases have
to split.

MFC after:	1 week
2017-10-06 23:05:55 +00:00
Mark Johnston
f38c0c46c5 Let stack_create(9) take a malloc flags argument.
Reviewed by:	cem
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D12614
2017-10-06 21:52:28 +00:00
Emmanuel Vadot
b1d51685ea vfs_export_lookup: Fix r324054
When using the default address list nam is still valid, the code in
r324054 assumed that is was NULL.

Reported by:	Guy Yur <guyyur@gmail.com>
Tested by:	Guy Yur <guyyur@gmail.com>
2017-10-06 09:02:36 +00:00
Mateusz Guzik
d07e22cdd8 locks: take the number of readers into account when waiting
Previous code would always spin once before checking the lock. But a lock
with e.g. 6 readers is not going to become free in the duration of once spin
even if they start draining immediately.

Conservatively perform one for each reader.

Note that the total number of allowed spins is still extremely small and is
subject to change later.

MFC after:	1 week
2017-10-05 19:18:02 +00:00
Stephen Hurd
1c0054d261 Fix "taskqgroup_attach: setaffinity failed: 3" with iflib drivers
Improved logging added in r323879 exposed an error during
attach. We need the irq, not the rid to work correctly. em uses
shared irqs, so it will use the same irq for TX as RX. bnxt does
not use shared irqs, or TX irqs at all, so there's no need to set
the TX irq affinity.

Reviewed by:	sbruno
Approved by:	sbruno (mentor)
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D12496
2017-10-05 14:43:30 +00:00
Mateusz Guzik
20a15d1752 locks: partially tidy up waiting on readers
spin first instant of instantly re-readoing and don't re-read after
spinning is finished - the state is already known.

Note the code is subject to significant changes later.

MFC after:	1 week
2017-10-05 13:01:18 +00:00
Andriy Gapon
693593b6f0 sysctl-s in a module should be accessible only when the module is initialized
A sysctl can have a custom handler that may access data that is initialized
via SYSINIT(9) or via a module event handler (also invoked via SYSINIT).
Thus, it is not safe to allow access to the module's sysctl-s until
the initialization is performed.  Likewise, we should not allow access
to teh sysctl-s after the module is uninitialized.
The latter is easy to achieve by properly ordering linker_file_unregister_sysctls
and linker_file_sysuninit.
The former is not as easy for two reasons:
- the initialization may depend on tunables which get set when sysctl-s are
  registered, so we need to set the tunables before running sysinit-s
- the initialization may try to dynamically add more sysctl-s under statically
  defined sysctl nodes
So, this change splits the sysctl setup into two phases.  In the first phase
the sysctl-s are registered as before but they are disabled and hidden from
consumers.  In the second phase, done after sysinit-s, normal access to the
sysctl-s is enabled.

The change should affect only dynamic module loading and unloading after
the system boot-up.  Nothing changes for sysctl-s compiled into the kernel
and sysctl-s in preloaded modules.

Discussed with:	hselasky, ian, jhb
Reviewed by:	julian, kib
MFC after:	2 weeks
Sponsored by:	Panzura
Differential Revision: https://reviews.freebsd.org/D12545
2017-10-05 12:32:14 +00:00
Gleb Smirnoff
0e229f343f Hide struct socket and struct unpcb from the userland.
Violators may define _WANT_SOCKET and _WANT_UNPCB respectively and
are not guaranteed for stability of the structures.  The violators
list is the the usual one: libprocstat(3) and netstat(1) internally
and lsof in ports.

In struct xunpcb remove the inclusion of kernel structure and add
a bunch of spare fields.  The xsocket already has socket not included,
but add there spares as well.  Embed xsockbuf into xsocket.

Sort declarations in sys/socketvar.h to separate kernel only from
userland available ones.

PR:		221820 (exp-run)
2017-10-02 23:29:56 +00:00
Alan Cox
0c0e1e96c6 Use vm_page_active() rather than directly accessing the page's queue
field.

Reviewed by:	kib, markj
MFC after:	2 weeks
X-MFC with:	r324146
2017-10-02 07:30:21 +00:00
Andriy Gapon
550374efe6 revert r324166, it has an unrelated change in it 2017-10-01 16:37:54 +00:00
Andriy Gapon
7d5c6491f0 MFV r323531: 8521 nvlist memory leak in get_clones_stat() and spa_load_best()
illumos/illumos-gate@7d3000f774
7d3000f774

https://www.illumos.org/issues/8521
  Yuri reported this to the mailing list:
  doing a `reboot -d` on current illumos-gate HEAD gives the following "::
  findleaks -dv" output:
  findleaks: maximum buffers => 301061
  findleaks: actual buffers => 297587
  findleaks:
  findleaks: potential pointers => 29289774
  findleaks: dismissals => 26242305 (89.5%)
  findleaks: misses => 331153 ( 1.1%)
  findleaks: dups => 2419681 ( 8.2%)
  findleaks: follows => 296635 ( 1.0%)
  findleaks:
  findleaks: peak memory usage => 7353 kB
  findleaks: elapsed CPU time => 1.5 seconds
  findleaks: elapsed wall time => 2.0 seconds
  findleaks:
  CACHE LEAKED BUFCTL CALLER
  ffffff03d222b008 120 ffffff03ef7ceb78 nv_alloc_sys+0x1f
  ffffff03d222a448 123 ffffff03f4150cc8 nv_alloc_sys+0x1f
  ffffff03d222b448 5 ffffff03f28bd598 nv_alloc_sys+0x1f
  ffffff03d222b888 87 ffffff03f28c10f0 nv_alloc_sys+0x1f
  ffffff03d222c008 21 ffffff03f4139310 nv_alloc_sys+0x1f
  ffffff03d222b888 43 ffffff040ef3f3e8 nv_alloc_sys+0x1f
  ffffff03d222c008 120 ffffff03f4591e58 nv_alloc_sys+0x1f
  ffffff03d222b008 121 ffffff03f352c068 nv_alloc_sys+0x1f
  ffffff03d222a448 112 ffffff03f414e5f8 nv_alloc_sys+0x1f
  ffffff03d222b008 119 ffffff03ee92fdc0 nv_alloc_sys+0x1f
  ffffff03d222b888 46 ffffff03f28c1378 nv_alloc_sys+0x1f
  ffffff03d222b448 4 ffffff03f28c7708 nv_alloc_sys+0x1f
  ffffff03d222c008 20 ffffff03f2a6e7e8 nv_alloc_sys+0x1f

Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Yuri Pankov <yuripv@gmx.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>

MFC after:	5 weeks
X-MFC after:	r324163
2017-10-01 16:34:16 +00:00
Mark Johnston
0ffc7ed7e3 Have uiomove_object_page() keep accessed pages in the active queue.
Previously, uiomove_object_page() would maintain LRU by requeuing the
accessed page. This involves acquiring one of the heavily contended page
queue locks. Moreover, it is unnecessarily expensive for pages in the
active queue.

As of r254304 the page daemon continually performs a slow scan of the
active queue, with the effect that unreferenced pages are gradually
moved to the inactive queue, from which they can be reclaimed. Prior to
that revision, the active queue was scanned only during shortages of
free and inactive pages, meaning that unreferenced pages could get
"stuck" in the queue. Thus, tmpfs was required to use the inactive queue
and requeue pages in order to maintain LRU. Now that this is no longer
the case, tmpfs I/O operations can use the active queue and avoid the
page queue locks in most cases, instead setting PGA_REFERENCED on
referenced pages to provide pseudo-LRU.

Reviewed by:	alc (previous version)
MFC after:	2 weeks
2017-09-30 23:41:28 +00:00
Konstantin Belousov
d3c968bf84 Revert r323722. A better fix will be committed shortly, as well as
some still useful bits of the reverted revision.

The problem with the committed fix is that there are still issues with
returning from NMI, when NMI interrupted kernel in a moment where the
kernel segments selectors were still not loaded into registers.  If
this happens, the NMI return would loose the userspace selectors
because r323722 does not reload segment registers on return to kernel
mode.

Fixing the problem is complicated.  Since an alternative approach to
handle the original bug exists, it makes sence to stop adding more
complexity.

Discussed with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-09-28 08:38:24 +00:00
John Baldwin
c2dc6d5db1 Use UMA_ALIGNOF() for name cache UMA zones.
This fixes kernel crashes due to misaligned accesses to the 64-bit
time_t embedded in struct namecache_ts in MIPS n32 kernels.

MFC after:	1 week
Sponsored by:	DARPA / AFRL
2017-09-27 23:18:57 +00:00
Emmanuel Vadot
5e254379a8 vfs_export: Simplify vfs_export_lookup
If the filesystem is not exported directly return NULL.
If no address is given and filesystem is exported using some default
one return it directly, if it doesn't have a default one directly
return NULL.

Reviewed by:	kib, bapt
MFC after:	1 week
Sponsored by:	Gandi.net
Differential Revision:	https://reviews.freebsd.org/D12505
2017-09-27 09:39:16 +00:00
Mateusz Guzik
a79d52d739 sysctl: remove target buffer read/write checks prior to calling the handler
Said checks were inherently racy anyway as jokers could unmap target areas
before the handler got around to accessing them.

This saves time by avoiding locking the address space.

MFC after:	1 week
2017-09-27 01:31:52 +00:00
Mateusz Guzik
956713cb74 Annotate sysctlmemlock with __exclusive_cache_line.
MFC after:	1 week
2017-09-27 01:27:43 +00:00
Mateusz Guzik
2f1ddb89fc mtx: drop the tid argument from _mtx_lock_sleep
tid must be equal to curthread and the target routine was already reading
it anyway, which is not a problem. Not passing it as a parameter allows for
a little bit shorter code in callers.

MFC after:	1 week
2017-09-27 00:57:05 +00:00
John Baldwin
09f3bb8756 Log signal number passed to PT_STEP requests in KTR_PTRACE traces.
MFC after:	1 week
2017-09-25 20:38:55 +00:00
Conrad Meyer
f41b85a63c ddb(4): Add 'show badstacks' command to show witness badstacks
Add a DDB command that mirrors sysctl debug.witness.badstacks.

Reapply r323935 after fixing trivial deficiency.  I forgot to compile with
WITNESS enabled.  Thanks emaste@ for fixing the build while I was asleep.

Reported by:	rstone
Reviewed by:	rstone (previous version)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D12468
2017-09-23 17:48:49 +00:00
Ed Maste
4c087f8a83 Revert r323935 as it broke the build
subr_witness.c:2577:4: error: use of undeclared identifier 'req'
                        req->oldidx = 0;
                        ^
2017-09-23 12:35:46 +00:00
Stephen Hurd
d57a78580e Make struct grouptask gt_name member a char array
Previously, it was just a pointer which was copied, but
some callers pass in a stack variable which will go out of scope.
Add GROUPTASK_NAMELEN macro (32) and snprintf() the name into it,
using "grouptask" if name is NULL. We can now safely include
gtask->gt_name in console messages.

Reviewed by:	sbruno
Approved by:	sbruno (mentor)
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D12449
2017-09-23 01:39:16 +00:00
Conrad Meyer
6fec2d2cce ddb(4): Add 'show badstacks' command to show witness badstacks
Add a DDB command that mirrors sysctl debug.witness.badstacks.

Reported by:	rstone
Reviewed by:	rstone
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D12468
2017-09-22 20:01:12 +00:00
Kirk McKusick
75e3597abb Continuing efforts to provide hardening of FFS, this change adds a
check hash to cylinder groups. If a check hash fails when a cylinder
group is read, no further allocations are attempted in that cylinder
group until it has been fixed by fsck. This avoids a class of
filesystem panics related to corrupted cylinder group maps. The
hash is done using crc32c.

Check hases are added only to UFS2 and not to UFS1 as UFS1 is primarily
used in embedded systems with small memories and low-powered processors
which need as light-weight a filesystem as possible.

Specifics of the changes:

sys/sys/buf.h:
    Add BX_FSPRIV to reserve a set of eight b_xflags that may be used
    by individual filesystems for their own purpose. Their specific
    definitions are found in the header files for each filesystem
    that uses them. Also add fields to struct buf as noted below.

sys/kern/vfs_bio.c:
    It is only necessary to compute a check hash for a cylinder
    group when it is actually read from disk. When calling bread,
    you do not know whether the buffer was found in the cache or
    read. So a new flag (GB_CKHASH) and a pointer to a function to
    perform the hash has been added to breadn_flags to say that the
    function should be called to calculate a hash if the data has
    been read. The check hash is placed in b_ckhash and the B_CKHASH
    flag is set to indicate that a read was done and a check hash
    calculated. Though a rather elaborate mechanism, it should
    also work for check hashing other metadata in the future. A
    kernel internal API change was to change breada into a static
    fucntion and add flags and a function pointer to a check-hash
    function.

sys/ufs/ffs/fs.h:
    Add flags for types of check hashes; stored in a new word in the
    superblock. Define corresponding BX_ flags for the different types
    of check hashes. Add a check hash word in the cylinder group.

sys/ufs/ffs/ffs_alloc.c:
    In ffs_getcg do the dance with breadn_flags to get a check hash and
    if one is provided, check it.

sys/ufs/ffs/ffs_vfsops.c:
    Copy across the BX_FFSTYPES flags in background writes.
    Update the check hash when writing out buffers that need them.

sys/ufs/ffs/ffs_snapshot.c:
    Recompute check hash when updating snapshot cylinder groups.

sys/libkern/crc32.c:
lib/libufs/Makefile:
lib/libufs/libufs.h:
lib/libufs/cgroup.c:
    Include libkern/crc32.c in libufs and use it to compute check
    hashes when updating cylinder groups.

Four utilities are affected:

sbin/newfs/mkfs.c:
    Add the check hashes when building the cylinder groups.

sbin/fsck_ffs/fsck.h:
sbin/fsck_ffs/fsutil.c:
    Verify and update check hashes when checking and writing cylinder groups.

sbin/fsck_ffs/pass5.c:
    Offer to add check hashes to existing filesystems.
    Precompute check hashes when rebuilding cylinder group
    (although this will be done when it is written in fsutil.c
    it is necessary to do it early before comparing with the old
    cylinder group)

sbin/dumpfs/dumpfs.c
    Print out the new check hash flag(s)

sbin/fsdb/Makefile:
    Needs to add libufs now used by pass5.c imported from fsck_ffs.

Reviewed by: kib
Tested by: Peter Holm (pho)
2017-09-22 12:45:15 +00:00
Stephen Hurd
bf227542f3 Fix undeclared identifier error introduced in r323879
It doesn't appear to be safe to use gtask->gt_name.

Reported by:	Mark Johnston, Jenkins
Reviewed by:	sbruno
Approved by:	sbruno (mentor)
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D12448
2017-09-21 23:27:35 +00:00
John Baldwin
e1d15b892a Only handle _PC_MAX_CANON, _PC_MAX_INPUT, and _PC_VDISABLE for TTY devices.
Move handling of these three pathconf() variables out of vop_stdpathconf()
and into devfs_pathconf() as TTY devices can only be devfs files.  In
addition, only return settings for these three variables for devfs devices
whose device switch has the D_TTY flag set.

Discussed with:	bde, kib
Sponsored by:	Chelsio Communications
2017-09-21 23:05:32 +00:00
Stephen Hurd
326aacb0e3 Improved logging of gtaskqueue failues
Check the return code of intr_setaffinity() and log any errors
it returns. When a qid is not located, log an error before returning
failure.  Also, use __func__ rather than hardcoding the function name

Reviewed by:	sbruno
Approved by:	sbruno (mentor)
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D12436
2017-09-21 21:14:48 +00:00
Stephen Hurd
a0fcc37122 Fix M_GTASKQUEUE definition
Previously had the same short and long description as taskqueues.
This could cause problems with memguard(9) and vmstat -m which use
the short description as a unique identifier.

Reviewed by:	sbruno
Approved by:	sbruno (mentor)
MFC after:	3 days
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D12438
2017-09-21 20:34:33 +00:00
Konstantin Belousov
9770475ce7 Do not vrele() covered vnode under the mp mutex.
If vrele() changes the hold count to zero, it needs to acquire the
vnode lock.

Sponsored by:	The FreeBSD Foundation
Discussed with:	avg
X-MFC with:	r323578
2017-09-19 16:49:45 +00:00
Konstantin Belousov
5bf949377e For unlinked files, do not msync(2) or sync on the vnode deactivation.
One consequence of the patch is that msyncing unlinked file mappings
no longer reduces the amount of the dirty memory in the system, but I
do not think that there are users of msync(2) that utilize it for such
side-effect.

Reported and tested by:	tjil
PR:	222356
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D12411
2017-09-19 16:46:37 +00:00
Konstantin Belousov
5efe338f3d Fix handling of the segment registers on i386.
Suppose that userspace is executing with the non-standard segment
descriptors.  Then, until exception or interrupt handler executed
SET_KERNEL_SEGS, kernel is still executing with user %ds, %es and %fs.
If an interrupt occurs in this window, the interrupt handler is
executed unsafely, relying on usability of the usermode registers.  If
the interrupt results in the context switch on return, the
contamination of the kernel state spreads to the thread we switched
to.  As result, kernel data accesses might fault or, if only the base
is changed, completely messed up.

More, if the user segment was allocated in LDT, another thread might
mark the descriptor as invalid before doreti code tried to reload
them.  In this case kernel panics.

The issue exists for all exception entry points which use trap gate,
and thus do not automatically disable interrupts on entry, and for
lcall_handler.

Fix is two-fold: first, we need to disable interrupts for all kernel
entries, changing the IDT descriptor types from trap gate to interrupt
gate.  Interrupts are re-enabled not earlier than the kernel segments
are loaded into the segment registers.  Second, we only load the
segment registers from the trap frame when returning to usermode.  For
the later, all interrupt return paths must happen through the doreti
common code.

There is no way to disable interrupts on call gate, which is the
supposed mode of servicing for lcall $7,$0 syscalls.  Change the LDT
descriptor 0 into a code segment type and point it to the userspace
trampoline which redirects the syscall to int $0x80.

All the measures make the segment register handling similar to that of
amd64.  We do not apply amd64 optimizations of not reloading segment
registers on return from the syscall.

Reported by:	Maxime Villard <max@m00nbsd.net>
Tested by:	pho (the non-lcall part)
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D12402
2017-09-18 20:22:42 +00:00
Alan Cox
ec371b57e8 Modify blst_leaf_alloc to take only the cursor argument.
Modify blst_leaf_alloc to find allocations that cross the boundary between
one leaf node and the next when those two leaves descend from the same
meta node.

Update the hint field for leaves so that it represents a bound on how
large an allocation can begin in that leaf, where it currently represents
a bound on how large an allocation can be found within the boundaries of
the leaf.

The first phase of blst_leaf_alloc currently shrinks sequences of
consecutive 1-bits in mask until each has been shrunken by count-1 bits,
so that any bits remaining show where an allocation can begin, or until
all the bits have disappeared, in which case the allocation fails. This
change amends that so that the high-order bit is copied, as if, when the
last block was free, it was followed by an endless stream of free
blocks. It also amends the early stopping condition, so that the shrinking
of 1-sequences stops early when there are none, or there is only one
unbounded one remaining.

The search for the first set bit is unchanged, and the code path
thereafter is mostly unchanged unless the first set bit is in a position
that makes some of those copied sign bits matter. In that case, we look
for a next leaf, and at what blocks it can provide, to see if a
cross-boundary allocation is possible.

The hint is updated on a successful allocation that clears the last bit,
but it not updated on a failed allocation that leaves the last bit
set. So, as long as the last block is free, the hint value for the leaf is
large. As long as the last block is free, and there's a next leaf, a large
allocation can begin here, perhaps. A stricter rule than this would mean
that allocations and frees in one leaf could require hint updates to the
preceding leaf, and this change seeks to leave the freeing code
unmodified.

Define BLIST_BMAP_MASK, and use it for bit masking in blst_leaf_free and
blist_leaf_fill, as well as in blst_leaf_alloc.

Correct a panic message in blst_leaf_free.

Submitted by:	Doug Moore <dougm@rice.edu>
Reviewed by:	markj (an earlier version)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D11819
2017-09-16 18:12:15 +00:00
Stephen Hurd
ab2e3f7958 Revert r323516 (iflib rollup)
This was really too big of a commit even if everything worked, but there
are multiple new issues introduced in the one huge commit, so it's not
worth keeping this until it's fixed.

I'll work on splitting this up into logical chunks and introduce them one
at a time over the next week or two.

Approved by:	sbruno (mentor)
Sponsored by:	Limelight Networks
2017-09-16 02:41:38 +00:00
Gleb Smirnoff
584ab65a75 Fix locking in soisconnected().
When a newborn socket moves from incomplete queue to complete
one, we need to obtain the listening socket lock after the child,
which is a wrong order.  The old code did that in potentially
endless loop of mtx_trylock().  The new one does only one attempt
of mtx_trylock(), and in case of failure references listening
socket, unlocks child and locks everything in right order.  In
case if listening socket shuts down during that, just bail out.

Reported & tested by:	Jason Eggleston <jeggleston llnw.com>
Reported & tested by:	Jason Wolfe <jason llnw.com>
2017-09-14 18:05:54 +00:00
John Baldwin
c2f37b9245 Add AT_HWCAP and AT_EHDRFLAGS on all platforms.
A new 'u_long *sv_hwcap' field is added to 'struct sysentvec'.  A
process ABI can set this field to point to a value holding a mask of
architecture-specific CPU feature flags.  If an ABI does not wish to
supply AT_HWCAP to processes the field can be left as NULL.

The support code for AT_EHDRFLAGS was already present on all systems,
just the #define was not present.  This is a step towards unifying the
AT_* constants across platforms.

Reviewed by:	kib
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D12290
2017-09-14 14:26:55 +00:00
Andriy Gapon
cbc785c293 dounmount: do not release the mount point's reference on the covered vnode
As long as mnt_ref is not zero there can be a consumer that might try
to access mnt_vnodecovered.  For this reason the covered vnode must not
be freed until mnt_ref goes to zero.
So, move the release of the covered vnode to vfs_mount_destroy.

Reviewed by:	kib
MFC after:	3 weeks
Differential Revision: https://reviews.freebsd.org/D12329
2017-09-14 08:47:06 +00:00
Gleb Smirnoff
d37aa3ccce Use soref() in sendfile(2) instead fhold() to reference a socket.
The problem is that fdrop() requires syscall context, as it may
enter sleep in some cases.  The reason to use it in the original
non-blocking sendfile implementation, was to avoid use of global
ACCEPT_LOCK() on every I/O completion. Now in head sorele() no
longer requires this lock.
2017-09-13 22:11:05 +00:00
Gleb Smirnoff
100db364eb Fix two issues with not ready data in sockets (read: sendfile)
in UNIX sockets.

o Check that socket is still connected in uipc_ready(). If not
  we are responsible to free mbufs.
o In uipc_send() if socket appears to be disconnected, but we
  are sending data with pending I/Os, don't free mbufs.

Reported by:	Kevin Bowling <kbowling llnw.com>
Tested by:	Kevin Bowling <kbowling llnw.com>
PR:		222259
Reported by:	Mark Martinec <Mark.Martinec ijs.si>
MFC after:	3 days
2017-09-13 16:47:23 +00:00
Stephen Hurd
d300df0182 Roll up iflib commits from github. This pulls in most of the work done
by Matt Macy as well as other changes which he has accepted via pull
request to his github repo at https://github.com/mattmacy/networking/

This should bring -CURRENT and the github repo into close enough sync to
allow small feature branches rather than a large chain of interdependant
patches being developed out of tree.  The reset of the synchronization
should be able to be completed on github by splitting the remaining
changes that are not yet ready into short feature branches for later
review as smaller commits.

Here is a summary of changes included in this patch:

1)  More checks when INVARIANTS are enabled for eariler problem
    detection
2)  Group Task Queue cleanups
    - Fix use of duplicate shortdesc for gtaskqueue malloc type.
      Some interfaces such as memguard(9) use the short description to
      identify malloc types, so duplicates should be avoided.
3)  Allow gtaskqueues to use ithreads in addition to taskqueues
    - In some cases, this can improve performance
4)  Better logging when taskqgroup_attach*() fails to set interrupt
    affinity.
5)  Do not start gtaskqueues until they're needed
6)  Have mp_ring enqueue function enter the ABDICATED rather than BUSY
    state.  This moves the TX to the gtaskq and allows processing to
    continue faster as well as make TX batching more likely.
7)  Add an ift_txd_errata function to struct if_txrx.  This allows
    drivers to inspect/modify mbufs before transmission.
8)  Add a new IFLIB_NEED_ZERO_CSUM for drivers to indicate they need
    checksums zeroed for checksum offload to work.  This avoids modifying
    packet data in the TX path when possible.
9)  Use ithreads for iflib I/O instead of taskqueues
10) Clean up ioctl and support async ioctl functions
11) Prefetch two cachlines from each mbuf instead of one up to 128B.  We
    often need to parse packet header info beyond 64B.
12) Fix potential memory corruption due to fence post error in
    bit_nclear() usage.
13) Improved hang detection and handling
14) If the packet is smaller than MTU, disable the TSO flags.
    This avoids extra packet parsing when not needed.
15) Move TCP header parsing inside the IS_TSO?() test.
    This avoids extra packet parsing when not needed.
16) Pass chains of mbufs that are not consumed by lro to if_input()
    rather call if_input() for each mbuf.
17) Re-arrange packet header loads to get as much work as possible done
    before a cache stall.
18) Lock the context when calling IFDI_ATTACH_PRE()/IFDI_ATTACH_POST()/
    IFDI_DETACH();
19) Attempt to distribute RX/TX tasks across cores more sensibly,
    especially when RX and TX share an interrupt.  RX will attempt to
    take the first threads on a core, and TX will attempt to take
    successive threads.
20) Allow iflib_softirq_alloc_generic() to request affinity to the same
    cpus an interrupt has affinity with.  This allows TX queues to
    ensure they are serviced by the socket the device is on.
21) Add new iflib sysctls to net.iflib:
    - timer_int - interval at which to run per-queue timers in ticks
    - force_busdma
22) Add new per-device iflib sysctls to dev.X.Y.iflib
    - rx_budget allows tuning the batch size on the RX path
    - watchdog_events Count of watchdog events seen since load
23) Fix error where netmap_rxq_init() could get called before
    IFDI_INIT()
24) e1000: Fixed version of r323008: post-cold sleep instead of DELAY
    when waiting for firmware
    - After interrupts are enabled, convert all waits to sleeps
    - Eliminates e1000 software/firmware synchronization busy waits after
      startup
25) e1000: Remove special case for budget=1 in em_txrx.c
    - Premature optimization which may actually be incorrect with
      multi-segment packets
26) e1000: Split out TX interrupt rather than share an interrupt for
    RX and TX.
    - Allows better performance by keeping RX and TX paths separate
27) e1000: Separate igb from em code where suitable
    Much easier to understand separate functions and "if (is_igb)" than
    previous tests like "if (reg_icr & (E1000_ICR_RXSEQ | E1000_ICR_LSC))"

#blamebruno

Reviewed by:	sbruno
Approved by:	sbruno (mentor)
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D12235
2017-09-13 01:18:42 +00:00
Alan Cox
d027ed2e7a To analyze the allocation of swap blocks by blist functions, add a method
for analyzing the radix tree structures and reporting on the number, and
sizes, of maximal intervals of free blocks.  The report includes the number
of maximal intervals, and also the number of them in each of several size
ranges, from small (size 1, or 3 to 4) to large (28657 to 46367) with size
boundaries defined by Fibonacci numbers.  The report is written in the test
tool with the 's' command, or in a running kernel by sysctl.

The analysis of the radix tree frequently computes the position of the lone
bit set in a u_daddr_t, a computation that also appears in leaf allocation.
That computation has been moved into a function of its own, and optimized
for cases where an inlined machine instruction can replace the usual binary
search.

Submitted by:	Doug Moore <dougm@rice.edu>
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D11906
2017-09-10 17:46:03 +00:00
Dag-Erling Smørgrav
008a09355b If the user tries to set kern.randompid to 1 (which is meaningless), set
it to a random value between 100 and 1123, rather than 0 as before.

Submitted by:	Marie Helene Kvello-Aune <marieheleneka@gmail.com>
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D5336
2017-09-10 15:01:29 +00:00
Mateusz Guzik
0bbae6f364 namecache: clean up struct namecache_ts handling
namecache_ts differs from mere namecache by few fields placed mid struct.
The access to the last element (the name) is thus special-cased.

The standard solution is to put new fields at the very beginning anad
embedd the original struct. The pointer shuffled around points to the
embedded part. If needed, access to new fields can be gained through
__containerof.

MFC after:	1 week
2017-09-10 11:17:32 +00:00
Mateusz Guzik
dad74ce924 namecache: fold the unlock label into the only consumer
No functional changes.

MFC after:	1 week
2017-09-08 06:57:11 +00:00
Mateusz Guzik
da8f32a7f1 namecache: factor out dot lookup into a dedicated function
The intent is to move uncommon cases out of the way.

MFC after:	1 week
2017-09-08 06:51:33 +00:00
Mateusz Guzik
6a569d3525 Annotate Giant with __exclusive_cache_line 2017-09-08 06:46:24 +00:00
Mateusz Guzik
3e72c8449b Annotate global process locks with __exclusive_cache_line
MFC after:	1 week
2017-09-08 06:46:02 +00:00
Mateusz Guzik
574adb65c8 Sprinkle __read_frequently on few obvious places.
Note that some of annotated variables should probably change their types
to something smaller, preferably bit-sized.
2017-09-06 20:33:33 +00:00
Mateusz Guzik
fe933c1d88 Start annotating global _padalign locks with __exclusive_cache_line
While these locks are guarnteed to not share their respective cache lines,
their current placement leaves unnecessary holes in lines which preceeded them.

For instance the annotation of vm_page_queue_free_mtx allows 2 neighbour
cachelines (previously separate by the lock) to be collapsed into 1.

The annotation is only effective on architectures which have it implemented in
their linker script (currently only amd64). Thus locks are not converted to
their not-padaligned variants as to not affect the rest.

MFC after:	1 week
2017-09-06 20:28:18 +00:00
Edward Tomasz Napierala
b0618cda03 Make root_mount_rel(9) ignore NULL arguments, like it used to before r313351.
It would be better to fix API consumers to not pass NULL there - most of them,
such as gmirror, already contain the neccessary checks - but this is easier
and much less error-prone.

One known user-visible result is that it fixes panic on a failed "graid label".

PR:		221846
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
2017-09-05 14:32:56 +00:00
Warner Losh
519772814d Add CAM/NVMe support for CAM_DATA_SG
This adds support in pass(4) for data to be described with a
scatter-gather list (sglist) to augment the existing (single) virtual
address.

Differential Revision: https://reviews.freebsd.org/D11361
Submitted by: Chuck Tuffli
Reviewed by: imp@, scottl@, kenm@
2017-08-29 15:29:57 +00:00
Bryan Drewery
8359a6b7b3 Allow vdrop() of a vnode not yet on the per-mount list after r306512.
The old code allowed calling vdrop() before insmntque() to place the vnode back
onto the freelist for later recycling.  Some downstream consumers may rely on
this support.  Normally insmntque() failing is fine since is uses vgone() and
immediately frees the vnode rather than attempting to add it to the freelist if
vdrop() were used instead.

Also assert that vhold() cannot be used on such a vnode.

Reviewed by:	kib, cem, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D12126
2017-08-28 19:29:51 +00:00
Conrad Meyer
4ae2ade114 Enhance debugibility of sysctl leaf re-use warnings
Print the full conflicting oid path, and include the function name in the
warning so it is clear that the warnings are sysctl-related.

PR:		221853
Submitted by:	Fabian Keil <fk AT fabiankeil.de> (earlier version)
Sponsored by:	Dell EMC Isilon
2017-08-27 17:12:30 +00:00
Conrad Meyer
eee87314d3 Improve scheduler performance
Improve scheduler performance by flattening nonsensical topology layers
(layers with only one child don't serve any purpose).

This is especially relevant on non-AMD Zen systems after r322776.  On my
dual core Intel laptop, this brings the kern.sched.topology_spec table down
from three levels to two.

Submitted by:	jeff
Reviewed by:	attilio
Sponsored by:	Dell EMC Isilon
2017-08-27 05:14:48 +00:00
John Baldwin
12fb14f36d Don't grab SOCK_LOCK for soref() when queuing an AIO request.
The AIO job holds a reference on the associated file descriptor, so the
socket's count should already be > 0.  This fixes a LOR with the socket
buffer lock after recent socket locking changes in HEAD.

Sponsored by:	Chelsio Communications
2017-08-25 23:10:27 +00:00
Alan Cox
a0ae476b7e Correct a regression in the previous change, r322459. Specifically, the
removal of the "blk" parameter from blst_meta_alloc() had the unintended
effect of generating an out-of-range allocation when the cursor reaches
the end of the tree if the number of managed blocks in the tree equals
the so-called "radix" (which in the blist code is not the standard notion
of what a radix is but rather the maximum number of leaves in a tree of
the current height.)  In other words, only certain swap configurations
were affected, which is why earlier testing did not reveal the problem.

Submitted by:	Doug Moore <dougm@rice.edu>
Reported by:	pho, kib
Tested by:	pho
X-MFC with:	r322459
Differential Revision:	https://reviews.freebsd.org/D12106
2017-08-25 18:47:23 +00:00
Gleb Smirnoff
555b3e2f2c Third take on the r319685 and r320480. Actually allow for call soisconnected()
via soisdisconnected(), and in the earlier unlock earlier to avoid lock
recursion.

This fixes a situation when a socket on accept queue is reset before being
accepted.

Reported by:	Jason Eggleston <jeggleston llnw.com>
2017-08-24 20:49:19 +00:00
Conrad Meyer
d2e155a4f0 Remove unused declaration and update ddb.4
A follow-up to r322836.

Warnings for the unused declaration were breaking some second tier
architectures, but did not show up in Clang on x86.

Reported by:	markj (ddb.4), emaste (declaration)
Sponsored by:	Dell EMC Isilon
2017-08-24 19:16:25 +00:00
Conrad Meyer
0c1d923efb Merge print_lockchain and print_sleepchain
When debugging a deadlock, it is useful to follow the full chain of locks as
far as possible.

Reviewed by:	jhb
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D12115
2017-08-24 15:12:16 +00:00
Jung-uk Kim
a1d0659ca9 Fix size to copyout(9) for cpuset_getid(2).
MFC after:	3 days
2017-08-22 20:46:29 +00:00
Conrad Meyer
bb14d5643b subr_smp: Clean up topology analysis, add additional layers
Rather than repeatedly nesting loops, separate concerns with a single loop
per call stack level.  Use a table to drive the recursive routine.  Handle
missing topology layers more gracefully (infer a single unit).

Analyze some additional optional layers which may be present on e.g. AMD Zen
systems (groups, aka dies, per package; and cachegroups, aka CCXes, per
group).

Display that additional information in the boot-time topology information,
when it is relevent (non-one).

Reviewed by:	markj@, mjoras@ (earlier version)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D12019
2017-08-22 00:10:15 +00:00
Konstantin Belousov
b59ea73029 Allow vinvalbuf() to operate with the shared vnode lock.
This mode allows other clean buffers to arrive while we flush the buf
lists for the vnode, which is fine for the targeted use.  We only need
that all buffers existed at the time of the function start were
flushed.  In fact, only one assert has to be relaxed.

In collaboration with:	pho
Reviewed by:	rmacklem
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
X-Differential revision:	https://reviews.freebsd.org/D12083
2017-08-20 10:07:45 +00:00
Mark Johnston
e9666bf645 Remove some unneeded subroutines for padding writes to dump devices.
Right now we only need to pad when writing kernel dump headers, so
flatten three related subroutines into one. The encrypted kernel dump
code already writes out its key in a dumper.blocksize-sized block.

No functional change intended.

Reviewed by:	cem, def
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D11647
2017-08-18 04:07:25 +00:00
Mark Johnston
01938d3666 Rename mkdumpheader() and group EKCD functions in kern_shutdown.c.
This helps simplify the code in kern_shutdown.c and reduces the number
of globally visible functions.

No functional change intended.

Reviewed by:	cem, def
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D11603
2017-08-18 04:04:09 +00:00
Mark Johnston
50ef60dabe Factor out duplicated kernel dump code into dump_{start,finish}().
dump_start() and dump_finish() are responsible for writing kernel dump
headers, optionally writing the key when encryption is enabled, and
initializing the initial offset into the dump device.

Also remove the unused dump_pad(), and make some functions static now that
they're only called from kern_shutdown.c.

No functional change intended.

Reviewed by:	cem, def
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D11584
2017-08-18 03:52:35 +00:00
Lawrence Stewart
9a61faf67d An off-by-one error exists in sbuf_vprintf()'s use of SBUF_HASROOM() when an
sbuf is filled to capacity by vsnprintf(), the loop exits without error, and
the sbuf is not marked as auto-extendable.

SBUF_HASROOM() evaluates true if there is room for one or more non-NULL
characters, but in the case that the sbuf was filled exactly to capacity,
SBUF_HASROOM() evaluates false. Consequently, sbuf_vprintf() incorrectly
assigns an ENOMEM error to the sbuf when in fact everything is fine, in turn
poisoning the buffer for all subsequent operations.

Correct by moving the ENOMEM assignment into the loop where it can be made
unambiguously.

As a related safety net change, explicitly check for the zero bytes drained
case in sbuf_drain() and set EDEADLK as the error. This avoids an infinite loop
in sbuf_vprintf() if a drain function were to inadvertently return a value of
zero to sbuf_drain().

Reviewed by:	cem, jtl, gallatin
MFC after:	2 weeks
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D8535
2017-08-18 02:06:28 +00:00
Lawrence Stewart
a8ec96af28 Implement simple record boundary tracking in sbuf(9) to avoid record splitting
during drain operations. When an sbuf is configured to use this feature by way
of the SBUF_DRAINTOEOR sbuf_new() flag, top-level sections started with
sbuf_start_section() create a record boundary marker that is used to avoid
flushing partial records.

Reviewed by:	cem,imp,wblock
MFC after:	2 weeks
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D8536
2017-08-17 07:20:09 +00:00
Ian Lepore
ce44a73667 Fix compile error with option DEBUG. This is fallout from some long-ago
INTRNG refactoring that didn't get caught at the time because code in a
debugf() statement isn't compiled unless DEBUG is defined.

PR:		221557
2017-08-16 16:51:55 +00:00
Conrad Meyer
f3fed04372 Fix a couple of comment typos
No functional change.

Submitted by:	Anton Rang <anton.rang AT isilon.com>
Sponsored by:	Dell EMC Isilon
2017-08-15 02:21:02 +00:00
Ian Lepore
2db14f97de Add config_intrhook_oneshot(): schedule an intrhook function and unregister
it automatically after it runs.

The config_intrhook mechanism allows a driver to stall the boot process
until device(s) required for booting are available, by not allowing system
inits to proceed until all intrhook functions have been unregistered.
Virtually all existing code simply unregisters from within the hook function
when it gets called.

This new function makes that common usage more convenient. Instead of
allocating and filling in a struct, passing it to a function that might (in
theory) fail, and checking the return code, now a driver can simply call
this cannot-fail routine, passing just the intrhook function and its arg.

Differential Revision:	https://reviews.freebsd.org/D11963
2017-08-13 18:10:24 +00:00
Alan Cox
bee93d3cf0 The *_meta_* functions include a radix parameter, a blk parameter, and
another parameter that identifies a starting point in the memory address
block.  Radix is a power of two, blk is a multiple of radix, and the
starting point is in the range [blk, blk+radix), so that blk can always be
computed from the other two.  This change drops the blk parameter from the
meta functions and computes it instead.  (On amd64, for example, this
change reduces subr_blist.o's text size by 7%.)

It also makes the radix parameters unsigned to address concerns that the
calculation of '-radix' might overflow without the -fwrapv option.  (See
https://reviews.freebsd.org/D11819.)

Submitted by:	Doug Moore <dougm@rice.edu>
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D11964
2017-08-13 16:39:49 +00:00
Mark Johnston
af0460beda Have sendfile_swapin() use vm_page_grab_pages().
Reviewed by:	alc, kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D11942
2017-08-11 16:32:24 +00:00
Mark Johnston
9df950b35d Modify vm_page_grab_pages() to handle VM_ALLOC_NOWAIT.
This will allow its use in sendfile_swapin().

Reviewed by:	alc, kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D11942
2017-08-11 16:29:22 +00:00
Alan Cox
6921451dab An invalid page can't be dirty.
Reviewed by:	kib
MFC after:	1 week
2017-08-11 16:27:54 +00:00
Andrew Turner
a92a2f00b1 Only return the current cpu if it's in the cpumask. When we restrict the
cpumask it probably means we are unable to sent interrupts to CPUs outside
the map. As such only return the current CPU when it's within the mask
otherwise return the first valid CPU.

This is needed on ThunderX as, in a dual socket configuration, we are
unable to send MSI/MSI-X interrupts between sockets.

Reviewed by:	mmel
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D11957
2017-08-11 12:45:58 +00:00
Gleb Smirnoff
ef3266d58a Plug uninitialized stack variable leak in sendfile(2).
Reported by:	Ilja Van Sprundel <ivansprundel ioactive.com>
Submitted by:	Domagoj Stolfa <domagoj.stolfa gmail.com>
MFC after:	1 week
Security:	uninitialized stack variable leak
2017-08-09 17:48:38 +00:00
Alan Cox
5471caf6f1 Introduce vm_page_grab_pages(), which is intended to replace loops calling
vm_page_grab() on consecutive page indices.  Besides simplifying the code
in the caller, vm_page_grab_pages() allows for batching optimizations.
For example, the current implementation replaces calls to vm_page_lookup()
on consecutive page indices by cheaper calls to vm_page_next().

Reviewed by:	kib, markj
Tested by:	pho (an earlier version)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D11926
2017-08-09 04:23:04 +00:00
Alan Somers
c45796d54e Make p1003_1b.aio_listio_max a tunable
p1003_1b.aio_listio_max is now a tunable. Its value is reflected in the
sysctl of the same name, and the sysconf(3) variable _SC_AIO_LISTIO_MAX.
Its value will be bounded from below by the compile-time constant
AIO_LISTIO_MAX and from above by the compile-time constant
MAX_AIO_QUEUE_PER_PROC and the tunable vfs.aio.max_aio_queue.

Reviewed by:	jhb, kib
MFC after:	3 weeks
Relnotes:	yes
Sponsored by:	Spectra Logic Corp
Differential Revision:	https://reviews.freebsd.org/D11601
2017-08-08 16:14:31 +00:00
Ruslan Bukin
ca20f8ec29 o Replace __riscv__ with __riscv
o Replace __riscv64 with (__riscv && __riscv_xlen == 64)

This is required to support new GCC 7.1 compiler.
This is compatible with current GCC 6.1 compiler.

RISC-V is extensible ISA and the idea here is to have built-in define
per each extension, so together with __riscv we will have some subset
of these as well (depending on -march string passed to compiler):

__riscv_compressed
__riscv_atomic
__riscv_mul
__riscv_div
__riscv_muldiv
__riscv_fdiv
__riscv_fsqrt
__riscv_float_abi_soft
__riscv_float_abi_single
__riscv_float_abi_double
__riscv_cmodel_medlow
__riscv_cmodel_medany
__riscv_cmodel_pic
__riscv_xlen

Reviewed by:	ngie
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D11901
2017-08-07 14:09:57 +00:00
Alan Cox
ba98e6a2d7 In case readers are misled by expressions that combine multiplication and
division, add parentheses to make the precedence explicit.

Submitted by:	Doug Moore <dougm@rice.edu>
Requested by:	imp
Reviewed by:	imp
MFC after:	1 week
X-MFC after:	r321840
Differential Revision:	https://reviews.freebsd.org/D11815
2017-08-04 04:23:23 +00:00
Mark Johnston
526b5fe16c Amend r321884 to check the refcount and update the class with w_mtx held.
Reviewed by:	jhb
X-MFC with:	r321884
2017-08-01 23:14:38 +00:00
Mark Johnston
57688b6e4e Fix a witness assertion that fires when a lock type's class changes.
When all instances of a lock type are destroyed (for example, after a
module unload), the corresponding witness entry remains associated with
that lock type. In this case, we shouldn't panic if a new instance of the
lock type is created and its lock class does not match that recorded in the
witness entry.

Reviewed by:	jhb
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D11788
2017-08-01 17:50:28 +00:00
Alan Cox
2ac0c7c37c The blist_meta_* routines that process a subtree take arguments 'radix' and
'skip', which denote, respectively, the largest number of blocks that can be
managed by a subtree of that height, and one less than the number of nodes
in a subtree of that height.  This change removes the 'skip' argument from
those functions because 'skip' can be trivially computed from 'radius'.
This change also redefines 'skip' so that it denotes the number of nodes in
the subtree, and so changes loop upper bound tests from '<= skip' to '<
skip' to account for the change.

The 'skip' field is also removed from the blist struct.

The self-test program is changed so that the print command includes the
cursor value in the output.

Submitted by:	Doug Moore <dougm@rice.edu>
MFC after:	1 week
2017-08-01 03:51:26 +00:00
Dmitry Chagin
77d3337c9f Implement proper Linux /dev/fd and /proc/self/fd behavior by adding
Linux specific things to the native fdescfs file system.

Unlike FreeBSD, the Linux fdescfs is a directory containing a symbolic
links to the actual files, which the process has open.
A readlink(2) call on this file returns a full path in case of regular file
or a string in a special format (type:[inode], anon_inode:<file-type>, etc..).
As well as in a FreeBSD, opening the file in the Linux fdescfs directory is
equivalent to duplicating the corresponding file descriptor.

Here we have mutually exclusive requirements:
- in case of readlink(2) call fdescfs lookup() method should return VLNK
vnode otherwise our kern_readlink() fail with EINVAL error;
- in the other calls fdescfs lookup() method should return non VLNK vnode.

For what new vnode v_flag VV_READLINK was added, which is set if fdescfs has beed
mounted with linrdlnk option an modified kern_readlinkat() to properly handle it.

For now For Linux ABI compatibility mount fdescfs volume with linrdlnk option:

    mount -t fdescfs -o linrdlnk null /compat/linux/dev/fd

Reviewed by:	kib@
MFC after:	1 week
Relnotes:	yes
2017-08-01 03:40:19 +00:00
Mark Johnston
6c7ebc242b Batch v_wire_count decrements in vm_hold_free_pages().
Atomic updates to v_wire_count are a significant source of contention, so
combine multiple updates into one in this easy case. Also remove an old
printf that gets executed if the page is shared-busied, which is a case
that will lead to a panic anyway.

Reviewed by:	alc, kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D11791
2017-07-31 18:48:58 +00:00
Ian Lepore
cd9d9e5417 Add clock_schedule(), a feature that allows realtime clock drivers to
request that their clock_settime() methods be called at a given offset
from top-of-second.  This adds a timeout_task to the rtc_instance so that
each clock can be separately added to taskqueue_thread with the scheduling
it prefers, instead of looping through all the clocks at once with a
single task on taskqueue_thread.  If a driver doesn't call clock_schedule()
the default is the old behavior: clock_settime() is queued immediately.

The motivation behind this is that I was on the path of adding identical
code to a third RTC driver to figure out a delta to top-of-second and
sleep for that amount of time because writing the the RTC registers resets
the hardware's concept of top-of-second.  (Sometimes it's not top-of-second,
some RTC clocks tick over a half second after you set their time registers.)
Worst-case would be to sleep for almost a full second, which is a rude thing
to do on a shared task queue thread.
2017-07-31 01:18:21 +00:00
Mark Johnston
d0f68f913b Correct the predicates on which lockstat:::{thread,spin}-spin fire.
In particular, they should fire only if the lock was owned by another
thread when we first attempted to acquire that lock.

MFC after:	1 week
2017-07-31 00:59:28 +00:00
Ian Lepore
f37b7fc2d4 Add taskqueue_enqueue_timeout_sbt(), because sometimes you want more control
over the scheduling precision than 'ticks' can offer, and because sometimes
you're already working with sbintime_t units and it's dumb to convert them
to ticks just so they can get converted back to sbintime_t under the hood.
2017-07-31 00:54:50 +00:00
Conrad Meyer
ca3fec5042 kldstat: Use sizeof in place of named constants for sizing
No functional change.

This is handy for FreeBSD derivatives that want to modify the value of
MAXPATHLEN, but not the kld_file_stat ABI.

Submitted by:	Siddhant Agarwal <sagarwal AT isilon.com>
Sponsored by:	Dell EMC Isilon
2017-07-29 23:31:21 +00:00
Konstantin Belousov
7ceeb35bd8 Make it possible to request nosys logging to console.
New kern.lognosys values are
1 - log to ctty
2 - log to console
3 - log to both.

Inspired by:	eugen
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-07-27 20:45:41 +00:00
Konstantin Belousov
0948519ded Make the number of children for pctrie node available outside subr_pctrie.c.
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
X-Differential revision:	https://reviews.freebsd.org/D11435
2017-07-27 16:40:14 +00:00
Alan Cox
a396c83a5e Change the interactions of the interface functions with the "meta" and
"leaf" functions for alloc, free, and fill.  After the change, the interface
functions call "meta" unconditionally, and the "meta" functions recur
unconditionally in looping over their descendants. The "meta" functions
start with a validity test, and then a test for the "leaf" case, before
falling into the general recursive case.  This simplifies and shrinks the
code, and, for "free" and "fill" moves panic tests that check the same meta
node repeatedly in a loop to a place that will have each node tested once.

Remove irrelevant null checks from blist_free and blist_fill.

Make the code that initializes a meta node the same in blist_meta_alloc and
blist_meta_fill.

Parenthesize return expressions in blst_meta_fill.

Submitted by:	Doug Moore <dougm@rice.edu>
MFC after:	1 week
2017-07-24 17:23:53 +00:00
Ian Lepore
f1b21e2c92 Add common code to support realtime clocks that store year without century.
Most realtime clocks store the year as 2 BCD digits.  Some add a century bit
to extend the range another hundred years.  Every clock driver has its own
code to determine the century and pass a full year value to clock_ct_to_ts().
Now clock drivers can just convert BCD to bin and store the result in the
clocktime struct and let the common code figure out the century.  Clocks
with a century bit can just add 100 to year if the century bit is on.
2017-07-23 21:28:00 +00:00
Michael Tuexen
27d8bea898 Fix getsockopt() for listening sockets when using SO_SNDBUF, SO_RCVBUF,
SO_SNDLOWAT, SO_RCVLOWAT. Since r31972 it only worked for non-listening
sockets.

Sponsored by:	Netflix, Inc.
2017-07-21 07:44:43 +00:00
Enji Cooper
d700db78ad Fix whitespace regression accidentally checked in via ^/head@r280149
MFC after:	now
2017-07-18 06:51:27 +00:00
Alan Cox
b411efa49b Tidy up before making another round of functional changes: Remove end-
of-line whitespace, remove excessive whitespace and blank lines, remove
dead code, follow our standard style for function definitions, and
correct grammatical and factual errors in some of the comments.

Submitted by:	Doug Moore <dougm@rice.edu>
MFC after:	1 week
2017-07-17 23:16:33 +00:00
John Baldwin
c7af789360 Set the current vnet pointer in the socket buffer AIO handler.
This fixes panics when using AIO under VIMAGE.

Reported by:	kp
MFC after:	3 days
Sponsored by:	Chelsio Communications
2017-07-17 16:59:22 +00:00
Ian Lepore
f7afe7679c Minor optimization: instead of converting between days and years using loops
that start in 1970, assume most conversions are going to be for recent dates
and use a precomputed number of days through the end of 2016.

This is a do-over of r320997, hopefully this time with 100% more workiness.

The first attempt had an off-by-one error, but instead of just adding
another mysterious +1 adjustment, this rearranges the relationship between
recent_base_year and recent_base_days so that the latter is the number of
days that occurred before the start of the associated year (instead of the
count thru the end of that year).  This makes the recent_base stuff work
more like the original loop logic that didn't need any +1 adjustments.
2017-07-16 16:54:03 +00:00
Mark Johnston
ab384d75db Revert r320918 and have mkdumpheader() handle version string truncation.
Reported by:	jhb
MFC after:	1 week
2017-07-15 20:53:08 +00:00
Ian Lepore
cfcdbe4b52 Revert r320997. There are reports of it getting the wrong results, so
clearly my testing was insuffficent, and it's best to just revert it
until I get it straightened out.
2017-07-15 00:45:22 +00:00
Brooks Davis
6aacc69847 Add 32-bit compat for kinfo_proc's ki_tdaddr.
This appears to have been an oversight in r213536.

Reviewed by:	markj
MFC after:	1 week
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D11521
2017-07-14 21:13:05 +00:00
Ian Lepore
32cd61b793 Minor optimization: instead of converting between days and years using
loops that start in 1970, assume most conversions are going to be for recent
dates and use a precomputed number of days through the end of 2016.
2017-07-14 18:36:15 +00:00
Ian Lepore
932d14c6e7 Allow setting debug.clocktime as a tunable. Print 64-bit time_t correctly
on 32-bit systems.
2017-07-14 18:13:54 +00:00
Warner Losh
df4245150a This adds CAM pass(4) support for NVMe IO's. Applications indicate
the IO type (Admin or NVM) using XPT op-codes XPT_NVME_ADMIN or
XPT_NVME_IO.

Submitted by:   Chuck Tuffli <chuck@tuffli.net>
Differential Revision:  https://reviews.freebsd.org/D10247
2017-07-14 14:52:20 +00:00
Konstantin Belousov
5cead59181 Correct sysent flags for dynamically loaded syscalls.
Using the https://github.com/google/capsicum-test/ suite, the
PosixMqueue.CapModeForked test was failing due to an ECAPMODE after
calling kmq_notify(). On further inspection, the dynamically
loaded syscall entry was initialized with sy_flags zeroed out, since
SYSCALL_INIT_HELPER() left sysent.sy_flags with the default value.

Add a new helper SYSCALL{,32}_INIT_HELPER_F() which takes an
additional argument to specify the sy_flags value.

Submitted by:	Siva Mahadevan <smahadevan@freebsdfoundation.org>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D11576
2017-07-14 09:34:44 +00:00
Ryan Libby
808a9c8646 kvprintf %b enhancements
Make the %b formatter accept number formatting flags. It will now accept
alternate form, precision, and length modifiers. It also now partially
supports field width (but forces left justification).

Reviewed by:	markj
Approved by:	markj (mentor)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D11284
2017-07-12 07:30:14 +00:00
Ian Lepore
a8d7b9d3bb Support multiple realtime clocks, and remove locking/sleeping restrictions
on clock drivers.

This tracks multiple concurrent realtime clock drivers in a list sorted by
clock resolution.  When system time changes (and periodically) the
clock_settime() methods of all registered clocks are invoked.

To initialize system time, each driver is tried in turn from best to worst
resolution, until one succesfully returns a valid time.

The code no longer holds a mutex while calling the clock_settime() and
clock_gettime() methods of the registered clocks. This allows clock drivers
to do whatever kind of locking or sleeping is necessary (this is especially
important for i2c clock chips since i2c drivers often need to sleep).

A new clock_register_flags() function allows the clock driver to pass
flags. The flags currently defined help support drivers that use their own
techniques to avoid roundoff errors (prevents the 4/5 rounding done by the
subr_rtc code). A driver which may need to wait for resources (such as bus
ownership) may pass a flag to indicate that it will obtain system time for
itself after waiting for resources; this is merely an optimization to avoid
the common code retrieving a timespec that will never get used.

Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D11484
2017-07-12 02:53:54 +00:00
Andrew Gallatin
a5169546ee Simplify UIO_SYSSPACE and UIO_NOCOPY paths in uiomove
Uiomove can only block when the segflag is UIO_USERSPACE,
otherwise we end up just doing a bcopy (or nothing) and
moving cursors. So only emit witness warnings and
set deadlock thread flags in the UIO_USERSPACE case.

Reviewed by:	kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D11489
2017-07-06 15:03:54 +00:00
Hans Petter Selasky
fe715b8090 After r319722 two fields were left uninitialized when transforming a
socket structure into a listening socket. This resulted in an invalid
instruction fault for all 32-bit platforms.

When INVARIANTS is set the union where the two uninitialized fields
reside gets properly zeroed. This patch ensures the two uninitialized
fields are zeroed when INVARIANTS is undefined.

For 64-bit platforms this issue was not visible because so->sol_upcall
which is uninitialized overlaps with so->so_rcv.sb_state which is
already zero during soalloc();

For 32-bit platforms this issue was visible and resulted in an invalid
instruction fault, because so->sol_upcall overlaps with
so->so_rcv.sb_sel which is always initialized to a valid data pointer
during soalloc().

Verifying the offset locations mentioned above are identical is left
as an exercise to the reader.

PR: 220452
PR: 220358
Reviewed by:	ae (network), gallatin
Differential Revision:	https://reviews.freebsd.org/D11475
Sponsored by:	Mellanox Technologies
2017-07-04 18:23:17 +00:00
Konstantin Belousov
467b3975a3 Resolve confusion between different error code spaces.
The vm_map_fixed() and vm_map_stack() VM functions return Mach error
codes.  Convert them into errno values before returning result from
exec_new_vmspace().

While there, modernize the comment and do minor style adjustments.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-07-03 20:44:01 +00:00
Mateusz Guzik
3f7830a31e rwlock: perform the typically false td_rw_rlocks check later
Check if the lock is available first instead.

MFC after:	1 week
2017-07-02 01:05:16 +00:00
Alan Cox
545414213d Change blst_leaf_alloc() to handle a cursor argument, and to improve
performance.

To find in the leaf bitmap all ranges of sufficient length, use a doubling
strategy with shift-and-and until each bit still set represents a bit
sequence of length 'count', or until the bitmask is zero.  In the latter
case, update the hint based on the first bit sequence length not found to
be available.  For example, seeking an interval of length 12, the set bits
of the bitmap would represent intervals of length 1, then 2, then 3, then
6, then 12.  If no bits are set at the point when each bit represents an
interval of length 6, then the hint can be updated to 5 and the search
terminated.

If long-enough intervals are found, discard those before the cursor.  If
any remain, use binary search to find the position of the first of them,
and allocate that interval.

Submitted by:	Doug Moore <dougm@rice.edu>
Reviewed by:	kib, markj
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D11426
2017-07-01 05:27:40 +00:00
Konstantin Belousov
9fb8c888f1 Define ino64_trunc_error under same conditions as the code which uses
the variable.

Noted by:	bde
Sponsored by:	The FreeBSD Foundation
2017-06-30 16:10:21 +00:00
Alan Cox
8056df6e25 Clear the MAP_WIREFUTURE flag on the vm map in exec_new_vmspace() when it
recycles the current vm space.  Otherwise, an mlockall(MCL_FUTURE) could
still be in effect on the process after an execve(2), which violates the
specification for mlockall(2).

It's pointless for vm_map_stack() to check the MEMLOCK limit.  It will
never be asked to wire the stack.  Moreover, it doesn't even implement
wiring of the stack.

Reviewed by:	kib, markj
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D11421
2017-06-30 15:49:36 +00:00
John Baldwin
51645e836d Store a 32-bit PT_LWPINFO struct for 32-bit process core dumps.
Process core notes for a 32-bit process running on a 64-bit host need to
use 32-bit structures so that the note layout matches the layout of notes
of a core dump of a 32-bit process under a 32-bit kernel.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D11407
2017-06-29 21:31:13 +00:00
Navdeep Parhar
98c9236978 Adjust sowakeup post-r319685 so that it continues to make upcalls but
still avoids calling soconnected during sodisconnected.

Discussed with:	glebius@
Sponsored by:	Chelsio Communications
2017-06-29 19:43:27 +00:00
Konstantin Belousov
d137278838 Do not cast struct kevent_args or struct freebsd11_kevent_args to
struct g_kevent_args.

On some architectures, e.g. PowerPC, there is additional padding in uap.

Reported and tested by:	andreast
Sponsored by:	The FreeBSD Foundation
2017-06-29 14:40:33 +00:00
Konstantin Belousov
34d3e89f33 Do not ignore an error from vm_mmap_object().
Found and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-06-27 20:12:13 +00:00
Alan Cox
d37837249b Address the remaining integer overflow issues with the "skip" parameters
and "next_skip" variables.  The "skip" value in struct blist has long been
a 64-bit quantity but various functions have implicitly truncated this
value to 32 bits.  Now, all arithmetic involving the "skip" value is 64
bits wide.  (This should allow us to relax the size limit on a swap device
in the swap pager.)

Maintain the ability to test this allocator as a user-space application by
including <stdbool.h>.

Remove an unused variable from blst_radix_print().

Reviewed by:	kib, markj
MFC after:	4 weeks
Differential Revision:	https://reviews.freebsd.org/D11358
2017-06-27 17:45:26 +00:00
Conrad Meyer
f5b7359a00 Fix one more place uio_resid is truncated to int
A follow-up to r231949 and r194990.

Reported by:	pho@
Reviewed by:	kib@, markj@
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D11373
2017-06-27 17:23:20 +00:00
Gleb Smirnoff
64290befc1 Provide sbsetopt() that handles socket buffer related socket options.
It distinguishes between data flow sockets and listening sockets, and
in case of the latter doesn't change resource limits, since listening
sockets don't hold any buffers, they only carry values to be inherited
by their children.
2017-06-25 01:41:07 +00:00
Mark Johnston
704cb42f2a Fix the !TD_IS_IDLETHREAD(curthread) locking assertions.
Most of the lock slowpaths assert that the calling thread isn't an idle
thread. However, this may not be true if the system has panicked, and in
some cases the assertion appears before a SCHEDULER_STOPPED() check.

MFC after:	3 days
Sponsored by:	Dell EMC Isilon
2017-06-19 21:09:50 +00:00
Konstantin Belousov
711dba24d7 Allow negative aio_offset only for the read and write LIO ops on
device nodes.

Otherwise, the current check of aio_offset == -1LL makes it possible
to pass negative file offsets down to the filesystems. This trips
assertions and is even unsafe for e.g. FFS which keeps metadata at
negative offsets.

Reported and tested by:	pho
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D11266
2017-06-19 15:17:17 +00:00
Alan Cox
d4e3484bd9 Change blist_alloc()'s allocation policy from first-fit to next-fit so
that disk writes are more likely to be sequential.  This change is
beneficial on both the solid state and mechanical disks that I've
tested.  (A similar change in allocation policy was made by DragonFly
BSD in 2013 to speed up Poudriere with "stressful memory parameters".)

Increase the width of blst_meta_alloc()'s parameter "skip" and the local
variables whose values are derived from it to 64 bits.  (This matches the
width of the field "skip" that is stored in the structure "blist" and
passed to blst_meta_alloc().)

Eliminate a pointless check for a NULL blist_t.

Simplify blst_meta_alloc()'s handling of the ALL-FREE case.

Address nearby style errors.

Reviewed by:	kib, markj
MFC after:	5 weeks
Differential Revision:	https://reviews.freebsd.org/D11247
2017-06-18 18:23:39 +00:00
Rick Macklem
d1c5e240a8 Make MAXBCACHEBUF a tunable called vfs.maxbcachebuf.
By making MAXBCACHEBUF a tunable, it can be increased to allow for
larger read/write data sizes for the NFS client.
The tunable is limited to MAXPHYS, which is currently 128K.
Making MAXPHYS a tunable or increasing its value is being discussed,
since it would be nice to support a read/write data size of 1Mbyte
for the NFS client when mounting the AmazonEFS file service.

Reviewed by:	kib
MFC after:	2 weeks
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D10991
2017-06-17 22:24:19 +00:00
Konstantin Belousov
eb84ca643c Regen. 2017-06-17 00:58:19 +00:00
Konstantin Belousov
2b34e84335 Add abstime kqueue(2) timers and expand struct kevent members.
This change implements NOTE_ABSTIME flag for EVFILT_TIMER, which
specifies that the data field contains absolute time to fire the
event.

To make this useful, data member of the struct kevent must be extended
to 64bit.  Using the opportunity, I also added ext members.  This
changes struct kevent almost to Apple struct kevent64, except I did
not changed type of ident and udata, the later would cause serious API
incompatibilities.

The type of ident was kept uintptr_t since EVFILT_AIO returns a
pointer in this field, and e.g. CHERI is sensitive to the type
(discussed with brooks, jhb).

Unlike Apple kevent64, symbol versioning allows us to claim ABI
compatibility and still name the new syscall kevent(2).  Compat shims
are provided for both host native and compat32.

Requested by:	bapt
Reviewed by:	bapt, brooks, ngie (previous version)
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D11025
2017-06-17 00:57:26 +00:00
Konstantin Belousov
f2eb97b2cd Style.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
X-Differential revision:	https://reviews.freebsd.org/D11025
2017-06-16 23:41:13 +00:00
Gleb Smirnoff
2b8e036bfc Plug read(2) and write(2) on listening sockets. 2017-06-15 20:11:29 +00:00
Ryan Libby
c74ae2ca93 ddb show socket debugging
Display the mbuf/cluster count for a sockbuf and fix a couple whitespace
issues in the output.

Reviewed by:	jhb, markj (both previous version)
Approved by:	markj (mentor)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D11062
2017-06-15 04:49:12 +00:00
Ryan Libby
76f2c27264 ddb show files: fix up file types and whitespace
This makes ddb show files more descriptive and also adjusts the
whitespace to align the columns for non-32-bit architectures.

Reviewed by:	cem (previous version), jhb
Approved by:	markj (mentor)
Differential Revision:	https://reviews.freebsd.org/D11061
2017-06-14 07:46:52 +00:00
Konstantin Belousov
81c3737d95 Remove stray return.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-06-13 19:02:12 +00:00
Alan Cox
4be4fd5d5f Reduce the frequency of hint updates on allocation without incurring
additional allocation overhead.  Previously, blst_meta_alloc() updated the
hint after every successful allocation.  However, these "eager" hint
updates are of no actual benefit if, instead, the "lazy" hint update at
the start of blst_meta_alloc() is generalized to handle all cases where
the number of available blocks is less than the requested allocation.
Previously, the lazy hint update at the start of blst_meta_alloc() only
handled the ALL-FULL case.  (I would also note that this change provides
consistency between blist_alloc() and blist_fill() in that their hint
maintenance is now entirely lazy.)

Eliminate unnecessary checks for terminators in blst_meta_alloc() and
blst_meta_fill() when handling ALL-FREE meta nodes.

Eliminate the field "bl_free" from struct blist.  It is redundant.  Unless
the entire radix tree is a single leaf, the count of free blocks is stored
in the root node.  Instead, provide a function blist_avail() for obtaining
the number of free blocks.

In blst_meta_alloc(), perform a sanity check on the allocation once rather
than repeating it in a loop over the meta node's children.

In blst_leaf_fill(), use the optimized bitcount*() function instead of a
loop to count the blocks being allocated.

Add or improve several comments.

Address some nearby style errors.

Reviewed by:	kib
MFC after:	6 weeks
Differential Revision:	https://reviews.freebsd.org/D11146
2017-06-13 17:49:49 +00:00
Mark Johnston
46514e7d76 Hint at the intended usage for the "ll" field of struct uuid_private.
Discussed with:	kib
MFC after:	1 week
2017-06-13 15:37:04 +00:00
Konstantin Belousov
b43ce76c77 Add ptrace(PT_GET_SC_ARGS) command to return debuggee' current syscall
arguments.

Reviewed by:	jhb (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
Differential revision:	https://reviews.freebsd.org/D11080
2017-06-12 21:15:43 +00:00
Konstantin Belousov
f5a077c390 Print unimplemented syscall number to the ctty on SIGSYS, if enabled
by the knob kern.lognosys.

Discussed with:	imp
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
X-Differential revision:	https://reviews.freebsd.org/D11080
2017-06-12 21:11:11 +00:00
Konstantin Belousov
2d88da2f06 Move struct syscall_args syscall arguments parameters container into
struct thread.

For all architectures, the syscall trap handlers have to allocate the
structure on the stack.  The structure takes 88 bytes on 64bit arches
which is not negligible.  Also, it cannot be easily found by other
code, which e.g. caused duplication of some members of the structure
to struct thread already.  The change removes td_dbg_sc_code and
td_dbg_sc_nargs which were directly copied from syscall_args.

The structure is put into the copied on fork part of the struct thread
to make the syscall arguments information correct in the child after
fork.

This move will also allow several more uses shortly.

Reviewed by:	jhb (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
X-Differential revision:	https://reviews.freebsd.org/D11080
2017-06-12 21:03:23 +00:00
Mark Johnston
56060a373e Add a helper function for comparing struct uuids.
Submitted by:	Domagoj Stolfa <domagoj.stolfa@gmail.com>
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D11138
2017-06-12 20:14:44 +00:00
Alan Cox
015d7db6b6 Remove an unnecessary field from struct blist. (The comment describing
what this field represented was also inaccurate.)  Suggested by: kib

In r178792, blist_create() grew a malloc flag, allowing M_NOWAIT to be
specified.  However, blist_create() was not modified to handle the
possibility that a malloc() call failed.  Address this omission.

Increase the width of the local variable "radix" to 64 bits.  (This
matches the width of the corresponding field in struct blist.)

Reviewed by:	kib
MFC after:	6 weeks
2017-06-10 16:11:39 +00:00
Alan Cox
a7249a6c85 blist_fill()'s return type is too narrow. blist_fill() accepts a 64-bit
quantity as the size of the range to fill, but returns a 32-bit quantity
as the number of blocks that were allocated to fill that range.  This
revision corrects that mismatch.  Currently, swaponsomething() limits
the size of a swap area to prevent arithmetic arithmetic overflow in
other parts of the blist allocator.  That limit has also prevented this
type mismatch from causing problems.

Reviewed by:	kib, markj
MFC after:	6 weeks
Differential Revision:	https://reviews.freebsd.org/D11096
2017-06-09 16:19:24 +00:00
Gleb Smirnoff
438902edbd Fix stat(2) on a listening socket. 2017-06-09 15:54:48 +00:00
Konstantin Belousov
7abe0df223 Enhance vfs.ino64_trunc_error sysctl.
Provide a new mode "2" which returns a special overflow indicator in
the non-representable field instead of the silent truncation (mode
"0") or EOVERFLOW (mode "1").

In particular, the typical use of st_ino to detect hard links with
mode "2" reports false positives, which might be more suitable for
some uses.

Discussed with:	bde
Sponsored by:	The FreeBSD Foundation
2017-06-09 11:17:08 +00:00
Gleb Smirnoff
779f106aa1 Listening sockets improvements.
o Separate fields of struct socket that belong to listening from
  fields that belong to normal dataflow, and unionize them.  This
  shrinks the structure a bit.
  - Take out selinfo's from the socket buffers into the socket. The
    first reason is to support braindamaged scenario when a socket is
    added to kevent(2) and then listen(2) is cast on it. The second
    reason is that there is future plan to make socket buffers pluggable,
    so that for a dataflow socket a socket buffer can be changed, and
    in this case we also want to keep same selinfos through the lifetime
    of a socket.
  - Remove struct struct so_accf. Since now listening stuff no longer
    affects struct socket size, just move its fields into listening part
    of the union.
  - Provide sol_upcall field and enforce that so_upcall_set() may be called
    only on a dataflow socket, which has buffers, and for listening sockets
    provide solisten_upcall_set().

o Remove ACCEPT_LOCK() global.
  - Add a mutex to socket, to be used instead of socket buffer lock to lock
    fields of struct socket that don't belong to a socket buffer.
  - Allow to acquire two socket locks, but the first one must belong to a
    listening socket.
  - Make soref()/sorele() to use atomic(9).  This allows in some situations
    to do soref() without owning socket lock.  There is place for improvement
    here, it is possible to make sorele() also to lock optionally.
  - Most protocols aren't touched by this change, except UNIX local sockets.
    See below for more information.

o Reduce copy-and-paste in kernel modules that accept connections from
  listening sockets: provide function solisten_dequeue(), and use it in
  the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
  infiniband, rpc.

o UNIX local sockets.
  - Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
    local sockets.  Most races exist around spawning a new socket, when we
    are connecting to a local listening socket.  To cover them, we need to
    hold locks on both PCBs when spawning a third one.  This means holding
    them across sonewconn().  This creates a LOR between pcb locks and
    unp_list_lock.
  - To fix the new LOR, abandon the global unp_list_lock in favor of global
    unp_link_lock.  Indeed, separating these two locks didn't provide us any
    extra parralelism in the UNIX sockets.
  - Now call into uipc_attach() may happen with unp_link_lock hold if, we
    are accepting, or without unp_link_lock in case if we are just creating
    a socket.
  - Another problem in UNIX sockets is that uipc_close() basicly did nothing
    for a listening socket.  The vnode remained opened for connections.  This
    is fixed by removing vnode in uipc_close().  Maybe the right way would be
    to do it for all sockets (not only listening), simply move the vnode
    teardown from uipc_detach() to uipc_close()?

Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D9770
2017-06-08 21:30:34 +00:00
Alan Cox
86dd278f03 When allocating swap blocks, if the available number of free blocks in a
subtree is already zero, then setting the "largest contiguous free block"
hint for that subtree to anything other than zero makes no sense.  To be
clear, assigning a value to the hint that is too large is not a correctness
problem, only a pessimization.

Dragonfly BSD has applied the same change to blst_meta_alloc() but not
blst_meta_fill().

MFC after:	6 weeks
2017-06-08 15:48:54 +00:00
Gleb Smirnoff
8d40bada3e Fix a degenerate case when soisdisconnected() would call soisconnected().
This happens when closing a socket with upcall, and trace is: soclose()->
... protocol ... -> soisdisconnected() -> socantrcvmore_locked() ->
sowakeup() -> soisconnected().

Right now this case is innocent for two reasons.  First, soisconnected()
doesn't clear SS_ISDISCONNECTED flag.  Second, the mutex to lock the
socket is the socket receive buffer mutex, and sodisconnected() first
disables the receive buffer. But in future code, the mutex to lock
socket is different to buffer mutex, and we would get undesired mutex
recursion.

The fix is to check SS_ISDISCONNECTED flag before calling upcall.
2017-06-08 06:16:47 +00:00
Marcelo Araujo
e0a6a23c6d Allow sysctl kern.vm_guest to return bhyve when running under bhyve.
Submitted by:	Sean Fagan <sef@ixsystems.com>
Reviewed by:	grehan
MFH:		4 weeks.
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D11090
2017-06-08 04:02:14 +00:00
Alan Cox
d90bf7d850 Originally, this file could be compiled as a user-space application for
testing purposes.  However, over the years, various changes to the kernel
have broken this feature.  This revision applies some fixes to get user-
space compilation working again.  There are no changes in this revision
to code that is used by the kernel.

MFC after:	3 days
2017-06-07 16:04:34 +00:00
Gleb Smirnoff
b3244df799 Provide typedef for socket upcall function.
While here change so_gen_t type to modern uint64_t.
2017-06-07 01:48:11 +00:00
Gleb Smirnoff
b94f68dc52 Remove a piece of dead code. 2017-06-07 01:21:34 +00:00
Alan Cox
03bdd65f18 When the function blist_fill() was added to the kernel in r107913, the swap
pager used a different scheme for striping the allocation of swap space
across multiple devices.  And, although blist_fill() was intended to support
fill operations with large counts, the old striping scheme never performed a
fill larger than the stripe size.  Consequently, the misplacement of a
sanity check in blst_meta_fill() went undetected.  Now, moving forward in
time to r118390, a new scheme for striping was introduced that maintained a
blist allocator per device, but as noted in r318995, swapoff_one() was not
fully and correctly converted to the new scheme.  This change completes what
was started in r318995 by fixing the underlying bug in blst_meta_fill() that
stops swapoff_one() from simply performing a single blist_fill() operation.

Reviewed by:	kib
MFC after:	5 days
Differential Revision:	https://reviews.freebsd.org/D11043
2017-06-06 03:32:17 +00:00
Allan Jude
e28f9b7d03 Jails: Optionally prevent jailed root from binding to privileged ports
You may now optionally specify allow.noreserved_ports to prevent root
inside a jail from using privileged ports (less than 1024)

PR:		217728
Submitted by:	Matt Miller <mattm916@pulsar.neomailbox.ch>
Reviewed by:	jamie, cem, smh
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D10202
2017-06-06 02:15:00 +00:00
Alan Cox
064650c180 Halve the memory being internally allocated by the blist allocator. In
short, half of the memory that is allocated to implement the radix tree is
wasted because we did not change "u_daddr_t" to be a 64-bit unsigned int
when we changed "daddr_t" to be a 64-bit (signed) int.  (See r96849 and
r96851.)

Reviewed by:	kib, markj
Tested by:	pho
MFC after:	5 days
Differential Revision:	https://reviews.freebsd.org/D11028
2017-06-05 17:14:16 +00:00
Konstantin Belousov
3df7ebc4ed Add sysctl vfs.ino64_trunc_error controlling action on truncating
inode number or link count for the ABI compat binaries.

Right now, and by default after the change, too large 64bit values are
silently truncated to 32 bits.  Enabling the knob causes the system to
return EOVERFLOW for stat(2) family of compat syscalls when some
values cannot be completely represented by the old structures.  For
getdirentries(2), knob skips the dirents which would cause non-trivial
truncation of d_ino.

EOVERFLOW error is specified by the X/Open 1996 LFS document
('Adding Support for Arbitrary File Sizes to the Single UNIX
Specification').

Based on the discussion with:	bde
Sponsored by:	The FreeBSD Foundation
2017-06-05 11:40:30 +00:00
Alan Cox
d712b799b5 The data type returned by vmoff() is too narrow in its range. This could
break the transmission of files longer than 4 GB on 32-bit architectures.

Reviewed by:	glebius, kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D10019
2017-06-03 16:19:33 +00:00
Konstantin Belousov
a7ca2c6ad0 Ensure that cached struct thread does not keep spurious td_su
reference on an UFS mount point.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-06-03 14:12:17 +00:00
Gleb Smirnoff
971af2a311 Rename accept filter getopt/setopt functions, so that they are prefixed
with module name and match other functions in the module.  There is no
functional change.
2017-06-02 17:49:21 +00:00
Gleb Smirnoff
810951ddc9 Style: unwrap lines that doesn't have a good reason to be wrapped. 2017-06-02 17:43:47 +00:00
Gleb Smirnoff
bd617e3b98 Remove write only flag UNP_HAVEPCCACHED. 2017-06-02 17:39:05 +00:00
Gleb Smirnoff
0c3c207ffd For UNIX sockets make vnode point not to the socket, but to the UNIX PCB,
since the latter is the thing that links together VFS and sockets.

While here, make the union in the struct vnode anonymous.
2017-06-02 17:31:25 +00:00
Mateusz Guzik
c7a6a1b325 mtx: fix whitespace damage in _mtx_trylock_flags_
MFC after:	3 days
2017-05-30 02:25:47 +00:00
Konstantin Belousov
03311f117b Use whole mnt_stat.f_fsid bits for st_dev.
Since ino64 expanded dev_t to 64bit, make VOP_GETATTR(9) provide all
bits of mnt_stat.f_fsid as va_fsid for vnodes on filesystems which use
f_fsid.  In particular, NFSv3 and sometimes NFSv4, and ZFS use this
method or reporting st_dev by stat(2).

Provide a new helper vn_fsid() to avoid duplicating code to copy
f_fsid to va_fsid.

Note that the change is mostly cosmetic.  Its motivation is to avoid
sign-extension of f_fsid[0] into 64bit dev_t value which happens after
dev_t becomes 64bit..

Reviewed by:	avg(zfs), rmacklem (nfs) (both for previous version)
Sponsored by:	The FreeBSD Foundation
2017-05-27 17:00:30 +00:00
Conrad Meyer
95b978955c procstat(1): Add TCP socket send/recv buffer size
Add TCP socket send and receive buffer size to procstat -f output.

Reviewed by:	kib, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D10689
2017-05-26 22:17:44 +00:00
Allan Jude
c20feae640 Followup to r318765 (capsicumize cpuset_*affinity)
Update *sysent files
2017-05-24 01:01:57 +00:00
Allan Jude
f299c47b52 Allow cpuset_{get,set}affinity in capabilities mode
bhyve was recently sandboxed with capsicum, and needs to be able to
control the CPU sets of its vcpu threads

Reviewed by:	emaste, oshogbo, rwatson
MFC after:	2 weeks
Sponsored by:	ScaleEngine Inc.
Differential Revision:	https://reviews.freebsd.org/D10170
2017-05-24 00:58:30 +00:00
Steve Wills
a4aaba3b0a Add security.bsd.see_jail_proc
Add security.bsd.see_jail_proc sysctl to hide jail processes from non-root
users

Reviewed by:	jamie
Approved by:	allanjude
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D10770
2017-05-23 16:59:24 +00:00
Konstantin Belousov
ec95c622ff Regen. 2017-05-23 09:30:42 +00:00
Konstantin Belousov
6992112349 Commit the 64-bit inode project.
Extend the ino_t, dev_t, nlink_t types to 64-bit ints.  Modify
struct dirent layout to add d_off, increase the size of d_fileno
to 64-bits, increase the size of d_namlen to 16-bits, and change
the required alignment.  Increase struct statfs f_mntfromname[] and
f_mntonname[] array length MNAMELEN to 1024.

ABI breakage is mitigated by providing compatibility using versioned
symbols, ingenious use of the existing padding in structures, and
by employing other tricks.  Unfortunately, not everything can be
fixed, especially outside the base system.  For instance, third-party
APIs which pass struct stat around are broken in backward and
forward incompatible ways.

Kinfo sysctl MIBs ABI is changed in backward-compatible way, but
there is no general mechanism to handle other sysctl MIBS which
return structures where the layout has changed. It was considered
that the breakage is either in the management interfaces, where we
usually allow ABI slip, or is not important.

Struct xvnode changed layout, no compat shims are provided.

For struct xtty, dev_t tty device member was reduced to uint32_t.
It was decided that keeping ABI compat in this case is more useful
than reporting 64-bit dev_t, for the sake of pstat.

Update note: strictly follow the instructions in UPDATING.  Build
and install the new kernel with COMPAT_FREEBSD11 option enabled,
then reboot, and only then install new world.

Credits: The 64-bit inode project, also known as ino64, started life
many years ago as a project by Gleb Kurtsou (gleb).  Kirk McKusick
(mckusick) then picked up and updated the patch, and acted as a
flag-waver.  Feedback, suggestions, and discussions were carried
by Ed Maste (emaste), John Baldwin (jhb), Jilles Tjoelker (jilles),
and Rick Macklem (rmacklem).  Kris Moore (kris) performed an initial
ports investigation followed by an exp-run by Antoine Brodin (antoine).
Essential and all-embracing testing was done by Peter Holm (pho).
The heavy lifting of coordinating all these efforts and bringing the
project to completion were done by Konstantin Belousov (kib).

Sponsored by:	The FreeBSD Foundation (emaste, kib)
Differential revision:	https://reviews.freebsd.org/D10439
2017-05-23 09:29:05 +00:00
Ed Maste
bd309b323a Regen sysent after r318634, no open(2) in capability mode
Sponsored by:	The FreeBSD Foundation
2017-05-22 11:45:45 +00:00
Ed Maste
68fc8f3934 disallow open(2) in capability mode
Previously open(2) was allowed in capability mode, with a comment that
suggested this was likely the case to facilitate debugging. The system
call would still fail later on, but it's better to disallow the syscall
altogether.

We now have the kern.trap_enotcap sysctl or PROC_TRAPCAP_CTL proccontrol
to aid in debugging.

In any case libc has translated open() to the openat syscall since
r277032.

Reviewed by:	kib, rwatson
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D10850
2017-05-22 11:43:19 +00:00
Mark Johnston
3bd485f968 Avoid open-coding PRI_UNCHANGED.
MFC after:	1 week
2017-05-18 18:24:11 +00:00
Ed Maste
3e85b721d6 Remove register keyword from sys/ and ANSIfy prototypes
A long long time ago the register keyword told the compiler to store
the corresponding variable in a CPU register, but it is not relevant
for any compiler used in the FreeBSD world today.

ANSIfy related prototypes while here.

Reviewed by:	cem, jhb
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D10193
2017-05-17 00:34:34 +00:00
John Baldwin
00f6cd3f56 Add sglist_append_sglist().
This function permits a range of one scatter/gather list to be appended to
another sglist.  This can be used to construct a scatter/gather list that
reorders or duplicates ranges from one or more existing scatter/gather
lists.

Sponsored by:	Chelsio Communications
2017-05-16 23:31:52 +00:00
Konstantin Belousov
391aba32e6 mnt_vnode_next_active: use conventional lock order when trylock fails.
Previously, when the VI_TRYLOCK failed, we would spin under the mutex
that protects the vnode active list until we either succeeded or
noticed that we had hogged the CPU. Since we were violating the lock
order, this would guarantee that we would become a hog under any
deadlock condition (e.g. a race with vdrop(9) on the same vnode). In
the presence of many concurrent threads in sync(2) or vdrop etc, the
victim could hang for a long time.

Now, avoid spinning by dropping and reacquiring the locks in the
conventional lock order when the trylock fails. This requires a dance
with the vnode hold count.

Submitted by:	Tom Rix <trix@juniper.net>
Tested by:	pho
Differential revision:	https://reviews.freebsd.org/D10692
2017-05-15 10:02:45 +00:00
Konstantin Belousov
396a0d4455 Do not wake up sleeping thread in reschedule_signals() if the signal
is blocked.  The spurious wakeup might result in spurious EINTR.

The reschedule_signals() function is called when the calling thread
has the signal mask changed.  For each newly blocked signal, we try to
find a thread which might have the signal not blocked.  If no such
thread exists, sigtd() returns random thread, which must not be waken
up.  I decided that re-checking, as suggested by PR submitter, is more
reasonable change than to change sigtd() interface, due to other uses
of sigtd().  signotify() already performs this check.

Submitted by:	Duane <parakleta@darkreality.org>
PR:	219228
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-05-12 15:34:59 +00:00
Mark Johnston
0c46712ca1 Let ptracestop() suspend threads sleeping in an SBDRY section.
When a thread enters ptracestop(), for example because it had received
SIGSTOP from ptrace(PT_ATTACH), it attempts to suspend other threads in
the same process. In the case of a thread sleeping interruptibly in an
SBDRY section, sig_suspend_threads() must wake the thread and allow it to
reach the user-mode boundary. However, sig_suspend_threads() would
erroneously avoid waking up such threads, resulting in an apparent hang.

Reviewed by:	kib
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	Dell EMC Isilon
2017-05-11 17:03:45 +00:00
Marius Strobl
26d877f5b8 - Also outside of the KOBJOPLOOKUP macro - which in turn is used by
the code auto-generated for *.m - kobj_lookup_method(9) is useful;
  for example in back-ends or base class device drivers in order to
  determine whether a default method has been overridden. Thus, allow
  for the kobj_method_t pointer argument - used by KOBJOPLOOKUP in
  order to update the cache entry - of kobj_lookup_method(9), to be
  NULL. Actually, that pointer is redundant as it's just set to the
  same kobj_method_t that the kobj_lookup_method(9) function returns
  in the first place, but probably it serves to reduce the number of
  instructions generated for KOBJOPLOOKUP.
- For the same reason, move updating kobj_lookup_{hits,misses} (if
  KOBJ_STATS is defined) from kobj_lookup_method(9) to KOBJOPLOOKUP.
  As a side-effect, this gets rid of the convoluted approach of always
  incrementing kobj_lookup_hits in KOBJOPLOOKUP and then in case of
  a cache miss, decrementing it in kobj_lookup_method(9) again.
2017-05-08 21:08:39 +00:00
Brooks Davis
f19351aad8 Provide a freebsd32 implementation of sigqueue()
The previous misuse of sys_sigqueue() was sending random register or
stack garbage to 64-bit targets.  The freebsd32 implementation preserves
the sival_int member of value when signaling a 64-bit process.

Document the mixed ABI implementation of union sigval and the
incompability of sival_ptr with pointer integrity schemes.

Reviewed by:	kib, wblock
MFC after:	1 week
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D10605
2017-05-05 18:49:39 +00:00
Mateusz Guzik
8066a14a3c cache: stop holding the ncneg_hot lock across purging
Only non-hot entries are purged so the lock is not needed in the first place.
This saves one lock/unlock pair.

MFC after:	1 week
2017-05-04 03:11:59 +00:00
Conrad Meyer
29dfb631d8 Extend cpuset_get/setaffinity() APIs
Add IRQ placement-only and ithread-only API variants. intr_event_bind
has been extended with sibling methods, as it has many more callsites in
existing code.

Reviewed by:	kib@, adrian@ (earlier version)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D10586
2017-05-03 18:41:08 +00:00
Konstantin Belousov
acd9f51725 Add asserts to verify stability of struct proc and struct thread layouts.
Some notes:
- Only i386 and amd64 layouts are checked, other Tier-1 (or close to
  it) architectures would benefit from the same check.
- Unconditional enabling of the asserts depend on the stability of locks
  memory layout.  If locks are optimized to avoid bloat when some debugging
  or profiling features turned off, it makes sense to only assert layout
  for production configs.

Reviewed by:	badger, emaste, jhb, vangyzen
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D10526
2017-04-27 21:24:50 +00:00
Patrick Kelsey
1431521236 Remove unnecessary check for NULL mbuf in soreceive_generic().
This check has been redundant since it was introduced in r162554.

Reviewed by:	emaste, glebius
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D10322
2017-04-25 19:54:34 +00:00
Edward Tomasz Napierala
04005c2f92 Make it possible to terminate "show lockedbufs" by pressing "q".
MFC after:	2 weeks
2017-04-23 22:20:25 +00:00
Edward Tomasz Napierala
10be945708 Improve BUF_TRACKING by not displaying NULL entries.
Reviewed by:	cem
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D10443
2017-04-23 17:39:31 +00:00
Gleb Smirnoff
83c9dea1ba - Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter
in place.  To do per-cpu stats, convert all fields that previously were
  maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
  before we have set up UMA and we can do counter_u64_alloc(), provide an
  early counter mechanism:
  o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
  o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
    so that at early stages of boot, before counters are allocated we already
    point to a counter that can be safely written to.
  o For sparc64 that required a whole dummy pcpu[MAXCPU] array.

Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
  to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.

This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html

Reviewed by:	kib, gallatin, marius, lidl
Differential Revision:	https://reviews.freebsd.org/D10156
2017-04-17 17:34:47 +00:00
Gleb Smirnoff
fef0991322 Typo! 2017-04-17 17:07:51 +00:00
Gleb Smirnoff
9ed01c32e0 All these files need sys/vmmeter.h, but now they got it implicitly
included via sys/pcpu.h.
2017-04-17 17:07:00 +00:00
Gleb Smirnoff
6286dc78d4 Remove unneeded include of vm_phys.h. 2017-04-17 16:51:04 +00:00
Edward Tomasz Napierala
b66f26e931 Don't try to write out bufs that have already failed with ENXIO.
This fixes some panics after disconnecting mounted disks.

Submitted by:	imp (slightly different version, which I've then lost)
Reviewed by:	kib, imp, mckusick
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D9674
2017-04-14 20:15:34 +00:00
Maxim Sobolev
63649db042 Restore ability to shutdown DGRAM sockets, still forcing ENOTCONN to be returned
by the shutdown(2) system call. This ability has been lost as part of the svn
revision 285910.

Reviewed by:	ed, rwatson, glebius, hiren
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D10351
2017-04-14 17:23:28 +00:00
Andrey V. Elsukov
57386f5dce Fix the build.
Reported by:	lwhsu
2017-04-14 10:21:38 +00:00
Andrey V. Elsukov
c33a231337 Rework r316770 to make it protocol independent and general, like we
do for streaming sockets.

And do more cleanup in the sbappendaddr_locked_internal() to prevent
leak information from existing mbuf to the one, that will be possible
created later by netgraph.

Suggested by:	glebius
Tested by:	Irina Liakh <spell at itl ua>
MFC after:	1 week
2017-04-14 09:00:48 +00:00