Commit Graph

15682 Commits

Author SHA1 Message Date
mjg
630e712410 mtx: implement thread lock fastpath
MFC after:	1 week
2017-10-21 22:40:09 +00:00
mmel
be76b77ce7 Add AT_HWCAP2 ELF auxiliary vector.
- allocate value for new AT_HWCAP2 auxiliary vector on all platforms.
 - expand 'struct sysentvec' by new 'u_long *sv_hwcap2', in exactly
   same way as for AT_HWCAP.

MFC after:	1 month
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D12699
2017-10-21 12:05:01 +00:00
markj
ce2d698d24 Avoid the nbp lookup in the final loop iteration in flushbuflist().
The end of the loop must re-lookup the next buf since the bufobj lock
is dropped in the loop body. If the lookup fails, the loop is restarted.
This mechanism non-obviously also terminates the loop when the end of
the buf list is reached. Split up the two loops termination cases to
make the code a bit less fragile. No functional change intended.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D12730
2017-10-20 14:56:13 +00:00
mjg
002d36b454 mtx: fix up UP build after r324778
Reported by:	Michael Butler
2017-10-20 14:04:01 +00:00
mjg
7c60734b41 Mark kdb_active as __read_frequently and switch to bool to eat less space. 2017-10-20 04:02:53 +00:00
mjg
abcba1b168 rwlock: reduce lockstat branches in the slowpath
MFC after:	1 week
2017-10-20 03:32:42 +00:00
mjg
cb00d2eba3 mtx: stop testing SCHEDULER_STOPPED in kabi funcs for spin mutexes
There is nothing panic-breaking to do in the unlock case and the lock
case will fallback to the slow path doing the check already.

MFC after:	1 week
2017-10-20 00:34:25 +00:00
mjg
96ff69bdc9 mtx: clean up locking spin mutexes
1) shorten the fast path by pushing the lockstat probe to the slow path
2) test for kernel panic only after it turns out we will have to spin,
in particular test only after we know we are not recursing

MFC after:	1 week
2017-10-20 00:30:35 +00:00
mjg
2b6199af16 sysctl: only take mem lock if oldlen is > 4 * PAGE_SIZE
The previous limit of just one page is hit by ps.

The entire mechanism should be reworked, if not whacked. It seems the intent
is to reduce kernel dos-ability - some handlers wire the amount of memory
passed here. Handlers should probably stop wiring in the first place or in
the worst case indicate they are doing so so that the check is done only if
necessary. It should also probably be a counter, not a lock.

MFC after:	1 week
2017-10-19 01:38:31 +00:00
mjg
a5e44043ca execve: avoid one proc lock/unlock trip unless PTRACE_EXEC is set
MFC after:	1 week
2017-10-19 00:46:15 +00:00
mjg
7b0fc84b9d Tidy up pmc support at execve.
The proc-specific check is inherently racy, so the code can just unlock
beforehand.

MFC after:	1 week
2017-10-19 00:38:14 +00:00
mjg
e611bbff76 sysvsem: check if semu_list has anything on it before grabbing the lock
This should get a process-specific support instead.

MFC after:	1 week
2017-10-19 00:31:00 +00:00
mjg
8f29191fc8 Don't take Giant for SMP status and cpu topology sysctls.
Not only this lock doesn't play any role here, dirtying it slows down
other things a little bit as giant-held checks (e.g. DROP_GIANT) are
spread all over the kernel.

MFC after:	1 week
2017-10-18 22:00:44 +00:00
markj
c87fb69add Move kernel dump offset tracking into MI code.
All of the kernel dump implementations keep track of the current offset
("dumplo") within the dump device. However, except for textdumps, they
all write the dump sequentially, so we can reduce code duplication by
having the MI code keep track of the current offset. The new
dump_append() API can be used to write at the current offset.

This is needed to implement support for kernel dump compression in the
MI kernel dump code.

Also simplify dump_encrypted_write() somewhat: use dump_write() instead
of duplicating its bounds checks, and get rid of the redundant offset
tracking.

Reviewed by:	cem
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D11722
2017-10-18 15:38:05 +00:00
brooks
fb0dea4f57 Remove mbpool(9) now that it has no consumers.
mbpool existed to support NICs with memory interfaces and all remaining
comsumers were removed earlier this year with NATM.

Reviewed by:	jhb
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D10513
2017-10-18 00:18:03 +00:00
markj
9fce266679 Fix a racy VI_DOOMED check in MNT_VNODE_FOREACH_ALL().
MNT_VNODE_FOREACH_ALL() is supposed to avoid returning doomed vnodes,
but the VI_DOOMED check it used was done without the vnode interlock
held, so it could race with a concurrent vgone().

Submitted by:	Don Morris <don.morris@isilon.com>
Reviewed by:	kib, mckusick
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D12704
2017-10-17 19:41:45 +00:00
avos
17ed4f0710 mbuf(9): unbreak m_fragment()
- Fix it by replacing m_cat() with m_prev->m_next = m_new
(m_cat() will try to append data - as a result, there will be no
fragmentation).
- Move some constants out of the loop.

Was previously tested with D4077.

Differential Revision:	https://reviews.freebsd.org/D4090
2017-10-16 21:46:11 +00:00
kib
49a0c8651e Re-evaluate thread' signal mask after ptracestop().
The stop drops process lock, which allows the signal mask to be
changed and our selected signal might become blocked, i.e. should be
returned to the process queue instead of delivery.

Also, for the existing check of the process no longer having an
attached debugger, we should not loose the signal, but requeue it.

Reported and tested by:	bdrewery
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-16 20:21:51 +00:00
kib
30cdab3f84 Improve assertion that an ignored or blocked signal is not delivered.
Split two conditions into separate asserts.  Print additional details,
like the signal number and action value.

Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-16 20:15:19 +00:00
kib
27e9b9ded2 Style.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-10-16 20:11:29 +00:00
mjoras
88e3a903b6 Properly reset the fields in clean_unrhdr.
In r324542 I neglected to reset the first and last fields of struct
unrhdr. This causes a tmpfs to fail the unr(9) consistency checks with
DIAGNOSTIC on. Fix this by resetting the fields by calling init_unrhdr.
While here, change a loop to use TAILQ_FOREACH_SAFE since it is more
readable and equally fast.

Reported by:	David Wolfskill <david@catwhisker.org>
Approved by:	rstone (mentor)
Sponsored by:	Dell EMC Isilon
2017-10-16 16:14:50 +00:00
tijl
5b4d8cee9d When a Linux program tries to access a /path the kernel tries
/compat/linux/path before /path.  Stop following symbolic links when
looking up /compat/linux/path so dead symbolic links aren't ignored.
This allows syscalls like readlink(2) and lstat(2) to work on such links.
And open(2) will return an error now instead of trying /path.
2017-10-15 18:53:21 +00:00
mjg
e99dca871b mtx: fix up owner_mtx after r324609
Now that MTX_UNOWNED is 0 the test was alwayas false.
2017-10-14 00:47:30 +00:00
alc
e64f2e0930 Address two problems with sendfile(..., SF_NOCACHE) and apply one
"optimization".  First, sendfile(..., SF_NOCACHE) frees pages without
checking whether those pages are mapped.  This can leave the system
with mappings to free or repurposed pages.  Second, a page can be
busied between the time of the current busy test and acquiring the
object lock.  Essentially, the test performed before the object lock
is acquired can only be regarded as an optimization to short-circuit
further work on the page.  It cannot, however, be relied upon to prove
that it is safe to free the page.  Third, when sendfile(..., SF_NOCACHE)
was originally implemented, vm_page_deactivate_noreuse() did not yet
exist.  Use vm_page_deactivate_noreuse() instead of vm_page_deactivate(),
because it comes closer to freeing the page.

In collaboration with:	glebius
Discussed with:	gallatin, kib, markj
X-MFC after:	r324448
2017-10-13 16:31:50 +00:00
avg
13558f906a remove process and jail directory machinations from dounmount
The manipulations done by mountcheckdirs() are not that useful during
the unmount, they can bring about unexpected security consequences.

Thic change effectively reverts the change in r73241.

The change also allows to simplify the handling of rootvnode global
variable.

Discussed with:	mckusick, mjg, kib
Reviewed by:	trasz
MFC after:	1 month
Differential Revision: https://reviews.freebsd.org/D12366
2017-10-13 09:42:05 +00:00
emaste
a97448391f regen init_sysent.c r324560 2017-10-12 15:48:37 +00:00
emaste
5d14c78c8e allow posix_fallocate in capability mode
posix_fallocate is logically equivalent to writing zero blocks to the
desired file size and there is no reason to prevent calling it in
capability mode. posix_fallocate already checked for the CAP_WRITE
right, so we merely need to list it in capabilities.conf.

Reviewed by:	allanjude
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D12640
2017-10-12 15:45:53 +00:00
mjoras
1bac95586a Add clearing function for unr(9).
Previously before you could call unrhdr_delete you needed to
individually free every allocated unit. It is useful to be able to tear
down the unr without having to go through this process, as it is
significantly faster than freeing the individual units.

Reviewed by:	cem, lidl
Approved by:	rstone (mentor)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D12591
2017-10-11 21:53:50 +00:00
kib
49c3eee83d The th_bintime, th_microtime and th_nanotime members of the timehand
all cache the last system time (uptime + boottime).  Only the format
differs.  Do not re-calculate the bintime and simply use the value
used to calculate the microtime and nanotime.

Group all the updates under the relevant comment.  Remove obsoleted
XXX part.

Submitted by:	Sebastian Huber <sebastian.huber@embedded-brains.de>
MFC after:	1 week
2017-10-11 11:03:11 +00:00
sbruno
89fa9e690e match sendfile() error handling to send().
Sendfile() should match the error checking order of send() which
is currently:

SBS_CANTSENDMORE
so_error
SS_ISCONNECTED

Submitted by:	Jason Eggleston <jason@eggnet.com>
Reviewed by:	glebius
MFC after:	2 weeks
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D12633
2017-10-10 22:21:05 +00:00
sbruno
1b75921a14 Revert r324405 at the request of the submitter pending better solution.
Submitted by:	Jason Eggleston <jason@eggnet.com>
Sponsored by:	Limelight Networks
2017-10-10 00:32:21 +00:00
glebius
4bbcc4b9b3 Improvements to sendfile(2) mbuf free routine.
o Fall back to default m_ext free mech, using function pointer in
  m_ext_free, and remove sf_ext_free() called directly from mbuf code.
  Testing on modern CPUs showed no regression.
o Provide internally used flag EXT_FLAG_SYNC, to mark that I/O uses
  SF_SYNC flag.  Lack of the flag allows us not to dereference
  ext_arg2, saving from a cache line miss.
o Create function sendfile_free_page() that later will be used, for
  multi-page mbufs.  For now compiler will inline it into
  sendfile_free_mext().

In collaboration with:	gallatin
Differential Revision:	https://reviews.freebsd.org/D12615
2017-10-09 21:06:16 +00:00
glebius
492b1ec8b6 In mb_dupcl() don't copy full m_ext, to avoid cache miss. Respectively,
in mb_free_ext() always use fields from the original refcount holding
mbuf (see. r296242) mbuf.  Cuts another cache miss from mb_free_ext().

However, treat EXT_EXTREF mbufs differently, since they are different -
they don't have a refcount holding mbuf.

Provide longer comments in m_ext declaration to explain this change
and change from r296242.

In collaboration with:	gallatin
Differential Revision:	https://reviews.freebsd.org/D12615
2017-10-09 20:51:58 +00:00
glebius
a9152a7f99 Shorten list of arguments to mbuf external storage freeing function.
All of these arguments are stored in m_ext, so there is no reason
to pass them in the argument list.  Not all functions need the second
argument, some don't even need the first one.  The second argument
lives in next cache line, so not dereferencing it is a performance
gain.  This was discovered in sendfile(2), which will be covered by
next commits.

The second goal of this commit is to bring even more flexibility
to m_ext mbufs, allowing to create more fields in m_ext, opaque to
the generic mbuf code, and potentially set and dereferenced by
subsystems.

Reviewed by:	gallatin, kbowling
Differential Revision:	https://reviews.freebsd.org/D12615
2017-10-09 20:35:31 +00:00
hselasky
69fd8ab233 When showing the sleepqueues from the in-kernel debugger,
properly dump all the sendqueues and not just the first one

History:
It appears that in the commit which introduced the code,
r165272, the array indexes of "sq_blocked[0]" and "td_name[i]"
were interchanged. In r180927 "td_name[i]" was corrected to
"td_name[0]", but "sq_blocked[0]" was left unchanged.

PR:		222624
Discussed with:	kmacy @
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2017-10-09 18:33:29 +00:00
alc
6a7e568f99 The recent change to initialization of blists (r324420) relied on '-1'
appearing only where the code explicitly set it, but since much of the
data was not initialized, '-1' appeared other places too, and led to
panics.  Clear the allocated data before initializing nonzero values by
allocating with M_ZERO.

Submitted by:	Doug Moore <dougm@rice.edu>
Reported by:	Oleg V. Nauman <oleg@theweb.org.ua>, cy
Tested by:	Oleg V. Nauman <oleg@theweb.org.ua>
MFC after:	1 week
X-MFC with:	r324420
Differential Revision:	https://reviews.freebsd.org/D12627
2017-10-09 18:19:06 +00:00
alc
008747750e The blst_radix_init function has two purposes - to compute the number of
nodes to allocate for the blist, and to initialize them.  The computation
can be done much more quickly by identifying the terminating node, if any,
at every level of the tree and then summing the number of nodes at each
level that precedes the topmost terminator.  The initialization can also be
done quickly, since settings at the root mark the tree as all-allocated, and
only a few terminator nodes need to be marked in the rest of the tree.
Eliminate blst_radix_init, and perform its two functions more simply in
blist_create.

The allocation of the blist takes places in two pieces, but there's no good
reason to do so, when a single allocation is sufficient, and simpler.
Allocate the blist struct, and the array of nodes associated with it, with a
single allocation.

Submitted by:	Doug Moore <dougm@rice.edu>
Reviewed by:	markj (an earlier version)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D11968
2017-10-08 22:17:39 +00:00
ian
5cc2194826 Add eventhandler notifications for newbus device attach/detach.
The detach case is slightly complicated by the fact that some in-kernel
consumers may want to know before a device detaches (so they can release
related resources, stop using the device, etc), but the detach can fail. So
there are pre- and post-detach notifications for those consumers who need to
handle all cases.

A couple salient comments from the review, they amount to some helpful
documentation about these events, but there's currently no good place for
such documentation...

Note that in the current newbus locking model, DETACH_BEGIN and
DETACH_COMPLETE/FAILED sequence of event handler invocation might interweave
with other attach/detach events arbitrarily. The handlers should be prepared
for such situations.

Also should note that detach may be called after the parent bus knows the
hardware has left the building. In-kernel consumers have to be prepared to
cope with this race.

Differential Revision:	https://reviews.freebsd.org/D12557
2017-10-08 17:33:49 +00:00
ian
69fb8e033a Restore the ability to deregister an eventhandler from within the callback.
When the EVENTHANDLER(9) subsystem was created, it was a documented feature
that an eventhandler callback function could safely deregister itself. In
r200652 that feature was inadvertantly broken by adding drain-wait logic to
eventhandler_deregister(), so that it would be safe to unload a module upon
return from deregistering its event handlers.

There are now 145 callers of EVENTHANDLER_DEREGISTER(), and it's likely many
of them are depending on the drain-wait logic that has been in place for 8
years. So instead of creating a separate eventhandler_drain() and adding it
to some or all of those 145 call sites, this creates a separate
eventhandler_drain_nowait() function for the specific purpose of
deregistering a callback from within the running callback.

Differential Revision:	https://reviews.freebsd.org/D12561
2017-10-08 17:21:16 +00:00
sbruno
692f29cb6e Check so_error early in sendfile() call. Prior to this patch, if a
connection was reset by the remote end, sendfile() would just report
ENOTCONN instead of ECONNRESET.

Submitted by:	Jason Eggleston <jason@eggnet.com>
Reviewed by:	glebius
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D12575
2017-10-07 23:30:57 +00:00
mjg
6523dff125 namecache: factor out ~MAKEENTRY lookups from the common path
Lookups of the sort are rare compared to regular ones and succesfull ones
result in removing entries from the cache.

In the current code buckets are rlocked and a trylock dance is performed,
which can fail and cause a restart. Fixing it will require a little bit
of surgery and in order to keep the code maintaineable the 2 cases have
to split.

MFC after:	1 week
2017-10-06 23:05:55 +00:00
markj
08a37c1513 Let stack_create(9) take a malloc flags argument.
Reviewed by:	cem
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D12614
2017-10-06 21:52:28 +00:00
manu
13125fb720 vfs_export_lookup: Fix r324054
When using the default address list nam is still valid, the code in
r324054 assumed that is was NULL.

Reported by:	Guy Yur <guyyur@gmail.com>
Tested by:	Guy Yur <guyyur@gmail.com>
2017-10-06 09:02:36 +00:00
mjg
6f516b023d locks: take the number of readers into account when waiting
Previous code would always spin once before checking the lock. But a lock
with e.g. 6 readers is not going to become free in the duration of once spin
even if they start draining immediately.

Conservatively perform one for each reader.

Note that the total number of allowed spins is still extremely small and is
subject to change later.

MFC after:	1 week
2017-10-05 19:18:02 +00:00
shurd
bb993c697a Fix "taskqgroup_attach: setaffinity failed: 3" with iflib drivers
Improved logging added in r323879 exposed an error during
attach. We need the irq, not the rid to work correctly. em uses
shared irqs, so it will use the same irq for TX as RX. bnxt does
not use shared irqs, or TX irqs at all, so there's no need to set
the TX irq affinity.

Reviewed by:	sbruno
Approved by:	sbruno (mentor)
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D12496
2017-10-05 14:43:30 +00:00
mjg
a059e2d8eb locks: partially tidy up waiting on readers
spin first instant of instantly re-readoing and don't re-read after
spinning is finished - the state is already known.

Note the code is subject to significant changes later.

MFC after:	1 week
2017-10-05 13:01:18 +00:00
avg
341d9df85b sysctl-s in a module should be accessible only when the module is initialized
A sysctl can have a custom handler that may access data that is initialized
via SYSINIT(9) or via a module event handler (also invoked via SYSINIT).
Thus, it is not safe to allow access to the module's sysctl-s until
the initialization is performed.  Likewise, we should not allow access
to teh sysctl-s after the module is uninitialized.
The latter is easy to achieve by properly ordering linker_file_unregister_sysctls
and linker_file_sysuninit.
The former is not as easy for two reasons:
- the initialization may depend on tunables which get set when sysctl-s are
  registered, so we need to set the tunables before running sysinit-s
- the initialization may try to dynamically add more sysctl-s under statically
  defined sysctl nodes
So, this change splits the sysctl setup into two phases.  In the first phase
the sysctl-s are registered as before but they are disabled and hidden from
consumers.  In the second phase, done after sysinit-s, normal access to the
sysctl-s is enabled.

The change should affect only dynamic module loading and unloading after
the system boot-up.  Nothing changes for sysctl-s compiled into the kernel
and sysctl-s in preloaded modules.

Discussed with:	hselasky, ian, jhb
Reviewed by:	julian, kib
MFC after:	2 weeks
Sponsored by:	Panzura
Differential Revision: https://reviews.freebsd.org/D12545
2017-10-05 12:32:14 +00:00
glebius
7168fac388 Hide struct socket and struct unpcb from the userland.
Violators may define _WANT_SOCKET and _WANT_UNPCB respectively and
are not guaranteed for stability of the structures.  The violators
list is the the usual one: libprocstat(3) and netstat(1) internally
and lsof in ports.

In struct xunpcb remove the inclusion of kernel structure and add
a bunch of spare fields.  The xsocket already has socket not included,
but add there spares as well.  Embed xsockbuf into xsocket.

Sort declarations in sys/socketvar.h to separate kernel only from
userland available ones.

PR:		221820 (exp-run)
2017-10-02 23:29:56 +00:00
alc
513b841b01 Use vm_page_active() rather than directly accessing the page's queue
field.

Reviewed by:	kib, markj
MFC after:	2 weeks
X-MFC with:	r324146
2017-10-02 07:30:21 +00:00
avg
8e8e725268 revert r324166, it has an unrelated change in it 2017-10-01 16:37:54 +00:00