15888 Commits

Author SHA1 Message Date
Brooks Davis
91a743004c Use umtx_copyin_umtx_time32() in __umtx_op_lock_umutex_compat32().
Non-NULL timeouts where copied in improperly and could produce failures
due to incompatible data structures.

Reviewed by:	kib
MFC after:	3 days
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14587
2018-03-06 01:52:04 +00:00
Brooks Davis
aec37bad99 Regen after r330517. 2018-03-05 17:02:50 +00:00
Brooks Davis
1c1b4c66b6 Remove remenants of 1990s efforts to let us run Net/OpenBSD binaries.
No functional change (comments change in some generated files.)

Reviewed by:	kib
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14571
2018-03-05 17:02:16 +00:00
Mateusz Guzik
0ad122a966 lockmgr: save on sleepq when cmpset fails 2018-03-05 00:30:07 +00:00
Mateusz Guzik
93d41967da lockmgr: whack unused lockmgr_note_exclusive_upgrade 2018-03-04 22:14:20 +00:00
Mateusz Guzik
9f4e008d4a mtx: tidy up recursion handling in thread lock
Normally after grabbing the lock it has to be verified we got the right one
to begin with. However, if we are recursing, it must not change thus the
check can be avoided. In particular this avoids a lock read for non-recursing
case which found out the lock was changed.

While here avoid an irq trip of this happens.

Tested by:	pho (previous version)
2018-03-04 22:01:23 +00:00
Mateusz Guzik
a8e747c5e7 sx: don't do an atomic op in upgrade if it cananot succeed
The code already pays the cost of reading the lock to obtain the waiters
flag. Checking whether there is more than one reader is not a problem and
avoids dirtying the line.

This also fixes a small corner case: if waiters were to show up between
reading the flag and upgrading the lock, the operation would fail even
though it should not. No correctness change here though.
2018-03-04 21:41:05 +00:00
Mateusz Guzik
d94df98c5c locks: fix a corner case in r327399
If there were exactly rowner_retries/asx_retries (by default: 10) transitions
between read and write state and the waiters still did not get the lock, the
next owner -> reader transition would result in the code correctly falling
back to turnstile/sleepq where it would incorrectly think it was waiting
for a writer and decide to leave turnstile/sleepq to loop back. From this
point it would take ts/sq trips until the lock gets released.

The bug sometimes manifested itself in stalls during -j 128 package builds.

Refactor the code to fix the bug, while here remove some of the gratituous
differences between rw and sx locks.
2018-03-04 21:38:30 +00:00
Mateusz Guzik
1c6987ebc5 lockmgr: start decomposing the main routine
The main routine takes 8 args, 3 of which are almost the same for most uses.
This in particular pushes it above the limit of 6 arguments passable through
registers on amd64 making it impossible to tail call.

This is a prerequisite for further cleanups.

Tested by:	pho
2018-03-04 19:12:54 +00:00
Hans Petter Selasky
2077229b56 Allow pause_sbt() to catch signals during sleep by passing C_CATCH flag.
Define pause_sig() function macro helper similarly to other kernel functions
which catch signals. Update outdated function description.

Discussed with:	kib@
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2018-03-03 18:36:38 +00:00
Hans Petter Selasky
54fc03834a Correct the return code from pause() during cold startup from zero to
EWOULDBLOCK. This also matches the description in pause(9).

Discussed with:	kib@
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2018-03-03 18:12:21 +00:00
Brooks Davis
93e48a303a Rename kernel-only members of semid_ds and msgid_ds.
This deliberately breaks the API in preperation for future syscall
revisions which will remove these nonstandard members.

In an exp-run a single port (devel/qemu-user-static) was found to
use them which it did becuase it emulates system calls.  This has
been fixed in the ports tree.

PR:		224443 (exp-run)
Reviewed by:	kib, jhb (previous version)
Exp-run by:	antoine
Sponsored by:	DARPA, AFRP
Differential Revision:	https://reviews.freebsd.org/D14490
2018-03-02 22:10:48 +00:00
Mateusz Guzik
c505b59961 sx: fix adaptive spinning broken in r327397
The condition was flipped.

In particular heavy multithreaded kernel builds on zfs started suffering
due to nested sx locks.

For instance make -s -j 128 buildkernel:

before: 3326.67s user 1269.62s system 6981% cpu 1:05.84 total
after: 3365.55s user 911.27s system 6871% cpu 1:02.24 total

ps.
      .-'---`-.			      .-'---`-.
    ,'          `.		    ,'          `.
    |             \		    |             \
    |              \		    |              \
    \           _  \		    \           _  \
    ,\  _    ,'-,/-)\		    ,\  _    ,'-,/-)\
    ( * \ \,' ,' ,'-)		    ( * \ \,' ,' ,'-)
     `._,)     -',-')		     `._,)     -',-')
       \/         ''/		       \/         ''/
        )        / /		        )        / /
       /       ,'-'		       /       ,'-'
2018-03-02 21:26:27 +00:00
Mateusz Guzik
9d4e369ae8 Don't generate data in sysctl_out_proc unless we intend to copy out.
The first call is used to gauge how much spaces is needed. Just computing
the size instead of generating the output allows to not take the proctree
lock.
2018-02-25 15:16:58 +00:00
Jeff Roberson
1c2529ab32 Fix issues with sparse cpu allocation. Consistently use mp_maxid + 1.
Reported by:	pho
Reviewed by:	markj
Sponsored by:	Netflix, Dell/EMC Isilon
2018-02-25 00:35:21 +00:00
Conrad Meyer
63901c0171 kern/sys_generic.c: style(9) return(foo) -> return (foo)
No functional change.

Sponsored by:	Dell EMC Isilon
2018-02-24 01:15:33 +00:00
Jeff Roberson
5f8cd1c0bf Add a generic Proportional Integral Derivative (PID) controller algorithm and
use it to regulate page daemon output.

This provides much smoother and more responsive page daemon output, anticipating
demand and avoiding pageout stalls by increasing the number of pages to match
the workload.  This is a reimplementation of work done by myself and mlaier at
Isilon.

Reviewed by:	bsdimp
Tested by:	pho
Sponsored by:	Netflix, Dell/EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14402
2018-02-23 22:51:51 +00:00
Kirk McKusick
16680b6af5 Include error number in the "fsync: giving up on dirty" message
(in case it ever starts happening again in spite of 328444).

Submitted by: Andreas Longwitz <longwitz at incore.de>
2018-02-23 21:57:10 +00:00
Konstantin Belousov
4c8a8cfcde Restore UP build.
Reviewed by:	truckman
Sponsored by:	The FreeBSD Foundation
2018-02-23 18:26:31 +00:00
Ed Maste
315fbaeca2 Correct pseudo misspelling in sys/ comments
contrib code and #define in intel_ata.h unchanged.
2018-02-23 18:15:50 +00:00
Don Lewis
97e9382d56 Decrease latency by not wrapping the idle loop's potentially lengthy
search for a thread to steal inside a critical section.  Since this
allows the search to be preempted, restart the search if preemption
happens since the search results found earlier may no longer be
valid.

Decrease the latency of starting a thread that may be assigned to
this CPU during the search by polling for incoming threads during
the search and switching to that thread instead of continuing the
search.

Test for stale search results and restart the search before going
through the expense of calling tdq_lock_pair().  Retry some tests
after grabbing the locks since things may have changed while waiting
to get both locks.

Eliminate special case handling for stealing from an SMT peer that
uses 1 as the steal threshold.  This can only succeed if a thread
has been assigned but our SMT peer has not yet started executing
it.  This is quite rare and when it happens the other SMT thread
is generally waiting for the same tdq lock that we hold.  Basically
both SMT threads are racing to grab the same spin lock.

Add the kern.sched.always_steal knob from a ULE patch by jeff@.

Incorporate another idea from Jeff's ULE patch.  If the sched_switch()
detects that the CPU is about to go idle, try to steal a thread
before switching to the idle thread.  Since the search for a thread
to steal has to be done inside a critical section in this context,
limit the impact on latency by adding the knob kern.sched.trysteal_limit
to limit the topological distance of the search and don't restart
the search if we detect stale results.  If this search can't find
an stealable thread, the idle loop can do a more complete search.
Also poll for threads being assigned to this CPU during the search
and switch to them instead of continuing the search.  This change
is responsibile for the majority of the improvement in parallel
buildworld times.

In sched_balance_group() change the minimum threshold from stealing
a thread from 1 to 2.  Poaching a newly assigned thread from a CPU
that is waking up hasn't yet switched to that thread from idle is
likely very rare and is likely to have the same lock race as is
seen when stealing threads in the idle loop.  Also use tdq_notify()
to kick the destintation CPU instead of always sending an IPI.
Update a stale comment, the number of transferable threads is not
calculated.

Reviewed by:	kib (earlier version)
Comments by:	avg, jeff, mav
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D12130
2018-02-23 00:12:51 +00:00
Mateusz Guzik
a0c722bdbf Fix up sysctl vfs.buffercache broken in r329612
Sample problem:
top: sysctl(vfs.bufspace...) expected 8, got 4

Reported by:	O. Hartmann <ohartmann walstatt.org>
2018-02-22 20:39:25 +00:00
Eric van Gyzen
0127914caa sched_ule: update a comment to reflect reality
MFC after:	3 days
Sponsored by:	Dell EMC
2018-02-22 17:09:26 +00:00
Jeff Roberson
683ca3a432 Fix the broken subqueue assignment for the cleanq.
Reported by:	pho
Tested by:	pho
Sponsored by:	Netflix, Dell/EMC Isilon
2018-02-20 21:27:17 +00:00
Mateusz Guzik
500ca73d43 mtx: add debug assertions to mtx_spin_wait_unlocked 2018-02-20 20:39:34 +00:00
Mateusz Guzik
862db53fb5 Fix reaping on process fd close broken after r329449
The only consumer of proc_reap other than proc_to_reap was not updated
to not PROC_SLOCK.

Reported by:    Juan Ramon Molina Menor <listjm club.fr>
2018-02-20 20:19:38 +00:00
Brooks Davis
b81e88d296 Reduce duplication in dynamic syscall registration code.
Remove the unused syscall_(de)register() functions in favor of the
better documented and easier to use syscall_helper_(un)register(9)
functions.

The default and freebsd32 versions differed in which array of struct
sysents they used and a few missing updates to the 32-bit code as
features were added to the main code.

Reviewed by:	cem
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14337
2018-02-20 18:08:57 +00:00
Mateusz Guzik
681a1b752c Make killpg1 perform process validity checks without proc lock held. 2018-02-20 10:52:07 +00:00
Mateusz Guzik
81d68271d7 Reduce contention on the proctree lock during heavy package build.
There is a proctree -> allproc ordering established.

Most of the time it is either xlock -> xlock or slock -> slock.

On fork however there is a slock -> xlock pair which results in
pathological wait times due to threads keeping proctree held for
reading and all waiting on allproc. Switch this to xlock -> xlock.
Longer term fix would get rid of proctree in this place to begin with.
Right now it is necessary to walk the session/process group lists to
determine which id is free. The walk can be avoided e.g. with bitmaps.

The exit path used to have one place which dealt with allproc and
then with proctree. Move the allproc acquire into the section protected
by proctree. This reduces contention against threads waiting on proctree
in the fork codepath - the fork proctree holder does not have to wait
for allproc as often.

Finally, move tidhash manipulation outside of the area protected by
either of these locks. The removal from the hash was already unprotected.
There is no legitimate reason to look up thread ids for a process still
under construction.

This results in about 50% wait time reduction during -j 128 package build.
2018-02-20 02:18:30 +00:00
Jeff Roberson
06220fa737 Further parallelize the buffer cache.
Provide multiple clean queues partitioned into 'domains'.  Each domain manages
its own bufspace and has its own bufspace daemon.  Each domain has a set of
subqueues indexed by the current cpuid to reduce lock contention on the cleanq.

Refine the sleep/wakeup around the bufspace daemon to use atomics as much as
possible.

Add a B_REUSE flag that is used to requeue bufs during the scan to approximate
LRU rather than locking the queue on every use of a frequently accessed buf.

Implement bufspace_reserve with only atomic_fetchadd to avoid loop restarts.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	Netflix, Dell/EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14274
2018-02-20 00:06:07 +00:00
Mateusz Guzik
2ca66c1ef5 Fix process exit vs reap race introduced in r329449
The race manifested itself mostly in terms of crashes with "spin lock
held too long".

Relevant parts of respective code paths:

exit:				reap:
PROC_LOCK(p);
PROC_SLOCK(p);
p->p_state == PRS_ZOMBIE
PROC_UNLOCK(p);
				PROC_LOCK(p);
/* exit work */
				if (p->p_state == PRS_ZOMBIE) /* true */
					proc_reap()
					free proc
/* more exit work */
PROC_SUNLOCK(p);

Thus a still exiting process is reaped.

Prior to the change the zombie check was followed by slock/sunlock trip
which prevented the problem.

Even code prior to this commit has a bug: the proc is still accessed for
statistic collection purposes. However, the severity is rather small and
the bug may be fixed in a future commit.

Reported by:	many
Tested by:	allanjude
2018-02-19 00:54:08 +00:00
Mateusz Guzik
d257698833 mtx: add mtx_spin_wait_unlocked
The primitive can be used to wait for the lock to be released. Intended
usage is for locks in structures which are about to be freed.

The benefit is the avoided interrupt enable/disable trip + atomic op to
grab the lock and shorter wait if the lock is held (since there is no
worry someone will contend on the lock, re-reads can be more aggressive).

Briefly discussed with:	 kib
2018-02-19 00:38:14 +00:00
Mateusz Guzik
7beb60820f exit: get rid of PROC_SLOCK when checking a process to report, take #2
The suspension counter needs synchronisation through slock, but we don't
need it to check if inspecting the counter is necessary to begin with.
In the common case it is not, thus avoid the lock if possible.

Reviewed by:	kib
Tested by:	pho
2018-02-18 21:07:15 +00:00
Mariusz Zaborski
965cd21173 Fix broken assertion in r329520.
Reported by:	pho@ lwhsu@
2018-02-18 20:04:39 +00:00
Brooks Davis
7a095112b2 Correct/improve the descriptions if kern.ipc.(shmsegs,sema,msqids).
The description of kern.ipc.shmsegs was wrong since 2005.  I updated the
others (which were more correct) to match.

PR:		225933
Reviewed by:	cem
MFC after:	3 days
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14391
2018-02-18 19:19:36 +00:00
Mariusz Zaborski
20641651ec Use the fdeget_locked function instead of the fget_locked in the
sys_capability.

Reviewed by:	pjd@ (earlier version)
Discussed with:	mjg@
2018-02-18 15:27:24 +00:00
Mateusz Guzik
8bf6ff2226 Revert r329448.
Turns out is is actually racy, reproducible with stress2/misc/truss.sh

Requested by:	kib
2018-02-17 17:23:43 +00:00
Mateusz Guzik
e4ccf57fdc Undo LOCK_PROFILING pessimisation after r313454 and r313455
With the option used to compile the kernel both sx and rw shared ops would
always go to the slow path which added avoidable overhead even when the
facility is disabled.

Furthermore the increased time spent doing uncontested shared lock acquire
would be bogusly added to total wait time, somewhat skewing the results.

Restore old behaviour of going there only when profiling is enabled.

This change is a no-op for kernels without LOCK_PROFILING (which is the
default).
2018-02-17 12:07:09 +00:00
Mateusz Guzik
ad58e5e86c exit: stop doing PROC_SLOCK just to call proc_reap
It immediately does PROC_SUNLOCK anyway and the lock plays no role.
2018-02-17 09:03:11 +00:00
Mateusz Guzik
9c0e785c58 exit: get rid of PROC_SLOCK when checking a process to report
All accessed fields are protected with already held process lock.
2018-02-17 08:48:45 +00:00
Mateusz Guzik
015cd8dc93 On process exit signal the parent after dropping the proctree lock. 2018-02-17 00:24:50 +00:00
Mateusz Guzik
7e588b9219 Unref the prison after proctree is dropped. 2018-02-17 00:23:56 +00:00
Mateusz Guzik
65f29b9caa Postpone sx_sunlock(&proctree_lock) on fork until after allproc is dropped.
There is a significant contention on the lock during -j 128 package build.
This change drops total wait time on this lock by 60%.
2018-02-17 00:23:28 +00:00
Mateusz Guzik
6776bfeb8f Tidy up kern_wait6
- don't relock curproc in msleep
- don't relock proctree if P_STATCHILD is spotted
- reformat the proc_to_reap call in the main loop
2018-02-17 00:21:50 +00:00
Brooks Davis
aff4f2d315 Reduce duplication in __acl_*_(file|link).
Add const to new kern_ functions and push down as required.

Reviewed by:	rwatson
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14174
2018-02-15 21:24:43 +00:00
Mark Johnston
05f0f0e9ea Fix the test for SET_FOREACH termination.
Unlike the queue(3) _FOREACH macros, the iterator for a SET_FOREACH is
not NULL after the end of the set is reached.
2018-02-15 17:35:40 +00:00
Mateusz Guzik
f795032b47 rwlock: diff-reduction of runlock compared to sx sunlock 2018-02-14 20:37:33 +00:00
Bryan Drewery
70c144dc78 nanosleep(2): Fix bogus incrementing of rmtp by tc_tick_sbt on [EINTR].
sbt is the time in the future that the tsleep_sbt() is expected to be completed
at.  sbtt is the current time.  Depending on the precision with sysctl
kern.timecounter.alloweddeviation the start time may be incremented by
tc_tick_sbt.  The same increment is needed for the current time of sbtt before
calculating the difference.  The impact of missing this increment is that rmtp
may increase by one tc_tick_sbt on every early [EINTR] return.  If the same
struct is passed in for rqtp as rmtp this can result in rqtp effectively
incrementing by tc_tick_sbt and sleeping longer than originally intended.

This problem was introduced in r247797.

Reviewed by:	kib, markj, vangyzen (all on an older version of the test)
MFC after:	2 weeks
Sponsored by:	Dell EMC
Differential Revision:	https://reviews.freebsd.org/D14362
2018-02-14 18:43:50 +00:00
Mark Johnston
6026dcd7ca Add support for zstd-compressed user and kernel core dumps.
This works similarly to the existing gzip compression support, but
zstd is typically faster and gives better compression ratios.

Support for this functionality must be configured by adding ZSTDIO to
one's kernel configuration file. dumpon(8)'s new -Z option is used to
configure zstd compression for kernel dumps. savecore(8) now recognizes
and saves zstd-compressed kernel dumps with a .zst extension.

Submitted by:	cem (original version)
Relnotes:	yes
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D13101,
			https://reviews.freebsd.org/D13633
2018-02-13 19:28:02 +00:00
Ian Lepore
157f3d7649 Fix bad indentation. Whitespace only, no functional changes.
Reported by:	bde@
2018-02-13 17:38:08 +00:00