Commit Graph

14479 Commits

Author SHA1 Message Date
Konstantin Belousov
b2557db607 Use per-cpu values for base and last in tc_cpu_ticks(). The values
are updated lockess, different CPUs write its own view of timecounter
state.  The critical section is done for safety, callers of
tc_cpu_ticks() are supposed to already enter critical section, or to
own a spinlock.

The change fixes sporadical reports of too high values reported for
the (W)CPU on platforms that do not provide cpu ticker and use
tc_cpu_ticks(), in particular, arm*.

Diagnosed and reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-09-25 13:03:57 +00:00
Mateusz Guzik
3c44a3495f kqueue: simplify kern_kqueue by not refing/unrefing creds too early
No functional changes.
2015-09-23 12:45:08 +00:00
Jeff Roberson
589c956a5a - Fix a nonsense reordering that somehow slipped into my last diff.
Reported by:	pho
2015-09-23 07:44:07 +00:00
Jeff Roberson
8264830c95 Some refactoring of the buf/vm interface.
- Eliminate bogus page replacement that is inconsistently applied in the
   invalidation loop in brelse.  This has been a no-op in modern times as
   biodone() is responsible for cleaning up after bogus pages.  This
   would've spammed the console with printfs at a minimum.
 - Allow the compiler and human readers alike to reason about allocbuf()
   by splitting it into constituent parts.
 - Separate the VM manipulating and buf manipulating code in brelse() and
   bufdone() so that the intentions are clear.  This makes it evident that
   there are several duplicated buf pages loops that will be consolidated
   at a later time.

Reviewed by:	kib
Tested by:	pho
Sponsored by:	EMC / Isilon Storage Division
2015-09-22 23:57:52 +00:00
Alan Cox
15aaea7892 Change vm_page_unwire() such that it (1) accepts PQ_NONE as the specified
queue and (2) returns a Boolean indicating whether the page's wire count
transitioned to zero.

Exploit this change in vfs_vmio_release() to avoid pointlessly enqueueing
a page that is about to be freed.

(An earlier version of this change was developed by attilio@ and kmacy@.
Any errors in this version are my own.)

Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2015-09-22 18:16:52 +00:00
Hans Petter Selasky
c55f4c9445 Revert r287780 until more developers have their say.
Differential Revision:	https://reviews.freebsd.org/D3521
Requested by:		gnn
2015-09-22 06:51:55 +00:00
Bryan Drewery
6c5c24c98c vfs_mountroot_shuffle() never returns non-zero. 2015-09-22 03:34:07 +00:00
Konstantin Belousov
1f57d8c66b Ensure that maxproc does not exceed pid_max, at the time of boot.
Changes to kern.pid_max mib after the boot can break this relation.

The maxfiles value was calculated by the MAXFILES formula based on
maxproc value, but this change decouples them, and MAXFILES now
references maxusers.  Without manual tuning, the maxfiles default
value remains as it was prior to this commit.  But for systems which
have tuned maxproc and rely on maxfiles to adjust, additional
reconfiguration is needed.

Reported by:	rwatson
Reviewed by:	emaste
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-09-21 15:02:59 +00:00
Konstantin Belousov
cff8c6f2d1 Add support for weak symbols to the kernel linkers. It means that
linkers no longer raise an error when undefined weak symbols are
found, but relocate as if the symbol value was 0.  Note that we do not
repeat the mistake of userspace dynamic linker of making the symbol
lookup prefer non-weak symbol definition over the weak one, if both
are available.  In fact, kernel linker uses the first definition
found, and ignores duplicates.

Signature of the elf_lookup() and elf_obj_lookup() functions changed
to split result/error code and the symbol address returned.
Otherwise, it is impossible to return zero address as the symbol
value, to MD relocation code.  This explains the mechanical changes in
elf_machdep.c sources.

The powerpc64 R_PPC_JMP_SLOT handler did not checked error from the
lookup() call, the patch leaves the code as is (untested).

Reported by:	glebius
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-09-20 01:27:59 +00:00
Edward Tomasz Napierala
0d3d0cc358 Kernel part of reroot support - a way to change rootfs without reboot.
Note that the mountlist manipulations are somewhat fragile, and not very
pretty.  The reason for this is to avoid changing vfs_mountroot(), which
is (obviously) rather mission-critical, but not very well documented,
and thus hard to test properly.  It might be possible to rework it to use
its own simple root mount mechanism instead of vfs_mountroot().

Reviewed by:	kib@
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D2698
2015-09-18 17:32:22 +00:00
John Baldwin
bdd64116b0 Always clear TDB_USERWR before fetching system call arguments. The
TDB_USERWR flag may still be set after a debugger detaches from a
process via PT_DETACH.  Previously the flag would never be cleared
forcing a double fetch of the system call arguments for each system
call.  Note that the flag cannot be cleared at PT_DETACH time in case
one of the threads in the process is currently stopped in
syscallenter() and the debugger has modified the arguments for that
pending system call before detaching.

Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D3678
2015-09-16 20:55:00 +00:00
John Baldwin
4295bec16a When a process group leader exits, all of the processes in the group are
sent SIGHUP and SIGCONT if any of the processes are stopped.  Currently this
behavior is triggered for any type of process stop including ptrace() stops
and transient stops for single threading during exit() and execve().
Thus, if a debugger is attached to a process in a group when the leader
exits, the entire group can be HUPed.  Instead, only send the signals if a
process in the group is stopped due to SIGSTOP.

PR:		201149
Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D3681
2015-09-16 16:40:07 +00:00
Mateusz Guzik
7665e341ca sysctl: switch sysctllock to a sleepable rmlock, take 2
This restores r285125. Previous attempt was reverted due to a bug in rmlocks,
which is fixed since r287833.
2015-09-15 23:06:56 +00:00
John Baldwin
e89d5f43da Threads holding a read lock of a sleepable rm lock are not permitted
to sleep.  The rmlock implementation enforces this by disabling
sleeping when a read lock is acquired. To simplify the implementation,
sleeping is disabled for most of the duration of rm_rlock.  However,
it doesn't need to be disabled until the lock is acquired.  If a
sleepable rm lock is contested, then rm_rlock may need to acquire the
backing sx lock.  This tripped the overly-broad assertion.  Fix by
relaxing the assertion around the call to sx_xlock().

Reported by:	mjg
Reviewed by:	kib, mjg
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D3324
2015-09-15 22:16:21 +00:00
Conrad Meyer
55d33667ee kevent(2): Note DOOMED vnodes with NOTE_REVOKE
In poll mode, check for and wake VBAD vnodes.  (Vnodes that are VBAD at
registration will never be woken by the RECLAIM trigger.)

Add post-VOP_RECLAIM hook to trigger notes on vnode reclamation.  (Vnodes that
were fine at registration but are vgoned while being monitored should signal
waiters.)

Reviewed by:	kib
Approved by:	markj (mentor)
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D3675
2015-09-15 20:22:30 +00:00
Hans Petter Selasky
9acc0eafd7 Implement callout_drain_async(), inspired by the projects/hps_head
branch.

This function is used to drain a callout via a callback instead of
blocking the caller until the drain is complete. Refer to the
callout_drain_async() manual page for a detailed description.

Limitation: If a lock is used with the callout, the callout can only
be drained asynchronously one time unless the callout_init_mtx()
function is called again. This limitation is not present in
projects/hps_head and will require more invasive changes to the
timeout code, which was not in the scope of this patch.

Differential Revision:	https://reviews.freebsd.org/D3521
Reviewed by:		wblock
MFC after:		1 month
2015-09-14 10:52:26 +00:00
Warner Losh
7297c5e535 bufdonebio is now unused. Retire it too. 2015-09-11 04:20:04 +00:00
Mark Johnston
610141cebb Add stack_save_td_running(), a function to trace the kernel stack of a
running thread.

It is currently implemented only on amd64 and i386; on these
architectures, it is implemented by raising an NMI on the CPU on which
the target thread is currently running. Unlike stack_save_td(), it may
fail, for example if the thread is running in user mode.

This change also modifies the kern.proc.kstack sysctl to use this function,
so that stacks of running threads are shown in the output of "procstat -kk".
This is handy for debugging threads that are stuck in a busy loop.

Reviewed by:	bdrewery, jhb, kib
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D3256
2015-09-11 03:54:37 +00:00
Warner Losh
ad8d57a99d dev_strategy and dev_strategy_csw are unused since r281825. Remove
them.

Differential Revision: https://reviews.freebsd.org/D3620
2015-09-11 00:38:58 +00:00
Adrian Chadd
32766cd281 Also make kern.maxfilesperproc a boot time tunable.
Auto-tuning threshold discussions aside, it turns out that if you want
to lower this on say, rather memory-packed machines, you either set maxusers
or kern.maxfiles, or you set it in sysctl.  The former is a non-exact
way to tune this; the latter doesn't actually affect anything in the
startup scripts.

This first occured because I wondered why the hell screen would take upwards
of 10 seconds to spawn a new screen.  I then found python doing the same
thing during fork/exec of child processes - it calls close() on each FD
up to the current openfiles limit.  On a 1TB machine this is like, 26 million
FDs per process.  Ugh.

So:

* This allows it to be set early in /boot/loader.conf;
* It can be used to work around the ridiculous situation of
  screen, python, etc doing a close() on potentially millions of FDs
  even though you only have four open.

Tested:

* 4GB, 32GB, 64GB, 128GB, 384GB, 1TB systems with autotune, ensuring
  screen and python forking doesn't result in some pretty hilariously
  bad behaviour.

TODO:

* Note that the default login.conf sets openfiles-cur to unlimited,
  effectively obeying kern.maxfilesperproc.  Perhaps we should fix
  this.

* .. and even if we do, we need to also ensure that daemons get
  a soft limit of something reasonable and capped - they can request
  more FDs themselves.

MFC after:	1 week
Sponsored by:	Norse Corp, Inc.
2015-09-10 04:05:58 +00:00
Konstantin Belousov
9e18c9eb27 For open("name", O_DIRECTORY | O_CREAT), do not try to create the
named node, open(2) cannot create directories.  But do allow the flag
combination to succeed if the directory already exists.

Declare the open("name", O_DIRECTORY | O_CREAT | O_EXCL) always
invalid for the same reason, since open(2) cannot create directory.

Note that there is an argument that O_DIRECTORY | O_CREAT should be
invalid always, regardless of the target directory existence or
O_EXCL.  The current fix is conservative and allows the call to
succeed in the situation where it succeeded before the patch.

Reported by:	Tom Ridge <freebsd@tom-ridge.com>
Reviewed by:	rwatson
PR:	 202892
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-09-09 19:31:08 +00:00
Mateusz Guzik
9af8c8b72b fd: make rights a mandatory argument to fgetvp_rights
The only caller already always passes rights.
2015-09-07 20:05:56 +00:00
Mateusz Guzik
d7832811a7 fd: make the common case in filecaps_copy work lockless
The filedesc lock is only needed if ioctls caps are present, which is a
rare situation. This is a step towards reducing the scope of the filedesc
lock.
2015-09-07 20:02:56 +00:00
Conrad Meyer
bcb60d52e6 Follow-up to r287442: Move sysctl to compiled-once file
Avoid duplicate sysctl nodes.

Found by:	tijl
Approved by:	markj (mentor)
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D3586
2015-09-07 16:44:28 +00:00
Kirk McKusick
17518b1a2b Track changes to kern.maxvnodes and appropriately increase or decrease
the size of the name cache hash table (mapping file names to vnodes)
and the vnode hash table (mapping mount point and inode number to vnode).
An appropriate locking strategy is the key to changing hash table sizes
while they are in active use.

Reviewed by: kib
Tested by:   Peter Holm
Differential Revision: https://reviews.freebsd.org/D2265
MFC after:   2 weeks
2015-09-06 05:50:51 +00:00
Xin LI
28ffe927c2 Expose an interface to determine if an ACE is inherited.
Submitted by:	sef
Reviewed by:	trasz
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D3540
2015-09-04 00:14:20 +00:00
Conrad Meyer
14bdbaf2e4 Detect badly behaved coredump note helpers
Coredump notes depend on being able to invoke dump routines twice; once
in a dry-run mode to get the size of the note, and another to actually
emit the note to the corefile.

When a note helper emits a different length section the second time
around than the length it requested the first time, the kernel produces
a corrupt coredump.

NT_PROCSTAT_FILES output length, when packing kinfo structs, is tied to
the length of filenames corresponding to vnodes in the process' fd table
via vn_fullpath.  As vnodes may move around during dump, this is racy.

So:

 - Detect badly behaved notes in putnote() and pad underfilled notes.

 - Add a fail point, debug.fail_point.fill_kinfo_vnode__random_path to
   exercise the NT_PROCSTAT_FILES corruption.  It simply picks random
   lengths to expand or truncate paths to in fo_fill_kinfo_vnode().

 - Add a sysctl, kern.coredump_pack_fileinfo, to allow users to
   disable kinfo packing for PROCSTAT_FILES notes.  This should avoid
   both FILES note corruption and truncation, even if filenames change,
   at the cost of about 1 kiB in padding bloat per open fd.  Document
   the new sysctl in core.5.

 - Fix note_procstat_files to self-limit in the 2nd pass.  Since
   sometimes this will result in a short write, pad up to our advertised
   size.  This addresses note corruption, at the risk of sometimes
   truncating the last several fd info entries.

 - Fix NT_PROCSTAT_FILES consumers libutil and libprocstat to grok the
   zero padding.

With suggestions from:	bjk, jhb, kib, wblock
Approved by:	markj (mentor)
Relnotes:	yes
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D3548
2015-09-03 20:32:10 +00:00
Mateusz Guzik
7e8f566c0c fd: remove UMA_ZONE_ZINIT argument from Files zone
Originally it was added in order to prevent trashing of objects with
INVARIANTS enabled. The same effect is now provided with mere UMA_ZONE_NOFREE.

This reverts r286921.

Discussed with:		kib
2015-09-02 23:14:39 +00:00
John Baldwin
ded3e7f08e The 'sa' argument to syscallret() is not unused. 2015-09-01 22:28:23 +00:00
John Baldwin
183b68f74f Export current system call code and argument count for system call entry
and exit events. procfs stop events for system call tracing report these
values (argument count for system call entry and code for system call exit),
but ptrace() does not provide this information. (Note that while the system
call code can be determined in an ABI-specific manner during system call
entry, it is not generally available during system call exit.)

The values are exported via new fields at the end of struct ptrace_lwpinfo
available via PT_LWPINFO.

Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D3536
2015-09-01 22:24:54 +00:00
Konstantin Belousov
6ae26d06dc Exit notification for EVFILT_PROC removes knote from the knlist. In
particular, this invalidates the knote kn_link linkage, making the
SLIST_FOREACH() loop accessing undefined values (e.g. trashed by
QUEUE_MACRO_DEBUG).  If the knote is freed by other thread when kq
lock is released or when influx is cleared, e.g. by knote_scan() for
kqueue owning the knote, the iteration step would access freed memory.

Use SLIST_FOREACH_SAFE() to fix iteration.

Diagnosed by:	avg
Tested by:	avg, lstewart, pawel
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-09-01 14:05:29 +00:00
Konstantin Belousov
78b9afe121 Clean up the kqueue use of the uma KPI.
Explain why it is fine to not check for M_NOWAIT failures in
kqueue_register().  Remove unneeded check for NULL result from
waitable allocation in kqueue_scan().  uma_free(9) handles NULL
argument correctly, remove checks for NULL.  Remove useless cast and
adjust style in knote_alloc().

Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-09-01 13:21:32 +00:00
Andriy Gapon
378d5c6c89 callout_reset: fix a reversed check for cc_exec_cancel
The typo was introduced in r278469 / 344ecf88af.

As a result of the bug there was a timing window where callout_reset()
would fail to cancel a concurrent execution of a callout that is about
to start and would schedule the callout again.
The callout would fire more times than it is scheduled.
That would happen even if the callout is initialized with a lock.

For example, the bug triggered the "Stray timeout" assertion in
taskqueue_timeout_func().

MFC after:	5 days
2015-09-01 09:27:14 +00:00
Konstantin Belousov
8d830e02f5 Use P1B_PRIO_MAX to designate max posix priority for the RR/FIFO
scheduler types.  It was intended to be used there, compare with the
min value, and with the test for correctness in ksched_setscheduler().

Note that P1B_PRIO_MAX and RTP_PRIO_MAX do have the same numerical
values, the change is cosmetical.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-08-30 18:02:57 +00:00
Konstantin Belousov
44e629f18d Remove single-use macros obfuscating malloc(9) and free(9) calls.
Style.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-08-30 17:58:11 +00:00
Julien Charbon
2ea3089cb1 Revert r286880: If at first this change made sense, it turns out
it helps only the TCP timers callout(9) usage.  As the benefit for
others callout(9) usages did not reach a consensus the historical
usage should prevail.

Differential Revision:      https://reviews.freebsd.org/D3078
2015-08-30 13:44:46 +00:00
Warner Losh
de830d432c Remove now obsolete comment.
MFC After: 2 days
2015-08-28 20:06:58 +00:00
Warner Losh
3f27281613 Per overwhelming sentiment in the code review, use FEATURE instead.
Differential Revision: https://reviews.freebsd.org/D3488
MFC After: 2 days
2015-08-28 19:53:19 +00:00
Ed Schouten
bc1ace0b96 Decompose linkat()/renameat() rights to source and target.
To make it easier to understand how Capsicum interacts with linkat() and
renameat(), rename the rights to CAP_{LINK,RENAME}AT_{SOURCE,TARGET}.

This also addresses a shortcoming in Capsicum, where it isn't possible
to disable linking to files stored in a directory. Creating hardlinks
essentially makes it possible to access files with additional rights.

Reviewed by:	rwatson, wblock
Differential Revision:	https://reviews.freebsd.org/D3411
2015-08-27 15:16:41 +00:00
Julien Charbon
cd252ea74d Silent a compilation warning on callout_stop() 2015-08-27 10:43:35 +00:00
Julien Charbon
682d0e15b5 In callout_stop(), do not forget to initialize not_running variable.
Thanks to hselasky for noticing that.

Differential Revision:	https://reviews.freebsd.org/D3078 (Updated)
Submitted by:		hselasky
Pointy hat to:		jch
2015-08-27 08:58:03 +00:00
Julien Charbon
0cfae4b4bc In callout_stop(), if a callout is both pending and currently
being serviced return 0 (fail) but it is applicable only
mpsafe callouts.  Thanks to hselasky for finding this.

Differential Revision:	https://reviews.freebsd.org/D3078 (Updated)
Submitted by:		hselasky
Reviewed by:		jch
2015-08-27 08:15:32 +00:00
Marcel Moolenaar
898b510468 An error of -1 from parse_mount() indicates that the specification
was invalid. Don't trigger a mount failure (which by default means
a panic), but instead just move on to the next directive in the
configuration. This typically has us ask for the root mount.

PR:		163245
2015-08-27 04:25:27 +00:00
Warner Losh
135342777c When the kernel is compiled with INVARIANTS, export that as
debug.invariants.

Differential Revision: https://reviews.freebsd.org/D3488
MFC after: 3 days
2015-08-26 23:58:03 +00:00
George V. Neville-Neil
57031f7912 Summary: Add the interactivity equations to the header comment for our
interactivity calculation routine.

Suggested by: rwatson
2015-08-26 16:36:41 +00:00
Edward Tomasz Napierala
c9ba65040f Make vfs_unmountall() unmount /dev after /, not before. The only
reason this didn't result in an unclean shutdown is that devfs ignores
MNT_FORCE flag.

Reviewed by:	kib@
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D3467
2015-08-24 13:18:13 +00:00
Edward Tomasz Napierala
6e572e084b After r286237 it should be fine to call vgone(9) on a busy GEOM vnode;
remove KASSERT that would prevent forced devfs unmount from working.

MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2015-08-23 14:53:54 +00:00
Roger Pau Monné
e8234cfef6 preload_search_info: make sure mod is set
Add a check to preload_search_info to make sure mod is set. Most of the
callers of preload_search_info don't check that the mod parameter is
set, which can cause page faults. While at it, remove some now unnecessary
checks before calling preload_search_info.

Sponsored by:		Citrix Systems R&D
Reviewed by:		kib
Differential Revision:	https://reviews.freebsd.org/D3440
2015-08-21 15:57:57 +00:00
Konstantin Belousov
41d50cd6b7 If process becomes reaper (procctl(PROC_REAP_ACQUIRE)) while already
having some children, the children' reaper is not reset to the parent.
This allows for the situation where reaper has children but not
descendands and the too strict asserts in the reap_status() fire.

Remove the wrong asserts, add some clarification for the situation to
the procctl(2) REAP_STATUS.

Reported and tested by:	feld
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-08-20 22:44:26 +00:00
Konstantin Belousov
fe5ec54b50 fget_unlocked() depends on the freed struct file f_count field being
zero.  The file_zone if no-free, but r284861 added trashing of the
freed memory.  Most visible manifestation of the issue were 'memory
modified after free' panics for the file zone, triggered from
falloc_noinstall().

Add UMA_ZONE_ZINIT flag to turn off trashing.  Mjg noted that it makes
sense to not trash freed memory for any non-free zone, which will be
done later.

Reported and tested by:	pho
Discussed with:	mjg
Sponsored by:	The FreeBSD Foundation
2015-08-19 11:53:32 +00:00