15781 Commits

Author SHA1 Message Date
Mark Johnston
78f57a9cde Generalize the gzio API.
We currently use a set of subroutines in kern_gzio.c to perform
compression of user and kernel core dumps. In the interest of adding
support for other compression algorithms (zstd) in this role without
complicating the API consumers, add a simple compressor API which can be
used to select an algorithm.

Also change the (non-default) GZIO kernel option to not enable
compressed user cores by default. It's not clear that such a default
would be desirable with support for multiple algorithms implemented,
and it's inconsistent in that it isn't applied to kernel dumps.

Reviewed by:	cem
Differential Revision:	https://reviews.freebsd.org/D13632
2018-01-08 21:27:41 +00:00
Ian Lepore
ac579135b0 Use EVENTHANDLER_DIRECT_INVOKE for [un]mount events, for better performance. 2018-01-07 18:07:22 +00:00
Ian Lepore
f031a3b25f Use EVENTHANDLER_DIRECT_INVOKE() for device events, for better performance. 2018-01-07 18:06:30 +00:00
Kristof Provost
fd91e076c1 Introduce mallocarray() in the kernel
Similar to calloc() the mallocarray() function checks for integer
overflows before allocating memory.
It does not zero memory, unless the M_ZERO flag is set.

Reviewed by:	pfg, vangyzen (previous version), imp (previous version)
Obtained from:	OpenBSD
Differential Revision:	https://reviews.freebsd.org/D13766
2018-01-07 13:21:01 +00:00
Gleb Smirnoff
b4f55763ce In sendfile_iodone() both pru_abort and sorele need to be executed
with proper VNET context set.

Reported by:	sbruno
MFC after:	2 weeks
2018-01-05 20:21:46 +00:00
John Baldwin
2da93c21ec Always use atomic_fetchadd() when updating per-user accounting values.
This avoids re-reading a variable after it has been updated via an
atomic op.  It is just a cosmetic cleanup as the read value was only
used to control a diagnostic printf that should rarely occur (if ever).

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D13768
2018-01-04 22:07:58 +00:00
John Baldwin
3160862437 Report offset relative to the backing object for kinfo_vmentry structures.
For the pathname reported in kinfo_vmentry structures (kve_path), the
sysctl handlers walk the object chain to find the bottom-most VM object.
This permits a COW mapping of a file with dirty pages to report the
pathname of the originally mapped file.  Do the same for the object
offset (kve_offset) computing a cumulative offset during the same object
walk so that the reported offset is relative to the reported pathname.

Note that ptrace(PT_VM_ENTRY) already returns a cumulative offset
rather than the raw offset of the VM map entry.

Note also that this does not affect procstat -v output (even structured
output) since that output does not include the kve_offset field.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	DARPA / AFRL
Differential Revision:	https://reviews.freebsd.org/D13767
2018-01-04 21:59:34 +00:00
Mike Karels
d626b50b9d make SW_WATCHDOG dynamic
Enable the hardclock-based watchdog previously conditional on the
SW_WATCHDOG option whenever hardware watchdogs are not found, and
watchdogd attempts to enable the watchdog. The SW_WATCHDOG option
still causes the sofware watchdog to be enabled even if there is a
hardware watchdog. This does not change the other software-based
watchdog enabled by the --softtimeout option to watchdogd.

Note that the code to reprime the watchdog during kernel core dumps is
no longer conditional on SW_WATCHDOG. I think this was previously a bug.

Reviewed by:	imp alfred bjk
MFC after:	1 week
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D13713
2018-01-03 00:56:30 +00:00
Antoine Brodin
1b25176cbc sysctl_kern_proc_args: do not take the fast path if p_args is NULL
In this case it falls back to reading ps_strings
2018-01-01 21:25:01 +00:00
Colin Percival
d5d7606c0c Use the TSLOG framework to record entry/exit timestamps for DELAY and
_vprintf; these functions are called in many places and can contribute
meaningfully to the total time spent booting.
2017-12-31 09:24:41 +00:00
Colin Percival
49a4e3b4b4 Instrument thread creations for the the benefit of the TSLOG framework.
This assists in tracking time spent while the boot is being "held" waiting
for something to happen.
2017-12-31 09:24:11 +00:00
Colin Percival
8b8a7c43a9 Instrument "boot holds" for the benefit of the TSLOG framework. These
are places where the "main thread" of the booting kernel (either the
thread which later becomes swapper or the thread which later becomes
init) has to stop and wait for action to take place in another thread
before continuing.

There are currently three such holds:
1. The intr_config_hooks SYSINIT waits for hooks registered via the
config_intrhook_establish function; this allows (typically) devices
which need interrupts enabled to complete their initialization to do
so before root is mounted.

2. The g_waitidle function waits for the GEOM event queue to be empty;
this ensures that all of the disks which have been attached have been
tasted before we attempt to mount root.

3. The vfs_mountroot_wait function (in addition to calling g_waitidle)
waits for holds registered via root_mount_hold; among other things, this
is used by the USB subsystem to ensure that we don't fail to mount root
if it's located on a USB disk which takes a while to probe.
2017-12-31 09:23:52 +00:00
Colin Percival
a21a2da599 Teach makeobjops.awk to accept PROLOG and EPILOG blocks before
METHOD and STATICMETHOD declarations; that code will be inserted
into the dispatch function before and after the method call.

Use this functionality and the TSLOG framework to record DEVICE_ATTACH
and DEVICE_PROBE entry/exit timestamps.
2017-12-31 09:23:19 +00:00
Colin Percival
6032e08810 Use the TSLOG framework to record entry/exit timestamps for machine
independent functions with important roles in the early boot process:
mi_startup (with the "exit" recorded when it becomes swapper),
start_init (with the "exit" recorded when the thread is about to
"return" into the newly created init process), vfs_mountroot, and
vfs_mountroot_wait.
2017-12-31 09:22:31 +00:00
Colin Percival
e31e71991a Code for recording timestamps of events, especially function entries/exits.
This is a very primitive system, intended for use in measuring performance
during the early system boot, before more sophisticated tools like DTrace
or infrastructure like kernel memory allocation and mutexes are available.

Because this code records pointers to strings rather than copying strings
(in order to keep the memory usage more manageable), if a kernel module is
unloaded after logging an event, Bad Things can happen.  Users are advised
to not do that.

Since cycle counts from the early kernel boot are used as an initial entropy
source, publishing this information to userland could result in inadequate
entropy being kept private to the kernel RNG.  Users are advised to not
enable this on systems with untrusted users.

Discussed on:	freebsd-current
2017-12-31 09:21:01 +00:00
Pedro F. Giffuni
0879ca728a sysv_{ipc|shm}: update the NetBSD VCS tags to match nearer our files.
Both files originated in NetBSD:

sysv_ipc.c CVS 1.9:
Most of their changes don't apply to us as we already have similar
changes. This is a better reference for future merges.

sysv_shm.c CVS 1.39:
Most of their changes don't apply to our code but interestingly this
revision merged our changes and is a better point for reference.

Move the VCS tags to the position recommended in our committers guide
(section 8),

No functional change.
2017-12-31 03:34:00 +00:00
Mateusz Guzik
efa9f177f5 locks: adjust loop limit check when waiting for readers
The check was for the exact value, but since the counter started being
incremented by the number of readers it could have jumped over.
2017-12-31 02:31:01 +00:00
Mateusz Guzik
cde25ed4cd sx: fix up non-smp compilation after r327397 2017-12-31 01:59:56 +00:00
Mateusz Guzik
28f1a9e3ff locks: re-check the reason to go to sleep after locking sleepq/turnstile
In both rw and sx locks we always go to sleep if the lock owner is not
running.

We do spin for some time if the lock is read-locked.

However, if we decide to go to sleep due to the lock owner being off cpu
and after sleepq/turnstile gets acquired the lock is read-locked, we should
fallback to the aforementioned wait.
2017-12-31 00:47:04 +00:00
Mateusz Guzik
fb10612355 sx: read the SX_NOADAPTIVE flag and Giant ownership only once
These used to be read multiple times when waiting for the lock the become
free, which had the potential to issue completely avoidable traffic.
2017-12-31 00:37:50 +00:00
Mateusz Guzik
15140a8ade mtx: deduplicate indefinite wait check in spinlocks and thread lock 2017-12-31 00:34:29 +00:00
Mateusz Guzik
1f4d28c7ea mtx: pre-read the lock value in thread_lock_flags_
Since this function is effectively slow path, if we get here the lock is most
likely already taken in which case it is cheaper to not blindly attempt the
atomic op.

While here move hwpmc probe out of the loop to match other primitives.
2017-12-31 00:33:28 +00:00
Mateusz Guzik
80c39f6c37 rwlock: tidy up __rw_runlock_hard similarly to r325921 2017-12-31 00:31:14 +00:00
Konstantin Belousov
baaa79699a Make kern_proc_vmmap_resident() externally accesible, and move the
vmmap_skip_res_cnt control check inside it.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D13595
2017-12-28 13:16:32 +00:00
Eitan Adler
caa7e52f3f kernel: Fix several typos and minor errors
- duplicate words
- typos
- references to old versions of FreeBSD

Reviewed by:	imp, benno
2017-12-27 03:23:21 +00:00
Alexander Kabaev
151ba7933a Do pass removing some write-only variables from the kernel.
This reduces noise when kernel is compiled by newer GCC versions,
such as one used by external toolchain ports.

Reviewed by: kib, andrew(sys/arm and sys/arm64), emaste(partial), erj(partial)
Reviewed by: jhb (sys/dev/pci/* sys/kern/vfs_aio.c and sys/kern/kern_synch.c)
Differential Revision: https://reviews.freebsd.org/D10385
2017-12-25 04:48:39 +00:00
Alexander Kabaev
6d41588b6b Reverse the check to allocate the buffer if cached pointer is NULL.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D13596
2017-12-23 17:55:19 +00:00
Alexander Kabaev
4daa09f343 Remove dead store to local variable. 2017-12-23 16:49:57 +00:00
Bruce Evans
da9fba5447 Use resume_cpus() instead of restart_cpus() to resume from ACPI suspension.
restart_cpus() worked well enough by accident.  Before this set of fixes,
resume_cpus() used the same cpuset (started_cpus, meaning CPUs directed to
restart) as restart_cpus().  resume_cpus() waited for the wrong cpuset
(stopped_cpus) to become empty, but since mixtures of stopped and suspended
CPUs are not close to working, stopped_cpus must be empty when resuming so
the wait is null -- restart_cpus just allows the other CPUs to restart and
returns without waiting.

Fix resume_cpus() to wait on a non-wrong cpuset for the ACPI case, and
add further kludges to try to keep it working for the XEN case.  It
was only used for XEN.  It waited on suspended_cpus.  This works for
XEN.  However, for ACPI, resuming is a 2-step process.  ACPI has already
woken up the other CPUs and removed them from suspended_cpus.  This
fix records the move by putting them in a new cpuset resuming_cpus.
Waiting on suspended_cpus would give the same null wait as waiting on
stopped_cpus.  Wait on resuming_cpus instead.

Add a cpuset toresume_cpus to map the CPUs being told to resume to keep
this separate from the cpuset started_cpus for mapping the CPUs being told
to restart.  Mixtures of stopped and suspended/resuming CPUs are still far
from working.  Describe new and some old cpusets in comments.

Add further kludges to cpususpend_handler() to try to avoid breaking it
for XEN.  XEN doesn't use resumectx(), so it doesn't use the second
return path for savectx(), and it goes from the suspended state directly
to the restarted state, while ACPI resume goes through the resuming state.
Enter the resuming state early for all cases so that resume_cpus can test
for being in this state and not have to worry about the intermediate
!suspended state for ACPI only.

Reviewed by:	kib
2017-12-21 09:17:48 +00:00
John Baldwin
b501cc5da6 Rework pathconf handling for FIFOs.
On the one hand, FIFOs should respect other variables not supported by
the fifofs vnode operation (such as _PC_NAME_MAX, _PC_LINK_MAX, etc.).
These values are fs-specific and must come from a fs-specific method.
On the other hand, filesystems that support FIFOs are required to
support _PC_PIPE_BUF on directory vnodes that can contain FIFOs.
Given this latter requirement, once the fs-specific VOP_PATHCONF
method supports _PC_PIPE_BUF for directories, it is also suitable for
FIFOs permitting a single VOP_PATHCONF method to be used for both
FIFOs and non-FIFOs.

To that end, retire all of the FIFO-specific pathconf methods from
filesystems and change FIFO-specific vnode operation switches to use
the existing fs-specific VOP_PATHCONF method.  For fifofs, set it's
VOP_PATHCONF to VOP_PANIC since it should no longer be used.

While here, move _PC_PIPE_BUF handling out of vop_stdpathconf() so that
only filesystems supporting FIFOs will report a value.  In addition,
only report a valid _PC_PIPE_BUF for directories and FIFOs.

Discussed with:	bde
Reviewed by:	kib (part of a larger patch)
MFC after:	1 month
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D12572
2017-12-19 22:39:05 +00:00
John Baldwin
599afe53a8 Move NAME_MAX, LINK_MAX, and CHOWN_RESTRICTED out of vop_stdpathconf().
Having all filesystems fall through to default values isn't always correct
and these values can vary for different filesystem implementations.  Most
of these changes just use the existing default values with a few exceptions:
- Don't report CHOWN_RESTRICTED for ZFS since it doesn't do the exact
  permissions check this claims for chown().
- Use NANDFS_NAME_LEN for NAME_MAX for nandfs.
- Don't report a LINK_MAX of 0 on smbfs.  Now fail with EINVAL to
  indicate hard links aren't supported.

Requested by:	bde (though perhaps not this exact implementation)
Reviewed by:	kib (earlier version)
MFC after:	1 month
Sponsored by:	Chelsio Communications
2017-12-19 19:51:36 +00:00
John Baldwin
dd688800e1 Add a custom VOP_PATHCONF method for fdescfs.
The method handles NAME_MAX and LINK_MAX explicitly.  For all other
pathconf variables, the method passes the request down to the underlying
file descriptor.  This requires splitting a kern_fpathconf() syscallsubr
routine out of sys_fpathconf().  Also, to avoid lock order reversals with
vnode locks, the fdescfs vnode is unlocked around the call to
kern_fpathconf(), but with the usecount of the vnode bumped.

MFC after:	1 month
Sponsored by:	Chelsio Communications
2017-12-19 18:20:38 +00:00
Konstantin Belousov
6f697994fd Use atomic_load(9) to read ppsinfo sequence numbers.
In this case volatile qualifiers enusre that a compiler does not
optimize the accesses out.

Reviewed by:	alc, jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D13534
2017-12-19 10:05:45 +00:00
Pedro F. Giffuni
62cf53fdac SPDX: some uses of the RSA-MD license. 2017-12-13 16:30:39 +00:00
Fedor Uporov
4ba058c0cf Fix kernel build if MAC is not defined.
Reported by:    Ravi Pokala, Andrew Turner
Approved by:    pfg (mentor)
MFC after:      1 week
2017-12-13 16:14:38 +00:00
Fedor Uporov
61b214f338 Move buffer size checks outside of the vnode locks.
Reviewed by:    kib, cem, pfg (mentor)
Approved by:    pfg (mentor)
MFC after:      1 weeks

Differential Revision:    https://reviews.freebsd.org/D13405
2017-12-12 20:15:57 +00:00
Bruce Evans
fb3cc1c37d Move instantiation of msgbufp from 9 MD files to subr_prf.c.
This variable should be pure MI except possibly for reading it in MD
dump routines.  Its initialization was pure MD in 4.4BSD, but FreeBSD
changed this in r36441 in 1998.  There were many imperfections in
r36441.  This commit fixes only a small one, to simplify fixing the
others 1 arch at a time.  (r47678 added support for
special/early/multiple message buffer initialization which I want in
a more general form, but this was too fragile to use because hacking
on the msgbufp global corrupted it, and was only used for 5 hours in
-current...)
2017-12-07 07:55:38 +00:00
Mark Johnston
e1703ef5ae Plug a name cache lock leak.
Reviewed by:	mjg
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
2017-12-01 22:51:02 +00:00
Konstantin Belousov
36bce27be9 Destroy seltd st_mtx and st_wait in seltdfini().
A correct destruction is important for WITNESS(4) and LOCK_PROFILING(9).

Submitted by:	Sebastian Huber <sebastian.huber@embedded-brains.de>
MFC after:	1 week
2017-12-01 11:18:19 +00:00
Pedro F. Giffuni
64de3fdd58 SPDX: use the Beerware identifier. 2017-11-30 20:33:45 +00:00
Hans Petter Selasky
1408b84a26 The sched_add() function is not only used when the thread is initially
started, but also by the turnstiles to mark a thread as runnable for
all locks, for instance sleepqueues do:
setrunnable()->sched_wakeup()->sched_add()

In r326218 code was added to allow booting from non-zero CPU numbers
by setting the ts_cpu field inside the ULE scheduler's sched_add()
function. This had an undesired side-effect that prior sched_pin() and
sched_bind() calls got disregarded. This patch fixes the
initialization of the ts_cpu field for the ULE scheduler to only
happen once when the initial thread is constructed during system
init. Forking will then later on ensure that a valid ts_cpu value gets
copied to all children.

Reviewed by:	jhb, kib
Discussed with:	nwhitehorn
MFC after:	1 month
Differential revision:	https://reviews.freebsd.org/D13298
Sponsored by:	Mellanox Technologies
2017-11-29 23:28:40 +00:00
Alexey Dokuchaev
2c9ec07528 Fix several noticed style issues.
Reviewed by:	bde
Approved by:	bapt
2017-11-29 12:49:22 +00:00
Jeff Roberson
2e47807c21 Eliminate kmem_arena and kmem_object in preparation for further NUMA commits.
The arena argument to kmem_*() is now only used in an assert.  A follow-up
commit will remove the argument altogether before we freeze the API for the
next release.

This replaces the hard limit on kmem size with a soft limit imposed by UMA.  When
the soft limit is exceeded we periodically wakeup the UMA reclaim thread to
attempt to shrink KVA.  On 32bit architectures this should behave much more
gracefully as we exhaust KVA.  On 64bit the limits are likely never hit.

Reviewed by:	markj, kib (some objections)
Discussed with:	alc
Tested by:	pho
Sponsored by:	Netflix / Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D13187
2017-11-28 23:40:54 +00:00
Brooks Davis
5cd667e65f Disable vim syntax highlighting.
Vim's default pick doesn't understand that ';' is a comment character
and the result looks horrible.

Reviewed by:	emaste
2017-11-28 18:23:17 +00:00
Edward Tomasz Napierala
212ff84f4a Make kdb_reenter() silent when explicitly called from db_error().
This removes the useless backtrace on various ddb(4) user errors.

Reviewed by:	jhb@
Obtained from:	CheriBSD
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D13212
2017-11-28 12:53:55 +00:00
Nathan Whitehorn
51de47e3f8 Remove assertion that a CPU be present before returning a PCPU for it. It
is up to the caller to check for a NULL return value. The assert was meant
to catch buggy code that did not check the return value. Some code, however,
was smart and used the return value to see if a CPU existed, which this
broke.

Requested by:	jhb@
2017-11-28 05:39:48 +00:00
Pedro F. Giffuni
8a36da99de sys/kern: adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
2017-11-27 15:20:12 +00:00
Mateusz Guzik
e57b2b1830 rw: fix runlock_hard when new readers show up
When waiters/writer spinner flags are set no new readers can show up unless
they already have a different rw rock read locked. The change in r326195 failed
to take that into account - in presence of new readers it would spin until
they all drain, which would be lead to trouble if e.g. they go off cpu and
can get scheduled because of this thread.

Reported by:	pho
2017-11-26 21:10:47 +00:00
Nathan Whitehorn
efe67753cc Remove some, but not all, assumptions that the BSP is CPU 0 and that CPUs
are numbered densely from there to n_cpus.

MFC after:	1 month
2017-11-25 23:41:05 +00:00
Mateusz Guzik
2c50bafef5 Add the missing lockstat check for thread lock. 2017-11-25 20:49:27 +00:00