15808 Commits

Author SHA1 Message Date
Wojciech Macek
4d249cdd4c ULE: provide defaults to ts_cpu
Fix a bug when the system has no CPU 0. When created, threads were implicitly assigned to CPU 0.
This had no practical effect since a real CPU was chosen immediately by the scheduler. However,
on systems without a CPU 0, sched_ule attempted to access the scheduler queue of the "old" CPU
when assigned the initial choice of the old one. This caused an attempt to use illegal memory
and a crash (or, more usually, a deadlock). Fix this by assigned new threads to the BSP
explicitly and add some asserts to see that this problem does not recur.

Authored by:           Nathan Whitehorn <nwhitehorn@freebsd.org>
Submitted by:          Wojciech Macek <wma@semihalf.com>
Obtained from:         Semihalf
Differential revision: https://reviews.freebsd.org/D13932
2018-01-24 07:54:05 +00:00
Pedro F. Giffuni
44c514b142 Forgot to sort here in r328238. 2018-01-22 02:26:10 +00:00
Pedro F. Giffuni
d821d36419 Unsign some values related to allocation.
When allocating memory through malloc(9), we always expect the amount of
memory requested to be unsigned as a negative value would either stand for
an error or an overflow.
Unsign some values, found when considering the use of mallocarray(9), to
avoid unnecessary casting. Also consider that indexes should be of
at least the same size/type as the upper limit they pretend to index.

MFC after:	3 weeks
2018-01-22 02:08:10 +00:00
Pedro F. Giffuni
ac2fffa4b7 Revert r327828, r327949, r327953, r328016-r328026, r328041:
Uses of mallocarray(9).

The use of mallocarray(9) has rocketed the required swap to build FreeBSD.
This is likely caused by the allocation size attributes which put extra pressure
on the compiler.

Given that most of these checks are superfluous we have to choose better
where to use mallocarray(9). We still have more uses of mallocarray(9) but
hopefully this is enough to bring swap usage to a reasonable level.

Reported by:	wosch
PR:		225197
2018-01-21 15:42:36 +00:00
Nathan Whitehorn
9a8196ce19 Remove SFBUF_OPTIONAL_DIRECT_MAP and such hacks, replacing them across the
kernel by PHYS_TO_DMAP() as previously present on amd64, arm64, riscv, and
powerpc64. This introduces a new MI macro (PMAP_HAS_DMAP) that can be
evaluated at runtime to determine if the architecture has a direct map;
if it does not (or does) unconditionally and PMAP_HAS_DMAP is either 0 or
1, the compiler can remove the conditional logic.

As part of this, implement PHYS_TO_DMAP() on sparc64 and mips64, which had
similar things but spelled differently. 32-bit MIPS has a partial direct-map
that maps poorly to this concept and is unchanged.

Reviewed by:		kib
Suggestions from:	marius, alc, kib
Runtime tested on:	amd64, powerpc64, powerpc, mips64
2018-01-19 17:46:31 +00:00
Andriy Gapon
3e4f610dad correct read-ahead calculations in vfs_bio_getpages
Previously the calculations were done as if the requested region
ended at the start of the last requested page, not its end.
The problem as actually quite minor as it affected only stats and
page prefaulting, not the actual page data, and only with specific
parameters.

Reviewed by:	kib (previous version)
MFC after:	2 weeks
2018-01-18 12:59:04 +00:00
Wojciech Macek
5b3e8b0725 KDB: restart only CPUs stopped by KDB
There is a case when not all CPUs went online. In that situation,
restart only APs which were operational before entering KDB.

Created by:            Wojciech Macek <wma@semihalf.com>
Obtained from:         Semihalf
Reviewed by:           nwhitehorn
Differential revision: https://reviews.freebsd.org/D13949
Sponsored by:          QCM Technologies
2018-01-18 07:38:54 +00:00
John Baldwin
58c4aee0d7 Require the SHF_ALLOC flag for program sections from kernel object modules.
ELF object files can contain program sections which are not supposed
to be loaded into memory (e.g. .comment).  Normally the static linker
uses these flags to decide which sections are allocated to loadable
program segments in ELF binaries and shared objects (including kernels
on all architectures and kernel modules on architectures other than
amd64).

Mapping ELF object files (such as amd64 kernel modules) into memory
directly is a bit of a grey area.  ELF object files are intended to be
used as inputs to the static linker.  As a result, there is not a
standardized definition for what the memory layout of an ELF object
should be (none of the section headers have valid virtual memory
addresses for example).

The kernel and loader were not checking the SHF_ALLOC flag but loading
any program sections with certain types such as SHT_PROGBITS.  As a
result, the kernel and loader would load into RAM some sections that
weren't marked with SHF_ALLOC such as .comment that are not loaded
into RAM for kernel modules on other architectures (which are
implemented as ELF shared objects).  Aside from possibly requiring
slightly more RAM to hold a kernel module this does not affect runtime
correctness as the kernel relocates symbols based on the layout it
uses.

Debuggers such as gdb and lldb do not extract symbol tables from a
running process or kernel.  Instead, they replicate the memory layout
of ELF executables and shared objects and use that to construct their
own symbol tables.  For executables and shared objects this works
fine.  For ELF objects the current logic in kgdb (and probably lldb
based on a simple reading) assumes that only sections with SHF_ALLOC
are memory resident when constructing a memory layout.  If the
debugger constructs a different memory layout than the kernel, then it
will compute different addresses for symbols causing symbols in the
debugger to appear to have the wrong values (though the kernel itself
is working fine).  The current port of mdb does not check SHF_ALLOC as
it replicates the kernel's logic in its existing kernel support.

The bfd linker sorts the sections in ELF object files such that all of
the allocated sections (sections with SHF_ALLOCATED) are placed first
followed by unallocated sections.  As a result, when kgdb composed a
memory layout using only the allocated sections, this layout happened
to match the layout used by the kernel and loader.  The lld linker
does not sort the sections in ELF object files and mixed allocated and
unallocated sections.  This resulted in kgdb composing a different
memory layout than the kernel and loader.

We could either patch kgdb (and possibly in the future lldb) to use
custom handling when generating memory layouts for kernel modules that
are ELF objects, or we could change the kernel and loader to check
SHF_ALLOCATED.  I chose the latter as I feel we shouldn't be loading
things into RAM that the module won't use.  This should mostly be a
NOP when linking with bfd but will allow the existing kgdb to work
with amd64 kernel modules linked with lld.

Note that we only require SHF_ALLOC for "program" sections for types
like SHT_PROGBITS and SHT_NOBITS.  Other section types such as symbol
tables, string tables, and relocations must also be loaded and are not
marked with SHF_ALLOC.

Reported by:	np
Reviewed by:	kib, emaste
MFC after:	1 month
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D13926
2018-01-17 22:51:59 +00:00
John Baldwin
b1288166e0 Use long for the last argument to VOP_PATHCONF rather than a register_t.
pathconf(2) and fpathconf(2) both return a long.  The kern_[f]pathconf()
functions now accept a pointer to a long value rather than modifying
td_retval directly.  Instead, the system calls explicitly store the
returned long value in td_retval[0].

Requested by:	bde
Reviewed by:	kib
Sponsored by:	Chelsio Communications
2018-01-17 22:36:58 +00:00
Pedro F. Giffuni
a18a2290cd kern: make some use of mallocarray(9).
Focus on code where we are doing multiplications within malloc(9). None of
these ire likely to overflow, however the change is still useful as some
static checkers can benefit from the allocation attributes we use for
mallocarray.

This initial sweep only covers malloc(9) calls with M_NOWAIT. No good
reason but I started doing the changes before r327796 and at that time it
was convenient to make sure the sorrounding code could handle NULL values.

X-Differential revision: https://reviews.freebsd.org/D13837
2018-01-15 21:18:04 +00:00
Ian Lepore
862993757a Add RTC clock conversions for BCD values, with non-panic validation.
RTC clock hardware frequently uses BCD numbers.  Currently the low-level
bcd2bin() and bin2bcd() functions will KASSERT if given out-of-range BCD
values.  Every RTC driver must implement its own code for validating the
unreliable data coming from the hardware to avoid a potential kernel panic.

This change introduces two new functions, clock_bcd_to_ts() and
clock_ts_to_bcd().  The former validates its inputs and returns EINVAL if any
values are out of range. The latter guarantees the returned data will be
valid BCD in a known format (4-digit years, etc).

A new bcd_clocktime structure is used with the new functions.  It is similar
to the original clocktime structure, but defines the fields holding BCD
values as uint8_t (uint16_t for year), and adds a PM flag for handling hours
using AM/PM mode.

PR:		224813
Differential Revision:	https://reviews.freebsd.org/D13730 (no reviewers)
2018-01-14 17:01:37 +00:00
Bjoern A. Zeeb
8e23158af7 Remove trailing whitespace.
No functional change.
2018-01-14 15:01:25 +00:00
Konstantin Belousov
fd94177c70 Add sysctl debug.kdb.stack_overflow to conveniently test kernel
handling of the kstack overflow.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-01-13 11:59:49 +00:00
Mateusz Guzik
1b54ffc8d2 sx: retry hard shared unlock just like in r327905 for rwlocks 2018-01-13 09:26:24 +00:00
Mateusz Guzik
84f2a8a4b4 rwlock: try regular read unlock even in the hard path
Saves on turnstile trips if the lock got more readers.
2018-01-13 00:05:31 +00:00
Jeff Roberson
ab3185d15e Implement NUMA support in uma(9) and malloc(9). Allocations from specific
domains can be done by the _domain() API variants.  UMA also supports a
first-touch policy via the NUMA zone flag.

The slab layer is now segregated by VM domains and is precise.  It handles
iteration for round-robin directly.  The per-cpu cache layer remains
a mix of domains according to where memory is allocated and freed.  Well
behaved clients can achieve perfect locality with no performance penalty.

The direct domain allocation functions have to visit the slab layer and
so require per-zone locks which come at some expense.

Reviewed by:	Attilio (a slightly older version)
Tested by:	pho
Sponsored by:	Netflix, Dell/EMC Isilon
2018-01-12 23:25:05 +00:00
Jeff Roberson
7a469c8ef3 Implement NUMA policy for kmem_*(9). This maintains compatibility with
reservations by giving each memory domain its own KVA space in vmem that
is naturally aligned on superpage boundaries.

Reviewed by:	alc, markj, kib  (some objections)
Sponsored by:	Netflix, Dell/EMC Isilon
Tested by;	pho
Differential Revision:	https://reviews.freebsd.org/D13289
2018-01-12 23:13:55 +00:00
Jeff Roberson
af80820a57 Regenerate auto-generated files 2018-01-12 23:06:35 +00:00
Jeff Roberson
3f289c3fcf Implement 'domainset', a cpuset based NUMA policy mechanism. This allows
userspace to control NUMA policy administratively and programmatically.

Implement domainset based iterators in the page layer.

Remove the now legacy numa_* syscalls.

Cleanup some header polution created by having seq.h in proc.h.

Reviewed by:	markj, kib
Discussed with:	alc
Tested by:	pho
Sponsored by:	Netflix, Dell/EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D13403
2018-01-12 22:48:23 +00:00
Mateusz Guzik
310f24d72a mtx: use fcmpset to cover setting MTX_CONTESTED 2018-01-12 13:40:50 +00:00
Mateusz Guzik
31c2c6e95e vfs: tidy up vdrop
Skip vfs_refcount_release_if_not_last if the interlock is held and just
go straight to refcount_release.

While here do cosmetic rearrangement of _vhold to better show it contains
equivalent behaviour.
2018-01-12 13:39:02 +00:00
Michael Tuexen
ce076a1f58 Ensure that the vnet is set when calling pru_sockaddr() and
pru_peeraddr().

This is already true when called via kern_getsockname() and
kern_getpeername(). This patch sets it also, when they arecalled
via soo_fill_kinfo(). This is necessary, since the corresponding
functions for SCTP require the vnet to be set. Without this,
if a process having an wildcard bound SCTP socket is
terminated and a core is written, the kernel panics.

Reviewed by:		bz
Differential Revision:	https://reviews.freebsd.org/D13652
2018-01-11 20:26:17 +00:00
Conrad Meyer
c02fc9607a mallocarray(9): panic if the requested allocation would overflow
Additionally, move the overflow check logic out to WOULD_OVERFLOW() for
consumers to have a common means of testing for overflowing allocations.
WOULD_OVERFLOW() should be a secondary check -- on 64-bit platforms, just
because an allocation won't overflow size_t does not mean it is a sane size
to request.  Callers should be imposing reasonable allocation limits far,
far, below overflow.

Discussed with:	emaste, jhb, kp
Sponsored by:	Dell EMC Isilon
2018-01-10 21:49:45 +00:00
John Baldwin
86bbef4379 Don't store shadow copies of per-process AIO limits.
Previously the AIO subsystem would save a snapshot of the currently
configured per-process limits the first time a process used AIO.  The
process would continue to use the snapshotted limits ignoring any
changes to the global limits during the rest of its lifetime.  This
change removes the snapshotted values and changes the AIO code to
always check the global values which can be toggled at runtime.
This means an administrator can now change the effective limits of
existing processes.  This is more consistent with how other limits
configured via sysctl work in FreeBSD.

Reviewed by:	asomers, kib
MFC after:	2 weeks
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D13819
2018-01-10 21:18:46 +00:00
John Baldwin
f54c5606b3 Allow the fast-path for disk AIO requests to fail requests.
- If aio_qphysio() returns a non-zero error code, fail the request rather
  than queueing it to the AIO kproc pool to be retried via the slow path.
  Currently this means that if vm_fault_quick_hold_pages() reports an
  error, EFAULT is returned from the fast-path rather than retrying the
  request in the slow path where it will still fail with EFAULT.
- If aio_qphysio() wishes to use the fast path for a device that doesn't
  support unmapped I/O but there are already the maximum number of
  such requests in flight, fail with EAGAIN as we do for other AIO
  resource limits rather than queueing the request to the AIO kproc pool.
- Move the opcode check for aio_qphysio() out of the caller and into
  aio_qphysio() to simplify some logic and remove two goto's while here.
  It also uses a whitelist (only supported for LIO_READ / LIO_WRITE)
  rather than a blacklist (skipped for LIO_SYNC).

PR:		217261
Submitted by:	jkim (an earlier version)
MFC after:	2 weeks
Sponsored by:	Chelsio Communications
2018-01-10 00:18:47 +00:00
John Baldwin
7e40918452 Simplify some logic by merging an if test with a subsequent switch.
Specifically, in aio_queue_file() the code was doing this:

   if (opcode == LIO_SYNC) {
       ...
   }

   switch (opcode) {
   ...
   case LIO_SYNC:
       ...
   }

This moves the body of the if statement into the LIO_SYNC case of the
switch statement.

MFC after:	2 weeks
Sponsored by:	Chelsio Communications
2018-01-10 00:02:06 +00:00
John Baldwin
8091e52b42 Add a counter to track in-flight AIO requests using unmapped I/O.
MFC after:	2 weeks
Sponsored by:	Chelsio Communications
2018-01-09 23:57:29 +00:00
Mark Johnston
78f57a9cde Generalize the gzio API.
We currently use a set of subroutines in kern_gzio.c to perform
compression of user and kernel core dumps. In the interest of adding
support for other compression algorithms (zstd) in this role without
complicating the API consumers, add a simple compressor API which can be
used to select an algorithm.

Also change the (non-default) GZIO kernel option to not enable
compressed user cores by default. It's not clear that such a default
would be desirable with support for multiple algorithms implemented,
and it's inconsistent in that it isn't applied to kernel dumps.

Reviewed by:	cem
Differential Revision:	https://reviews.freebsd.org/D13632
2018-01-08 21:27:41 +00:00
Ian Lepore
ac579135b0 Use EVENTHANDLER_DIRECT_INVOKE for [un]mount events, for better performance. 2018-01-07 18:07:22 +00:00
Ian Lepore
f031a3b25f Use EVENTHANDLER_DIRECT_INVOKE() for device events, for better performance. 2018-01-07 18:06:30 +00:00
Kristof Provost
fd91e076c1 Introduce mallocarray() in the kernel
Similar to calloc() the mallocarray() function checks for integer
overflows before allocating memory.
It does not zero memory, unless the M_ZERO flag is set.

Reviewed by:	pfg, vangyzen (previous version), imp (previous version)
Obtained from:	OpenBSD
Differential Revision:	https://reviews.freebsd.org/D13766
2018-01-07 13:21:01 +00:00
Gleb Smirnoff
b4f55763ce In sendfile_iodone() both pru_abort and sorele need to be executed
with proper VNET context set.

Reported by:	sbruno
MFC after:	2 weeks
2018-01-05 20:21:46 +00:00
John Baldwin
2da93c21ec Always use atomic_fetchadd() when updating per-user accounting values.
This avoids re-reading a variable after it has been updated via an
atomic op.  It is just a cosmetic cleanup as the read value was only
used to control a diagnostic printf that should rarely occur (if ever).

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D13768
2018-01-04 22:07:58 +00:00
John Baldwin
3160862437 Report offset relative to the backing object for kinfo_vmentry structures.
For the pathname reported in kinfo_vmentry structures (kve_path), the
sysctl handlers walk the object chain to find the bottom-most VM object.
This permits a COW mapping of a file with dirty pages to report the
pathname of the originally mapped file.  Do the same for the object
offset (kve_offset) computing a cumulative offset during the same object
walk so that the reported offset is relative to the reported pathname.

Note that ptrace(PT_VM_ENTRY) already returns a cumulative offset
rather than the raw offset of the VM map entry.

Note also that this does not affect procstat -v output (even structured
output) since that output does not include the kve_offset field.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	DARPA / AFRL
Differential Revision:	https://reviews.freebsd.org/D13767
2018-01-04 21:59:34 +00:00
Mike Karels
d626b50b9d make SW_WATCHDOG dynamic
Enable the hardclock-based watchdog previously conditional on the
SW_WATCHDOG option whenever hardware watchdogs are not found, and
watchdogd attempts to enable the watchdog. The SW_WATCHDOG option
still causes the sofware watchdog to be enabled even if there is a
hardware watchdog. This does not change the other software-based
watchdog enabled by the --softtimeout option to watchdogd.

Note that the code to reprime the watchdog during kernel core dumps is
no longer conditional on SW_WATCHDOG. I think this was previously a bug.

Reviewed by:	imp alfred bjk
MFC after:	1 week
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D13713
2018-01-03 00:56:30 +00:00
Antoine Brodin
1b25176cbc sysctl_kern_proc_args: do not take the fast path if p_args is NULL
In this case it falls back to reading ps_strings
2018-01-01 21:25:01 +00:00
Colin Percival
d5d7606c0c Use the TSLOG framework to record entry/exit timestamps for DELAY and
_vprintf; these functions are called in many places and can contribute
meaningfully to the total time spent booting.
2017-12-31 09:24:41 +00:00
Colin Percival
49a4e3b4b4 Instrument thread creations for the the benefit of the TSLOG framework.
This assists in tracking time spent while the boot is being "held" waiting
for something to happen.
2017-12-31 09:24:11 +00:00
Colin Percival
8b8a7c43a9 Instrument "boot holds" for the benefit of the TSLOG framework. These
are places where the "main thread" of the booting kernel (either the
thread which later becomes swapper or the thread which later becomes
init) has to stop and wait for action to take place in another thread
before continuing.

There are currently three such holds:
1. The intr_config_hooks SYSINIT waits for hooks registered via the
config_intrhook_establish function; this allows (typically) devices
which need interrupts enabled to complete their initialization to do
so before root is mounted.

2. The g_waitidle function waits for the GEOM event queue to be empty;
this ensures that all of the disks which have been attached have been
tasted before we attempt to mount root.

3. The vfs_mountroot_wait function (in addition to calling g_waitidle)
waits for holds registered via root_mount_hold; among other things, this
is used by the USB subsystem to ensure that we don't fail to mount root
if it's located on a USB disk which takes a while to probe.
2017-12-31 09:23:52 +00:00
Colin Percival
a21a2da599 Teach makeobjops.awk to accept PROLOG and EPILOG blocks before
METHOD and STATICMETHOD declarations; that code will be inserted
into the dispatch function before and after the method call.

Use this functionality and the TSLOG framework to record DEVICE_ATTACH
and DEVICE_PROBE entry/exit timestamps.
2017-12-31 09:23:19 +00:00
Colin Percival
6032e08810 Use the TSLOG framework to record entry/exit timestamps for machine
independent functions with important roles in the early boot process:
mi_startup (with the "exit" recorded when it becomes swapper),
start_init (with the "exit" recorded when the thread is about to
"return" into the newly created init process), vfs_mountroot, and
vfs_mountroot_wait.
2017-12-31 09:22:31 +00:00
Colin Percival
e31e71991a Code for recording timestamps of events, especially function entries/exits.
This is a very primitive system, intended for use in measuring performance
during the early system boot, before more sophisticated tools like DTrace
or infrastructure like kernel memory allocation and mutexes are available.

Because this code records pointers to strings rather than copying strings
(in order to keep the memory usage more manageable), if a kernel module is
unloaded after logging an event, Bad Things can happen.  Users are advised
to not do that.

Since cycle counts from the early kernel boot are used as an initial entropy
source, publishing this information to userland could result in inadequate
entropy being kept private to the kernel RNG.  Users are advised to not
enable this on systems with untrusted users.

Discussed on:	freebsd-current
2017-12-31 09:21:01 +00:00
Pedro F. Giffuni
0879ca728a sysv_{ipc|shm}: update the NetBSD VCS tags to match nearer our files.
Both files originated in NetBSD:

sysv_ipc.c CVS 1.9:
Most of their changes don't apply to us as we already have similar
changes. This is a better reference for future merges.

sysv_shm.c CVS 1.39:
Most of their changes don't apply to our code but interestingly this
revision merged our changes and is a better point for reference.

Move the VCS tags to the position recommended in our committers guide
(section 8),

No functional change.
2017-12-31 03:34:00 +00:00
Mateusz Guzik
efa9f177f5 locks: adjust loop limit check when waiting for readers
The check was for the exact value, but since the counter started being
incremented by the number of readers it could have jumped over.
2017-12-31 02:31:01 +00:00
Mateusz Guzik
cde25ed4cd sx: fix up non-smp compilation after r327397 2017-12-31 01:59:56 +00:00
Mateusz Guzik
28f1a9e3ff locks: re-check the reason to go to sleep after locking sleepq/turnstile
In both rw and sx locks we always go to sleep if the lock owner is not
running.

We do spin for some time if the lock is read-locked.

However, if we decide to go to sleep due to the lock owner being off cpu
and after sleepq/turnstile gets acquired the lock is read-locked, we should
fallback to the aforementioned wait.
2017-12-31 00:47:04 +00:00
Mateusz Guzik
fb10612355 sx: read the SX_NOADAPTIVE flag and Giant ownership only once
These used to be read multiple times when waiting for the lock the become
free, which had the potential to issue completely avoidable traffic.
2017-12-31 00:37:50 +00:00
Mateusz Guzik
15140a8ade mtx: deduplicate indefinite wait check in spinlocks and thread lock 2017-12-31 00:34:29 +00:00
Mateusz Guzik
1f4d28c7ea mtx: pre-read the lock value in thread_lock_flags_
Since this function is effectively slow path, if we get here the lock is most
likely already taken in which case it is cheaper to not blindly attempt the
atomic op.

While here move hwpmc probe out of the loop to match other primitives.
2017-12-31 00:33:28 +00:00
Mateusz Guzik
80c39f6c37 rwlock: tidy up __rw_runlock_hard similarly to r325921 2017-12-31 00:31:14 +00:00