15915 Commits

Author SHA1 Message Date
jeff
ba27b5187b Make v_wire_count a per-cpu counter(9) counter. This eliminates a
significant source of cache line contention from vm_page_alloc().  Use
accessors and vm_page_unwire_noq() so that the mechanism can be easily
changed in the future.

Reviewed by:	markj
Discussed with:	kib, glebius
Tested by:	pho (earlier version)
Sponsored by:	Netflix, Dell/EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14273
2018-02-12 22:53:00 +00:00
ian
e1340a314d Add a new sysctl, debug.clock_do_io, to allow manully triggering a one-shot
read or write of all registered realtime clocks.  In the read case, the
values read are simply discarded.  For writes, there's no alternative but
to actually write the current system time to the device.
2018-02-12 17:41:11 +00:00
ian
e5909c170b Add a set of convenience routines for RTC drivers to use for debug output,
and a debug.clock_show_io sysctl to control debugging output.
2018-02-12 17:33:14 +00:00
ian
19e2991d59 Replace the existing print_ct() private debugging function with a set of
three public functions to format and print the three major data structures
used by realtime clock drivers (clocktime, bcd_clocktime, and timespec).
2018-02-12 16:25:56 +00:00
mckusick
83f7902028 Merge biodone_finish() back into biodone(). The primary purpose is
to make the order of operations clearer to avoid the race condition
that was fixed in r328914. In particular, this commit corrects a
similar race that existed in the soft updates callback.

Doing some sleuthing through the SVN repository, it appears that
bufdone_finish() was added to support XFS:

------------------------------------------------------------------------
r153192 | rodrigc | 2005-12-06 19:39:08 -0800 (Tue, 06 Dec 2005) | 13 lines

Changes imported from XFS for FreeBSD project:
- add fields to struct buf (needed by XFS)
    - 3 private fields: b_fsprivate1, b_fsprivate2, b_fsprivate3
    - b_pin_count, count of pinned buffer

- add new B_MANAGED flag
- add breada() function to initiate asynchronous I/O on read-ahead blocks.
- add bufdone_finish(), bpin(), bunpin_wait() functions

Patches provided by:    kan
Reviewed by:            phk
Silence on:             arch@

------------------------------------------------------------------------

It does not appear to ever have been used for anything else.  XFS was
disconnected in r241607:

------------------------------------------------------------------------
r241607 | attilio | 2012-10-16 03:04:00 -0700 (Tue, 16 Oct 2012) | 5 lines

Disconnect non-MPSAFE XFS from the build in preparation for dropping
GIANT from VFS.

This is not targeted for MFC.

------------------------------------------------------------------------

and removed entirely in r247631:

------------------------------------------------------------------------
r247631 | attilio | 2013-03-02 07:33:54 -0800 (Sat, 02 Mar 2013) | 5 lines

Garbage collect XFS bits which are now already completely disconnected
from the tree since few months.

This is not targeted for MFC.

------------------------------------------------------------------------

Since XFS support is gone, there is no reason to retain biodone_finish().

Suggested by: Warner Losh (imp)
Discussed with: cem, kib
Tested by: Peter Holm (pho)
2018-02-09 19:50:47 +00:00
glebius
2c573b6df7 Fix boot_pages exhaustion on machines with many domains and cores, where
size of UMA zone allocation is greater than page size. In this case zone
of zones can not use UMA_MD_SMALL_ALLOC, and we  need to postpone switch
off of this zone from startup_alloc() until full launch of VM.

o Always supply number of VM zones to uma_startup_count(). On machines
  with UMA_MD_SMALL_ALLOC ignore it completely, unless zsize goes over
  a page. In the latter case account VM zones for number of allocations
  from the zone of zones.
o Rewrite startup_alloc() so that it will immediately switch off from
  itself any zone that is already capable of running real alloc.
  In worst case scenario we may leak a single page here. See comment
  in uma_startup_count().
o Hardcode call to uma_startup2() into vm_mem_init(). Otherwise some
  extra SYSINITs, e.g. vm_page_init() may sneak in before.
o While here, remove uma_boot_pages_mtx. With recent changes to boot
  pages calculation, we are guaranteed to use all of the boot_pages
  in the early single threaded stage.

Reported & tested by:	mav
2018-02-09 04:45:39 +00:00
avg
75f02086d1 exec_map_first_page: fix an inverse condition introduced in r254138
While the bug itself was serious, as we could either pass a non-busied
page to vm_pager_get_pages() or leak a busy page, it could only be
triggered under a very rare condition where the page is already inserted
into the object, but it is not valid yet.

Reviewed by:	kib
MFC after:	2 weeks
2018-02-07 21:51:59 +00:00
glebius
ed8d237f4f Fix three miscalculations in amount of boot pages:
o Most of startup zones have struct uma_slab embedded into the slab,
  so provide macro UMA_SLAB_SPACE and use it instead of UMA_SLAB_SIZE,
  when calculating how many pages would certain kind of allocations
  require. Some zones are offpage, so we might have a positive inaccuracy.
o The keg for the zone of zones is allocated "dynamically", so we
  need +1 when calculating amount of pages for kegs. [1]
o The zones of zones and zones of kegs have arbitrary alignment of 32,
  and this also needs to be accounted for. [2]

While here, spread more comments and improve diagnostic messages.

Reported by:	pho [1], jtl [2]
2018-02-07 18:32:51 +00:00
markj
d241ba38fa Dequeue wired pages lazily.
Previously, wiring a page would cause it to be removed from its page
queue. In the common case, unwiring causes it to be enqueued at the tail
of that page queue. This change modifies vm_page_wire() to not dequeue
the page, thus avoiding the highly contended page queue locks. Instead,
vm_page_unwire() takes care of requeuing the page as a single operation,
and the page daemon dequeues wired pages as they are encountered during
a queue scan to avoid needlessly revisiting them later. For pages in
PQ_ACTIVE we do even better, since a requeue is unnecessary.

The change improves scalability for some common workloads. For instance,
threads wiring pages into the buffer cache no longer need to modify
global page queues, and unwiring is usually done by the bufspace thread,
so concurrency is not as much of an issue. As another example, many
sysctl handlers wire the output buffer to avoid faults on copyout, and
since the buffer is likely to be in PQ_ACTIVE, we now entirely avoid
modifying the page queue in this case.

The change also adds a block comment describing some properties of
struct vm_page's reference counters, and the busy lock.

Reviewed by:	jeff
Discussed with:	alc, kib
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D11943
2018-02-07 16:57:10 +00:00
ian
7f4e81c493 Use const pointers for input data not modified by clock utility functions. 2018-02-06 22:17:01 +00:00
jeff
e67ec0d694 Use per-domain locks for vm page queue free. Move paging control from
global to per-domain state.  Protect reservations with the free lock
from the domain that they belong to.  Refactor to make vm domains more
of a first class object.

Reviewed by:    markj, kib, gallatin
Tested by:      pho
Sponsored by:   Netflix, Dell/EMC Isilon
Differential Revision:  https://reviews.freebsd.org/D14000
2018-02-06 22:10:07 +00:00
glebius
59ed8012f8 Fix boot_pages calculation for machines that don't have UMA_MD_SMALL_ALLOC.
o Call uma_startup1() after initializing kmem, vmem and domains.
o Include 8 eight VM startup pages into uma_startup_count() calculation.
o Account for vmem_startup() and vm_map_startup() preallocating pages.
o Account for extra two allocations done by kmem_init() and vmem_create().
o Hardcode the place of execution of vm_radix_reserve_kva(). Using SYSINIT
  allowed several other SYSINITs to sneak in before it, thus bumping
  requirement for amount of boot pages.
2018-02-06 22:06:59 +00:00
glebius
baecd1a4b3 Followup on r302393 by cperciva, improving calculation of boot pages required
for UMA startup.

o Introduce another stage of UMA startup, which is entered after
  vm_page_startup() finishes. After this stage we don't yet enable buckets,
  but we can ask VM for pages. Rename stages to meaningful names while here.
  New list of stages: BOOT_COLD, BOOT_STRAPPED, BOOT_PAGEALLOC, BOOT_BUCKETS,
  BOOT_RUNNING.
  Enabling page alloc earlier allows us to dramatically reduce number of
  boot pages required. What is more important number of zones becomes
  consistent across different machines, as no MD allocations are done before
  the BOOT_PAGEALLOC stage. Now only UMA internal zones actually need to use
  startup_alloc(), however that may change, so vm_page_startup() provides
  its need for early zones as argument.
o Introduce uma_startup_count() function, to avoid code duplication. The
  functions calculates sizes of zones zone and kegs zone, and calculates how
  many pages UMA will need to bootstrap.
  It counts not only of zone structures, but also of kegs, slabs and hashes.
o Hide uma_startup_foo() declarations from public file.
o Provide several DIAGNOSTIC printfs on boot_pages usage.
o Bugfix: when calculating zone of zones size use (mp_maxid + 1) instead of
  mp_ncpus. Use resulting number not only in the size argument to zone_ctor()
  but also as args.size.

Reviewed by:		imp, gallatin (earlier version)
Differential Revision:	https://reviews.freebsd.org/D14054
2018-02-06 04:16:00 +00:00
mckusick
a7ac77d8a7 Occasional cylinder-group check-hash errors were being reported on
systems running with a heavy filesystem load. Tracking down this
bug was elusive because there were actually two problems. Sometimes
the in-memory check hash was wrong and sometimes the check hash
computed when doing the read was wrong. The occurrence of either
error caused a check-hash mismatch to be reported.

The first error was that the check hash in the in-memory cylinder
group was incorrect. This error was caused by the following
sequence of events:

- We read a cylinder-group buffer and the check hash is valid.
- We update its cg_time and cg_old_time which makes the in-memory
  check-hash value invalid but we do not mark the cylinder group dirty.
- We do not make any other changes to the cylinder group, so we
  never mark it dirty, thus do not write it out, and hence never
  update the incorrect check hash for the in-memory buffer.
- Later, the buffer gets freed, but the page with the old incorrect
  check hash is still in the VM cache.
- Later, we read the cylinder group again, and the first page with
  the old check hash is still in the VM cache, but some other pages
  are not, so we have to do a read.
- The read does not actually get the first page from disk, but rather
  from the VM cache, resulting in the old check hash in the buffer.
- The value computed after doing the read does not match causing the
  error to be printed.

The fix for this problem is to only set cg_time and cg_old_time as
the cylinder group is being written to disk. This keeps the in-memory
check-hash valid unless the cylinder group has had other modifications
which will require it to be written with a new check hash calculated.
It also requires that the check hash be recalculated in the in-memory
cylinder group when it is marked clean after doing a background write.

The second problem was that the check hash computed at the end of the
read was incorrect because the calculation of the check hash on
completion of the read was being done too soon.

- When a read completes we had the following sequence:

  - bufdone()
  -- b_ckhashcalc (calculates check hash)
  -- bufdone_finish()
  --- vfs_vmio_iodone() (replaces bogus pages with the cached ones)

- When we are reading a buffer where one or more pages are already
  in memory (but not all pages, or we wouldn't be doing the read),
  the I/O is done with bogus_page mapped in for the pages that exist
  in the VM cache. This mapping is done to avoid corrupting the
  cached pages if there is any I/O overrun. The vfs_vmio_iodone()
  function is responsible for replacing the bogus_page(s) with the
  cached ones. But we were calculating the check hash before the
  bogus_page(s) were replaced. Hence, when we were calculating the
  check hash, we were partly reading from bogus_page, which means
  we calculated a bad check hash (e.g., because multiple pages have
  been mapped to bogus_page, so its contents are indeterminate).

The second fix is to move the check-hash calculation from bufdone()
to bufdone_finish() after the call to vfs_vmio_iodone() so that it
computes the check hash over the correct set of pages.

With these two changes, the occasional cylinder-group check-hash
errors are gone.

Submitted by: David Pfitzner <dpfitzner@netflix.com>
Reviewed by: kib
Tested by: David Pfitzner
2018-02-06 00:19:46 +00:00
jhb
6145bc74c1 Ignore relocation tables for non-memory-resident sections.
As a followup to r328101, ignore relocation tables for ELF object
sections that are not memory resident.  For modules loaded by the
loader, ignore relocation tables whose associated section was not
loaded by the loader (sh_addr is zero).  For modules loaded at runtime
via kldload(2), ignore relocation tables whose associated section is
not marked with SHF_ALLOC.

Reported by:	Mori Hiroki <yamori813@yahoo.co.jp>, adrian
Tested on:	mips, mips64
MFC after:	1 month
Sponsored by:	DARPA / AFRL
2018-02-05 23:35:33 +00:00
jhb
3f3863ea7c Always give ELF brands a chance to veto a match.
If a brand provides a header_supported hook, check it when trying to
find a brand based on a matching interpreter as well as in the final
loop for the fallback brand. Previously a brand might reject a binary
via a header_supported hook in one of the earlier loops, but still be
chosen by one of these later loops.

Reviewed by:	kib
Obtained from:	CheriBSD
MFC after:	2 weeks
Sponsored by:	DARPA / AFRL
Differential Revision:	https://reviews.freebsd.org/D13945
2018-02-05 23:27:42 +00:00
brooks
dc50858dea Reduce duplication in extattr_*_(file|link) syscalls.
Reviewed by:	rwatson
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14173
2018-02-05 19:06:34 +00:00
brooks
44d1111709 ANSIfy syscall implementations.
Reviewed by:	rwatson
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14172
2018-02-05 18:58:55 +00:00
brooks
59fcf4b241 Add kern.ipc.{msqids,semsegs,sema} sysctls for FreeBSD32.
Stop leaking kernel pointers though theses sysctls and make sure that the
padding in the structures is zeroed on allocation to avoid other leaks.

Reviewed by:	gordon, kib
Obtained from:	CheriBSD
MFC after:	1 week
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D13459
2018-02-02 18:03:12 +00:00
hselasky
91b534f7d7 Slightly bump the maximum OID path for loading tunable SYSCTLs.
Coming updates to the mlx5en(4) driver will require this.

MFC after:	1 week
Sponsored by:	Mellanox Technologies
2018-02-02 12:42:46 +00:00
mckusick
dca0cf2869 One of the vnode fields listed by vn_printf is the union of pointers
whose type depends on the type of vnode. Correct vn_printf so that
it correctly identifies the name of the pointer that it is printing.

Submitted by: Andreas Longwitz <longwitz at incore.de>
MFC after: 1 week
2018-01-31 22:49:50 +00:00
emaste
187eb493d5 makesyscalls: permit a range of syscall numbers for UNIMPL
Some ABIs have large gaps in syscall numbers.  Allow gaps to be filled
as ranges of UNIMPL, with an entry like:

248-1023	AUE_NULL	UNIMPL	unimplemented

Reviewed by:	jhb, gnn
Sponsored by:	Turing Robotic Industries Inc.
Differential Revision:	https://reviews.freebsd.org/D14122
2018-01-30 18:29:38 +00:00
bdrewery
59f27534cc Don't use an .OBJDIR for 'make sysent'.
Reported by:	emaste, jhb
Sponsored by:	Dell EMC
2018-01-29 19:14:15 +00:00
lwhsu
520dbdaa53 Fix LINT build after r328508, add forgotten part in format string
Reviewed by:	delphij
Differential Revision:	https://reviews.freebsd.org/D14089
2018-01-29 02:29:08 +00:00
imp
bfc21ccd43 Create deprecation management functions.
gone_in(majar, msg);	If we're running in FreeBSD major, tell
			the user this code may be deleted soon.
			If we're running in FreeBSD major - 1,
			the the user is deprecated and will
			be gone in major.
			Otherwise say nothing.

gone_in_dev(dev, major, msg) Just like gone_in, except use device_printf.

New tunable / sysctl debug.oboslete_panic: 0 - don't panic,
	1 - panic in major or newer , 2 - panic in major - 1 or newer
	default: 0

if NO_OBSOLETE_CODE is defined, then both of these turn into compile
time errors when building for major. Add options NO_OBSOLETE_CODE to
kernel build system.

This lets us tag code that's going away so users know it will be gone,
as well as automatically manage things.

Differential Review: https://reviews.freebsd.org/D13818
2018-01-29 00:14:39 +00:00
imp
3506c70696 Add the DF_SUSPENDED flag to flags that are printed. 2018-01-28 05:13:17 +00:00
mckusick
b60c21e66e For many years the message "fsync: giving up on dirty" has occationally
appeared on UFS/FFS filesystems. In some cases it was promptly followed
by a panic of "softdep_deallocate_dependencies: dangling deps". This fix
should eliminate both of these occurences.

Submitted by: Andreas Longwitz <longwitz at incore.de>
Reviewed by:  kib
Tested by:    Peter Holm (pho)
PR:           225423
MFC after:    1 week
2018-01-26 18:17:11 +00:00
lwhsu
344189d5c0 Fix build for architectures where size_t is not unsigned long
Reviewed by:	cem
Differential Revision:	https://reviews.freebsd.org/D14045
2018-01-25 06:37:14 +00:00
cem
2b1bc6707d malloc(9): Change nominal size to size_t to match standard C
No functional change -- size_t matches unsigned long on all platforms.

Reported by:	bde
Discussed with:	jhb
Sponsored by:	Dell EMC Isilon
2018-01-24 19:37:18 +00:00
wma
86a31a6bb4 Reverting r328320 2018-01-24 13:57:01 +00:00
wma
d8d083c4f2 ULE: provide defaults to ts_cpu
Fix a bug when the system has no CPU 0. When created, threads were implicitly assigned to CPU 0.
This had no practical effect since a real CPU was chosen immediately by the scheduler. However,
on systems without a CPU 0, sched_ule attempted to access the scheduler queue of the "old" CPU
when assigned the initial choice of the old one. This caused an attempt to use illegal memory
and a crash (or, more usually, a deadlock). Fix this by assigned new threads to the BSP
explicitly and add some asserts to see that this problem does not recur.

Authored by:           Nathan Whitehorn <nwhitehorn@freebsd.org>
Submitted by:          Wojciech Macek <wma@semihalf.com>
Obtained from:         Semihalf
Differential revision: https://reviews.freebsd.org/D13932
2018-01-24 07:54:05 +00:00
pfg
6f66677652 Forgot to sort here in r328238. 2018-01-22 02:26:10 +00:00
pfg
f0c6025eb6 Unsign some values related to allocation.
When allocating memory through malloc(9), we always expect the amount of
memory requested to be unsigned as a negative value would either stand for
an error or an overflow.
Unsign some values, found when considering the use of mallocarray(9), to
avoid unnecessary casting. Also consider that indexes should be of
at least the same size/type as the upper limit they pretend to index.

MFC after:	3 weeks
2018-01-22 02:08:10 +00:00
pfg
ced875130d Revert r327828, r327949, r327953, r328016-r328026, r328041:
Uses of mallocarray(9).

The use of mallocarray(9) has rocketed the required swap to build FreeBSD.
This is likely caused by the allocation size attributes which put extra pressure
on the compiler.

Given that most of these checks are superfluous we have to choose better
where to use mallocarray(9). We still have more uses of mallocarray(9) but
hopefully this is enough to bring swap usage to a reasonable level.

Reported by:	wosch
PR:		225197
2018-01-21 15:42:36 +00:00
nwhitehorn
e79f2b9178 Remove SFBUF_OPTIONAL_DIRECT_MAP and such hacks, replacing them across the
kernel by PHYS_TO_DMAP() as previously present on amd64, arm64, riscv, and
powerpc64. This introduces a new MI macro (PMAP_HAS_DMAP) that can be
evaluated at runtime to determine if the architecture has a direct map;
if it does not (or does) unconditionally and PMAP_HAS_DMAP is either 0 or
1, the compiler can remove the conditional logic.

As part of this, implement PHYS_TO_DMAP() on sparc64 and mips64, which had
similar things but spelled differently. 32-bit MIPS has a partial direct-map
that maps poorly to this concept and is unchanged.

Reviewed by:		kib
Suggestions from:	marius, alc, kib
Runtime tested on:	amd64, powerpc64, powerpc, mips64
2018-01-19 17:46:31 +00:00
avg
d57a913b89 correct read-ahead calculations in vfs_bio_getpages
Previously the calculations were done as if the requested region
ended at the start of the last requested page, not its end.
The problem as actually quite minor as it affected only stats and
page prefaulting, not the actual page data, and only with specific
parameters.

Reviewed by:	kib (previous version)
MFC after:	2 weeks
2018-01-18 12:59:04 +00:00
wma
e93d0fe0cf KDB: restart only CPUs stopped by KDB
There is a case when not all CPUs went online. In that situation,
restart only APs which were operational before entering KDB.

Created by:            Wojciech Macek <wma@semihalf.com>
Obtained from:         Semihalf
Reviewed by:           nwhitehorn
Differential revision: https://reviews.freebsd.org/D13949
Sponsored by:          QCM Technologies
2018-01-18 07:38:54 +00:00
jhb
1248834695 Require the SHF_ALLOC flag for program sections from kernel object modules.
ELF object files can contain program sections which are not supposed
to be loaded into memory (e.g. .comment).  Normally the static linker
uses these flags to decide which sections are allocated to loadable
program segments in ELF binaries and shared objects (including kernels
on all architectures and kernel modules on architectures other than
amd64).

Mapping ELF object files (such as amd64 kernel modules) into memory
directly is a bit of a grey area.  ELF object files are intended to be
used as inputs to the static linker.  As a result, there is not a
standardized definition for what the memory layout of an ELF object
should be (none of the section headers have valid virtual memory
addresses for example).

The kernel and loader were not checking the SHF_ALLOC flag but loading
any program sections with certain types such as SHT_PROGBITS.  As a
result, the kernel and loader would load into RAM some sections that
weren't marked with SHF_ALLOC such as .comment that are not loaded
into RAM for kernel modules on other architectures (which are
implemented as ELF shared objects).  Aside from possibly requiring
slightly more RAM to hold a kernel module this does not affect runtime
correctness as the kernel relocates symbols based on the layout it
uses.

Debuggers such as gdb and lldb do not extract symbol tables from a
running process or kernel.  Instead, they replicate the memory layout
of ELF executables and shared objects and use that to construct their
own symbol tables.  For executables and shared objects this works
fine.  For ELF objects the current logic in kgdb (and probably lldb
based on a simple reading) assumes that only sections with SHF_ALLOC
are memory resident when constructing a memory layout.  If the
debugger constructs a different memory layout than the kernel, then it
will compute different addresses for symbols causing symbols in the
debugger to appear to have the wrong values (though the kernel itself
is working fine).  The current port of mdb does not check SHF_ALLOC as
it replicates the kernel's logic in its existing kernel support.

The bfd linker sorts the sections in ELF object files such that all of
the allocated sections (sections with SHF_ALLOCATED) are placed first
followed by unallocated sections.  As a result, when kgdb composed a
memory layout using only the allocated sections, this layout happened
to match the layout used by the kernel and loader.  The lld linker
does not sort the sections in ELF object files and mixed allocated and
unallocated sections.  This resulted in kgdb composing a different
memory layout than the kernel and loader.

We could either patch kgdb (and possibly in the future lldb) to use
custom handling when generating memory layouts for kernel modules that
are ELF objects, or we could change the kernel and loader to check
SHF_ALLOCATED.  I chose the latter as I feel we shouldn't be loading
things into RAM that the module won't use.  This should mostly be a
NOP when linking with bfd but will allow the existing kgdb to work
with amd64 kernel modules linked with lld.

Note that we only require SHF_ALLOC for "program" sections for types
like SHT_PROGBITS and SHT_NOBITS.  Other section types such as symbol
tables, string tables, and relocations must also be loaded and are not
marked with SHF_ALLOC.

Reported by:	np
Reviewed by:	kib, emaste
MFC after:	1 month
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D13926
2018-01-17 22:51:59 +00:00
jhb
0aaf564f94 Use long for the last argument to VOP_PATHCONF rather than a register_t.
pathconf(2) and fpathconf(2) both return a long.  The kern_[f]pathconf()
functions now accept a pointer to a long value rather than modifying
td_retval directly.  Instead, the system calls explicitly store the
returned long value in td_retval[0].

Requested by:	bde
Reviewed by:	kib
Sponsored by:	Chelsio Communications
2018-01-17 22:36:58 +00:00
pfg
50d0301ca5 kern: make some use of mallocarray(9).
Focus on code where we are doing multiplications within malloc(9). None of
these ire likely to overflow, however the change is still useful as some
static checkers can benefit from the allocation attributes we use for
mallocarray.

This initial sweep only covers malloc(9) calls with M_NOWAIT. No good
reason but I started doing the changes before r327796 and at that time it
was convenient to make sure the sorrounding code could handle NULL values.

X-Differential revision: https://reviews.freebsd.org/D13837
2018-01-15 21:18:04 +00:00
ian
9c7cac637a Add RTC clock conversions for BCD values, with non-panic validation.
RTC clock hardware frequently uses BCD numbers.  Currently the low-level
bcd2bin() and bin2bcd() functions will KASSERT if given out-of-range BCD
values.  Every RTC driver must implement its own code for validating the
unreliable data coming from the hardware to avoid a potential kernel panic.

This change introduces two new functions, clock_bcd_to_ts() and
clock_ts_to_bcd().  The former validates its inputs and returns EINVAL if any
values are out of range. The latter guarantees the returned data will be
valid BCD in a known format (4-digit years, etc).

A new bcd_clocktime structure is used with the new functions.  It is similar
to the original clocktime structure, but defines the fields holding BCD
values as uint8_t (uint16_t for year), and adds a PM flag for handling hours
using AM/PM mode.

PR:		224813
Differential Revision:	https://reviews.freebsd.org/D13730 (no reviewers)
2018-01-14 17:01:37 +00:00
bz
8d499a89c5 Remove trailing whitespace.
No functional change.
2018-01-14 15:01:25 +00:00
kib
78904c70bb Add sysctl debug.kdb.stack_overflow to conveniently test kernel
handling of the kstack overflow.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-01-13 11:59:49 +00:00
mjg
00f02d857e sx: retry hard shared unlock just like in r327905 for rwlocks 2018-01-13 09:26:24 +00:00
mjg
6f29c630d9 rwlock: try regular read unlock even in the hard path
Saves on turnstile trips if the lock got more readers.
2018-01-13 00:05:31 +00:00
jeff
f375b4dd66 Implement NUMA support in uma(9) and malloc(9). Allocations from specific
domains can be done by the _domain() API variants.  UMA also supports a
first-touch policy via the NUMA zone flag.

The slab layer is now segregated by VM domains and is precise.  It handles
iteration for round-robin directly.  The per-cpu cache layer remains
a mix of domains according to where memory is allocated and freed.  Well
behaved clients can achieve perfect locality with no performance penalty.

The direct domain allocation functions have to visit the slab layer and
so require per-zone locks which come at some expense.

Reviewed by:	Attilio (a slightly older version)
Tested by:	pho
Sponsored by:	Netflix, Dell/EMC Isilon
2018-01-12 23:25:05 +00:00
jeff
e7c9f84113 Implement NUMA policy for kmem_*(9). This maintains compatibility with
reservations by giving each memory domain its own KVA space in vmem that
is naturally aligned on superpage boundaries.

Reviewed by:	alc, markj, kib  (some objections)
Sponsored by:	Netflix, Dell/EMC Isilon
Tested by;	pho
Differential Revision:	https://reviews.freebsd.org/D13289
2018-01-12 23:13:55 +00:00
jeff
058d03378a Regenerate auto-generated files 2018-01-12 23:06:35 +00:00
jeff
94c7af8ca2 Implement 'domainset', a cpuset based NUMA policy mechanism. This allows
userspace to control NUMA policy administratively and programmatically.

Implement domainset based iterators in the page layer.

Remove the now legacy numa_* syscalls.

Cleanup some header polution created by having seq.h in proc.h.

Reviewed by:	markj, kib
Discussed with:	alc
Tested by:	pho
Sponsored by:	Netflix, Dell/EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D13403
2018-01-12 22:48:23 +00:00
mjg
ddfd5797d3 mtx: use fcmpset to cover setting MTX_CONTESTED 2018-01-12 13:40:50 +00:00