significant source of cache line contention from vm_page_alloc(). Use
accessors and vm_page_unwire_noq() so that the mechanism can be easily
changed in the future.
Reviewed by: markj
Discussed with: kib, glebius
Tested by: pho (earlier version)
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14273
read or write of all registered realtime clocks. In the read case, the
values read are simply discarded. For writes, there's no alternative but
to actually write the current system time to the device.
to make the order of operations clearer to avoid the race condition
that was fixed in r328914. In particular, this commit corrects a
similar race that existed in the soft updates callback.
Doing some sleuthing through the SVN repository, it appears that
bufdone_finish() was added to support XFS:
------------------------------------------------------------------------
r153192 | rodrigc | 2005-12-06 19:39:08 -0800 (Tue, 06 Dec 2005) | 13 lines
Changes imported from XFS for FreeBSD project:
- add fields to struct buf (needed by XFS)
- 3 private fields: b_fsprivate1, b_fsprivate2, b_fsprivate3
- b_pin_count, count of pinned buffer
- add new B_MANAGED flag
- add breada() function to initiate asynchronous I/O on read-ahead blocks.
- add bufdone_finish(), bpin(), bunpin_wait() functions
Patches provided by: kan
Reviewed by: phk
Silence on: arch@
------------------------------------------------------------------------
It does not appear to ever have been used for anything else. XFS was
disconnected in r241607:
------------------------------------------------------------------------
r241607 | attilio | 2012-10-16 03:04:00 -0700 (Tue, 16 Oct 2012) | 5 lines
Disconnect non-MPSAFE XFS from the build in preparation for dropping
GIANT from VFS.
This is not targeted for MFC.
------------------------------------------------------------------------
and removed entirely in r247631:
------------------------------------------------------------------------
r247631 | attilio | 2013-03-02 07:33:54 -0800 (Sat, 02 Mar 2013) | 5 lines
Garbage collect XFS bits which are now already completely disconnected
from the tree since few months.
This is not targeted for MFC.
------------------------------------------------------------------------
Since XFS support is gone, there is no reason to retain biodone_finish().
Suggested by: Warner Losh (imp)
Discussed with: cem, kib
Tested by: Peter Holm (pho)
size of UMA zone allocation is greater than page size. In this case zone
of zones can not use UMA_MD_SMALL_ALLOC, and we need to postpone switch
off of this zone from startup_alloc() until full launch of VM.
o Always supply number of VM zones to uma_startup_count(). On machines
with UMA_MD_SMALL_ALLOC ignore it completely, unless zsize goes over
a page. In the latter case account VM zones for number of allocations
from the zone of zones.
o Rewrite startup_alloc() so that it will immediately switch off from
itself any zone that is already capable of running real alloc.
In worst case scenario we may leak a single page here. See comment
in uma_startup_count().
o Hardcode call to uma_startup2() into vm_mem_init(). Otherwise some
extra SYSINITs, e.g. vm_page_init() may sneak in before.
o While here, remove uma_boot_pages_mtx. With recent changes to boot
pages calculation, we are guaranteed to use all of the boot_pages
in the early single threaded stage.
Reported & tested by: mav
While the bug itself was serious, as we could either pass a non-busied
page to vm_pager_get_pages() or leak a busy page, it could only be
triggered under a very rare condition where the page is already inserted
into the object, but it is not valid yet.
Reviewed by: kib
MFC after: 2 weeks
o Most of startup zones have struct uma_slab embedded into the slab,
so provide macro UMA_SLAB_SPACE and use it instead of UMA_SLAB_SIZE,
when calculating how many pages would certain kind of allocations
require. Some zones are offpage, so we might have a positive inaccuracy.
o The keg for the zone of zones is allocated "dynamically", so we
need +1 when calculating amount of pages for kegs. [1]
o The zones of zones and zones of kegs have arbitrary alignment of 32,
and this also needs to be accounted for. [2]
While here, spread more comments and improve diagnostic messages.
Reported by: pho [1], jtl [2]
Previously, wiring a page would cause it to be removed from its page
queue. In the common case, unwiring causes it to be enqueued at the tail
of that page queue. This change modifies vm_page_wire() to not dequeue
the page, thus avoiding the highly contended page queue locks. Instead,
vm_page_unwire() takes care of requeuing the page as a single operation,
and the page daemon dequeues wired pages as they are encountered during
a queue scan to avoid needlessly revisiting them later. For pages in
PQ_ACTIVE we do even better, since a requeue is unnecessary.
The change improves scalability for some common workloads. For instance,
threads wiring pages into the buffer cache no longer need to modify
global page queues, and unwiring is usually done by the bufspace thread,
so concurrency is not as much of an issue. As another example, many
sysctl handlers wire the output buffer to avoid faults on copyout, and
since the buffer is likely to be in PQ_ACTIVE, we now entirely avoid
modifying the page queue in this case.
The change also adds a block comment describing some properties of
struct vm_page's reference counters, and the busy lock.
Reviewed by: jeff
Discussed with: alc, kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D11943
global to per-domain state. Protect reservations with the free lock
from the domain that they belong to. Refactor to make vm domains more
of a first class object.
Reviewed by: markj, kib, gallatin
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14000
o Call uma_startup1() after initializing kmem, vmem and domains.
o Include 8 eight VM startup pages into uma_startup_count() calculation.
o Account for vmem_startup() and vm_map_startup() preallocating pages.
o Account for extra two allocations done by kmem_init() and vmem_create().
o Hardcode the place of execution of vm_radix_reserve_kva(). Using SYSINIT
allowed several other SYSINITs to sneak in before it, thus bumping
requirement for amount of boot pages.
for UMA startup.
o Introduce another stage of UMA startup, which is entered after
vm_page_startup() finishes. After this stage we don't yet enable buckets,
but we can ask VM for pages. Rename stages to meaningful names while here.
New list of stages: BOOT_COLD, BOOT_STRAPPED, BOOT_PAGEALLOC, BOOT_BUCKETS,
BOOT_RUNNING.
Enabling page alloc earlier allows us to dramatically reduce number of
boot pages required. What is more important number of zones becomes
consistent across different machines, as no MD allocations are done before
the BOOT_PAGEALLOC stage. Now only UMA internal zones actually need to use
startup_alloc(), however that may change, so vm_page_startup() provides
its need for early zones as argument.
o Introduce uma_startup_count() function, to avoid code duplication. The
functions calculates sizes of zones zone and kegs zone, and calculates how
many pages UMA will need to bootstrap.
It counts not only of zone structures, but also of kegs, slabs and hashes.
o Hide uma_startup_foo() declarations from public file.
o Provide several DIAGNOSTIC printfs on boot_pages usage.
o Bugfix: when calculating zone of zones size use (mp_maxid + 1) instead of
mp_ncpus. Use resulting number not only in the size argument to zone_ctor()
but also as args.size.
Reviewed by: imp, gallatin (earlier version)
Differential Revision: https://reviews.freebsd.org/D14054
systems running with a heavy filesystem load. Tracking down this
bug was elusive because there were actually two problems. Sometimes
the in-memory check hash was wrong and sometimes the check hash
computed when doing the read was wrong. The occurrence of either
error caused a check-hash mismatch to be reported.
The first error was that the check hash in the in-memory cylinder
group was incorrect. This error was caused by the following
sequence of events:
- We read a cylinder-group buffer and the check hash is valid.
- We update its cg_time and cg_old_time which makes the in-memory
check-hash value invalid but we do not mark the cylinder group dirty.
- We do not make any other changes to the cylinder group, so we
never mark it dirty, thus do not write it out, and hence never
update the incorrect check hash for the in-memory buffer.
- Later, the buffer gets freed, but the page with the old incorrect
check hash is still in the VM cache.
- Later, we read the cylinder group again, and the first page with
the old check hash is still in the VM cache, but some other pages
are not, so we have to do a read.
- The read does not actually get the first page from disk, but rather
from the VM cache, resulting in the old check hash in the buffer.
- The value computed after doing the read does not match causing the
error to be printed.
The fix for this problem is to only set cg_time and cg_old_time as
the cylinder group is being written to disk. This keeps the in-memory
check-hash valid unless the cylinder group has had other modifications
which will require it to be written with a new check hash calculated.
It also requires that the check hash be recalculated in the in-memory
cylinder group when it is marked clean after doing a background write.
The second problem was that the check hash computed at the end of the
read was incorrect because the calculation of the check hash on
completion of the read was being done too soon.
- When a read completes we had the following sequence:
- bufdone()
-- b_ckhashcalc (calculates check hash)
-- bufdone_finish()
--- vfs_vmio_iodone() (replaces bogus pages with the cached ones)
- When we are reading a buffer where one or more pages are already
in memory (but not all pages, or we wouldn't be doing the read),
the I/O is done with bogus_page mapped in for the pages that exist
in the VM cache. This mapping is done to avoid corrupting the
cached pages if there is any I/O overrun. The vfs_vmio_iodone()
function is responsible for replacing the bogus_page(s) with the
cached ones. But we were calculating the check hash before the
bogus_page(s) were replaced. Hence, when we were calculating the
check hash, we were partly reading from bogus_page, which means
we calculated a bad check hash (e.g., because multiple pages have
been mapped to bogus_page, so its contents are indeterminate).
The second fix is to move the check-hash calculation from bufdone()
to bufdone_finish() after the call to vfs_vmio_iodone() so that it
computes the check hash over the correct set of pages.
With these two changes, the occasional cylinder-group check-hash
errors are gone.
Submitted by: David Pfitzner <dpfitzner@netflix.com>
Reviewed by: kib
Tested by: David Pfitzner
As a followup to r328101, ignore relocation tables for ELF object
sections that are not memory resident. For modules loaded by the
loader, ignore relocation tables whose associated section was not
loaded by the loader (sh_addr is zero). For modules loaded at runtime
via kldload(2), ignore relocation tables whose associated section is
not marked with SHF_ALLOC.
Reported by: Mori Hiroki <yamori813@yahoo.co.jp>, adrian
Tested on: mips, mips64
MFC after: 1 month
Sponsored by: DARPA / AFRL
If a brand provides a header_supported hook, check it when trying to
find a brand based on a matching interpreter as well as in the final
loop for the fallback brand. Previously a brand might reject a binary
via a header_supported hook in one of the earlier loops, but still be
chosen by one of these later loops.
Reviewed by: kib
Obtained from: CheriBSD
MFC after: 2 weeks
Sponsored by: DARPA / AFRL
Differential Revision: https://reviews.freebsd.org/D13945
Stop leaking kernel pointers though theses sysctls and make sure that the
padding in the structures is zeroed on allocation to avoid other leaks.
Reviewed by: gordon, kib
Obtained from: CheriBSD
MFC after: 1 week
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D13459
whose type depends on the type of vnode. Correct vn_printf so that
it correctly identifies the name of the pointer that it is printing.
Submitted by: Andreas Longwitz <longwitz at incore.de>
MFC after: 1 week
Some ABIs have large gaps in syscall numbers. Allow gaps to be filled
as ranges of UNIMPL, with an entry like:
248-1023 AUE_NULL UNIMPL unimplemented
Reviewed by: jhb, gnn
Sponsored by: Turing Robotic Industries Inc.
Differential Revision: https://reviews.freebsd.org/D14122
gone_in(majar, msg); If we're running in FreeBSD major, tell
the user this code may be deleted soon.
If we're running in FreeBSD major - 1,
the the user is deprecated and will
be gone in major.
Otherwise say nothing.
gone_in_dev(dev, major, msg) Just like gone_in, except use device_printf.
New tunable / sysctl debug.oboslete_panic: 0 - don't panic,
1 - panic in major or newer , 2 - panic in major - 1 or newer
default: 0
if NO_OBSOLETE_CODE is defined, then both of these turn into compile
time errors when building for major. Add options NO_OBSOLETE_CODE to
kernel build system.
This lets us tag code that's going away so users know it will be gone,
as well as automatically manage things.
Differential Review: https://reviews.freebsd.org/D13818
appeared on UFS/FFS filesystems. In some cases it was promptly followed
by a panic of "softdep_deallocate_dependencies: dangling deps". This fix
should eliminate both of these occurences.
Submitted by: Andreas Longwitz <longwitz at incore.de>
Reviewed by: kib
Tested by: Peter Holm (pho)
PR: 225423
MFC after: 1 week
Fix a bug when the system has no CPU 0. When created, threads were implicitly assigned to CPU 0.
This had no practical effect since a real CPU was chosen immediately by the scheduler. However,
on systems without a CPU 0, sched_ule attempted to access the scheduler queue of the "old" CPU
when assigned the initial choice of the old one. This caused an attempt to use illegal memory
and a crash (or, more usually, a deadlock). Fix this by assigned new threads to the BSP
explicitly and add some asserts to see that this problem does not recur.
Authored by: Nathan Whitehorn <nwhitehorn@freebsd.org>
Submitted by: Wojciech Macek <wma@semihalf.com>
Obtained from: Semihalf
Differential revision: https://reviews.freebsd.org/D13932
When allocating memory through malloc(9), we always expect the amount of
memory requested to be unsigned as a negative value would either stand for
an error or an overflow.
Unsign some values, found when considering the use of mallocarray(9), to
avoid unnecessary casting. Also consider that indexes should be of
at least the same size/type as the upper limit they pretend to index.
MFC after: 3 weeks
Uses of mallocarray(9).
The use of mallocarray(9) has rocketed the required swap to build FreeBSD.
This is likely caused by the allocation size attributes which put extra pressure
on the compiler.
Given that most of these checks are superfluous we have to choose better
where to use mallocarray(9). We still have more uses of mallocarray(9) but
hopefully this is enough to bring swap usage to a reasonable level.
Reported by: wosch
PR: 225197
kernel by PHYS_TO_DMAP() as previously present on amd64, arm64, riscv, and
powerpc64. This introduces a new MI macro (PMAP_HAS_DMAP) that can be
evaluated at runtime to determine if the architecture has a direct map;
if it does not (or does) unconditionally and PMAP_HAS_DMAP is either 0 or
1, the compiler can remove the conditional logic.
As part of this, implement PHYS_TO_DMAP() on sparc64 and mips64, which had
similar things but spelled differently. 32-bit MIPS has a partial direct-map
that maps poorly to this concept and is unchanged.
Reviewed by: kib
Suggestions from: marius, alc, kib
Runtime tested on: amd64, powerpc64, powerpc, mips64
Previously the calculations were done as if the requested region
ended at the start of the last requested page, not its end.
The problem as actually quite minor as it affected only stats and
page prefaulting, not the actual page data, and only with specific
parameters.
Reviewed by: kib (previous version)
MFC after: 2 weeks
There is a case when not all CPUs went online. In that situation,
restart only APs which were operational before entering KDB.
Created by: Wojciech Macek <wma@semihalf.com>
Obtained from: Semihalf
Reviewed by: nwhitehorn
Differential revision: https://reviews.freebsd.org/D13949
Sponsored by: QCM Technologies
ELF object files can contain program sections which are not supposed
to be loaded into memory (e.g. .comment). Normally the static linker
uses these flags to decide which sections are allocated to loadable
program segments in ELF binaries and shared objects (including kernels
on all architectures and kernel modules on architectures other than
amd64).
Mapping ELF object files (such as amd64 kernel modules) into memory
directly is a bit of a grey area. ELF object files are intended to be
used as inputs to the static linker. As a result, there is not a
standardized definition for what the memory layout of an ELF object
should be (none of the section headers have valid virtual memory
addresses for example).
The kernel and loader were not checking the SHF_ALLOC flag but loading
any program sections with certain types such as SHT_PROGBITS. As a
result, the kernel and loader would load into RAM some sections that
weren't marked with SHF_ALLOC such as .comment that are not loaded
into RAM for kernel modules on other architectures (which are
implemented as ELF shared objects). Aside from possibly requiring
slightly more RAM to hold a kernel module this does not affect runtime
correctness as the kernel relocates symbols based on the layout it
uses.
Debuggers such as gdb and lldb do not extract symbol tables from a
running process or kernel. Instead, they replicate the memory layout
of ELF executables and shared objects and use that to construct their
own symbol tables. For executables and shared objects this works
fine. For ELF objects the current logic in kgdb (and probably lldb
based on a simple reading) assumes that only sections with SHF_ALLOC
are memory resident when constructing a memory layout. If the
debugger constructs a different memory layout than the kernel, then it
will compute different addresses for symbols causing symbols in the
debugger to appear to have the wrong values (though the kernel itself
is working fine). The current port of mdb does not check SHF_ALLOC as
it replicates the kernel's logic in its existing kernel support.
The bfd linker sorts the sections in ELF object files such that all of
the allocated sections (sections with SHF_ALLOCATED) are placed first
followed by unallocated sections. As a result, when kgdb composed a
memory layout using only the allocated sections, this layout happened
to match the layout used by the kernel and loader. The lld linker
does not sort the sections in ELF object files and mixed allocated and
unallocated sections. This resulted in kgdb composing a different
memory layout than the kernel and loader.
We could either patch kgdb (and possibly in the future lldb) to use
custom handling when generating memory layouts for kernel modules that
are ELF objects, or we could change the kernel and loader to check
SHF_ALLOCATED. I chose the latter as I feel we shouldn't be loading
things into RAM that the module won't use. This should mostly be a
NOP when linking with bfd but will allow the existing kgdb to work
with amd64 kernel modules linked with lld.
Note that we only require SHF_ALLOC for "program" sections for types
like SHT_PROGBITS and SHT_NOBITS. Other section types such as symbol
tables, string tables, and relocations must also be loaded and are not
marked with SHF_ALLOC.
Reported by: np
Reviewed by: kib, emaste
MFC after: 1 month
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D13926
pathconf(2) and fpathconf(2) both return a long. The kern_[f]pathconf()
functions now accept a pointer to a long value rather than modifying
td_retval directly. Instead, the system calls explicitly store the
returned long value in td_retval[0].
Requested by: bde
Reviewed by: kib
Sponsored by: Chelsio Communications
Focus on code where we are doing multiplications within malloc(9). None of
these ire likely to overflow, however the change is still useful as some
static checkers can benefit from the allocation attributes we use for
mallocarray.
This initial sweep only covers malloc(9) calls with M_NOWAIT. No good
reason but I started doing the changes before r327796 and at that time it
was convenient to make sure the sorrounding code could handle NULL values.
X-Differential revision: https://reviews.freebsd.org/D13837
RTC clock hardware frequently uses BCD numbers. Currently the low-level
bcd2bin() and bin2bcd() functions will KASSERT if given out-of-range BCD
values. Every RTC driver must implement its own code for validating the
unreliable data coming from the hardware to avoid a potential kernel panic.
This change introduces two new functions, clock_bcd_to_ts() and
clock_ts_to_bcd(). The former validates its inputs and returns EINVAL if any
values are out of range. The latter guarantees the returned data will be
valid BCD in a known format (4-digit years, etc).
A new bcd_clocktime structure is used with the new functions. It is similar
to the original clocktime structure, but defines the fields holding BCD
values as uint8_t (uint16_t for year), and adds a PM flag for handling hours
using AM/PM mode.
PR: 224813
Differential Revision: https://reviews.freebsd.org/D13730 (no reviewers)
domains can be done by the _domain() API variants. UMA also supports a
first-touch policy via the NUMA zone flag.
The slab layer is now segregated by VM domains and is precise. It handles
iteration for round-robin directly. The per-cpu cache layer remains
a mix of domains according to where memory is allocated and freed. Well
behaved clients can achieve perfect locality with no performance penalty.
The direct domain allocation functions have to visit the slab layer and
so require per-zone locks which come at some expense.
Reviewed by: Attilio (a slightly older version)
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
reservations by giving each memory domain its own KVA space in vmem that
is naturally aligned on superpage boundaries.
Reviewed by: alc, markj, kib (some objections)
Sponsored by: Netflix, Dell/EMC Isilon
Tested by; pho
Differential Revision: https://reviews.freebsd.org/D13289
userspace to control NUMA policy administratively and programmatically.
Implement domainset based iterators in the page layer.
Remove the now legacy numa_* syscalls.
Cleanup some header polution created by having seq.h in proc.h.
Reviewed by: markj, kib
Discussed with: alc
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13403