This is a minor cosmetic change - the users are more likely to want to
increase (rather than decrease) default kernel stack size,
which is already 4 pages on amd64.
MFC after: 4 days
explicit process at fork trampoline path instead of eventhadler(schedtail)
invocation for each child process.
Remove eventhandler(schedtail) code and change linux ABI to use newly added
sysvec method.
While here replace explicit comparing of module sysentvec structure with the
newly created process sysentvec to detect the linux ABI.
Discussed with: kib
MFC after: 2 Week
on processors that support 1 GB pages. Specifically, if the end of physical
memory is not aligned to a 1 GB page boundary, then map the residual
physical memory with multiple 2 MB page mappings rather than a single 1 GB
page mapping. When a 1 GB page mapping is used for this residual memory,
access to the memory is slower than when multiple 2 MB page mappings are
used. (I suspect that the reason for this slowdown is that the TLB is
actually being loaded with 4 KB page mappings for the residual memory.)
X-MFC after: r214425
White list sysarch calls allowed in capability mode; arguably, there
should be some link between the capability mode model and the privilege
model here. Sysarch is a morass similar to ioctl, in many senses.
Submitted by: anderson
Discussed with: benl, kris, pjd
Sponsored by: Google, Inc.
Obtained from: Capsicum Project
MFC after: 3 months
MI ucontext_t and x86 MD parts.
Kernel allocates the structures on the stack, and not clearing
reserved fields and paddings causes leakage.
Noted and discussed with: bde
MFC after: 2 weeks
should_yield(). Use this in various places. Encapsulate the common
case of check-and-yield into a new function maybe_yield().
Change several checks for a magic number of iterations to use
should_yield() instead.
MFC after: 1 week
be used by linuxolator itself.
Move linux_wait4() to MD path as it requires native struct
rusage translation to struct l_rusage on linux32/amd64.
MFC after: 1 Month.
members, thus making a signed extension of 32 bit register
context. If the register is not touched in usermode between
return from signal and next syscall entry, the sign-extension
part of 64bit register is not cleared, causing
linux32_fetch_syscall_args() to read wrong values.
Use unsigned type for the registers in the linux sigcontext.
Reported by: Jacob Frelinger <jacob.frelinger duke edu>, arundel
In collaboration with: dchagin
MFC after: 1 week
- Only check largs->num against max_ldt_segment on amd64 for I386_SET_LDT
when descriptors are provided. Specifically, allow the 'start == 0'
and 'num == 0' special case used to free all LDT entries that previously
failed with EINVAL.
Submitted by: clang via rdivacky (some of 1)
Reviewed by: kib
Compile sys/dev/mem/memutil.c for all supported platforms and remove now
unnecessary dev_mem_md_init(). Consistently define mem_range_softc from
mem.c for all platforms. Add missing #include guards for machine/memdev.h
and sys/memrange.h. Clean up some nearby style(9) nits.
MFC after: 1 month
started to execute, it seems that the corresponding ISR bit in the "old"
local APIC can be cleared. This causes the local APIC interrupt routine
to fail to find an interrupt to service. Rather than panic'ing in this
case, simply return from the interrupt without sending an EOI to the
local APIC. If there are any other pending interrupts in other ISR
registers, the local APIC will assert a new interrupt.
Tested by: steve
setting SV_SHP flag and providing pointer to the vm object and mapping
address. Provide simple allocator to carve space in the page, tailored
to put the code with alignment restrictions.
Enable shared page use for amd64, both native and 32bit FreeBSD
binaries. Page is private mapped at the top of the user address
space, moving a start of the stack one page down. Move signal
trampoline code from the top of the stack to the shared page.
Reviewed by: alc
architecture macros (__mips_n64, __powerpc64__) when 64 bit types (and
corresponding macros) are different from 32 bit. [1]
Correct the type of INT64_MIN, INT64_MAX and UINT64_MAX.
Define (U)INTMAX_C as an alias for (U)INT64_C matching the type definition
for (u)intmax_t. Do this on all architectures for consistency.
Suggested by: bde [1]
Approved by: kib (mentor)
On some architectures UCHAR_MAX and USHRT_MAX had type unsigned int.
However, lacking integer suffixes for types smaller than int, their type
should correspond to that of an object of type unsigned char (or short)
when used in an expression with objects of type int. In that case unsigned
char (short) are promoted to int (i.e. signed) so the type of UCHAR_MAX and
USHRT_MAX should also be int.
Where MIN/MAX constants implicitly have the correct type the suffix has
been removed.
While here, correct some comments.
Reviewed by: bde
Approved by: kib (mentor)
for manipulating pcb_flags. These inline functions are very similar to
atomic_set_char(9) and atomic_clear_char(9) but without unnecessary LOCK
prefix for SMP. Add comments about the rationale[1]. Use these functions
wherever possible. Although there are some places where it is not strictly
necessary (e.g., a PCB is copied to create a new PCB), it is done across
the board for sake of consistency. Turn pcb_full_iret into a PCB flag as
it is safe now. Move rarely used fields before pcb_flags and reduce size
of pcb_flags to one byte. Fix some style(9) nits in pcb.h while I am in
the neighborhood.
Reviewed by: kib
Submitted by: kib[1]
MFC after: 2 months
the original amd64 and i386 headers with stubs.
Rename (AMD64|I386)_BUS_SPACE_* to X86_BUS_SPACE_* everywhere.
Reviewed by: imp (previous version), jhb
Approved by: kib (mentor)
function always returned the nominal frequency instead of current frequency
because we use RDTSC instruction to calculate difference in CPU ticks, which
is supposedly constant for the case. Now we support cpu_get_nominal_mhz()
for the case, instead. Note it should be just enough for most usage cases
because cpu_est_clockrate() is often times abused to find maximum frequency
of the processor.
its similar disabling of adaptive mutexes and rwlocks. The existing
comment on why this is the case also applies to sx locks.
MFC after: 3 days
Discussed with: attilio
mark user FPU context initialized, if current context is user context.
It was reversed in r215865, by inadequate change of this code fragment
to a call to fpuuserinited()/npxuserinited().
The issue is only relevant for in-kernel users of FPU.
Reported by: Jan Henrik Sylvester <me janh de>, Mike Tancsa <mike sentex net>
Tested by: Mike Tancsa
MFC after: 3 days
to support PV drivers (such as xenpci), and non-adptive locking (along
with a comment about why).
This change eliminates the synchronisation problem between GENERIC and
XENHVM, which had become severely rotted in HEAD, and in 8-STABLE
included non-production kernel debugging features such as WITNESS.
However, it comes at the cost of enabling devices and options that may
not be present under Xen (such as random ethernet cards). For now, opt
for a simpler kernel configuration file rather than using nooptions/
nodevice to enumerate and eliminate them. This leads to a somewhat
larger XENHVM kernel.
This is an MFC candidate for 8-STABLE before 8.2, in order to provide
a production-worthy XENHVM kernel configuration for amd64.
Discussed with: gibbs, cperciva
Reported by: Piete Brooks <Piete.Brooks at cl.cam.ac.uk>
Sponsored by: DARPA, AFRL
MFC after: 3 days
while on i386 we have MAX_BPAGES=512. Implement this difference via
'#ifdef __i386__'.
With this commit, the i386 and amd64 busdma_machdep.c files become
identical; they will soon be replaced by a single file under sys/x86.
no-op currently, since FreeBSD/amd64 doesn't have (paravirtualized) Xen
support, but if/when that support is ever added we'll want this, and
until then it's harmless.
In exec_linux_setregs(), use locally cached pointer to pcb to set
pcb_full_iret.
In set_regs(), note that full return is needed when code that sets
segment registers is enabled.
MFC after: 1 week
execve(2). Note that ia32 binaries already handle this properly,
since ia32_setregs() resets td_retval[1], but not exec_setregs().
We still do not conform to the amd64 ABI specification, since %rsp
on the image startup is not aligned to 16 bytes.
PR: amd64/124134
Discussed with: Petr Salinger <Petr.Salinger seznam cz>
(who convinced me that there is indeed several bugs)
MFC after: 1 week
Passing a count of zero on i386 and amd64 for [I386|AMD64]_BUS_SPACE_MEM
causes a crash/hang since the 'loop' instruction decrements the counter
before checking if it's zero.
PR: kern/80980
Discussed with: jhb
functions, they are unused. Remove 'user' from npxgetuserregs()
etc. names.
For {npx,fpu}{get,set}regs(), always use pcb->pcb_user_save for FPU
context storage. This eliminates the need for ugly copying with
overwrite of the newly added and reserved fields in ucontext on i386
to satisfy alignment requirements for fpusave() and fpurstor().
pc98 version was copied from i386.
Suggested and reviewed by: bde
Tested by: pho (i386 and amd64)
MFC after: 1 week
assembly instruction "movw %rcx,2(%rax)" to "movw %cx,2(%rax)", since
the intent was to move 16 bits of data, in this case.
Found by: clang
Reviewed by: kib
Flushing TLBs is required to ensure cache coherency according to the AMD64
architecture manual. Flushing caches is only required when changing from a
cacheable memory type (WB, WP, or WT) to an uncacheable type (WC, UC, or
UC-). Since this function is only used once per processor during startup,
there is no need to take any shortcuts.
- Leave PAT indices 0-3 at the default of WB, WT, UC-, and UC. Program 5 as
WP (from default WT) and 6 as WC (from default UC-). Leave 4 and 7 at the
default of WB and UC. This is to avoid transition from a cacheable memory
type to an uncacheable type to minimize possible cache incoherency. Since
we perform flushing caches and TLBs now, this change may not be necessary
any more but we do not want to take any chances.
- Remove Apple hardware specific quirks. With the above changes, it seems
this hack is no longer needed.
- Improve pmap_cache_bits() with an array to map PAT memory type to index.
This array is initialized early from pmap_init_pat(), so that we do not need
to handle special cases in the function any more. Now this function is
identical on both amd64 and i386.
Reviewed by: jhb
Tested by: RM (reuf_m at hotmail dot com)
Ryszard Czekaj (rychoo at freeshell dot net)
army.of.root (army dot of dot root at googlemail dot com)
MFC after: 3 days
These MSRs can be used to determine actual (average) performance as
compared to a maximum defined performance.
Availability of these MSRs is indicated by bit0 in CPUID.6.ECX on both
Intel and AMD processors.
MFC after: 5 days
It seems that this MSR has been available in a range of AMD processors
families for quite a while now.
Note1: not all AMD MSRs that are found in amd64 specialreg.h are also in
the i386 version.
Note2: perhaps some additional name component is needed to distinguish
AMD-specific MSRs.
MFC after: 5 days
no noticeable change because we enable caches before we enter here for both
BSP and AP cases. Remove another pointless optimization for CR4.PGE bit
while I am here.
code but probably it only worked by chance because modifying CR4.PGE bit
causes invlidation of entire TLBs. Since these are very rare events, this
micro-optimization seems useless.
Reviewed by: jhb
The ports/Mk/bsd.port.mk uses sys/param.h to fetch osrel, and cannot
grok several constants with the prefix.
Reported and tested by: swell.k gmail com
MFC after: 1 week
After KVA space was increased to 512GB on amd64 it became impractical
to use PTEs as entries in the minidump map of dumped pages, because size
of that map alone would already be 1GB.
Instead, we now use PDEs as page map entries and employ two stage lookup
in libkvm: virtual address -> PDE -> PTE -> physical address. PTEs are
now dumped as regular pages. Fixed page map size now is 2MB.
libkvm keeps support for accessing amd64 minidumps of version 1.
Support for 1GB pages is added.
Many thanks to Alan Cox for his guidance, numerous reviews, suggestions,
enhancments and corrections.
Reviewed by: alc [kernel part]
MFC after: 15 days
contents of the ones that were not empty were stale and unused.
- Now that <machine/mutex.h> no longer exists, there is no need to allow it
to override various helper macros in <sys/mutex.h>.
- Rename various helper macros for low-level operations on mutexes to live
in the _mtx_* or __mtx_* namespaces. While here, change the names to more
closely match the real API functions they are backing.
- Drop support for including <sys/mutex.h> in assembly source files.
Suggested by: bde (1, 2)
work properly with single-stepping in a kernel debugger. Specifically,
these routines have always disabled interrupts before increasing the nesting
count and restored the prior state of interrupts after decreasing the nesting
count to avoid problems with a nested interrupt not disabling interrupts
when acquiring a spin lock. However, trap interrupts for single-stepping
can still occur even when interrupts are disabled. Now the saved state of
interrupts is not saved in the thread until after interrupts have been
disabled and the nesting count has been increased. Similarly, the saved
state from the thread cannot be read once the nesting count has been
decreased to zero. To fix this, use temporary variables to store interrupt
state and shuffle it between the thread's MD area and the appropriate
registers.
In cooperation with: bde
MFC after: 1 month
This could lead to a division by zero if hardware is multi-core and/or
multi-threaded, but for some (quite unusual) reason FreeBSD sees only
one logical processor. This could happen, for example, if neither MADT
nor MP Table are presented by BIOS.
Also:
- assert in topo_probe_0x4 that BSP is accounted for
- neither cpu_cores nor cpu_logical should be zero after successful
probing, so either being zero is an indication of failed probing
Reported by: vwe, Dan Allen <danallen46@airwired.net>
Tested by: Dan Allen <danallen46@airwired.net>
MFC after: 3 days
when routing interrupts instead of cpu_apic_ids[0] since cpu_apic_ids[]
is only populated for multiple-CPU machines. This also matches what the
code does when SMP is not enabled.
PR: bin/151616
Tested by: "Damian S. Kolodziejczyk" damkol | gmail
Submitted by: avg
MFC after: 1 week
physical page mapping should span two or more MTRRs of different types.
Add a pmap function, pmap_demote_DMAP(), by which the MTRR module can
ensure that the direct map region doesn't have such a mapping.
[2] Fix a couple of nearby style errors in amd64_mrset().
[3] Re-enable the use of 1GB page mappings for implementing the direct
map. (See also r197580 and r213897.)
Tested by: kib@ on a Westmere-family processor [3]
MFC after: 3 weeks
use pmap_extract() rather than pmap_kextract() on direct map addresses.
Thus, pmap_extract() needs to be able to deal with 1GB page mappings if
we are to use 1GB page mappings for the direct map. (See r197580.)
kernel of exactly the same __FreeBSD_version as the headers module was
compiled against.
Mark our in-tree ABI emulators with DECLARE_MODULE_TIED. The modules
use kernel interfaces that the Release Engineering Team feel are not
stable enough to guarantee they will not change during the life cycle
of a STABLE branch. In particular, the layout of struct sysentvec is
declared to be not part of the STABLE KBI.
Discussed with: bz, rwatson
Approved by: re (bz, kensmith)
MFC after: 2 weeks
trap frame when trap initiated kdb entry, incorrectly calculated the
value of %rsp for trapped thread.
According to Intel(R) 64 and IA-32 Architectures Software Developer's Manual
Volume 3A: System Programming Guide, Part 1, rev. 035, 6.14.2 64-Bit Mode
Stack Frame, "64-bit mode ... pushes SS:RSP unconditionally, rather than
only on a CPL change."
Even assuming the conditional push of the %ss:%rsp, the calculation
was still wrong because sizeof(tf_ss) + sizeof(tf_rsp) == 16 on amd64.
Always use the tf_rsp from trap frame. The change supposedly fixes
stepping when using kgdb backend for kdb.
Submitted by: Zhouyi Zhou <zhouzhouyi gmail com>
PR: amd64/151167
Reviewed by: avg
MFC after: 1 week
This patch is significantly based on previous work by jkim.
List of changes:
- added comments that describe topology uniformity assumption
- added reference to Intel Processor Topology Enumeration article
- documented a few global variables that describe topology
- retired weirdly set and used logical_cpus variable
- changed fallback code for mp_ncpus > 0 case, so that CPUs are treated
as being different packages rather than cores in a single package
- moved AMD-specific code to topo_probe_amd [jkim]
- in topo_probe_0x4() follow Intel-prescribed procedure of deriving SMT
and core masks and match APIC IDs against those masks [started by
jkim]
- in topo_probe_0x4() drop code for double-checking topology parameters
by looking at L1 cache properties [jkim]
- in topo_probe_0xb() add fallback path to topo_probe_0x4() as
prescribed by Intel [jkim]
Still to do:
- prepare for upcoming AMD CPUs by using new mechanism of uniform
topology description [pointed by jkim]
- probe cache topology in addition to CPU topology and probably use that
for scheduler affinity topology; e.g. Core2 Duo and Athlon II X2 have
the same CPU topology, but Athlon cores do not share L2 cache while
Core2's do (no L3 cache in both cases)
- think of supporting non-uniform topologies if they are ever
implemented for platforms in question
- think how to better described old HTT vs new HTT distinction, HTT vs
SMT can be confusing as SMT is a generic term
- more robust code for marking CPUs as "logical" and/or "hyperthreaded",
use HTT mask instead of modulo operation
- correct support for halting logical and/or hyperthreaded CPUs, let
scheduler know that it shouldn't schedule any threads on those CPUs
PR: kern/145385 (related)
In collaboration with: jkim
Tested by: Sergey Kandaurov <pluknet@gmail.com>,
Jeremy Chadwick <freebsd@jdc.parodius.com>,
Chip Camden <sterling@camdensoftware.com>,
Steve Wills <steve@mouf.net>,
Olivier Smedts <olivier@gid0.org>,
Florian Smeets <flo@smeets.im>
MFC after: 1 month
The check for alignment should be made against the physical address and not
the virtual address that maps it.
Sponsored by: NetApp
Submitted by: Will McGovern (will at netapp dot com)
Reviewed by: mjacob, jhb
KVA space is abundant on amd64, so there is no reason to limit kernel
map size to a fraction of available physical memory. In fact, it could
be larger than physical memory.
This should help with memory auto-tuning for ZFS and shouldn't affect
other workloads.
This should reduce number of circumstances for "kmem_map too small"
panics, but probably won't eliminate them entirely due to potential kmem
fragmentation.
In fact, you might want/need to limit maximum ARC size after this commit
if you need to resrve more memory for applications.
This change was discussed on arch@ and nobody said "don't do it".
MFC after: 6 weeks
The main goal of this is to generate timer interrupts only when there is
some work to do. When CPU is busy interrupts are generating at full rate
of hz + stathz to fullfill scheduler and timekeeping requirements. But
when CPU is idle, only minimum set of interrupts (down to 8 interrupts per
second per CPU now), needed to handle scheduled callouts is executed.
This allows significantly increase idle CPU sleep time, increasing effect
of static power-saving technologies. Also it should reduce host CPU load
on virtualized systems, when guest system is idle.
There is set of tunables, also available as writable sysctls, allowing to
control wanted event timer subsystem behavior:
kern.eventtimer.timer - allows to choose event timer hardware to use.
On x86 there is up to 4 different kinds of timers. Depending on whether
chosen timer is per-CPU, behavior of other options slightly differs.
kern.eventtimer.periodic - allows to choose periodic and one-shot
operation mode. In periodic mode, current timer hardware taken as the only
source of time for time events. This mode is quite alike to previous kernel
behavior. One-shot mode instead uses currently selected time counter
hardware to schedule all needed events one by one and program timer to
generate interrupt exactly in specified time. Default value depends of
chosen timer capabilities, but one-shot mode is preferred, until other is
forced by user or hardware.
kern.eventtimer.singlemul - in periodic mode specifies how much times
higher timer frequency should be, to not strictly alias hardclock() and
statclock() events. Default values are 2 and 4, but could be reduced to 1
if extra interrupts are unwanted.
kern.eventtimer.idletick - makes each CPU to receive every timer interrupt
independently of whether they busy or not. By default this options is
disabled. If chosen timer is per-CPU and runs in periodic mode, this option
has no effect - all interrupts are generating.
As soon as this patch modifies cpu_idle() on some platforms, I have also
refactored one on x86. Now it makes use of MONITOR/MWAIT instrunctions
(if supported) under high sleep/wakeup rate, as fast alternative to other
methods. It allows SMP scheduler to wake up sleeping CPUs much faster
without using IPI, significantly increasing performance on some highly
task-switching loads.
Tested by: many (on i386, amd64, sparc64 and powerc)
H/W donated by: Gheorghe Ardelean
Sponsored by: iXsystems, Inc.
Bring in a driver for the LSI Logic MPT2 6Gb SAS controllers.
This driver supports basic I/O, and works with SAS and SATA drives and
expanders.
Basic error recovery works (i.e. timeouts and aborts) as well.
Integrated RAID isn't supported yet, and there are some known bugs.
So this isn't ready for production use, but is certainly ready for
testing and additional development. For the moment, new commits to this
driver should go into the FreeBSD Perforce repository first
(//depot/projects/mps/...) and then get merged into -current once
they've been vetted.
This has only been added to the amd64 GENERIC, since that is the only
architecture I have tested this driver with.
Submitted by: scottl
Discussed with: imp, gibbs, will
Sponsored by: Yahoo, Spectra Logic Corporation
This reflects actual type used to store and compare child device orders.
Change is mostly done via a Coccinelle (soon to be devel/coccinelle)
semantic patch.
Verified by LINT+modules kernel builds.
Followup to: r212213
MFC after: 10 days
In particular, provide pagesize and pagesizes array, the canary value
for SSP use, number of host CPUs and osreldate.
Tested by: marius (sparc64)
MFC after: 1 month
It is more appropriate in this context because TSC MSR is reset to zero
when the CPU is restarted from S3 and above. Move acpi_resync_clock() back
to where it was before r211202. It does not make a difference any more.
As long as interrupts are disabled and there is not explicit call to
sched_add() there can't be any preemption there, thus the calls may be
consistent.
Reported by: kib, jhb
are served via an interrupt gate.
However, that doesn't explicitly prevent preemption and thread
migration thus scheduler pinning may be necessary in some handlers.
Fix that.
Tested by: gianni
MFC after: 1 month
While there, also fix some places assuming cpu type is 'int' while
u_int is really meant.
Note: this will also fix some possible races in per-cpu data accessings
to be addressed in further commits.
In collabouration with: Yahoo! Incorporated (via sbruno and peter)
Tested by: gianni
MFC after: 1 month
IPI to a specific CPU by its cpuid. Replace calls to ipi_selected() that
constructed a mask for a single CPU with calls to ipi_cpu() instead. This
will matter more in the future when we transition from cpumask_t to
cpuset_t for CPU masks in which case building a CPU mask is more expensive.
Submitted by: peter, sbruno
Reviewed by: rookie
Obtained from: Yahoo! (x86)
MFC after: 1 month
savectx() is only used for panic dump (dumppcb) and kdb (stoppcbs). Thus,
saving additional information does not hurt and it may be even beneficial.
Unfortunately, struct pcb has grown larger to accommodate more data.
Move 512-byte long pcb_user_save to the end of struct pcb while I am here.
- savectx() now saves FPU state unconditionally and copy it to the PCB of
FPU thread if necessary. This gives panic dump and kdb a chance to take
a look at the current FPU state even if the FPU is "supposedly" not used.
- Resuming CPU now unconditionally reinitializes FPU. If the saved FPU
state was irrelevant, it could be in an unknown state.
Suggested by: bde [1]
Xeon 5500/5600 series:
- Utilize IA32_TEMPERATURE_TARGET, a.k.a. Tj(target) in place
of Tj(max) when a sane value is available, as documented
in Intel whitepaper "CPU Monitoring With DTS/PECI"; (By sane
value we mean 70C - 100C for now);
- Print the probe results when booting verbose;
- Replace cpu_mask with cpu_stepping;
- Use CPUID_* macros instead of rolling our own.
Approved by: rpaulo
MFC after: 1 month
from the inline assembly. This allows the compiler to cache invocations of
curthread since it's value does not change within a thread context.
Submitted by: zec (i386)
MFC after: 1 week
for crash dump (dumppcb) and kdb (stoppcbs). For both cases, there cannot
have a valid pointer in pcb_save. This should restore the previous
behaviour.
zones for each malloc bucket size. The purpose is to isolate
different malloc types into hash classes, so that any buffer overruns
or use-after-free will usually only affect memory from malloc types in
that hash class. This is purely a debugging tool; by varying the hash
function and tracking which hash class was corrupted, the intersection
of the hash classes from each instance will point to a single malloc
type that is being misused. At this point inspection or memguard(9)
can be used to catch the offending code.
Add MALLOC_DEBUG_MAXZONES=8 to -current GENERIC configuration files.
The suggestion to have this on by default came from Kostik Belousov on
-arch.
This code is based on work by Ron Steinke at Isilon Systems.
Reviewed by: -arch (mostly silence)
Reviewed by: zml
Approved by: zml (mentor)
now it uses a very dumb first-touch allocation policy. This will change in
the future.
- Each architecture indicates the maximum number of supported memory domains
via a new VM_NDOMAIN parameter in <machine/vmparam.h>.
- Each cpu now has a PCPU_GET(domain) member to indicate the memory domain
a CPU belongs to. Domain values are dense and numbered from 0.
- When a platform supports multiple domains, the default freelist
(VM_FREELIST_DEFAULT) is split up into N freelists, one for each domain.
The MD code is required to populate an array of mem_affinity structures.
Each entry in the array defines a range of memory (start and end) and a
domain for the range. Multiple entries may be present for a single
domain. The list is terminated by an entry where all fields are zero.
This array of structures is used to split up phys_avail[] regions that
fall in VM_FREELIST_DEFAULT into per-domain freelists.
- Each memory domain has a separate lookup-array of freelists that is
used when fulfulling a physical memory allocation. Right now the
per-domain freelists are listed in a round-robin order for each domain.
In the future a table such as the ACPI SLIT table may be used to order
the per-domain lookup lists based on the penalty for each memory domain
relative to a specific domain. The lookup lists may be examined via a
new vm.phys.lookup_lists sysctl.
- The first-touch policy is implemented by using PCPU_GET(domain) to
pick a lookup list when allocating memory.
Reviewed by: alc
name of 32bit sibling architecture instead of the host one. Do the
same for hw.machine on amd64.
Add a safety belt debug.adaptive_machine_arch sysctl, to turn the
substitution off.
Reviewed by: jhb, nwhitehorn
MFC after: 2 weeks
systems with PnP/ACPI not reporting i8254 timer. In some cases it can be
fatal, as i8254 can be the only available time counter hardware. From other
side we are now heavily depend on i8254 timer and till the last time it's
init/usage was completely hardcoded. So this change just restores previous
behavior in more regular fashion.
instead of calling pmap_invalidate_page() for each PG_G mapping, call
pmap_invalidate_range() for each range of PG_G mappings. In addition,
eliminate a redundant call to pmap_invalidate_page(). Both
pmap_remove_pte() and pmap_remove_page() called pmap_invalidate_page()
when the mapping had the PG_G attribute. Now, only pmap_remove_page()
calls pmap_invalidate_page(). Altogether, these changes eliminate 53%
of the TLB shootdowns for a "buildworld" on a ZFS file system. On
FFS, the reduction is 3%.
MFC after: 6 weeks
into the pcb before disabling watchpoints. Otherwise, when the
thread is restored on a processor, watchpoints are still disabled.
Submitted by: Tijl Coosemans <tijl coosemans org>
(I would be much happier if Tijl commited this himself)
MFC after: 1 week
Specifically, teach pmap_qenter() to recognize the case when it is being
asked to replace a mapping with the very same mapping and not generate
a shootdown. Unfortunately, the buffer cache commonly passes an entire
buffer to pmap_qenter() when only a subset of the mappings are changing.
For the extension of buffers in allocbuf() this was resulting in
unnecessary shootdowns. The addition of new pages to the end of the
buffer need not and did not trigger a shootdown, but overwriting the
initial mappings with the very same mappings was seen as a change that
necessitated a shootdown. With this change, that is no longer so.
For a "buildworld" on amd64, this change eliminates 14-15% of the
pmap_invalidate_range() shootdowns, and about 4% of the overall
shootdowns.
MFC after: 3 weeks
- change the type of pm_active to cpumask_t, which it is;
- in pmap_remove_pages(), compare with PCPU(curpmap), instead of
dereferencing the long chain of pointers [1].
For amd64 pmap, remove the unneeded checks for validity of curpmap
in pmap_activate(), since curpmap should be always valid after
r209789.
Submitted by: alc [1]
Reviewed by: alc
MFC after: 3 weeks
do on i386. The consequences of not doing so on amd64 became apparent
with the introduction of the COUNT_IPIS and COUNT_XINVLTLB_HITS
options. Specifically, single-threaded applications were generating
unnecessary IPIs to shoot-down the TLB on other processors. However,
this is clearly nonsensical because a single-threaded application is
only running on the current processor. The reason that this happens
is that pmap_activate() is unable to properly update the old pmap's
field "pm_active" without the correct "curpmap". So, in effect, stale
bits in "pm_active" were leading pmap_protect(), pmap_remove(),
pmap_remove_pages(), etc. to flush the TLB contents on some arbitrary
processor that wasn't even running the same application.
Reviewed by: kib
MFC after: 3 weeks
ABI specifies the DF should be zero, and newer compilers do not clear
DF before using DF-sensitive instructions.
The DF clearing for signal handlers was done some time ago.
MFC after: 1 week
get_fpcontext(), and npxsetuserregs() for set_fpcontext). Also,
note that usercontext is not initialized anymore in fpstate_drop().
Systematically replace references to npxgetregs() and npxsetregs()
by npxgetuserregs() and npxsetuserregs() in comments.
Noted by: bde
writing event timer drivers, for choosing best possible drivers by machine
independent code and for operating them to supply kernel with hardclock(),
statclock() and profclock() events in unified fashion on various hardware.
Infrastructure provides support for both per-CPU (independent for every CPU
core) and global timers in periodic and one-shot modes. MI management code
at this moment uses only periodic mode, but one-shot mode use planned for
later, as part of tickless kernel project.
For this moment infrastructure used on i386 and amd64 architectures. Other
archs are welcome to follow, while their current operation should not be
affected.
This patch updates existing drivers (i8254, RTC and LAPIC) for the new
order, and adds event timers support into the HPET driver. These drivers
have different capabilities:
LAPIC - per-CPU timer, supports periodic and one-shot operation, may
freeze in C3 state, calibrated on first use, so may be not exactly precise.
HPET - depending on hardware can work as per-CPU or global, supports
periodic and one-shot operation, usually provides several event timers.
i8254 - global, limited to periodic mode, because same hardware used also
as time counter.
RTC - global, supports only periodic mode, set of frequencies in Hz
limited by powers of 2.
Depending on hardware capabilities, drivers preferred in following orders,
either LAPIC, HPETs, i8254, RTC or HPETs, LAPIC, i8254, RTC.
User may explicitly specify wanted timers via loader tunables or sysctls:
kern.eventtimer.timer1 and kern.eventtimer.timer2.
If requested driver is unavailable or unoperational, system will try to
replace it. If no more timers available or "NONE" specified for second,
system will operate using only one timer, multiplying it's frequency by few
times and uing respective dividers to honor hz, stathz and profhz values,
set during initial setup.
This information can be very valuable for CPU sleep-time (and respectively
idle power consumption) optimization.
Add counters for timer-related IPIs.
Reviewed by: jhb@ (previous version)
the consistency between PCPU fpcurthread and the state of the FPU.
Explicitely assert that the calling conventions for fpudrop() are
adhered too. In cpu_thread_exit(), add missed critical section entrance.
Reviewed by: bde
Tested by: pho
MFC after: 1 month
allow pmap_enter() to be performed on an unmanaged page that doesn't have
VPO_BUSY set. Having VPO_BUSY set really only matters for managed pages.
(See, for example, pmap_remove_write().)
in Linux emulation layer. Linux seems to only require that pos is
page-aligned, but otherwise ignores it. Default FreeBSD mmap parameter
checking is too strict to allow some Linux binaries to run. tsMuxeR is
one example of such a binary.
Discussed with: jhb
MFC after: 1 week
PG_REFERENCED changes in vm_pageout_object_deactivate_pages().
Simplify this function's inner loop using TAILQ_FOREACH(), and shorten
some of its overly long lines. Update a stale comment.
Assert that PG_REFERENCED may be cleared only if the object containing
the page is locked. Add a comment documenting this.
Assert that a caller to vm_page_requeue() holds the page queues lock,
and assert that the page is on a page queue.
Push down the page queues lock into pmap_ts_referenced() and
pmap_page_exists_quick(). (As of now, there are no longer any pmap
functions that expect to be called with the page queues lock held.)
Neither pmap_ts_referenced() nor pmap_page_exists_quick() should ever
be passed an unmanaged page. Assert this rather than returning "0"
and "FALSE" respectively.
ARM:
Simplify pmap_page_exists_quick() by switching to TAILQ_FOREACH().
Push down the page queues lock inside of pmap_clearbit(), simplifying
pmap_clear_modify(), pmap_clear_reference(), and pmap_remove_write().
Additionally, this allows for avoiding the acquisition of the page
queues lock in some cases.
PowerPC/AIM:
moea*_page_exits_quick() and moea*_page_wired_mappings() will never be
called before pmap initialization is complete. Therefore, the check
for moea_initialized can be eliminated.
Push down the page queues lock inside of moea*_clear_bit(),
simplifying moea*_clear_modify() and moea*_clear_reference().
The last parameter to moea*_clear_bit() is never used. Eliminate it.
PowerPC/BookE:
Simplify mmu_booke_page_exists_quick()'s control flow.
Reviewed by: kib@
the interrupt followed by a brief delay if it is not currently masked
before moving the interrupt.
- Move the icu_lock out of ioapic_program_intpin() and into callers. This
closes a race where ioapic_program_intpin() could use a stale value of
the masked state to compute the masked bit in the register.
Reviewed by: mav
MFC after: 2 weeks
FPU/SSE hardware. Caller should provide a save area that is chained
into the stack of the areas; pcb save_area for usermode FPU state is
on top. The pcb now contains a pointer to the current FPU saved area,
used during FPUDNA handling and context switches. There is also a
facility to allow the kernel thread to use pcb save_area.
Change the dreaded warnings "npxdna in kernel mode!" into the panics
when FPU usage is not registered.
KPI discussed with: fabient
Tested by: pho, fabient
Hardware provided by: Sentex Communications
MFC after: 1 month
an ordering dependence: A pmap operation that clears PG_WRITEABLE and calls
vm_page_dirty() must perform the call first. Otherwise, pmap_is_modified()
could return FALSE without acquiring the page queues lock because the page
is not (currently) writeable, and the caller to pmap_is_modified() might
believe that the page's dirty field is clear because it has not seen the
effect of the vm_page_dirty() call.
When I pushed down the page queues lock into pmap_is_modified(), I
overlooked one place where this ordering dependence is violated:
pmap_enter(). In a rare situation pmap_enter() can be called to replace a
dirty mapping to one page with a mapping to another page. (I say rare
because replacements generally occur as a result of a copy-on-write fault,
and so the old page is not dirty.) This change delays clearing PG_WRITEABLE
until after vm_page_dirty() has been called.
Fixing the ordering dependency also makes it easy to introduce a small
optimization: When pmap_enter() used to replace a mapping to one page with a
mapping to another page, it freed the pv entry for the first mapping and
later called the pv entry allocator for the new mapping. Now, pmap_enter()
attempts to recycle the old pv entry, saving two calls to the pv entry
allocator.
There is no point in setting PG_WRITEABLE on unmanaged pages, so don't.
Update a comment to reflect this.
Tidy up the variable declarations at the start of pmap_enter().
pmap_is_referenced(). Eliminate the corresponding page queues lock
acquisitions from vm_map_pmap_enter() and mincore(), respectively. In
mincore(), this allows some additional cases to complete without ever
acquiring the page queues lock.
Assert that the page is managed in pmap_is_referenced().
On powerpc/aim, push down the page queues lock acquisition from
moea*_is_modified() and moea*_is_referenced() into moea*_query_bit().
Again, this will allow some additional cases to complete without ever
acquiring the page queues lock.
Reorder a few statements in vm_page_dontneed() so that a race can't lead
to an old reference persisting. This scenario is described in detail by a
comment.
Correct a spelling error in vm_page_dontneed().
Assert that the object is locked in vm_page_clear_dirty(), and restrict the
page queues lock assertion to just those cases in which the page is
currently writeable.
Add object locking to vnode_pager_generic_putpages(). This was the one
and only place where vm_page_clear_dirty() was being called without the
object being locked.
Eliminate an unnecessary vm_page_lock() around vnode_pager_setsize()'s call
to vm_page_clear_dirty().
Change vnode_pager_generic_putpages() to the modern-style of function
definition. Also, change the name of one of the parameters to follow
virtual memory system naming conventions.
Reviewed by: kib
APIC interrupt that fires when a threshold of corrected machine check
events is reached. CMCI also includes a count of events when reporting
corrected errors in the bank's status register. Note that individual
banks may or may not support CMCI. If they do, each bank includes its own
threshold register that determines when the interrupt fires. Currently
the code uses a very simple strategy where it doubles the threshold on
each interrupt until it succeeds in throttling the interrupt to occur
only once a minute (this interval can be tuned via sysctl). The threshold
is also adjusted on each hourly poll which will lower the threshold once
events stop occurring.
Tested by: Sailaja Bangaru sbappana at yahoo com
MFC after: 1 month
independent code. Move this code into mincore(), and eliminate the
page queues lock from pmap_mincore().
Push down the page queues lock into pmap_clear_modify(),
pmap_clear_reference(), and pmap_is_modified(). Assert that these
functions are never passed an unmanaged page.
Eliminate an inaccurate comment from powerpc/powerpc/mmu_if.m:
Contrary to what the comment says, pmap_mincore() is not simply an
optimization. Without a complete pmap_mincore() implementation,
mincore() cannot return either MINCORE_MODIFIED or MINCORE_REFERENCED
because only the pmap can provide this information.
Eliminate the page queues lock from vfs_setdirty_locked_object(),
vm_pageout_clean(), vm_object_page_collect_flush(), and
vm_object_page_clean(). Generally speaking, these are all accesses
to the page's dirty field, which are synchronized by the containing
vm object's lock.
Reduce the scope of the page queues lock in vm_object_madvise() and
vm_page_dontneed().
Reviewed by: kib (an earlier version)
arbitrary frequencies into hardclock(), statclock() and profclock() calls.
Same code with minor variations duplicated several times over the tree for
different timer drivers and architectures.
- Switch all x86 archs to new functions, simplifying the code and removing
extra logic from timer drivers. Other archs are also welcome.
Extend struct sysvec with three new elements:
sv_fetch_syscall_args - the method to fetch syscall arguments from
usermode into struct syscall_args. The structure is machine-depended
(this might be reconsidered after all architectures are converted).
sv_set_syscall_retval - the method to set a return value for usermode
from the syscall. It is a generalization of
cpu_set_syscall_retval(9) to allow ABIs to override the way to set a
return value.
sv_syscallnames - the table of syscall names.
Use sv_set_syscall_retval in kern_sigsuspend() instead of hardcoding
the call to cpu_set_syscall_retval().
The new functions syscallenter(9) and syscallret(9) are provided that
use sv_*syscall* pointers and contain the common repeated code from
the syscall() implementations for the architecture-specific syscall
trap handlers.
Syscallenter() fetches arguments, calls syscall implementation from
ABI sysent table, and set up return frame. The end of syscall
bookkeeping is done by syscallret().
Take advantage of single place for MI syscall handling code and
implement ptrace_lwpinfo pl_flags PL_FLAG_SCE, PL_FLAG_SCX and
PL_FLAG_EXEC. The SCE and SCX flags notify the debugger that the
thread is stopped at syscall entry or return point respectively. The
EXEC flag augments SCX and notifies debugger that the process address
space was changed by one of exec(2)-family syscalls.
The i386, amd64, sparc64, sun4v, powerpc and ia64 syscall()s are
changed to use syscallenter()/syscallret(). MIPS and arm are not
converted and use the mostly unchanged syscall() implementation.
Reviewed by: jhb, marcel, marius, nwhitehorn, stas
Tested by: marcel (ia64), marius (sparc64), nwhitehorn (powerpc),
stas (mips)
MFC after: 1 month
DDB so that all the fields line up.
- Print out the tid of the per-CPU idlethread instead of the pid since
the idle process is now shared across all idle threads.
MFC after: 1 month
here, make the style of assertion used by pmap_enter() consistent
across all architectures.
On entry to pmap_remove_write(), assert that the page is neither
unmanaged nor fictitious, since we cannot remove write access to
either kind of page.
With the push down of the page queues lock, pmap_remove_write() cannot
condition its behavior on the state of the PG_WRITEABLE flag if the
page is busy. Assert that the object containing the page is locked.
This allows us to know that the page will neither become busy nor will
PG_WRITEABLE be set on it while pmap_remove_write() is running.
Correct a long-standing bug in vm_page_cowsetup(). We cannot possibly
do copy-on-write-based zero-copy transmit on unmanaged or fictitious
pages, so don't even try. Previously, the call to pmap_remove_write()
would have failed silently.
labeled iretq instruction.
Suppose that multithreaded process executes two threads, currently
scheduled on different processors. Let assume that thread A executes
using %cs or %ss pointing into the descriptor from LDT. If IPI comes
which handler does not return by jump to doreti, and meantime thread B
invalidates descriptor pointed to by %cs or %ss, then iretq from IPI
handler could fault.
Routing the return by doreti_iret allows kernel to catch the situation
and recover from it by sending signal to the usermode.
Tested by: pho
MFC after: 1 week
frame upon segment register load fault. The doreti procedure does not
load segment registers when returning to the kernel frame, and current
values in the segment descriptor cache already allow the kernel mode
to run, not modified by faulted loaded.
Suggested by: bde
Tested by: pho
MFC after: 1 week
vm_page_try_to_free(). Consequently, push down the page queues lock into
pmap_enter_quick(), pmap_page_wired_mapped(), pmap_remove_all(), and
pmap_remove_write().
Push down the page queues lock into Xen's pmap_page_is_mapped(). (I
overlooked the Xen pmap in r207702.)
Switch to a per-processor counter for the total number of pages cached.
pmap_page_is_mapped() in preparation for removing page queues locking
around calls to vm_page_free(). Setting aside the assertion that calls
pmap_page_is_mapped(), vm_page_free_toq() now acquires and holds the page
queues lock just long enough to actually add or remove the page from the
paging queues.
Update vm_page_unhold() to reflect the above change.
architecture from page queue lock to a hashed array of page locks
(based on a patch by Jeff Roberson), I've implemented page lock
support in the MI code and have only moved vm_page's hold_count
out from under page queue mutex to page lock. This changes
pmap_extract_and_hold on all pmaps.
Supported by: Bitgravity Inc.
Discussed with: alc, jeffr, and kib
In the end, it does help fixing /dev/io usage from multithreaded
processes.
- On i386 and amd64 the old behaviour is kept but multithreaded
processes must use the new interface in order to work well.
- Support for the other architectures is greatly improved, where
necessary, by the necessity to define very small things now.
Manpage update will happen shortly.
Sponsored by: Sandvine Incorporated
PR: threads/116181
Reviewed by: emaste, marcel
MFC after: 3 weeks