KVA space is abundant on amd64, so there is no reason to limit kernel
map size to a fraction of available physical memory. In fact, it could
be larger than physical memory.
This should help with memory auto-tuning for ZFS and shouldn't affect
other workloads.
This should reduce number of circumstances for "kmem_map too small"
panics, but probably won't eliminate them entirely due to potential kmem
fragmentation.
In fact, you might want/need to limit maximum ARC size after this commit
if you need to resrve more memory for applications.
This change was discussed on arch@ and nobody said "don't do it".
MFC after: 6 weeks
The main goal of this is to generate timer interrupts only when there is
some work to do. When CPU is busy interrupts are generating at full rate
of hz + stathz to fullfill scheduler and timekeeping requirements. But
when CPU is idle, only minimum set of interrupts (down to 8 interrupts per
second per CPU now), needed to handle scheduled callouts is executed.
This allows significantly increase idle CPU sleep time, increasing effect
of static power-saving technologies. Also it should reduce host CPU load
on virtualized systems, when guest system is idle.
There is set of tunables, also available as writable sysctls, allowing to
control wanted event timer subsystem behavior:
kern.eventtimer.timer - allows to choose event timer hardware to use.
On x86 there is up to 4 different kinds of timers. Depending on whether
chosen timer is per-CPU, behavior of other options slightly differs.
kern.eventtimer.periodic - allows to choose periodic and one-shot
operation mode. In periodic mode, current timer hardware taken as the only
source of time for time events. This mode is quite alike to previous kernel
behavior. One-shot mode instead uses currently selected time counter
hardware to schedule all needed events one by one and program timer to
generate interrupt exactly in specified time. Default value depends of
chosen timer capabilities, but one-shot mode is preferred, until other is
forced by user or hardware.
kern.eventtimer.singlemul - in periodic mode specifies how much times
higher timer frequency should be, to not strictly alias hardclock() and
statclock() events. Default values are 2 and 4, but could be reduced to 1
if extra interrupts are unwanted.
kern.eventtimer.idletick - makes each CPU to receive every timer interrupt
independently of whether they busy or not. By default this options is
disabled. If chosen timer is per-CPU and runs in periodic mode, this option
has no effect - all interrupts are generating.
As soon as this patch modifies cpu_idle() on some platforms, I have also
refactored one on x86. Now it makes use of MONITOR/MWAIT instrunctions
(if supported) under high sleep/wakeup rate, as fast alternative to other
methods. It allows SMP scheduler to wake up sleeping CPUs much faster
without using IPI, significantly increasing performance on some highly
task-switching loads.
Tested by: many (on i386, amd64, sparc64 and powerc)
H/W donated by: Gheorghe Ardelean
Sponsored by: iXsystems, Inc.
Bring in a driver for the LSI Logic MPT2 6Gb SAS controllers.
This driver supports basic I/O, and works with SAS and SATA drives and
expanders.
Basic error recovery works (i.e. timeouts and aborts) as well.
Integrated RAID isn't supported yet, and there are some known bugs.
So this isn't ready for production use, but is certainly ready for
testing and additional development. For the moment, new commits to this
driver should go into the FreeBSD Perforce repository first
(//depot/projects/mps/...) and then get merged into -current once
they've been vetted.
This has only been added to the amd64 GENERIC, since that is the only
architecture I have tested this driver with.
Submitted by: scottl
Discussed with: imp, gibbs, will
Sponsored by: Yahoo, Spectra Logic Corporation
This reflects actual type used to store and compare child device orders.
Change is mostly done via a Coccinelle (soon to be devel/coccinelle)
semantic patch.
Verified by LINT+modules kernel builds.
Followup to: r212213
MFC after: 10 days
In particular, provide pagesize and pagesizes array, the canary value
for SSP use, number of host CPUs and osreldate.
Tested by: marius (sparc64)
MFC after: 1 month
It is more appropriate in this context because TSC MSR is reset to zero
when the CPU is restarted from S3 and above. Move acpi_resync_clock() back
to where it was before r211202. It does not make a difference any more.
As long as interrupts are disabled and there is not explicit call to
sched_add() there can't be any preemption there, thus the calls may be
consistent.
Reported by: kib, jhb
are served via an interrupt gate.
However, that doesn't explicitly prevent preemption and thread
migration thus scheduler pinning may be necessary in some handlers.
Fix that.
Tested by: gianni
MFC after: 1 month
While there, also fix some places assuming cpu type is 'int' while
u_int is really meant.
Note: this will also fix some possible races in per-cpu data accessings
to be addressed in further commits.
In collabouration with: Yahoo! Incorporated (via sbruno and peter)
Tested by: gianni
MFC after: 1 month
IPI to a specific CPU by its cpuid. Replace calls to ipi_selected() that
constructed a mask for a single CPU with calls to ipi_cpu() instead. This
will matter more in the future when we transition from cpumask_t to
cpuset_t for CPU masks in which case building a CPU mask is more expensive.
Submitted by: peter, sbruno
Reviewed by: rookie
Obtained from: Yahoo! (x86)
MFC after: 1 month
savectx() is only used for panic dump (dumppcb) and kdb (stoppcbs). Thus,
saving additional information does not hurt and it may be even beneficial.
Unfortunately, struct pcb has grown larger to accommodate more data.
Move 512-byte long pcb_user_save to the end of struct pcb while I am here.
- savectx() now saves FPU state unconditionally and copy it to the PCB of
FPU thread if necessary. This gives panic dump and kdb a chance to take
a look at the current FPU state even if the FPU is "supposedly" not used.
- Resuming CPU now unconditionally reinitializes FPU. If the saved FPU
state was irrelevant, it could be in an unknown state.
Suggested by: bde [1]
Xeon 5500/5600 series:
- Utilize IA32_TEMPERATURE_TARGET, a.k.a. Tj(target) in place
of Tj(max) when a sane value is available, as documented
in Intel whitepaper "CPU Monitoring With DTS/PECI"; (By sane
value we mean 70C - 100C for now);
- Print the probe results when booting verbose;
- Replace cpu_mask with cpu_stepping;
- Use CPUID_* macros instead of rolling our own.
Approved by: rpaulo
MFC after: 1 month
from the inline assembly. This allows the compiler to cache invocations of
curthread since it's value does not change within a thread context.
Submitted by: zec (i386)
MFC after: 1 week
for crash dump (dumppcb) and kdb (stoppcbs). For both cases, there cannot
have a valid pointer in pcb_save. This should restore the previous
behaviour.
zones for each malloc bucket size. The purpose is to isolate
different malloc types into hash classes, so that any buffer overruns
or use-after-free will usually only affect memory from malloc types in
that hash class. This is purely a debugging tool; by varying the hash
function and tracking which hash class was corrupted, the intersection
of the hash classes from each instance will point to a single malloc
type that is being misused. At this point inspection or memguard(9)
can be used to catch the offending code.
Add MALLOC_DEBUG_MAXZONES=8 to -current GENERIC configuration files.
The suggestion to have this on by default came from Kostik Belousov on
-arch.
This code is based on work by Ron Steinke at Isilon Systems.
Reviewed by: -arch (mostly silence)
Reviewed by: zml
Approved by: zml (mentor)
now it uses a very dumb first-touch allocation policy. This will change in
the future.
- Each architecture indicates the maximum number of supported memory domains
via a new VM_NDOMAIN parameter in <machine/vmparam.h>.
- Each cpu now has a PCPU_GET(domain) member to indicate the memory domain
a CPU belongs to. Domain values are dense and numbered from 0.
- When a platform supports multiple domains, the default freelist
(VM_FREELIST_DEFAULT) is split up into N freelists, one for each domain.
The MD code is required to populate an array of mem_affinity structures.
Each entry in the array defines a range of memory (start and end) and a
domain for the range. Multiple entries may be present for a single
domain. The list is terminated by an entry where all fields are zero.
This array of structures is used to split up phys_avail[] regions that
fall in VM_FREELIST_DEFAULT into per-domain freelists.
- Each memory domain has a separate lookup-array of freelists that is
used when fulfulling a physical memory allocation. Right now the
per-domain freelists are listed in a round-robin order for each domain.
In the future a table such as the ACPI SLIT table may be used to order
the per-domain lookup lists based on the penalty for each memory domain
relative to a specific domain. The lookup lists may be examined via a
new vm.phys.lookup_lists sysctl.
- The first-touch policy is implemented by using PCPU_GET(domain) to
pick a lookup list when allocating memory.
Reviewed by: alc