o Setting td_intr_frame to the XIVs trap frame because it's referenced
by the ET event handler.
o Signal EOI to the CPU before calling the registered XIV handlers.
This prevents lost ITC interrupts, which cause starvation in one-shot
mode.
o Adding support for IPI_HARDCLOCK with corresponding per-CPU counters.
o Have the APs call cpu_initclocks() so as to limited the scattering of
clock related initialization. cpu_initclocks() calls the <self>_bsp()
or <self>_ap() version accordingly.
o Uncomment the ET clock handling in cpu_idle().
o Update the DDB 'show pcpu' output for the new MD fields.
o Entirely rewritten ia64_ih_clock(). Note that we don't create as many
clock XIVs as we have CPUs, as is done on PowerPC. It doesn't scale.
We can only have 240 XIVs and we can have more CPUs than that. There's
a single intrcnt index for the cumulative clock ticks and we keep per
CPU counts in the PCPU stats structure.
o Register the ITC by hooking SI_SUB_CONFIGURE (2nd order).
Open issues:
o Clock interrupts can still be lost. Some tweaking is still necessary.
Thanks to: mav@ for his support, feedback and explanations.
ET stats while committing:
eris% sysctl machdep.cpu | grep nclks
machdep.cpu.0.nclks: 24007
machdep.cpu.1.nclks: 22895
machdep.cpu.2.nclks: 13523
machdep.cpu.3.nclks: 9342
machdep.cpu.4.nclks: 9103
machdep.cpu.5.nclks: 9298
machdep.cpu.6.nclks: 10039
machdep.cpu.7.nclks: 9479
eris% vmstat -i | grep clock
clock 108599 50
stream of the local processor. Also explicitly invalidate
the ALAT. This is done on the other CPUs in the coherence
domain by virtue of the ptc.ga instruction, but does not
apply to the local CPU.
architectures (i386, for example) the virtual memory space may be
constrained enough that 2MB is a large chunk. Use 64K for arches
other than amd64 and ia64, with special handling for sparc64 due to
differing hardware.
Also commit the comment changes to kmem_init_zero_region() that I
missed due to not saving the file. (Darn the unfamiliar development
environment).
Arch maintainers, please feel free to adjust ZERO_REGION_SIZE as you
see fit.
Requested by: alc
MFC after: 1 week
MFC with: r221853
cpuset_t objects.
That is going to offer the underlying support for a simple bump of
MAXCPU and then support for number of cpus > 32 (as it is today).
Right now, cpumask_t is an int, 32 bits on all our supported architecture.
cpumask_t on the other side is implemented as an array of longs, and
easilly extendible by definition.
The architectures touched by this commit are the following:
- amd64
- i386
- pc98
- arm
- ia64
- XEN
while the others are still missing.
Userland is believed to be fully converted with the changes contained
here.
Some technical notes:
- This commit may be considered an ABI nop for all the architectures
different from amd64 and ia64 (and sparc64 in the future)
- per-cpu members, which are now converted to cpuset_t, needs to be
accessed avoiding migration, because the size of cpuset_t should be
considered unknown
- size of cpuset_t objects is different from kernel and userland (this is
primirally done in order to leave some more space in userland to cope
with KBI extensions). If you need to access kernel cpuset_t from the
userland please refer to example in this patch on how to do that
correctly (kgdb may be a good source, for example).
- Support for other architectures is going to be added soon
- Only MAXCPU for amd64 is bumped now
The patch has been tested by sbruno and Nicholas Esborn on opteron
4 x 12 pack CPUs. More testing on big SMP is expected to came soon.
pluknet tested the patch with his 8-ways on both amd64 and i386.
Tested by: pluknet, sbruno, gianni, Nicholas Esborn
Reviewed by: jeff, jhb, sbruno
of the 61 bits available within the region for virtual addressing. Since
there's no good way for us to map out the gap in the virtual address space,
limit KVA to the architectural minimum implemented address bits. This still
gives us 1 petabyte of KVA, so no worries.
use the PBVM. This eliminates the implied hardcoding of the
physical address at which the kernel needs to be loaded. Using the
PBVM makes it possible to load the kernel irrespective of the
physical memory organization and allows us to replicate kernel text
on NUMA machines.
While here, reduce the direct-mapped page size to the kernel's
page size so that we can support memory attributes better.
boundaries. For good measure, align all other objects to cache
lines boundaries.
Use the new arch_loadseg I/F to keep track of kernel text and
data so that we can wire as much of it as is possible. It is
the responsibility of the kernel to link critical (read IVT
related) code and data at the front of the respective segment
so that it's covered by TRs before the kernel has a chance to
add more translations.
Use a better way of determining whether we're loading a legacy
kernel or not. We can't check for the presence of the PBVM page
table, because we may have unloaded that kernel and loaded an
older (legacy) kernel after that. Simply use the latest load
address for it.
services or PAL procedures. The new implementation is based on
specific functions that are known to be called in certain scenarios
only. This in particular fixes the PAL call to obtain information
about translation registers. In general, the new implementation does
not bank on virtual addresses being direct-mapped and will work when
the kernel uses PBVM.
When new scenarios need to be supported, new functions are added if
the existing functions cannot be changed to handle the new scenario.
If a single generic implementation is possible, it will become clear
in due time.
While here, change bootinfo to a pointer type in anticipation of
future development.
1. The PBVM is in region 4, so if we want to make use of it, we
need region 4 freed up.
2. Region 4 and above cannot be represented by an off_t by virtue
of that type being signed. This is problematic for truss(1),
ktrace(1) and other such programs.
Add support for Pre-Boot Virtual Memory (PBVM) to the loader.
PBVM allows us to link the kernel at a fixed virtual address without
having to make any assumptions about the physical memory layout. On
the SGI Altix 350 for example, there's no usuable physical memory
below 192GB. Also, the PBVM allows us to control better where we're
going to physically load the kernel and its modules so that we can
make sure we load the kernel in memory that's close to the BSP.
The PBVM is managed by a simple page table. The minimum size of the
page table is 4KB (EFI page size) and the maximum is currently set
to 1MB. A page in the PBVM is 64KB, as that's the maximum alignment
one can specify in a linker script. The bottom line is that PBVM is
between 64KB and 8GB in size.
The loader maps the PBVM page table at a fixed virtual address and
using a single translations. The PBVM itself is also mapped using a
single translation for a maximum of 32MB.
While here, increase the heap in the EFI loader from 512KB to 2MB
and set the stage for supporting relocatable modules.
The compiler seems to assume it's a 32-bit integral and rounding to the
page size using the standard expression (((u_long)(x) + mask) & ~mask),
results in a 32-bit value. Dropping the 'U' suffix is enough to have the
compiler treat the expression as a 64-bit integral.
handlers.
o Put the IVT in its own section and keep the supporting code close.
o Make sure the VHPT is sized so that it can be mapped using a single
translation.
o Map the PAL code and VHPT with a translation that has the right size.
Assume the platform has a PAL code size that can be mapped with a
single translations.
o Pass the pointer to the bootinfo structure as an argument to ia64_init().
o Get rid of LOG2_ID_PAGE_SIZE and IA64_ID_PAGE_SIZE. It was used to map
the regions 6 & 7 and was as large as possible. The problem is that we
can't support memory attributes easily if the granuratity is not a page.
We need to support memory attributes because the new USB stack violates
the BUS_DMA(9) interface.
o Update some comments...
NOTE: this is broken for SMP kernels, because the AP startup code hasn't
been updated yet.
o The bootinfo structure is now a virtual pointer.
o Replace VM_MAX_ADDRESS with VM_MAXUSER_ADDRESS and redefine
VM_MAX_ADDRESS as the maximum address possible (~0UL).
o Since we're not using direct-mapped translations, switching
to physical addressing is less trivial. Reserve the boot stack
for running in physical mode and special-case the EFI call,
as we're still on the boot stack.
o Region 4 belongs to the kernel now, not process space.
o Move the backing store in the top half of region 0 now that
region 4 is re-assigned to be part of the kernel.
o De-emphasize VM_MAX_ADDRESS. It's really not used anywhere and probably
means something different than the limit for process address space (we
have VM_MAXUSER_ADDRESS for that).
o Exclude the gateway page from VM_MAXUSER_ADDRESS (i.e. make it the same
as VM_MAX_ADDRESS).
Compile sys/dev/mem/memutil.c for all supported platforms and remove now
unnecessary dev_mem_md_init(). Consistently define mem_range_softc from
mem.c for all platforms. Add missing #include guards for machine/memdev.h
and sys/memrange.h. Clean up some nearby style(9) nits.
MFC after: 1 month
architecture macros (__mips_n64, __powerpc64__) when 64 bit types (and
corresponding macros) are different from 32 bit. [1]
Correct the type of INT64_MIN, INT64_MAX and UINT64_MAX.
Define (U)INTMAX_C as an alias for (U)INT64_C matching the type definition
for (u)intmax_t. Do this on all architectures for consistency.
Suggested by: bde [1]
Approved by: kib (mentor)
On some architectures UCHAR_MAX and USHRT_MAX had type unsigned int.
However, lacking integer suffixes for types smaller than int, their type
should correspond to that of an object of type unsigned char (or short)
when used in an expression with objects of type int. In that case unsigned
char (short) are promoted to int (i.e. signed) so the type of UCHAR_MAX and
USHRT_MAX should also be int.
Where MIN/MAX constants implicitly have the correct type the suffix has
been removed.
While here, correct some comments.
Reviewed by: bde
Approved by: kib (mentor)
Passing a count of zero on i386 and amd64 for [I386|AMD64]_BUS_SPACE_MEM
causes a crash/hang since the 'loop' instruction decrements the counter
before checking if it's zero.
PR: kern/80980
Discussed with: jhb
contents of the ones that were not empty were stale and unused.
- Now that <machine/mutex.h> no longer exists, there is no need to allow it
to override various helper macros in <sys/mutex.h>.
- Rename various helper macros for low-level operations on mutexes to live
in the _mtx_* or __mtx_* namespaces. While here, change the names to more
closely match the real API functions they are backing.
- Drop support for including <sys/mutex.h> in assembly source files.
Suggested by: bde (1, 2)
In particular, provide pagesize and pagesizes array, the canary value
for SSP use, number of host CPUs and osreldate.
Tested by: marius (sparc64)
MFC after: 1 month
IPI to a specific CPU by its cpuid. Replace calls to ipi_selected() that
constructed a mask for a single CPU with calls to ipi_cpu() instead. This
will matter more in the future when we transition from cpumask_t to
cpuset_t for CPU masks in which case building a CPU mask is more expensive.
Submitted by: peter, sbruno
Reviewed by: rookie
Obtained from: Yahoo! (x86)
MFC after: 1 month
from the inline assembly. This allows the compiler to cache invocations of
curthread since it's value does not change within a thread context.
Submitted by: zec (i386)
MFC after: 1 week
now it uses a very dumb first-touch allocation policy. This will change in
the future.
- Each architecture indicates the maximum number of supported memory domains
via a new VM_NDOMAIN parameter in <machine/vmparam.h>.
- Each cpu now has a PCPU_GET(domain) member to indicate the memory domain
a CPU belongs to. Domain values are dense and numbered from 0.
- When a platform supports multiple domains, the default freelist
(VM_FREELIST_DEFAULT) is split up into N freelists, one for each domain.
The MD code is required to populate an array of mem_affinity structures.
Each entry in the array defines a range of memory (start and end) and a
domain for the range. Multiple entries may be present for a single
domain. The list is terminated by an entry where all fields are zero.
This array of structures is used to split up phys_avail[] regions that
fall in VM_FREELIST_DEFAULT into per-domain freelists.
- Each memory domain has a separate lookup-array of freelists that is
used when fulfulling a physical memory allocation. Right now the
per-domain freelists are listed in a round-robin order for each domain.
In the future a table such as the ACPI SLIT table may be used to order
the per-domain lookup lists based on the penalty for each memory domain
relative to a specific domain. The lookup lists may be examined via a
new vm.phys.lookup_lists sysctl.
- The first-touch policy is implemented by using PCPU_GET(domain) to
pick a lookup list when allocating memory.
Reviewed by: alc
name of 32bit sibling architecture instead of the host one. Do the
same for hw.machine on amd64.
Add a safety belt debug.adaptive_machine_arch sysctl, to turn the
substitution off.
Reviewed by: jhb, nwhitehorn
MFC after: 2 weeks