atomic operations behave as if the were followed by a memory barrier so
there's no need to include ones in the acquire variants of atomic(9).
Removing these results a small performance improvement, specifically this
is sufficient to compensate the performance loss seen in the worldstone
benchmark seen when using SCHED_ULE instead of SCHED_4BSD.
This change is inspired by Linux even more radically doing the equivalent
thing some time ago.
Thanks go to Peter Jeremy for additional testing.
This patch is going to help in cases like mips flavours where you
want a more granular support on MAXCPU.
No MFC is previewed for this patch.
Tested by: pluknet
Approved by: re (kib)
the TLBs in order to get rid of the user mappings but instead traverse
them an flush only the latter like we also do for the Spitfire-class.
Also flushing the unlocked kernel entries can cause instant faults which
when called from within cpu_switch() are handled with the scheduler lock
held which in turn can cause timeouts on the acquisition of the lock by
other CPUs. This was easily seen with a 16-core V890 but occasionally
also happened with 2-way machines.
While at it, move the SPARC64-V support code entirely to zeus.c. This
causes a little bit of duplication but is less confusing than partially
using Cheetah-class bits for these.
- For SPARC64-V ensure that 4-Mbyte page entries are stored in the 1024-
entry, 2-way set associative TLB.
- In {d,i}tlb_get_data_sun4u() turn off the interrupts in order to ensure
that ASI_{D,I}TLB_DATA_ACCESS_REG actually are read twice back-to-back.
Tested by: Peter Jeremy (16-core US-IV), Michael Moll (2-way SPARC64-V)
have to ignore it when sending the IPI anyway. Actually I can't think of
a good reason why this ever was done that way in the first place as it's
not even usefull for debugging.
While at it replace the use of pc_other_cpus as it's slated for deorbit.
more than three temporary register in several places CATR() is used so
this code trades instructions in for registers. Actually, this still isn't
sufficient and CATR() has the side-effect of clobbering %y. Luckily, with
the current uses of CATR() this either doesn't matter or we are able to
(save and) restore it.
Now that there's only one use of AND() and TEST() left inline these.
This introduce all the underlying support for making this possible (via
the function cpusetobj_strscan() and keeps ktr_cpumask exported. sparc64
implements its own assembly primitives for tracing events and needs to
properly check it. Anyway the sparc64 logic is not implemented yet due
to lack of knowledge (by me) and time (by marius), but it is just a
matter of using ktr_cpumask when possible.
Tested and fixed by: pluknet
Reviewed by: marius
architectures (i386, for example) the virtual memory space may be
constrained enough that 2MB is a large chunk. Use 64K for arches
other than amd64 and ia64, with special handling for sparc64 due to
differing hardware.
Also commit the comment changes to kmem_init_zero_region() that I
missed due to not saving the file. (Darn the unfamiliar development
environment).
Arch maintainers, please feel free to adjust ZERO_REGION_SIZE as you
see fit.
Requested by: alc
MFC after: 1 week
MFC with: r221853
must not, otherwise we tell the CPU to IPI itself, which the sun4u CPUs don't
support. For reasons unknown so far MD and MI IPI use actually still triggers
that assertion though.
Compile sys/dev/mem/memutil.c for all supported platforms and remove now
unnecessary dev_mem_md_init(). Consistently define mem_range_softc from
mem.c for all platforms. Add missing #include guards for machine/memdev.h
and sys/memrange.h. Clean up some nearby style(9) nits.
MFC after: 1 month
architecture macros (__mips_n64, __powerpc64__) when 64 bit types (and
corresponding macros) are different from 32 bit. [1]
Correct the type of INT64_MIN, INT64_MAX and UINT64_MAX.
Define (U)INTMAX_C as an alias for (U)INT64_C matching the type definition
for (u)intmax_t. Do this on all architectures for consistency.
Suggested by: bde [1]
Approved by: kib (mentor)
On some architectures UCHAR_MAX and USHRT_MAX had type unsigned int.
However, lacking integer suffixes for types smaller than int, their type
should correspond to that of an object of type unsigned char (or short)
when used in an expression with objects of type int. In that case unsigned
char (short) are promoted to int (i.e. signed) so the type of UCHAR_MAX and
USHRT_MAX should also be int.
Where MIN/MAX constants implicitly have the correct type the suffix has
been removed.
While here, correct some comments.
Reviewed by: bde
Approved by: kib (mentor)
and switch sparc64 to use the first one for bus error filter handlers of
bridge drivers instead of (ab)using INTR_FAST for that so we eventually
can get rid of the latter.
Reviewed by: jhb
MFC after: 1 month
which takes an physical address instead of an virtual one, for loading TTEs
of the kernel TSB so we no longer need to lock the kernel TSB into the dTLB,
which only has a very limited number of lockable dTLB slots. The net result
is that we now basically can handle a kernel TSB of any size and no longer
need to limit the kernel address space based on the number of dTLB slots
available for locked entries. Consequently, other parts of the trap handlers
now also only access the the kernel TSB via its physical address in order
to avoid nested traps, as does the PMAP bootstrap code as we haven't taken
over the trap table at that point, yet. Apart from that the kernel TSB now
is accessed via a direct mapping when we are otherwise taking advantage of
ASI_ATOMIC_QUAD_LDD_PHYS so no further code changes are needed. Most of this
is implemented by extending the patching of the TSB addresses and mask as
well as the ASIs used to load it into the trap table so the runtime overhead
of this change is rather low. Currently the use of ASI_ATOMIC_QUAD_LDD_PHYS
is not yet enabled on SPARC64 CPUs due to lack of testing and due to the
fact it might require minor adjustments there.
Theoretically it should be possible to use the same approach also for the
user TSB, which already is not locked into the dTLB, avoiding nested traps.
However, for reasons I don't understand yet OpenSolaris only does that with
SPARC64 CPUs. On the other hand I think that also addressing the user TSB
physically and thus avoiding nested traps would get us closer to sharing
this code with sun4v, which only supports trap level 0 and 1, so eventually
we could have a single kernel which runs on both sun4u and sun4v (as does
Linux and OpenBSD).
Developed at and committed from: 27C3
STICK/STICK_COMPARE independently of the selected instruction set by
TICK_COMPARE so tick.c as of r214358 once again can be compiled with
gcc -mcpu=v9 for reference purposes.
kernel address space in order to leave space for the buffer cache, pipes,
thread stacks, etc on machines with more physical memory until we take
advantage of ASI_ATOMIC_QUAD_LDD_PHYS on CPUs providing it so we don't need
to lock the kernel TSB pages into the dTLB, basically making the entire
64-bit kernel address space available on relevant machines.
Submitted by: alc
Passing a count of zero on i386 and amd64 for [I386|AMD64]_BUS_SPACE_MEM
causes a crash/hang since the 'loop' instruction decrements the counter
before checking if it's zero.
PR: kern/80980
Discussed with: jhb
DEBUG_MEMGUARD panics early in kmeminit() with the message
"kmem_suballoc: bad status return of 1" because of zero "size" argument
passed to kmem_suballoc() due to "vm_kmem_size_max" being zero.
The problem also exists on ia64.
creation of large page mappings in the pmap, it can provide modest
performance benefits. In particular, for a "buildworld" on a 2x 1GHz
Ultrasparc IIIi it reduced the wall clock time by 2.2% and the system
time by 12.6%.
Tested by: marius@
contents of the ones that were not empty were stale and unused.
- Now that <machine/mutex.h> no longer exists, there is no need to allow it
to override various helper macros in <sys/mutex.h>.
- Rename various helper macros for low-level operations on mutexes to live
in the _mtx_* or __mtx_* namespaces. While here, change the names to more
closely match the real API functions they are backing.
- Drop support for including <sys/mutex.h> in assembly source files.
Suggested by: bde (1, 2)
a critical section as apparently required by both. I don't think either
belongs in the event timer front-ends but the callback should handle
this as necessary instead just like for example intr_event_handle()
does but this is how the other architectures currently handle it, either
explicitly or implicitly.
- Further rename and reword references to hardclock as this front-end no
longer has a notion of actually calling it.
to the expected type so they work like the corresponding __bswapN_var()
functions and the compiler doesn't complain when arguments of different
width are passed.
additionally takes advantage of the prefetch cache of these CPUs.
Unlike the uncommitted US-III version, which provide no measurable
speedup or even resulted in a slight slowdown on certain CPUs models
compared to using the US-I version with these, the SPARC64 version
actually results in a slight improvement.
The main goal of this is to generate timer interrupts only when there is
some work to do. When CPU is busy interrupts are generating at full rate
of hz + stathz to fullfill scheduler and timekeeping requirements. But
when CPU is idle, only minimum set of interrupts (down to 8 interrupts per
second per CPU now), needed to handle scheduled callouts is executed.
This allows significantly increase idle CPU sleep time, increasing effect
of static power-saving technologies. Also it should reduce host CPU load
on virtualized systems, when guest system is idle.
There is set of tunables, also available as writable sysctls, allowing to
control wanted event timer subsystem behavior:
kern.eventtimer.timer - allows to choose event timer hardware to use.
On x86 there is up to 4 different kinds of timers. Depending on whether
chosen timer is per-CPU, behavior of other options slightly differs.
kern.eventtimer.periodic - allows to choose periodic and one-shot
operation mode. In periodic mode, current timer hardware taken as the only
source of time for time events. This mode is quite alike to previous kernel
behavior. One-shot mode instead uses currently selected time counter
hardware to schedule all needed events one by one and program timer to
generate interrupt exactly in specified time. Default value depends of
chosen timer capabilities, but one-shot mode is preferred, until other is
forced by user or hardware.
kern.eventtimer.singlemul - in periodic mode specifies how much times
higher timer frequency should be, to not strictly alias hardclock() and
statclock() events. Default values are 2 and 4, but could be reduced to 1
if extra interrupts are unwanted.
kern.eventtimer.idletick - makes each CPU to receive every timer interrupt
independently of whether they busy or not. By default this options is
disabled. If chosen timer is per-CPU and runs in periodic mode, this option
has no effect - all interrupts are generating.
As soon as this patch modifies cpu_idle() on some platforms, I have also
refactored one on x86. Now it makes use of MONITOR/MWAIT instrunctions
(if supported) under high sleep/wakeup rate, as fast alternative to other
methods. It allows SMP scheduler to wake up sleeping CPUs much faster
without using IPI, significantly increasing performance on some highly
task-switching loads.
Tested by: many (on i386, amd64, sparc64 and powerc)
H/W donated by: Gheorghe Ardelean
Sponsored by: iXsystems, Inc.
In particular, provide pagesize and pagesizes array, the canary value
for SSP use, number of host CPUs and osreldate.
Tested by: marius (sparc64)
MFC after: 1 month
td_critnest > 1 when not already running on the desired CPU read the
TICK counter of the BSP via a direct cross trap request in that case
instead.
- Treat the STICK based timecounter the same way as the TICK based one
regarding its quality and obtaining the counter value from the BSP.
Like the TICK timers the STICK ones also are only synchronized during
their startup (which might not result in good synchronicity in the
first place) but not afterwards and might drift over time, causing
problems when the time is read from different CPUs (see r135972).
to single CPUs more efficiently with Cheetah(-class) and Jalapeno CPUs.
Besides being used to implement the ipi_cpu() introduced in r210939,
cpu_ipi_single() will also be used internally by the sparc64 MD code.
- Factor out the Jalapeno support from the Cheetah IPI send functions
in order to be able to more easily and efficiently implement support
for more than 32 target CPUs as well as a workaround for Cheetah+
erratum 25 for the latter.
IPI to a specific CPU by its cpuid. Replace calls to ipi_selected() that
constructed a mask for a single CPU with calls to ipi_cpu() instead. This
will matter more in the future when we transition from cpumask_t to
cpuset_t for CPU masks in which case building a CPU mask is more expensive.
Submitted by: peter, sbruno
Reviewed by: rookie
Obtained from: Yahoo! (x86)
MFC after: 1 month
now it uses a very dumb first-touch allocation policy. This will change in
the future.
- Each architecture indicates the maximum number of supported memory domains
via a new VM_NDOMAIN parameter in <machine/vmparam.h>.
- Each cpu now has a PCPU_GET(domain) member to indicate the memory domain
a CPU belongs to. Domain values are dense and numbered from 0.
- When a platform supports multiple domains, the default freelist
(VM_FREELIST_DEFAULT) is split up into N freelists, one for each domain.
The MD code is required to populate an array of mem_affinity structures.
Each entry in the array defines a range of memory (start and end) and a
domain for the range. Multiple entries may be present for a single
domain. The list is terminated by an entry where all fields are zero.
This array of structures is used to split up phys_avail[] regions that
fall in VM_FREELIST_DEFAULT into per-domain freelists.
- Each memory domain has a separate lookup-array of freelists that is
used when fulfulling a physical memory allocation. Right now the
per-domain freelists are listed in a round-robin order for each domain.
In the future a table such as the ACPI SLIT table may be used to order
the per-domain lookup lists based on the penalty for each memory domain
relative to a specific domain. The lookup lists may be examined via a
new vm.phys.lookup_lists sysctl.
- The first-touch policy is implemented by using PCPU_GET(domain) to
pick a lookup list when allocating memory.
Reviewed by: alc
between determining the other CPUs and calling cpu_ipi_selected(), which
apart from generally doing the wrong thing can lead to a panic when a
CPU is told to IPI itself (which sun4u doesn't support).
Reported and tested by: Nathaniel W Filardo
- Add __unused where appropriate.
MFC after: 3 days
Extend struct sysvec with three new elements:
sv_fetch_syscall_args - the method to fetch syscall arguments from
usermode into struct syscall_args. The structure is machine-depended
(this might be reconsidered after all architectures are converted).
sv_set_syscall_retval - the method to set a return value for usermode
from the syscall. It is a generalization of
cpu_set_syscall_retval(9) to allow ABIs to override the way to set a
return value.
sv_syscallnames - the table of syscall names.
Use sv_set_syscall_retval in kern_sigsuspend() instead of hardcoding
the call to cpu_set_syscall_retval().
The new functions syscallenter(9) and syscallret(9) are provided that
use sv_*syscall* pointers and contain the common repeated code from
the syscall() implementations for the architecture-specific syscall
trap handlers.
Syscallenter() fetches arguments, calls syscall implementation from
ABI sysent table, and set up return frame. The end of syscall
bookkeeping is done by syscallret().
Take advantage of single place for MI syscall handling code and
implement ptrace_lwpinfo pl_flags PL_FLAG_SCE, PL_FLAG_SCX and
PL_FLAG_EXEC. The SCE and SCX flags notify the debugger that the
thread is stopped at syscall entry or return point respectively. The
EXEC flag augments SCX and notifies debugger that the process address
space was changed by one of exec(2)-family syscalls.
The i386, amd64, sparc64, sun4v, powerpc and ia64 syscall()s are
changed to use syscallenter()/syscallret(). MIPS and arm are not
converted and use the mostly unchanged syscall() implementation.
Reviewed by: jhb, marcel, marius, nwhitehorn, stas
Tested by: marcel (ia64), marius (sparc64), nwhitehorn (powerpc),
stas (mips)
MFC after: 1 month
hook it up to ada(4) also. While at it, rename *ad_firmware_geom_adjust()
to *ata_disk_firmware_geom_adjust() etc now that these are no longer
limited to ad(4).
Reviewed by: mav
MFC after: 3 days
HAL/Fujitsu) CPUs. For the most part this consists of fleshing out the
MMU and cache handling, it doesn't add pmap optimizations possible with
these CPU, yet, though.
With these changes FreeBSD runs stable on Fujitsu Siemens PRIMEPOWER 250
and likely also other models based on SPARC64 V like 450, 650 and 850.
Thanks go to Michael Moll for providing access to a PRIMEPOWER 250.
architecture from page queue lock to a hashed array of page locks
(based on a patch by Jeff Roberson), I've implemented page lock
support in the MI code and have only moved vm_page's hold_count
out from under page queue mutex to page lock. This changes
pmap_extract_and_hold on all pmaps.
Supported by: Bitgravity Inc.
Discussed with: alc, jeffr, and kib
7 which corresponds to WSTATE_KMIX in OpenSolaris whenever calling into
it which totally screws us even when restoring %wstate afterwards as
spill/fill traps can happen while in OFW. The rather hackish OpenBSD
approach of just setting the equivalent of WSTATE_KERNEL to 7 also is
no option as we treat %wstate as a bit field. So in order to deal with
this problem actually implement spill/fill handlers for %wstate 7 which
just act as the WSTATE_KERNEL ones except of theoretically also handling
32-bit, turn off interrupts completely so we don't even take IPIs while
in OFW which should ensure we only take spill/fill traps at most and
restore %wstate after calling into OFW once we have taken over the trap
table. While at it, actually set WSTATE_{,PROM}_KMIX before calling into
OFW just like OpenSolaris does, which should at least help testing this
change on non-V1280.
- Remove comments referring to the %wstate usage in BSD/OS.
- Remove the no longer used RSF_ALIGN_RETRY macro.
- Correct some trap table addresses in comments.
- Ensure %wstate is set to WSTATE_KERNEL when taking over the trap table.
- Ensure PSTATE_AM is off when entering or exiting to OFW as well as that
interrupts are also completely off when exiting to OFW as the firmware
trap table shouldn't be used to handle our interrupts.
- Swap the configuration of the first and second large dTLB as with
US-IV+ these can only hold entries of certain page sizes each, which
we happened to chose the non-working way around.
- Additionally ensure that the large iTLB is set up to hold 8k pages
(currently this happens to be a NOP though).
- Add a workaround for US-IV+ erratum #2.
- Turn off dTLB parity error reporting as otherwise we get seemingly
false positives when copying in the user window by simulating a
fill trap on return to usermode. Given that these parity errors can
be avoided by disabling multi issue mode and the problem could be
reproduced with a second machine this appears to be a silicon bug of
some sort.
- Add a membar #Sync also before the stores to ASI_DCACHE_TAG. While
at it, turn of interrupts across the whole cheetah_cache_flush() for
simplicity instead of around every flush. This should have next to no
impact as for cheetah-class machines we typically only need to flush
the caches a few times during boot when recovering from peeking/poking
non-existent PCI devices, if at all.
- Just use KERNBASE for FLUSH as we also do elsewhere as the US-IV+
documentation doesn't seem to mention that these CPUs also ignore the
address like previous cheetah-class CPUs do. Again the code changing
LSU_IC is executed seldom enough that the negligible optimization of
using %g0 instead should have no real impact.
With these changes FreeBSD runs stable on V890 equipped with US-IV+
and -j128 buildworlds in a loop for days are no problem. Unfortunately,
the performance isn't were it should be as a buildworld on a 4x1.5GHz
US-IV+ V890 takes nearly 3h while on a V440 with (theoretically) less
powerfull 4x1.5GHz US-IIIi it takes just over 1h. It's unclear whether
this is related to the supposed silicon bug mentioned above or due to
another issue. The documentation (which contains a sever bug in the
description of the bits added to the context registers though) at least
doesn't mention any requirements for changes in the CPU handling besides
those implemented and the cache as well as the TLB configurations and
handling look fine.
o Re-arrange cheetah_init() so it's easier to add support for SPARC64
V up to VIIIfx CPUs, which only require parts of this initialization.
by UltraSparc-IV and -IV+ as well as SPARC64 V, VI, VII and VIIIfx CPUs.
- Replace TLB_PCXR_PGSZ_MASK and TLB_SCXR_PGSZ_MASK with TLB_CXR_PGSZ_MASK
which just is the complement of TLB_CXR_CTX_MASK instead of trying to
assemble it from the page size bits which vary across CPUs.
- Add macros for the remainder of the SFSR bits, which are useful for at
least debugging purposes.
but also of different types, f.e. Sun Fire V890 can be equipped with a
mix of UltraSPARC IV and IV+ CPUs, requiring different MMU initialization
and different workarounds for model specific errata. Therefore move the
CPU implementation number from a global variable to the per-CPU data.
Functions which are called before the latter is available are passed the
implementation number as a parameter now.
This file was missed in r204152.
but also of different types, f.e. Sun Fire V890 can be equipped with a
mix of UltraSPARC IV and IV+ CPUs, requiring different MMU initialization
and different workarounds for model specific errata. Therefore move the
CPU implementation number from a global variable to the per-CPU data.
Functions which are called before the latter is available are passed the
implementation number as a parameter now.
to the exclusion lists as the CPU nodes aren't handled as regular devices
either. Also add the pseudo-devices found in Sun Fire V1280.
- Allow nexus_attach() and nexus_alloc_resource() to be used by drivers
derived from nexus(4) for subordinate busses.
- Don't add the zero-sized memory resources of glue devices to the resource
lists.
root nexus device for the CPUs as starting with UltraSPARC IV the 'cpu'
nodes hang off of from 'cmp' (chip multi-threading processor) or 'core'
or combinations thereof. Also in large UltraSPARC III based machines
the 'cpu' nodes hang off of 'ssm' (scalable shared memory) nodes which
group snooping-coherency domains together instead of directly from the
nexus.
It would be great if we could use newbus to deal with the different ways
the 'cpu' devices can hang off of pseudo ones but unfortunately both
cpu_mp_setmaxid() and sparc64_init() have to work prior to regular device
probing.
- Add support for UltraSPARC IV and IV+ CPUs. Due to the fact that these
are multi-core each CPU has two Fireplane config registers and thus the
module/target ID has to be determined differently so the one specific
to a certain core is used. Similarly, starting with UltraSPARC IV the
individual cores use a different property in the OFW device tree to
indicate the CPU/core ID as it no longer is in coincidence with the
shared slot/socket ID.
This involves changing the MD KTR code to not directly read the UPA
module ID either. We use the MID stored in the per-CPU data instead of
calling cpu_get_mid() as a replacement in order prevent clobbering any
registers as side-effect in the assembler version. This requires CATR()
invocations from mp_startup() prior to mapping the per-CPU pages to be
removed though.
While at it additionally distinguish between CPUs with Fireplane and
JBus interconnects as these also use slightly different sizes for the
JBus/agent/module/target IDs.
- Make sparc64_shutdown_final() static as it's not used outside of
machdep.c.
of Sun Fire V1280 doesn't round up the size itself but instead lets
claiming of non page-sized amounts of memory fail.
- Change parameters and variables related to the TLB slots to unsigned
which is more appropriate.
- Search the whole OFW device tree instead of only the children of the
root nexus device for the BSP as starting with UltraSPARC IV the 'cpu'
nodes hang off of from 'cmp' (chip multi-threading processor) or 'core'
or combinations thereof. Also in large UltraSPARC III based machines
the 'cpu' nodes hang off of 'ssm' (scalable shared memory) nodes which
group snooping-coherency domains together instead of directly from the
nexus.
- Add support for UltraSPARC IV and IV+ BSPs. Due to the fact that these
are multi-core each CPU has two Fireplane config registers and thus the
module/target ID has to be determined differently so the one specific
to a certain core is used. Similarly, starting with UltraSPARC IV the
individual cores use a different property in the OFW device tree to
indicate the CPU/core ID as it no longer is in coincidence with the
shared slot/socket ID.
While at it additionally distinguish between CPUs with Fireplane and
JBus interconnects as these also use slightly different sizes for the
JBus/agent/module/target IDs.
- Check the return value of init_heap(). This requires moving it after
cons_probe() so we can panic when appropriate. This should be fine as
the PowerPC OFW loader uses that order for quite some time now.
to PCIe bridges.
- Add support for talking the PROM mappings over to the kernel IOTSB
just like we do with the kernel TSB in order to allow OFW drivers
to continue to work.
- Change some members, parameters and variables to unsigned where
more appropriate.
- Change INTMAP_VEC() to take an INO as its second argument rather
than an INR. The former is what I actually intended with this
macro and how it's currently used.
compiled to use the Medium/Low code model, which we currently default
to for the userland. GNU/Linux has moved their default to Medium/Middle
some time ago, which probably explains why the current GNU ld(1) uses
a base in the range between 32 and 44 bits instead.
Submitted by: kib
by looking at the bases used for non-relocatable executables by gnu ld(1),
and adjusting it slightly.
Discussed with: bz
Reviewed by: kan
Tested by: bz (i386, amd64), bsam (linux)
MFC after: some time
has proven to have a good effect when entering KDB by using a NMI,
but it completely violates all the good rules about interrupts
disabled while holding a spinlock in other occasions. This can be the
cause of deadlocks on events where a normal IPI_STOP is expected.
* Adds an new IPI called IPI_STOP_HARD on all the supported architectures.
This IPI is responsible for sending a stop message among CPUs using a
privileged channel when disponible. In other cases it just does match a
normal IPI_STOP.
Right now the IPI_STOP_HARD functionality uses a NMI on ia32 and amd64
architectures, while on the other has a normal IPI_STOP effect. It is
responsibility of maintainers to eventually implement an hard stop
when necessary and possible.
* Use the new IPI facility in order to implement a new userend SMP kernel
function called stop_cpus_hard(). That is specular to stop_cpu() but
it does use the privileged channel for the stopping facility.
* Let KDB use the newly introduced function stop_cpus_hard() and leave
stop_cpus() for all the other cases
* Disable interrupts on CPU0 when starting the process of APs suspension.
* Style cleanup and comments adding
This patch should fix the reboot/shutdown deadlocks many users are
constantly reporting on mailing lists.
Please don't forget to update your config file with the STOP_NMI
option removal
Reviewed by: jhb
Tested by: pho, bz, rink
Approved by: re (kib)
actually specify valid bases that should be treated just as normal.
The PCI specifications have no indication that 0 would be a magic value
indicating a disabled BAR as commonly used on at least amd64 and i386
but not sparc64. It's unclear what to do in pci_delete_resource()
instead of writing 0 to a BAR though as there's no (other) way do
disable individual BARs so its decoding is left enabled in case of
__PCI_BAR_ZERO_VALID for now.
Approved by: re (kib), jhb
MFC after: 1 week
dependent memory attributes:
Rename vm_cache_mode_t to vm_memattr_t. The new name reflects the
fact that there are machine-dependent memory attributes that have
nothing to do with controlling the cache's behavior.
Introduce vm_object_set_memattr() for setting the default memory
attributes that will be given to an object's pages.
Introduce and use pmap_page_{get,set}_memattr() for getting and
setting a page's machine-dependent memory attributes. Add full
support for these functions on amd64 and i386 and stubs for them on
the other architectures. The function pmap_page_set_memattr() is also
responsible for any other machine-dependent aspects of changing a
page's memory attributes, such as flushing the cache or updating the
direct map. The uses include kmem_alloc_contig(), vm_page_alloc(),
and the device pager:
kmem_alloc_contig() can now be used to allocate kernel memory with
non-default memory attributes on amd64 and i386.
vm_page_alloc() and the device pager will set the memory attributes
for the real or fictitious page according to the object's default
memory attributes.
Update the various pmap functions on amd64 and i386 that map pages to
incorporate each page's memory attributes in the mapping.
Notes: (1) Inherent to this design are safety features that prevent
the specification of inconsistent memory attributes by different
mappings on amd64 and i386. In addition, the device pager provides a
warning when a device driver creates a fictitious page with memory
attributes that are inconsistent with the real page that the
fictitious page is an alias for. (2) Storing the machine-dependent
memory attributes for amd64 and i386 as a dedicated "int" in "struct
md_page" represents a compromise between space efficiency and the ease
of MFCing these changes to RELENG_7.
In collaboration with: jhb
Approved by: re (kib)
o add to platforms where it was missing (arm, i386, powerpc, sparc64, sun4v)
o define as "1" on amd64 and i386 where there is no restriction
o make the type returned consistent with ALIGN
o remove _ALIGNED_POINTER
o make associated comments consistent
Reviewed by: bde, imp, marcel
Approved by: re (kensmith)
used kernel TLB slots when unloading the kernel or modules, which
results in havoc when loading a kernel and modules which take up
less TLB slots afterwards as the unused but locked ones aren't
accounted for in virtual_avail. Eventually this should be fixed
in the loader which isn't straight forward though and the kernel
should be robust against this anyway. [1]
- Ensure that the addresses allocated directly from phys_avail[] by
pmap_bootstrap_alloc() are always colored properly. This implicit
assumption was broken in r194784 as unlike the other consumers the
DPCPU area allocated for the BSP isn't a multiple of PAGE_SIZE *
DCACHE_COLORS. [2]
- Remove the no longer used global msgbuf_phys.
- Remove the redundant ekva parameter of pmap_bootstrap_alloc().
- Correct some outdated function names in ktr(9) invocations.
Requested by: jhb [1]
Reported by: gavin [2]
Approved by: re (kib)
MFC after: 2 weeks
required by video card drivers. Specifically, this change introduces
vm_cache_mode_t with an appropriate VM_CACHE_DEFAULT definition on all
architectures. In addition, this changes adds a vm_cache_mode_t parameter
to kmem_alloc_contig() and vm_phys_alloc_contig(). These will be the
interfaces for allocating mapped kernel memory and physical memory,
respectively, with non-default cache modes.
In collaboration with: jhb
- Modules and kernel code alike may use DPCPU_DEFINE(),
DPCPU_GET(), DPCPU_SET(), etc. akin to the statically defined
PCPU_*. Requires only one extra instruction more than PCPU_* and is
virtually the same as __thread for builtin and much faster for shared
objects. DPCPU variables can be initialized when defined.
- Modules are supported by relocating the module's per-cpu linker set
over space reserved in the kernel. Modules may fail to load if there
is insufficient space available.
- Track space available for modules with a one-off extent allocator.
Free may block for memory to allocate space for an extent.
Reviewed by: jhb, rwatson, kan, sam, grehan, marius, marcel, stas
a fair number of static data structures, making this an unlikely
option to try to change without also changing source code. [1]
Change default cache line size on ia64, sparc64, and sun4v to 128
bytes, as this was what rtld-elf was already using on those
platforms. [2]
Suggested by: bde [1], jhb [2]
MFC after: 2 weeks
CACHE_LINE_SIZE constant. These constants are intended to
over-estimate the cache line size, and be used at compile-time
when a run-time tuning alternative isn't appropriate or
available.
Defaults for all architectures are 64 bytes, except powerpc
where it is 128 bytes (used on G5 systems).
MFC after: 2 weeks
Discussed on: arch@
to the full path of the image that is being executed.
Increase AT_COUNT.
Remove no longer true comment about types used in Linux ELF binaries,
listed types contain FreeBSD-specific entries.
Reviewed by: kan
kernel one as the non-faulting flush address in the loader so
we can can change KERNBASE and VM_MIN_KERNEL_ADDRESS if we
ever want to without needing to worry about using a compatible
loader.
- Correctly check for LOADER_DEBUG.
- Add a missing const for page_sizes[].
whole KVA space using one locked 4MB dTLB entry per GB of physical
memory. On Cheetah-class machines only the dt16 can hold locked
entries though, which would be completely consumed for the kernel
TSB on machines with >= 16GB. Therefore limit the KVA space to use
no more than half of the lockable dTLB slots, given that we need
them also for other things.
- Add sanity checks which ensure that we don't exhaust the (lockable)
TLB slots.
of OFW access semantics, in order to allow future support for real-mode
OF access and flattened device frees. OF client interface modules are
implemented using KOBJ, in a similar way to the PPC PMAP modules.
Because we need Open Firmware to be available before mutexes can be used on
sparc64, changes are also included to allow KOBJ to be used very early in
the boot process by only using the mutex once we know it has been initialized.
Reviewed by: marius, grehan
the code for parsing interrupt maps) to PowerPC and reflect their new MI
status by moving them to the shared dev/ofw directory.
This commit also modifies the OFW PCI enumeration procedure on PowerPC to
allow the bus to find non-firmware-enumerated devices that Apple likes to add,
and adds some useful Open Firmware properties (compat and name) to the pnpinfo
string of children on OFW SBus, EBus, PCI, and MacIO links. Because of the
change to PCI enumeration on PowerPC, X has started working again on PPC
machines with Grackle hostbridges.
Reviewed by: marius
Obtained from: sparc64
and ifnet functions
- add memory barriers to <machine/atomic.h>
- update drivers to only conditionally define their own
- add lockless producer / consumer ring buffer
- remove ring buffer implementation from cxgb and update its callers
- add if_transmit(struct ifnet *ifp, struct mbuf *m) to ifnet to
allow drivers to efficiently manage multiple hardware queues
(i.e. not serialize all packets through one ifq)
- expose if_qflush to allow drivers to flush any driver managed queues
This work was supported by Bitgravity Inc. and Chelsio Inc.
filters instead of PIL_FAST and allow special filters and handlers
for interrupts which need to be able to interrupt even filters, f.e.
bus error interrupts, to be registered with the revived INTR_FAST
at PIL_FAST.
rerun of the streaming cache for silicon bug workarounds.
- Announce the presence of a streaming cache on attach for
informational purposes.
- For performance reasons don't do unnecessary flushes of the
streaming cache when coherent mappings are synced.
- Fix some minor style issues.
consists of CPUs running at different speeds, for driving hardclock as
these timers in turn are driven at frequencies as low as 5MHz, resulting
in bad granularity compared to the TICK timers. However, don't employ
the workaround for the BlackBird erratum #1 when using the TICK timer
on machines with cheetah-class CPUs for performance reasons.
Reported by: Florian Smeets
disable interrupts and loop forever with these.
- Hide all MP-related bits in <machine/smp.h> underneath #ifdef SMP.
- Inline ipi_all_but_self(9) and ipi_selected(9). We don't expose any
additional bits but save a few cycles by doing so.
- Remove ipi_all(9), which actually only called panic(9). It can't be
implemented natively anyway and having it removed at least causes
MI users to fail already fail when linking.
for all three contexts and configure the dt512_1 to hold 4MB pages for
them (e.g. for direct mappings).
This might allow for additional optimization by using the faulting
page sizes provided by AA_DMMU_TAG_ACCESS_EXT for bypassing the page
size walker for the dt512 in the superpage support code.
Submitted by: nwhitehorn (initial patch)
table. This is required in order to set obp-control-relinquished
within the PROM, allowing to safely read the OFW translations node.
Without this, f.e. a `ofwdump -ap` triggers a fatal reset error or
worse things on machines based on USIII and beyond.
In theory this should allow to remove touching %tba in cpu_setregs(),
in practice we seem to currently face a chicken and egg problem when
doing so however.
to 43 bits so update TD_PA_BITS accordingly. For the most part this
increase is transparent to the existing code except for when reading
the physical address from ASI_{D,I}TLB_DATA_ACCESS_REG, which we
only do in the loader and which was already adjusted in r182478, or
from the OFW translations node.
While at it, ensure we are only taking valid OFW mapping entries
into account.
frequencies (and having different cache sizes) so use the STICK
(System TICK) timer, which was introduced due to this and is
driven by the same frequency across all CPUs, instead of the
TICK timer, whose frequency varies with the CPU clock, to drive
hardclock. We try to use the STICK counter with all CPUs that are
USIII or beyond, even when not necessary due to identical CPUs,
as we can can also avoid the workaround for the BlackBird erratum
#1 there. Unfortunately, using the STICK counter currently causes
a hang with USIIIi MP machines for reasons unknown, so we still
use the TICK timer there (which is okay as they can only consist
of identical CPUs).
- Given that we only (try to) synchronize the (S)TICK timers of APs
with the BSP during startup, we could end up spinning forever in
DELAY(9) if that function is migrated to another CPU while we're
spinning due to clock drift afterwards, so pin to the CPU in order
to avoid migration. Unfortunately, pinning doesn't work at the
point DELAY(9) is required by the low-level console drivers, yet,
so switch to a function pointer, which is updated accordingly, for
implementing DELAY(9). For USIII and beyond, this would also allow
to easily use the STICK counter instead of the TICK one here,
there's no benefit in doing so however.
While at it, use cpu_spinwait(9) for spinning in the delay-
functions. This currently is a NOP though.
- Don't set the TICK timer of the BSP to 0 during at startup as
there's no need to do so.
- Implement cpu_est_clockrate().
- Unfortunately, USIIIi-based machines don't provide a timecounter
device besides the STICK and TICK counters (well, in theory the
Tomatillo bridges have a performance counter that can be (ab)used
as timecounter by configuring it to count bus cycles, though unlike
the performance counter of Schizo bridges, the Tomatillo one is
broken and counts Sun knows what in this mode). This means that
we've to use a (S)TICK counter for timecounting, which has the old
problem of not being in sync across CPUs, so provide an additional
timecounter function which binds itself to the BSP but has an
adequate low priority.
sizes (and running at different frequencies) so move the cacheinfo
to the PCPU data. While at it, remove some redundant and/or unused
members from struct cacheinfo.
- In sparc64_init don't assume the first CPU node we find in the OFW
device tree is the BSP.
no particular reason for them to be implemented in assembler and
having them in C allows easier extension as well as using more C
macros and {d,i}tlb_slot_max rather than hard-coding magic (and
actually spitfire-only) values.
- Fix the compilation of pmap_print_tte().
- Change pmap_print_tlb() to use ldxa() rather than re-rolling it
inline as well as TLB_DAR_SLOT and {d,i}tlb_slot_max rather than
hardcoding magic (and actually spitfire-only) values.
- While at it, suffix the above mentioned functions with "_sun4u" to
underline they're architecture-specific.
- Use __FBSDID and macros instead of magic values in locore.S.
- Remove unused includes and smp_stack in locore.S.
that modify condition codes (the carry bit, in this case). Without
"__volatile", the compiler might add the inline assembler instructions
between unrelated code which also uses condition codes, modifying the
latter.
This prevents the TCP pseudo header checksum calculation done in
tcp_output() from having effects on other conditions when compiled
with GCC 4.2.1 at "-O2" and "options INET6" left out. [1]
Reported & tested by: Boris Kochergin [1]
MFC after: 3 days
Now that st_rdev is being automatically generated by the kernel, there
is no need to define static major/minor numbers for the iodev and
memdev. We still need the minor numbers for the memdev, however, to
distinguish between /dev/mem and /dev/kmem.
Approved by: philip (mentor)
for UPA it should have fulfilled its purpose by now and Fireplane-
and JBus-based machines are way to messy in organization to implement
something equivalent.
- Fix a bunch of style(9) bugs.
counter-timer timecounter so the associated SYSCTL nodes don't clash on
machines having multiple U2P and U2S bridges as well as establishing a
clear mapping between these bridges and their timecounter device.
- Don't bother setting up a "nice" name for the IOMMU, just use the name
returned by device_get_nameunit(9), too.
- Fix some minor style(9) bugs.
- Use __FBSDID in counter.c
MFC after: 1 week
don't send and EOI which works like on amd64/i386 and blocks all
interrupts on the relevant interrupt controller.
o Replace the post_filter and post_inthread hooks registered when
creating the interrupt events with just ic_clear as on sparc64 we
don't need to do any disable->EOI->enable dance to unblock all but
the relevant interrupt while running the filter or handler; just
not clearing the interrupt already has the same effect.
o Merge from amd64/i386:
- Split the intr_table_lock into an sx lock used for most things,
and a spin lock to protect intrcnt_index.
- Add support for binding interrupts to CPUs, including for the
bus_bind_intr(9) interface, a assign_cpu hook and initially
shuffling interrupts arround in a round-robin fashion.
Reviewed by: jhb
MFC after: 1 month
these days, so de-generalize the acquire_timer/release_timer api
to just deal with speakers.
The new (optional) MD functions are:
timer_spkr_acquire()
timer_spkr_release()
and
timer_spkr_setfreq()
the last of which configures the timer to generate a tone of a given
frequency, in Hz instead of 1/1193182th of seconds.
Drop entirely timer2 on pc98, it is not used anywhere at all.
Move sysbeep() to kern/tty_cons.c and use the timer_spkr*() if
they exist, and do nothing otherwise.
Remove prototypes and empty acquire-/release-timer() and sysbeep()
functions from the non-beeping archs.
This eliminate the need for the speaker driver to know about
i8254frequency at all. In theory this makes the speaker driver MI,
contingent on the timer_spkr_*() functions existing but the driver
does not know this yet and still attaches to the ISA bus.
Syscons is more tricky, in one function, sc_tone(), it knows the hz
and things are just fine.
In the other function, sc_bell() it seems to get the period from
the KDMKTONE ioctl in terms if 1/1193182th second, so we hardcode
the 1193182 and leave it at that. It's probably not important.
Change a few other sysbeep() uses which obviously knew that the
argument was in terms of i8254 frequency, and leave alone those
that look like people thought sysbeep() took frequency in hertz.
This eliminates the knowledge of i8254_freq from all but the actual
clock.c code and the prof_machdep.c on amd64 and i386, where I think
it would be smart to ask for help from the timecounters anyway [TBD].
sectors so the geometry of large IDE disks has to be adjusted. This
corresponds to what the OpenSolaris dad(7D) driver does except that
the latter only tweaks sectors and effectively limits the mediasize
to 128GB so the cylinders and heads fields won't ever overflow. Not
limiting the mediasize is a compromise between allowing to use Sun
disk label as far as possible and being able to use the entire disk
with another disk label.
This allows to use the full capacity of large IDE disks if they were
not labeled under (Open)Solaris (in both ways of the meaning).
MFC after: 2 weeks
- Introduce per-architecture stack_machdep.c to hold stack_save(9).
- Introduce per-architecture machine/stack.h to capture any common
definitions required between db_trace.c and stack_machdep.c.
- Add new kernel option "options STACK"; we will build in stack(9) if it is
defined, or also if "options DDB" is defined to provide compatibility
with existing users of stack(9).
Add new stack_save_td(9) function, which allows the capture of a stacktrace
of another thread rather than the current thread, which the existing
stack_save(9) was limited to. It requires that the thread be neither
swapped out nor running, which is the responsibility of the consumer to
enforce.
Update stack(9) man page.
Build tested: amd64, arm, i386, ia64, powerpc, sparc64, sun4v
Runtime tested: amd64 (rwatson), arm (cognet), i386 (rwatson)
ways:
(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.
This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.
Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.
(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.
Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.
Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)
with the INTR_FILTER-enabled MI code. Basically this consists of
registering an interrupt controller (of which there can be multiple
and optionally different ones either per host-to-foo bridge or shared
amongst host-to-foo bridges in any one machine) along with an interrupt
vector as specific argument for all the interrupt vectors used by a
given host-to-foo bridge (roughly similar to registering interrupt
sources on amd64 and i386), providing functions to enable, clear and
disable the interrupts of the children beneath the bridge.
This also includes:
- No longer entering a critical section in tl0_intr() and tl1_intr()
for executing interrupt handlers but rather let the handlers enter
it themselves so in the case of intr_event_handle() we don't enter
a nested critical section.
- Adding infrastructure for binding delivery of interrupt vectors to
specific CPUs which later on can be interfaced with the code from
amd64/i386 for binding interrupts to specific CPUs.
- Getting rid of the wrapper hack introduced along the lines of the
API changes for INTR_FILTER which as a side-effect caused interrupts
associated with ithread handlers only to get the elevated priority
of those associated with filters ("fast handlers") (this removes the
hack also in the non-INTR_FILTER case).
- Disabling (by not clearing) an interrupt in the interrupt controller
until all associated handlers have been executed, which is crucial
for the typical locking strategy of NIC drivers in order to work
correctly in case of shared interrupts. This was a more or less
theoretical problem on sparc64 though, as shared interrupts are
rather uncommon there except for the on-board SCCs and UARTs.
Note that due to the behavior of at least of some of the interrupt
controllers used on sparc64 an enable+EOI instead of a disable+EOI
approach (as implied by the INTR_FILTER MI code and implemented on
other architectures) is used as the latter can cause lost interrupts
or in the worst case interrupt starvation.
o Correct a typo in sbus_alloc_resource() which caused (pass-through)
allocations to only work down to the grandchildren of the bus, which
wasn't a real problem so far as we don't support any devices which are
great-grandchildren or greater of a U2S bridge, yet.
o In fhc(4) use bus_{read,write}_4() instead of bus_space_{read,write}_4()
in order to get rid of sc_bh and sc_bt in the fhc_softc. Also get rid
of some other unneeded members in fhc_softc.
Reviewed by: marcel (earlier version)
Approved by: re (kensmith)