zero on slower machines, which make the fenced get_timecount methods
not used despite needed. Remove the (shift > 0) condition when
selecting the get_timecount() implementation.
Rename smp_tsc_shift to tsc_shift, and apply it for the UP case too.
Allow shift to reach value of 31 instead of 30, as it was previously
(should be nop).
Reorganize the tc quality calculation to remove the conditionally
compiled block. Rename test_smp_tsc() to test_tsc() and provide
separate versions for SMP and UP builds. The check for virtialized
hardware is more natural to perform in the smp version of the
test_tsc(), since it is only done for smp case.
Noted and reviewed by: bde (previous version)
MFC after: 12 days
timecounter to 1, and correspondingly increase the precision of the
gettimeofday(2) and related functions in the default configuration.
The motivation for the TSC-low timecounter, as described in the
r222866, seems to provide a workaround for the non-serializing
behaviour of the RDTSC on some Intel hardware. Tests demonstrate that
even with the pre-shift of 8, the cross-core non-monotonicity of the
RDTSC is still observed reliably, e.g. on the Nehalems. The r238755
and r238973 implemented the proper fix for the issue.
The pre-shift of 1 is applied to keep TSC not overflowing for the
frequency of hardclock down to 2 sec/intr. The pre-shift is made a
tunable to allow the easy debugging of the issues users could see with
the shift being too low.
Reviewed by: bde
MFC after: 2 weeks
CPUs exhibit bad behavior if this is done (Intel Errata AAJ3, hangs on
Pentium-M, and trashing of the local APIC registers on a VIA C7). The
local APIC is implicitly mapped UC already via MTRRs, so the clflush isn't
necessary anyway.
MFC after: 2 weeks
Rather than trying to KASSERT for callers that invoke this on
IO tags, either do nothing (for write_8) or return ~0 (for read_8).
Using KASSERT here just makes bus.h too messy from both
polluting bus.h with systm.h (for any number of drivers that include
bus.h without first including systm.h) or ports that use bus.h
directly (i.e. libpciaccess) as reported by zeising@.
Also don't try to implement all of the other bus_space functions for
8 byte access since realistically only these two are needed for some
devices that expose 64-bit memory-mapped registers.
Put the amd64-specific functions here rather than sys/amd64/include/bus.h
so that we can keep this header unified for x86, as requested by mdf@
and tijl@.
Submitted by: Carl Delsey <carl.r.delsey@intel.com>
MFC after: 3 days
Programming the low bits has a side-effect if unmasking the pin if it is
not disabled. So if an interrupt was pending then it would be delivered
with the correct new vector but to the incorrect old LAPIC.
This fix could be made clearer by preserving the mask bit while
programming the low bits and then explicitly resetting the mask bit
after all the programming is done.
Probability to trip over the fixed bug could be increased by bootverbose
because printing of the interrupt information in ioapic_assign_cpu
lengthened the time window during which an interrupt could arrive while
a pin is masked.
Reported by: Andreas Longwitz <longwitz@incore.de>
Tested by: Andreas Longwitz <longwitz@incore.de>
MFC after: 12 days
introduced with the IvyBridge CPUs. Provide the definitions for new
bits in CR3 and CR4 registers.
Tested by: avg, Michael Moll <kvedulv@kvedulv.de>
MFC after: 2 weeks
instruction loads/stores at its will.
The macro __compiler_membar() is currently supported for both gcc and
clang, but kernel compilation will fail otherwise.
Reviewed by: bde, kib
Discussed with: dim, theraven
MFC after: 2 weeks
r234247.
Use, instead, the static intializer introduced in r239923 for x86 and
sparc64 intr_cpus, unwinding the code to the initial version.
Reviewed by: marius
bits under #ifdef _KERNEL but leave definitions for various structures
defined by standards ($PIR table, SMAP entries, etc.) available to
userland.
- Consolidate duplicate SMBIOS table structure definitions in ipmi(4)
and smbios(4) in <machine/pc/bios.h> and make them available to
userland.
MFC after: 2 weeks
segments for the entire allocation to use kmem_alloc_attr() to allocate
KVM rather than using kmem_alloc_contig(). This avoids requiring
a single physically contiguous chunk in this case.
Submitted by: Peter Jeremy (original version)
MFC after: 1 month
protect against 32-bit TSC overflow while the sync test is running.
On dual-socket Xeon E5-2600 (SNB) systems with up to 32 threads, there
is non-trivial chance (2-3%) that TSC synchronization test fails due to
32-bit TSC overflow while the synchronization test is running.
Sponsored by: Intel
Reviewed by: jkim
Discussed with: jkim, kib
programming using earlier cached values. This makes respective routines to
disappear from PMC top and reduces total number of active CPU cycles on idle
24-core system by 10%.
attributes (currently just BUS_DMA_NOCACHE):
- Don't call pmap_change_attr() on the returned address, instead use
kmem_alloc_contig() to ask the VM system for memory with the requested
attribute.
- As a result, always use kmem_alloc_contig() for non-default memory
attributes, even for sub-page allocations. This requires adjusting
bus_dmamem_free()'s logic for determining which free routine to use.
- For x86, add a new dummy bus_dmamap that is used for static DMA
buffers allocated via kmem_alloc_contig(). bus_dmamem_free() can then
use the map pointer to determine which free routine to use.
- For powerpc, add a new flag to the allocated map (bus_dmamem_alloc()
always creates a real map on powerpc) to indicate which free routine
should be used.
Note that the BUS_DMA_NOCACHE handling in powerpc is currently #ifdef'd out.
I have left it disabled but updated it to match x86.
Reviewed by: scottl
MFC after: 1 month
message for r238973:
Rdtsc instruction is not synchronized, it seems on some Intel cores it
can bypass even the locked instructions. As a result, rdtsc executed
on different cores may return unordered TSC values even when the rdtsc
appearance in the instruction sequences is provably ordered.
Similarly to what has been done in r238755 for TSC synchronization
test, add explicit fences right before rdtsc in the timecounters 'get'
functions. Intel recommends to use LFENCE, while AMD refers to
MFENCE. For VIA follow what Linux does and use LFENCE. With this
change, I see no reordered reads of TSC on Nehalem.
Change the rmb() to inlined CPUID in the SMP TSC synchronization test.
On i386, locked instruction is used for rmb(), and as noted earlier,
it is not enough. Since i386 machine may not support SSE2, do simplest
possible synchronization with CPUID.
MFC after: 1 week
Discussed with: avg, bde, jkim
Intel Architecture Manual specifies that rdtsc instruction is not serialized,
so without this change, TSC synchronization test would periodically fail,
resulting in use of HPET timecounter instead of TSC-low. This caused
severe performance degradation (40-50%) when running high IO/s workloads due to
HPET MMIO reads and GEOM stat collection.
Tests on Xeon E5-2600 (Sandy Bridge) 8C systems were seeing TSC synchronization
fail approximately 20% of the time.
Sponsored by: Intel
Reviewed by: kib
MFC after: 3 days
mostly meets the guidelines set by the Intel SDM:
1. We use XRSTOR and XSAVE from the same CPL using the same linear
address for the store area
2. Contrary to the recommendations, we cannot zero the FPU save area
for a new thread, since fork semantic requires the copy of the
previous state. This advice seemingly contradicts to the advice
from the item 6.
3. We do use XSAVEOPT in the context switch code only, and the area
for XSAVEOPT already always contains the data saved by XSAVE.
4. We do not modify the save area between XRSTOR, when the area is
loaded into FPU context, and XSAVE. We always spit the fpu context
into save area and start emulation when directly writing into FPU
context.
5. We do not use segmented addressing to access save area, or rather,
always address it using %ds basing.
6. XSAVEOPT can be only executed in the area which was previously
loaded with XRSTOR, since context switch code checks for FPU use by
outgoing thread before saving, and thread which stopped emulation
forcibly get context loaded with XRSTOR.
7. The PCB cannot be paged out while FPU emulation is turned off, since
stack of the executing thread is never swapped out.
The context switch code is patched to issue XSAVEOPT instead of XSAVE
if supported. This approach eliminates one conditional in the context
switch code, which would be needed otherwise.
For user-visible machine context to have proper data, fpugetregs()
checks for unsaved extension blocks and manually copies pristine FPU
state into them, according to the description provided by CPUID leaf
0xd.
MFC after: 1 month
This is required for ARM EABI. Section 7.1.1 of the Procedure Call for the
ARM Architecture (AAPCS) defines wchar_t as either an unsigned int or an
unsigned short with the former preferred.
Because of this requirement we need to move the definition of __wchar_t to
a machine dependent header. It also cleans up the macros defining the limits
of wchar_t by defining __WCHAR_MIN and __WCHAR_MAX in the same machine
dependent header then using them to define WCHAR_MIN and WCHAR_MAX
respectively.
Discussed with: bde
usermode, using shared page. The structures and functions have vdso
prefix, to indicate the intended location of the code in some future.
The versioned per-algorithm data is exported in the format of struct
vdso_timehands, which mostly repeats the content of in-kernel struct
timehands. Usermode reading of the structure can be lockless.
Compatibility export for 32bit processes on 64bit host is also
provided. Kernel also provides usermode with indication about
currently used timecounter, so that libc can fall back to syscall if
configured timecounter is unknown to usermode code.
The shared data updates are initiated both from the tc_windup(), where
a fast task is queued to do the update, and from sysctl handlers which
change timecounter. A manual override switch
kern.timecounter.fast_gettime allows to turn off the mechanism.
Only x86 architectures export the real algorithm data, and there, only
for tsc timecounter. HPET counters page could be exported as well, but
I prefer to not further glue the kernel and libc ABI there until
proper vdso-based solution is developed.
Minimal stubs neccessary for non-x86 architectures to still compile
are provided.
Discussed with: bde
Reviewed by: jhb
Tested by: flo
MFC after: 1 month
suspend/resume procedures are minimized among them.
common:
- Add global cpuset suspended_cpus to indicate APs are suspended/resumed.
- Remove acpi_waketag and acpi_wakemap from acpivar.h (no longer used).
- Add some variables in acpi_wakecode.S in order to minimize the difference
among amd64 and i386.
- Disable load_cr3() because now CR3 is restored in resumectx().
amd64:
- Add suspend/resume related members (such as MSR) in PCB.
- Modify savectx() for above new PCB members.
- Merge acpi_switch.S into cpu_switch.S as resumectx().
i386:
- Merge(and remove) suspendctx() into savectx() in order to match with
amd64 code.
Reviewed by: attilio@, acpi@
intr_bind() on x86.
This has been requested by jhb and I strongly disagree with this,
but as long as he is the x86 and interrupt subsystem maintainer I will
follow his directives.
The disagreement cames from what we should really consider as a
public KPI. IMHO, if we really need a selection between the kernel
functions, we may need an explicit protection like _KERNEL_KPI, which
defines which subset of the kernel function might really be considered
as part of the KPI (for thirdy part modules) and which not.
As long as we don't have this mechanism I just consider any possible
function as usable by thirdy part code, thus intr_bind() included.
MFC after: 1 week
discrepancy between modules and kernel, but deal with SMP differences
within the functions themselves.
As an added bonus this also helps in terms of code readability.
Requested by: gibbs
Reviewed by: jhb, marius
MFC after: 1 week
222813, that left all un-pinned interrupts assigned to CPU 0.
sys/x86/x86/intr_machdep.c:
In intr_shuffle_irqs(), remove CPU_SETOF() call that initialized
the "intr_cpus" cpuset to only contain CPU0.
This initialization is too late and nullifies the results of calls
the intr_add_cpu() that occur much earlier in the boot process.
Since "intr_cpus" is statically initialized to the empty set, and
all processors, including the BSP, already add themselves to
"intr_cpus" no special initialization for the BSP is necessary.
MFC after: 3 days
sleeping from a swi handler (even though in this case it would be ok), so
switch the refill and scanning SWI handlers to being tasks on a fast
taskqueue. Also, only schedule the refill task for a CMCI as an MC# can
fire at any time, so it should do the minimal amount of work needed and
avoid opportunities to deadlock before it panics (such as scheduling a
task it won't ever need in practice). To handle the case of an MC# only
finding recoverable errors (which should never happen), always try to
refill the event free list when the periodic scan executes.
MFC after: 2 weeks
an uncorrected ECC error tends to fire on all CPUs in a package
simultaneously and the current printf hacks are not sufficient to make
the messages legible. Instead, use the existing mca_lock spinlock to
serialize calls to mca_log() and change the machine check code to panic
directly when an unrecoverable error is encoutered rather than falling
back to a trap_fatal() call in trap() (which adds nearly a screen-full of
logging messages that aren't useful for machine checks).
MFC after: 2 weeks
- Don't malloc() new MCA records for machine checks logged due to a
CMCI or MC# exception. Instead, use a pre-allocated pool of records.
When a CMCI or MC# exception fires, schedule a swi to refill the pool.
The pool is sized to hold at least one record per available machine
bank, and one record per CPU. This should handle the case of all CPUs
triggering a single bank at once as well as the case a single CPU
triggering all of its banks. The periodic scans still use malloc()
since they are run from a safe context.
- Since we have to create an swi to handle refills, make the periodic scan
a second swi for the same thread instead of having a separate taskqueue
thread for the scans.
Suggested by: mdf (avoiding malloc())
MFC after: 2 weeks
that revision, the bswapXX_const() macros were renamed to bswapXX_gen().
Also, bswap64_gen() was implemented as two calls to bswap32(), and
similarly, bswap32_gen() as two calls to bswap16(). This mainly helps
our base gcc to produce more efficient assembly.
However, the arguments are not properly masked, which results in the
wrong value being calculated in some instances. For example,
bswap32(0x12345678) returns 0x7c563412, and bswap64(0x123456789abcdef0)
returns 0xfcdefc9a7c563412.
Fix this by appropriately masking the arguments to bswap16() in
bswap32_gen(), and to bswap32() in bswap64_gen(). This should also
silence warnings from clang.
Submitted by: jh
revision has two problems:
- It can produce worse code with both clang and gcc.
- It doesn't fix the actual issue introduced in r232721, which will be
fixed in the next commit.
Submitted by: bde, tijl and jh
Pointy hat to: dim
bridges. Rather than blindly enabling the windows on all of them, only
enable the window when an MSI interrupt is enabled for a device behind
the bridge, similar to what already happens for HT PCI-PCI bridges.
To implement this, each x86 Host-PCI bridge driver has to be able to
locate it's actual backing device on bus 0. For ACPI, use the _ADR
method to find the slot and function of the device. For the non-ACPI
case, the legacy(4) driver already scans bus 0 looking for Host-PCI
bridge devices. Now it saves the slot and function of each bridge that
it finds as ivars that the Host-PCI bridge driver can then use in its
pcib_map_msi() method.
This fixes machines where non-MSI interrupts were broken by the previous
round of HT MSI changes.
Tested by: bapt
MFC after: 1 week
added, the call to pmap_kextract() was moved up, and as a result the
code never updated the physical address to use for DMA if a bounce
buffer was used. Restore the earlier location of pmap_kextract() so
it takes bounce buffers into account.
Tested by: kargl
MFC after: 1 week
recent changes in sys/x86/include/endian.h:
sys/dev/dcons/dcons.c:190:15: error: implicit conversion from '__uint32_t' (aka 'unsigned int') to '__uint16_t' (aka 'unsigned short') changes value from 1684238190 to 28526 [-Werror,-Wconstant-conversion]
buf->magic = ntohl(DCONS_MAGIC);
^~~~~~~~~~~~~~~~~~
sys/sys/param.h:306:18: note: expanded from:
#define ntohl(x) __ntohl(x)
^
./x86/endian.h:128:20: note: expanded from:
#define __ntohl(x) __bswap32(x)
^
./x86/endian.h:78:20: note: expanded from:
__bswap32_gen((__uint32_t)(x)) : __bswap32_var(x))
^
./x86/endian.h:68:26: note: expanded from:
(((__uint32_t)__bswap16(x) << 16) | __bswap16((x) >> 16))
^
./x86/endian.h:75:53: note: expanded from:
__bswap16_gen((__uint16_t)(x)) : __bswap16_var(x)))
~~~~~~~~~~~~~ ^
This is because the __bswapXX_gen() macros (for x86) call the regular
__bswapXX() macros. Since the __bswapXX_gen() variants are only called
when their arguments are constant, there is no need to do that constancy
check recursively. Also, it causes the above error with clang.
Fix it by calling __bswap16_gen() from __bswap32_gen(), and similarly,
__bswap32_gen() from __bswap64_gen().
While here, add extra parentheses around the __bswap16_gen() macro
expansion, to prevent unexpected side effects.
private to this file. The 'lapics' array was actually shadowing a
completely different 'lapics' array that is private to local_apic.c.
Reported by: bde
MFC after: 2 weeks
segments.h to a new x86 segments.h.
Add __packed attribute to some structs (just to be sure).
Also make it clear that i386 GDT and LDT entries are used in ia64 code.
reg.h with stubs.
The tREGISTER macros are only made visible on i386. These macros are
deprecated and should not be available on amd64.
The i386 and amd64 versions of struct reg have been renamed to struct
__reg32 and struct __reg64. During compilation either __reg32 or __reg64
is defined as reg depending on the machine architecture. On amd64 the i386
struct is also available as struct reg32 which is used in COMPAT_FREEBSD32
code.
Most of compat/ia32/ia32_reg.h is now IA64 only.
Reviewed by: kib (previous version)
Remove FPU types from compat/ia32/ia32_reg.h that are no longer needed.
Create machine/npx.h on amd64 to allow compiling i386 code that uses
this header.
The original npx.h and fpu.h define struct envxmm differently. Both
definitions have been included in the new x86 header as struct __envxmm32
and struct __envxmm64. During compilation either __envxmm32 or __envxmm64
is defined as envxmm depending on machine architecture. On amd64 the i386
struct is also available as struct envxmm32.
Reviewed by: kib
- Merge r232744 changes to pc98.
(Allow a kernel to be built with 'nodevice atpic'.)
- Move ICU related defines from x86/isa/atpic.c to x86/isa/icu.h and
use them in x86/x86/intr_machdep.c.
Reviewed by: jhb
didn't already have them. This is because the ternary expression will
return int, due to the Usual Arithmetic Conversions. Such casts are not
needed for the 32 and 64 bit variants.
While here, add additional parentheses around the x86 variant, to
protect against unintended consequences.
MFC after: 2 weeks
- Remove extern "C". There are no functions with external linkage here. [1]
- Rename bswapNN_const(x) to bswapNN_gen(x) to indicate that these macros
are generic implementations that can take non-constant arguments. [1]
- Split up __GNUCLIKE_ASM && __GNUCLIKE_BUILTIN_CONSTANT_P and deal with
each separately.
- Replace _LP64 with __amd64__ because asm instructions are machine
dependent, not ABI dependent.
Submitted by: bde [1]
Reviewed by: bde
amd64/i386/pc98 ptrace.h with stubs.
For amd64 PT_GETXSTATE and PT_SETXSTATE have been redefined to match the
i386 values. The old values are still supported but should no longer be
used.
Reviewed by: kib
long for specifying a boundary constraint.
- Change bus_dma tags to use bus_addr_t instead of bus_size_t for boundary
constraints.
These allow boundary constraints to be fully expressed for cases where
sizeof(bus_addr_t) != sizeof(bus_size_t). Specifically, it allows a
driver to properly specify a 4GB boundary in a PAE kernel.
Note that this cannot be safely MFC'd without a lot of compat shims due
to KBI changes, so I do not intend to merge it.
Reviewed by: scottl
Enforce a boundary of no more than 4GB - transfers crossing a 4GB
boundary can lead to data corruption due to PCIe limitations. This
change is a less-intrusive workaround that can be quickly merged back
to older branches; a cleaner implementation will arrive in HEAD later
but may require KPI changes.
This change is based on a suggestion by jhb@.
Reviewed by: scottl, jhb
Sponsored by: Sandvine Incorporated
MFC after: 3 days
amd64/i386/pc98 endian.h with stubs.
In __bswap64_const(x) the conflict between 0xffUL and 0xffULL has been
resolved by reimplementing the macro in terms of __bswap32(x). As a side
effect __bswap64_var(x) is now implemented using two bswap instructions on
i386 and should be much faster. __bswap32_const(x) has been reimplemented
in terms of __bswap16(x) for consistency.
APIC is not found.
- Don't panic if lapic_enable_cmc() is called and the APIC is not enabled.
This can happen due to booting a kernel with APIC disabled on a CPU that
supports CMCI.
- Wrap a long line.
- Actually increment ndomain when building our list of known domains
so that we can properly renumber them to be 0-based and dense.
- If the number of domains exceeds the configured maximum (VM_NDOMAIN),
bail out of processing the SRAT and disable NUMA rather than hitting an
obscure panic later.
- Don't bother parsing the SRAT at all if VM_NDOMAIN is set to 1 to
disable NUMA (the default).
Reported by: phk (2)
MFC after: 1 week
Where i386/bios/apm.c requires no per-descriptor state, the ACPI version
of these device do. Instead of using hackish clone lists that leave
stale device nodes lying around, use the cdevpriv API.
one. Interestingly, these are actually the default for quite some time
(bus_generic_driver_added(9) since r52045 and bus_generic_print_child(9)
since r52045) but even recently added device drivers do this unnecessarily.
Discussed with: jhb, marcel
- While at it, use DEVMETHOD_END.
Discussed with: jhb
- Also while at it, use __FBSDID.
The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.
existing phys_avail[] table. If a hw.physmem setting causes a memory
domain to not be present in phys_avail[], the SRAT table will now be
ignored rather than triggering a panic when a CPU in the missing domain
tries to allocate a page.
MFC after: 1 week
If a selinfo object is recorded (via selrecord()) and then it is
quickly destroyed, with the waiters missing the opportunity to awake,
at the next iteration they will find the selinfo object destroyed,
causing a PF#.
That happens because the selinfo interface has no way to drain the
waiters before to destroy the registered selinfo object. Also this
race is quite rare to get in practice, because it would require a
selrecord(), a poll request by another thread and a quick destruction
of the selrecord()'ed selinfo object.
Fix this by adding the seldrain() routine which should be called
before to destroy the selinfo objects (in order to avoid such case),
and fix the present cases where it might have already been called.
Sometimes, the context is safe enough to prevent this type of race,
like it happens in device drivers which installs selinfo objects on
poll callbacks. There, the destruction of the selinfo object happens
at driver detach time, when all the filedescriptors should be already
closed, thus there cannot be a race.
For this case, mfi(4) device driver can be set as an example, as it
implements a full correct logic for preventing this from happening.
Sponsored by: Sandvine Incorporated
Reported by: rstone
Tested by: pluknet
Reviewed by: jhb, kib
Approved by: re (bz)
MFC after: 3 weeks
environment with a core i5-2500K, operation in this mode causes timeouts
from the mpt driver. Switching to the ACPI-fast timer resolves this issue.
Switching the VM back to single CPU mode also works, which is why I have
not disabled the TSC in that mode.
I did not test with KVM or other VM environments, but I am being cautious
and assuming that the TSC is not reliable in SMP mode there as well.
Reviewed by: kib
Approved by: re (kib)
MFC after: Not applicable, the timecounter code is new for 9.x
resource allocation on x86 platforms:
- Add a new helper API that Host-PCI bridge drivers can use to restrict
resource allocation requests to a set of address ranges for different
resource types.
- For the ACPI Host-PCI bridge driver, use Producer address range resources
in _CRS to enumerate valid address ranges for a given Host-PCI bridge.
This can be disabled by including "hostres" in the debug.acpi.disabled
tunable.
- For the MPTable Host-PCI bridge driver, use entries in the extended
MPTable to determine the valid address ranges for a given Host-PCI
bridge. This required adding code to parse extended table entries.
Similar to the new PCI-PCI bridge driver, these changes are only enabled
if the NEW_PCIB kernel option is enabled (which is enabled by default on
amd64 and i386).
Approved by: re (kib)
processors unless the invariant TSC bit of CPUID is set. Intel processors
may stop incrementing TSC when DPSLP# pin is asserted, according to Intel
processor manuals, i. e., TSC timecounter is useless if the processor can
enter deep sleep state (C3/C4). This problem was accidentally uncovered by
r222869, which increased timecounter quality of P-state invariant TSC, e.g.,
for Core2 Duo T5870 (Family 6, Model f) and Atom N270 (Family 6, Model 1c).
Reported by: Fabian Keil (freebsd-listen at fabiankeil dot de)
Ian FREISLICH (ianf at clue dot co dot za)
Tested by: Fabian Keil (freebsd-listen at fabiankeil dot de)
- Core2 Duo T5870 (C3 state available/enabled)
jkim - Xeon X5150 (C3 state unavailable)
some times compiler inserts redundant instructions to preserve unused upper
32 bits even when it is casted to a 32-bit value. Unfortunately, it seems
the problem becomes more serious when it is shifted, especially on amd64.
- Re-add accidentally removed atomic op. for sysctl(9) handler.
- Remove a period(`.') at the end of a debugging message.
- Consistently spell "low" for "TSC-low" timecounter throughout.
Pointed out by: bde
invariant. For SMP case (TSC-low), it also has to pass SMP synchronization
test and the CPU vendor/model has to be white-listed explicitly. Currently,
all Intel CPUs and single-socket AMD Family 15h processors are listed here.
Discussed with: hackers
TSC timecounter if TSC frequency is higher than ~4.29 MHz (or 2^32-1 Hz) or
multiple CPUs are present. The "TSC-low" frequency is always lower than a
preset maximum value and derived from TSC frequency (by being halved until
it becomes lower than the maximum). Note the maximum value for SMP case is
significantly lower than UP case because we want to reduce (rare but known)
"temporal anomalies" caused by non-serialized RDTSC instruction. Normally,
it is still higher than "ACPI-fast" timecounter frequency (which was default
timecounter hardware for long time until r222222) to be useful.
when the user has indicated that the system has synchronized TSCs or it has
P-state invariant TSCs. For the former case, we may clear the tunable if it
fails the test to prevent accidental foot-shooting. For the latter case, we
may set it if it passes the test to notify the user that it may be usable.
versions instead. They were never needed as bus_generic_intr() and
bus_teardown_intr() had been changed to pass the original child device up
in 42734, but the ISA bus was not converted to new-bus until 45720.
This fixes heavy interrupt storm and resulting system freeze when using
LAPIC timer in one-shot mode under Xen HVM. There, unlike real hardware,
programming timer with zero period almost immediately causes interrupt.
cpuset_t objects.
That is going to offer the underlying support for a simple bump of
MAXCPU and then support for number of cpus > 32 (as it is today).
Right now, cpumask_t is an int, 32 bits on all our supported architecture.
cpumask_t on the other side is implemented as an array of longs, and
easilly extendible by definition.
The architectures touched by this commit are the following:
- amd64
- i386
- pc98
- arm
- ia64
- XEN
while the others are still missing.
Userland is believed to be fully converted with the changes contained
here.
Some technical notes:
- This commit may be considered an ABI nop for all the architectures
different from amd64 and ia64 (and sparc64 in the future)
- per-cpu members, which are now converted to cpuset_t, needs to be
accessed avoiding migration, because the size of cpuset_t should be
considered unknown
- size of cpuset_t objects is different from kernel and userland (this is
primirally done in order to leave some more space in userland to cope
with KBI extensions). If you need to access kernel cpuset_t from the
userland please refer to example in this patch on how to do that
correctly (kgdb may be a good source, for example).
- Support for other architectures is going to be added soon
- Only MAXCPU for amd64 is bumped now
The patch has been tested by sbruno and Nicholas Esborn on opteron
4 x 12 pack CPUs. More testing on big SMP is expected to came soon.
pluknet tested the patch with his 8-ways on both amd64 and i386.
Tested by: pluknet, sbruno, gianni, Nicholas Esborn
Reviewed by: jeff, jhb, sbruno
driver would verify that requests for child devices were confined to any
existing I/O windows, but the driver relied on the firmware to initialize
the windows and would never grow the windows for new requests. Now the
driver actively manages the I/O windows.
This is implemented by allocating a bus resource for each I/O window from
the parent PCI bus and suballocating that resource to child devices. The
suballocations are managed by creating an rman for each I/O window. The
suballocated resources are mapped by passing the bus_activate_resource()
call up to the parent PCI bus. Windows are grown when needed by using
bus_adjust_resource() to adjust the resource allocated from the parent PCI
bus. If the adjust request succeeds, the window is adjusted and the
suballocation request for the child device is retried.
When growing a window, the rman_first_free_region() and
rman_last_free_region() routines are used to determine if the front or
end of the existing I/O window is free. From using that, the smallest
ranges that need to be added to either the front or back of the window
are computed. The driver will first try to grow the window in whichever
direction requires the smallest growth first followed by the other
direction if that fails.
Subtractive bridges will first attempt to satisfy requests for child
resources from I/O windows (including attempts to grow the windows). If
that fails, the request is passed up to the parent PCI bus directly
however.
The PCI-PCI bridge driver will try to use firmware-assigned ranges for
child BARs first and only allocate a "fresh" range if that specific range
cannot be accommodated in the I/O window. This allows systems where the
firmware assigns resources during boot but later wipes the I/O windows
(some ACPI BIOSen are known to do this) to "rediscover" the original I/O
window ranges.
The ACPI Host-PCI bridge driver has been adjusted to correctly honor
hw.acpi.host_mem_start and the I/O port equivalent when a PCI-PCI bridge
makes a wildcard request for an I/O window range.
The new PCI-PCI bridge driver is only enabled if the NEW_PCIB kernel option
is enabled. This is a transition aide to allow platforms that do not
yet support bus_activate_resource() and bus_adjust_resource() in their
Host-PCI bridge drivers (and possibly other drivers as needed) to use the
old driver for now. Once all platforms support the new driver, the
kernel option and old driver will be removed.
PR: kern/143874 kern/149306
Tested by: mav
constraints on the rman and reject attempts to manage a region that is out
of range.
- Fix various places that set rm_end incorrectly (to ~0 or ~0u instead of
~0ul).
- To preserve existing behavior, change rman_init() to set rm_start and
rm_end to allow managing the full range (0 to ~0ul) if they are not set by
the caller when rman_init() is called.
VMware products virtualize TSC and it run at fixed frequency in so-called
"apparent time". Although virtualized i8254 also runs in apparent time, TSC
calibration always gives slightly off frequency because of the complicated
timer emulation and lost-tick correction mechanism.
disk dumping.
With the option SW_WATCHDOG on, these operations are doomed to let
watchdog fire, fi they take too long.
I implemented the stubs this way because I really want wdog_kern_*
KPI to not be dependant by SW_WATCHDOG being on (and really, the option
only enables watchdog activation in hardclock) and also avoid to
call them when not necessary (avoiding not-volountary watchdog
activations).
Sponsored by: Sandvine Incorporated
Discussed with: emaste, des
MFC after: 2 weeks
invariant and APERF/MPERF MSRs exist but these MSRs never tick. When we
calculate effective frequency from cpu_est_clockrate(), it caused panic of
division-by-zero. Now we test whether these MSRs actually increase to avoid
such foot-shooting.
Reported by: dim
Tested by: dim
safer for i386 because it can be easily over 4 GHz now. More worse, it can
be easily changed by user with 'machdep.tsc_freq' tunable (directly) or
cpufreq(4) (indirectly). Note it is intentionally not used in performance
critical paths to avoid performance regression (but we should, in theory).
Alternatively, we may add "virtual TSC" with lower frequency if maximum
frequency overflows 32 bits (and ignore possible incoherency as we do now).
started to execute, it seems that the corresponding ISR bit in the "old"
local APIC can be cleared. This causes the local APIC interrupt routine
to fail to find an interrupt to service. Rather than panic'ing in this
case, simply return from the interrupt without sending an EOI to the
local APIC. If there are any other pending interrupts in other ISR
registers, the local APIC will assert a new interrupt.
Tested by: steve
bus_size_t may be 32 or 64 bits. Change the bounce_zone alignment field
to explicitly be 32 bits, as I can't really imagine a DMA device that
needs anything close to 2GB alignment of data.
interrupt in the I/O APIC before moving it to a different CPU. If the
interrupt had been triggered by the I/O APIC after locking icu_lock but
before we masked the pin in the I/O APIC, then this could cause the
interrupt to be pending on the "old" CPU and it would finally trigger
after we had moved the interrupt to the new CPU. This could cause us to
panic as there was no interrupt source associated with the old IDT vector
on the old CPU. Dropping the lock after the interrupt is masked but
before it is moved allows the interrupt to fire and be handled in this
case before it is moved.
Tested by: Daniel Braniss danny of cs huji ac il
MFC after: 1 week
the original amd64 and i386 headers with stubs.
Rename (AMD64|I386)_BUS_SPACE_* to X86_BUS_SPACE_* everywhere.
Reviewed by: imp (previous version), jhb
Approved by: kib (mentor)
- Avoid side-effect assignments in if statements when possible.
- Don't use ! to check for NULL pointers, explicitly check against NULL.
- Explicitly check error return values against 0.
- Don't use INTR_MPSAFE for interrupt handlers with only filters as it is
meaningless.
- Remove unneeded function casts.
Also, add a comment mentioning _PSD - on some systems it's enough to
put one logical CPU into a particular P-state to make other CPUs in
the same domain to enter that P-state.
Also, call sched_unbind() after the loop - sched_bind() automatically
rebinds from previous CPU to a new one, and the new arrangement of code
is safer against early loop exit.
Plus one minor style nit.
MFC after: 10 days