Commit Graph

112 Commits

Author SHA1 Message Date
Andriy Gapon
ff08349df5 ioapic_program_intpin: program high bits before low bits
Programming the low bits has a side-effect if unmasking the pin if it is
not disabled.  So if an interrupt was pending then it would be delivered
with the correct new vector but to the incorrect old LAPIC.

This fix could be made clearer by preserving the mask bit while
programming the low bits and then explicitly resetting the mask bit
after all the programming is done.

Probability to trip over the fixed bug could be increased by bootverbose
because printing of the interrupt information in ioapic_assign_cpu
lengthened the time window during which an interrupt could arrive while
a pin is masked.

Reported by:	Andreas Longwitz <longwitz@incore.de>
Tested by:	Andreas Longwitz <longwitz@incore.de>
MFC after:	12 days
2012-12-01 18:16:14 +00:00
Attilio Rao
af2bdacafb Reverts r234074,234105,234564,234723,234989,235231-235232 and part of
r234247.
Use, instead, the static intializer introduced in r239923 for x86 and
sparc64 intr_cpus, unwinding the code to the initial version.

Reviewed by:	marius
2012-10-09 12:22:43 +00:00
John Baldwin
2f36da87cb Allow static DMA allocations that allow for enough segments to do page-sized
segments for the entire allocation to use kmem_alloc_attr() to allocate
KVM rather than using kmem_alloc_contig().  This avoids requiring
a single physically contiguous chunk in this case.

Submitted by:	Peter Jeremy (original version)
MFC after:	1 month
2012-08-17 14:14:25 +00:00
Jim Harris
7bfcb3bb9b During TSC synchronization test, use rdtsc() rather than rdtsc32(), to
protect against 32-bit TSC overflow while the sync test is running.

On dual-socket Xeon E5-2600 (SNB) systems with up to 32 threads, there
is non-trivial chance (2-3%) that TSC synchronization test fails due to
32-bit TSC overflow while the synchronization test is running.

Sponsored by: Intel
Reviewed by: jkim
Discussed with: jkim, kib
2012-08-07 23:16:11 +00:00
John Baldwin
0046805a58 Correct function name in comment.
Submitted by:	alc
2012-08-03 18:40:44 +00:00
Alexander Motin
b19ee1c6ef Microoptimize LAPIC timer routines to avoid reading from hardware during
programming using earlier cached values. This makes respective routines to
disappear from PMC top and reduces total number of active CPU cycles on idle
24-core system by 10%.
2012-08-03 15:19:59 +00:00
John Baldwin
2db99100a4 Improve the handling of static DMA buffers that use non-default memory
attributes (currently just BUS_DMA_NOCACHE):
- Don't call pmap_change_attr() on the returned address, instead use
  kmem_alloc_contig() to ask the VM system for memory with the requested
  attribute.
- As a result, always use kmem_alloc_contig() for non-default memory
  attributes, even for sub-page allocations.  This requires adjusting
  bus_dmamem_free()'s logic for determining which free routine to use.
- For x86, add a new dummy bus_dmamap that is used for static DMA
  buffers allocated via kmem_alloc_contig().  bus_dmamem_free() can then
  use the map pointer to determine which free routine to use.
- For powerpc, add a new flag to the allocated map (bus_dmamem_alloc()
  always creates a real map on powerpc) to indicate which free routine
  should be used.

Note that the BUS_DMA_NOCACHE handling in powerpc is currently #ifdef'd out.
I have left it disabled but updated it to match x86.

Reviewed by:	scottl
MFC after:	1 month
2012-08-03 13:50:29 +00:00
Konstantin Belousov
e1a18e46e1 Do a trivial reformatting of the comment, to record the proper commit
message for r238973:

Rdtsc instruction is not synchronized, it seems on some Intel cores it
can bypass even the locked instructions.  As a result, rdtsc executed
on different cores may return unordered TSC values even when the rdtsc
appearance in the instruction sequences is provably ordered.

Similarly to what has been done in r238755 for TSC synchronization
test, add explicit fences right before rdtsc in the timecounters 'get'
functions.  Intel recommends to use LFENCE, while AMD refers to
MFENCE. For VIA follow what Linux does and use LFENCE.  With this
change, I see no reordered reads of TSC on Nehalem.

Change the rmb() to inlined CPUID in the SMP TSC synchronization test.
On i386, locked instruction is used for rmb(), and as noted earlier,
it is not enough. Since i386 machine may not support SSE2, do simplest
possible synchronization with CPUID.

MFC after:	  1 week
Discussed with:	  avg, bde, jkim
2012-08-01 17:34:43 +00:00
Konstantin Belousov
814124c33e diff --git a/sys/x86/x86/tsc.c b/sys/x86/x86/tsc.c
index c253a96..3d8bd30 100644
--- a/sys/x86/x86/tsc.c
+++ b/sys/x86/x86/tsc.c
@@ -82,7 +82,11 @@ static void tsc_freq_changed(void *arg, const struct cf_level *level,
 static void tsc_freq_changing(void *arg, const struct cf_level *level,
     int *status);
 static unsigned tsc_get_timecount(struct timecounter *tc);
-static unsigned tsc_get_timecount_low(struct timecounter *tc);
+static inline unsigned tsc_get_timecount_low(struct timecounter *tc);
+static unsigned tsc_get_timecount_lfence(struct timecounter *tc);
+static unsigned tsc_get_timecount_low_lfence(struct timecounter *tc);
+static unsigned tsc_get_timecount_mfence(struct timecounter *tc);
+static unsigned tsc_get_timecount_low_mfence(struct timecounter *tc);
 static void tsc_levels_changed(void *arg, int unit);

 static struct timecounter tsc_timecounter = {
@@ -262,6 +266,10 @@ probe_tsc_freq(void)
 		    (vm_guest == VM_GUEST_NO &&
 		    CPUID_TO_FAMILY(cpu_id) >= 0x10))
 			tsc_is_invariant = 1;
+		if (cpu_feature & CPUID_SSE2) {
+			tsc_timecounter.tc_get_timecount =
+			    tsc_get_timecount_mfence;
+		}
 		break;
 	case CPU_VENDOR_INTEL:
 		if ((amd_pminfo & AMDPM_TSC_INVARIANT) != 0 ||
@@ -271,6 +279,10 @@ probe_tsc_freq(void)
 		    (CPUID_TO_FAMILY(cpu_id) == 0xf &&
 		    CPUID_TO_MODEL(cpu_id) >= 0x3))))
 			tsc_is_invariant = 1;
+		if (cpu_feature & CPUID_SSE2) {
+			tsc_timecounter.tc_get_timecount =
+			    tsc_get_timecount_lfence;
+		}
 		break;
 	case CPU_VENDOR_CENTAUR:
 		if (vm_guest == VM_GUEST_NO &&
@@ -278,6 +290,10 @@ probe_tsc_freq(void)
 		    CPUID_TO_MODEL(cpu_id) >= 0xf &&
 		    (rdmsr(0x1203) & 0x100000000ULL) == 0)
 			tsc_is_invariant = 1;
+		if (cpu_feature & CPUID_SSE2) {
+			tsc_timecounter.tc_get_timecount =
+			    tsc_get_timecount_lfence;
+		}
 		break;
 	}

@@ -328,16 +344,31 @@ init_TSC(void)

 #ifdef SMP

-/* rmb is required here because rdtsc is not a serializing instruction. */
-#define	TSC_READ(x)			\
-static void				\
-tsc_read_##x(void *arg)			\
-{					\
-	uint32_t *tsc = arg;		\
-	u_int cpu = PCPU_GET(cpuid);	\
-					\
-	rmb();				\
-	tsc[cpu * 3 + x] = rdtsc32();	\
+/*
+ * RDTSC is not a serializing instruction, and does not drain
+ * instruction stream, so we need to drain the stream before executing
+ * it.  It could be fixed by use of RDTSCP, except the instruction is
+ * not available everywhere.
+ *
+ * Use CPUID for draining in the boot-time SMP constistency test.  The
+ * timecounters use MFENCE for AMD CPUs, and LFENCE for others (Intel
+ * and VIA) when SSE2 is present, and nothing on older machines which
+ * also do not issue RDTSC prematurely.  There, testing for SSE2 and
+ * vendor is too cumbersome, and we learn about TSC presence from
+ * CPUID.
+ *
+ * Do not use do_cpuid(), since we do not need CPUID results, which
+ * have to be written into memory with do_cpuid().
+ */
+#define	TSC_READ(x)							\
+static void								\
+tsc_read_##x(void *arg)							\
+{									\
+	uint32_t *tsc = arg;						\
+	u_int cpu = PCPU_GET(cpuid);					\
+									\
+	__asm __volatile("cpuid" : : : "eax", "ebx", "ecx", "edx");	\
+	tsc[cpu * 3 + x] = rdtsc32();					\
 }
 TSC_READ(0)
 TSC_READ(1)
@@ -487,7 +518,16 @@ init:
 	for (shift = 0; shift < 31 && (tsc_freq >> shift) > max_freq; shift++)
 		;
 	if (shift > 0) {
-		tsc_timecounter.tc_get_timecount = tsc_get_timecount_low;
+		if (cpu_feature & CPUID_SSE2) {
+			if (cpu_vendor_id == CPU_VENDOR_AMD) {
+				tsc_timecounter.tc_get_timecount =
+				    tsc_get_timecount_low_mfence;
+			} else {
+				tsc_timecounter.tc_get_timecount =
+				    tsc_get_timecount_low_lfence;
+			}
+		} else
+			tsc_timecounter.tc_get_timecount = tsc_get_timecount_low;
 		tsc_timecounter.tc_name = "TSC-low";
 		if (bootverbose)
 			printf("TSC timecounter discards lower %d bit(s)\n",
@@ -599,16 +639,48 @@ tsc_get_timecount(struct timecounter *tc __unused)
 	return (rdtsc32());
 }

-static u_int
+static inline u_int
 tsc_get_timecount_low(struct timecounter *tc)
 {
 	uint32_t rv;

 	__asm __volatile("rdtsc; shrd %%cl, %%edx, %0"
-	: "=a" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx");
+	    : "=a" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx");
 	return (rv);
 }

+static u_int
+tsc_get_timecount_lfence(struct timecounter *tc __unused)
+{
+
+	lfence();
+	return (rdtsc32());
+}
+
+static u_int
+tsc_get_timecount_low_lfence(struct timecounter *tc)
+{
+
+	lfence();
+	return (tsc_get_timecount_low(tc));
+}
+
+static u_int
+tsc_get_timecount_mfence(struct timecounter *tc __unused)
+{
+
+	mfence();
+	return (rdtsc32());
+}
+
+static u_int
+tsc_get_timecount_low_mfence(struct timecounter *tc)
+{
+
+	mfence();
+	return (tsc_get_timecount_low(tc));
+}
+
 uint32_t
 cpu_fill_vdso_timehands(struct vdso_timehands *vdso_th)
 {
2012-08-01 17:26:22 +00:00
Jim Harris
3f6e7b9b11 Add rmb() to tsc_read_##x to enforce serialization of rdtsc captures.
Intel Architecture Manual specifies that rdtsc instruction is not serialized,
so without this change, TSC synchronization test would periodically fail,
resulting in use of HPET timecounter instead of TSC-low.  This caused
severe performance degradation (40-50%) when running high IO/s workloads due to
HPET MMIO reads and GEOM stat collection.

Tests on Xeon E5-2600 (Sandy Bridge) 8C systems were seeing TSC synchronization
fail approximately 20% of the time.

Sponsored by: Intel
Reviewed by: kib
MFC after: 3 days
2012-07-24 22:10:11 +00:00
Konstantin Belousov
aea810386d Implement mechanism to export some kernel timekeeping data to
usermode, using shared page.  The structures and functions have vdso
prefix, to indicate the intended location of the code in some future.

The versioned per-algorithm data is exported in the format of struct
vdso_timehands, which mostly repeats the content of in-kernel struct
timehands. Usermode reading of the structure can be lockless.
Compatibility export for 32bit processes on 64bit host is also
provided. Kernel also provides usermode with indication about
currently used timecounter, so that libc can fall back to syscall if
configured timecounter is unknown to usermode code.

The shared data updates are initiated both from the tc_windup(), where
a fast task is queued to do the update, and from sysctl handlers which
change timecounter. A manual override switch
kern.timecounter.fast_gettime allows to turn off the mechanism.

Only x86 architectures export the real algorithm data, and there, only
for tsc timecounter. HPET counters page could be exported as well, but
I prefer to not further glue the kernel and libc ABI there until
proper vdso-based solution is developed.

Minimal stubs neccessary for non-x86 architectures to still compile
are provided.

Discussed with:	bde
Reviewed by:	jhb
Tested by:	flo
MFC after:	1 month
2012-06-22 07:06:40 +00:00
Andriy Gapon
7adc598a15 free wdog_kern_pat calls in post-panic paths from under SW_WATCHDOG
Those calls are useful with hardware watchdog drivers too.

MFC after:	3 weeks
2012-06-03 08:01:12 +00:00
Attilio Rao
b8be27bf29 Revert part of r234723 by re-enabling the SMP protection for
intr_bind() on x86.
This has been requested by jhb and I strongly disagree with this,
but as long as he is the x86 and interrupt subsystem maintainer I will
follow his directives.

The disagreement cames from what we should really consider as a
public KPI. IMHO, if we really need a selection between the kernel
functions, we may need an explicit protection like _KERNEL_KPI, which
defines which subset of the kernel function might really be considered
as part of the KPI (for thirdy part modules) and which not.
As long as we don't have this mechanism I just consider any possible
function as usable by thirdy part code, thus intr_bind() included.

MFC after:	1 week
2012-05-03 21:44:01 +00:00
Attilio Rao
70dbd1604c Clean up the intr* MD KPI from the SMP dependency, removing a cause of
discrepancy between modules and kernel, but deal with SMP differences
within the functions themselves.

As an added bonus this also helps in terms of code readability.

Requested by:	gibbs
Reviewed by:	jhb, marius
MFC after:	1 week
2012-04-26 20:24:25 +00:00
Justin T. Gibbs
47c77b2265 Fix interrupt load balancing regression, introduced in revision
222813, that left all un-pinned interrupts assigned to CPU 0.

sys/x86/x86/intr_machdep.c:
	In intr_shuffle_irqs(), remove CPU_SETOF() call that initialized
	the "intr_cpus" cpuset to only contain CPU0.

	This initialization is too late and nullifies the results of calls
	the intr_add_cpu() that occur much earlier in the boot process.
	Since "intr_cpus" is statically initialized to the empty set, and
	all processors, including the BSP, already add themselves to
	"intr_cpus" no special initialization for the BSP is necessary.

MFC after:	3 days
2012-04-06 21:19:28 +00:00
John Baldwin
b867b16dc9 Further tweak the changes made in r233709. The kernel doesn't permit
sleeping from a swi handler (even though in this case it would be ok), so
switch the refill and scanning SWI handlers to being tasks on a fast
taskqueue.  Also, only schedule the refill task for a CMCI as an MC# can
fire at any time, so it should do the minimal amount of work needed and
avoid opportunities to deadlock before it panics (such as scheduling a
task it won't ever need in practice).  To handle the case of an MC# only
finding recoverable errors (which should never happen), always try to
refill the event free list when the periodic scan executes.

MFC after:	2 weeks
2012-04-02 17:26:21 +00:00
John Baldwin
f2e3bfc074 Make machine check exception logging more readable. On newer Intel systems,
an uncorrected ECC error tends to fire on all CPUs in a package
simultaneously and the current printf hacks are not sufficient to make
the messages legible.  Instead, use the existing mca_lock spinlock to
serialize calls to mca_log() and change the machine check code to panic
directly when an unrecoverable error is encoutered rather than falling
back to a trap_fatal() call in trap() (which adds nearly a screen-full of
logging messages that aren't useful for machine checks).

MFC after:	2 weeks
2012-04-02 15:07:22 +00:00
John Baldwin
8b9e9831bf Attempt to make machine check handling a bit more robust:
- Don't malloc() new MCA records for machine checks logged due to a
  CMCI or MC# exception.  Instead, use a pre-allocated pool of records.
  When a CMCI or MC# exception fires, schedule a swi to refill the pool.
  The pool is sized to hold at least one record per available machine
  bank, and one record per CPU. This should handle the case of all CPUs
  triggering a single bank at once as well as the case a single CPU
  triggering all of its banks.  The periodic scans still use malloc()
  since they are run from a safe context.
- Since we have to create an swi to handle refills, make the periodic scan
  a second swi for the same thread instead of having a separate taskqueue
  thread for the scans.

Suggested by:	mdf (avoiding malloc())
MFC after:	2 weeks
2012-03-30 20:17:39 +00:00
John Baldwin
435803f3c7 Move the legacy(4) driver to x86. 2012-03-30 19:10:14 +00:00
John Baldwin
0d95597ca9 Use a more proper fix for enabling HT MSI mapping windows on Host-PCI
bridges.  Rather than blindly enabling the windows on all of them, only
enable the window when an MSI interrupt is enabled for a device behind
the bridge, similar to what already happens for HT PCI-PCI bridges.

To implement this, each x86 Host-PCI bridge driver has to be able to
locate it's actual backing device on bus 0.  For ACPI, use the _ADR
method to find the slot and function of the device.  For the non-ACPI
case, the legacy(4) driver already scans bus 0 looking for Host-PCI
bridge devices.  Now it saves the slot and function of each bridge that
it finds as ivars that the Host-PCI bridge driver can then use in its
pcib_map_msi() method.

This fixes machines where non-MSI interrupts were broken by the previous
round of HT MSI changes.

Tested by:	bapt
MFC after:	1 week
2012-03-29 19:03:22 +00:00
John Baldwin
3b22825af7 Revert the PCIe 4GB boundary issue workaround now that the proper fix is
in HEAD.

Ok'd by:	scottl
2012-03-16 16:12:10 +00:00
Yoshihiro Takahashi
dff207f860 - Fix to build a native i386 kernel without the SMP and atpic.
- Merge r232744 changes to pc98.
  (Allow a kernel to be built with 'nodevice atpic'.)
- Move ICU related defines from x86/isa/atpic.c to x86/isa/icu.h and
  use them in x86/x86/intr_machdep.c.

Reviewed by:	jhb
2012-03-16 12:13:44 +00:00
John Baldwin
646af7c6af Move i386's intr_machdep.c to the x86 tree and share it with amd64. 2012-03-09 20:43:29 +00:00
John Baldwin
831ce4cb3d - Change contigmalloc() to use the vm_paddr_t type instead of an unsigned
long for specifying a boundary constraint.
- Change bus_dma tags to use bus_addr_t instead of bus_size_t for boundary
  constraints.

These allow boundary constraints to be fully expressed for cases where
sizeof(bus_addr_t) != sizeof(bus_size_t).  Specifically, it allows a
driver to properly specify a 4GB boundary in a PAE kernel.

Note that this cannot be safely MFC'd without a lot of compat shims due
to KBI changes, so I do not intend to merge it.

Reviewed by:	scottl
2012-03-01 19:58:34 +00:00
Ed Maste
3f8e262e8c Workaround for PCIe 4GB boundary issue
Enforce a boundary of no more than 4GB - transfers crossing a 4GB
boundary can lead to data corruption due to PCIe limitations.  This
change is a less-intrusive workaround that can be quickly merged back
to older branches; a cleaner implementation will arrive in HEAD later
but may require KPI changes.

This change is based on a suggestion by jhb@.

Reviewed by:    scottl, jhb
Sponsored by:   Sandvine Incorporated
MFC after:      3 days
2012-02-28 19:42:40 +00:00
John Baldwin
8fef42c511 - Panic up front if a kernel does not include 'device atpic' and an
APIC is not found.
- Don't panic if lapic_enable_cmc() is called and the APIC is not enabled.
  This can happen due to booting a kernel with APIC disabled on a CPU that
  supports CMCI.
- Wrap a long line.
2012-02-27 17:33:16 +00:00
Marius Strobl
4b7ec27007 - There's no need to overwrite the default device method with the default
one. Interestingly, these are actually the default for quite some time
  (bus_generic_driver_added(9) since r52045 and bus_generic_print_child(9)
  since r52045) but even recently added device drivers do this unnecessarily.
  Discussed with: jhb, marcel
- While at it, use DEVMETHOD_END.
  Discussed with: jhb
- Also while at it, use __FBSDID.
2011-11-22 21:28:20 +00:00
Ed Schouten
6472ac3d8a Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs.
The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.
2011-11-07 15:43:11 +00:00
Mike Silbersack
5cf8ac1bc2 Disable TSC usage inside SMP VM environments. On my VMware ESXi 4.1
environment with a core i5-2500K, operation in this mode causes timeouts
from the mpt driver.  Switching to the ACPI-fast timer resolves this issue.
Switching the VM back to single CPU mode also works, which is why I have
not disabled the TSC in that mode.

I did not test with KVM or other VM environments, but I am being cautious
and assuming that the TSC is not reliable in SMP mode there as well.

Reviewed by:	kib
Approved by:	re (kib)
MFC after:	Not applicable, the timecounter code is new for 9.x
2011-08-22 03:10:29 +00:00
John Baldwin
869e878c19 Fix build when NEW_PCIB is not defined.
Submitted by:	gcooper (partially)
Pointy hat to:	jhb
2011-07-16 14:05:34 +00:00
John Baldwin
34ff71eecd Respect the BIOS/firmware's notion of acceptable address ranges for PCI
resource allocation on x86 platforms:
- Add a new helper API that Host-PCI bridge drivers can use to restrict
  resource allocation requests to a set of address ranges for different
  resource types.
- For the ACPI Host-PCI bridge driver, use Producer address range resources
  in _CRS to enumerate valid address ranges for a given Host-PCI bridge.
  This can be disabled by including "hostres" in the debug.acpi.disabled
  tunable.
- For the MPTable Host-PCI bridge driver, use entries in the extended
  MPTable to determine the valid address ranges for a given Host-PCI
  bridge.  This required adding code to parse extended table entries.

Similar to the new PCI-PCI bridge driver, these changes are only enabled
if the NEW_PCIB kernel option is enabled (which is enabled by default on
amd64 and i386).

Approved by:	re (kib)
2011-07-15 21:08:58 +00:00
Jung-uk Kim
08e1b4f4a9 If TSC stops ticking in C3, disable deep sleep when the user forcefully
select TSC as timecounter hardware.

Tested by:	Fabian Keil (freebsd-listen at fabiankeil dot de)
2011-07-14 21:00:26 +00:00
Jung-uk Kim
a49399a903 Set negative quality to TSC timecounter when C3 state is enabled for Intel
processors unless the invariant TSC bit of CPUID is set.  Intel processors
may stop incrementing TSC when DPSLP# pin is asserted, according to Intel
processor manuals, i. e., TSC timecounter is useless if the processor can
enter deep sleep state (C3/C4).  This problem was accidentally uncovered by
r222869, which increased timecounter quality of P-state invariant TSC, e.g.,
for Core2 Duo T5870 (Family 6, Model f) and Atom N270 (Family 6, Model 1c).

Reported by:	Fabian Keil (freebsd-listen at fabiankeil dot de)
		Ian FREISLICH (ianf at clue dot co dot za)
Tested by:	Fabian Keil (freebsd-listen at fabiankeil dot de)
		- Core2 Duo T5870 (C3 state available/enabled)
		jkim - Xeon X5150 (C3 state unavailable)
2011-06-22 16:40:45 +00:00
Jung-uk Kim
5df88f46bb Teach the compiler how to shift TSC value efficiently. As noted in r220631,
some times compiler inserts redundant instructions to preserve unused upper
32 bits even when it is casted to a 32-bit value.  Unfortunately, it seems
the problem becomes more serious when it is shifted, especially on amd64.
2011-06-17 21:41:06 +00:00
Jung-uk Kim
bc8e4ad2ef Tidy up r222866.
- Re-add accidentally removed atomic op. for sysctl(9) handler.
- Remove a period(`.') at the end of a debugging message.
- Consistently spell "low" for "TSC-low" timecounter throughout.

Pointed out by:	bde
2011-06-08 23:44:59 +00:00
Jung-uk Kim
26e6537a73 Increase quality of TSC (or TSC-low) timecounter to 1000 if it is P-state
invariant.  For SMP case (TSC-low), it also has to pass SMP synchronization
test and the CPU vendor/model has to be white-listed explicitly.  Currently,
all Intel CPUs and single-socket AMD Family 15h processors are listed here.

Discussed with:	hackers
2011-06-08 20:08:06 +00:00
Jung-uk Kim
95f2f0985b Introduce low-resolution TSC timecounter "TSC-low". It replaces the normal
TSC timecounter if TSC frequency is higher than ~4.29 MHz (or 2^32-1 Hz) or
multiple CPUs are present.  The "TSC-low" frequency is always lower than a
preset maximum value and derived from TSC frequency (by being halved until
it becomes lower than the maximum).  Note the maximum value for SMP case is
significantly lower than UP case because we want to reduce (rare but known)
"temporal anomalies" caused by non-serialized RDTSC instruction.  Normally,
it is still higher than "ACPI-fast" timecounter frequency (which was default
timecounter hardware for long time until r222222) to be useful.
2011-06-08 19:38:31 +00:00
Jung-uk Kim
75aa1914d5 Remove a redundant assignment since r221703. 2011-06-08 18:52:42 +00:00
Attilio Rao
bd55ede060 MFC 2011-05-09 18:53:13 +00:00
Jung-uk Kim
65e7d70b09 Implement boot-time TSC synchronization test for SMP. This test is executed
when the user has indicated that the system has synchronized TSCs or it has
P-state invariant TSCs.  For the former case, we may clear the tunable if it
fails the test to prevent accidental foot-shooting.  For the latter case, we
may set it if it passes the test to notify the user that it may be usable.
2011-05-09 17:34:00 +00:00
Attilio Rao
aa8b9e0706 MFC 2011-05-06 22:45:33 +00:00
Alexander Motin
00aa5aab1e Some changes around LAPIC timer programming.
This fixes heavy interrupt storm and resulting system freeze when using
LAPIC timer in one-shot mode under Xen HVM. There, unlike real hardware,
programming timer with zero period almost immediately causes interrupt.
2011-05-05 18:56:48 +00:00
Attilio Rao
71a19bdc64 Commit the support for removing cpumask_t and replacing it directly with
cpuset_t objects.
That is going to offer the underlying support for a simple bump of
MAXCPU and then support for number of cpus > 32 (as it is today).

Right now, cpumask_t is an int, 32 bits on all our supported architecture.
cpumask_t on the other side is implemented as an array of longs, and
easilly extendible by definition.

The architectures touched by this commit are the following:
- amd64
- i386
- pc98
- arm
- ia64
- XEN

while the others are still missing.
Userland is believed to be fully converted with the changes contained
here.

Some technical notes:
- This commit may be considered an ABI nop for all the architectures
  different from amd64 and ia64 (and sparc64 in the future)
- per-cpu members, which are now converted to cpuset_t, needs to be
  accessed avoiding migration, because the size of cpuset_t should be
  considered unknown
- size of cpuset_t objects is different from kernel and userland (this is
  primirally done in order to leave some more space in userland to cope
  with KBI extensions). If you need to access kernel cpuset_t from the
  userland please refer to example in this patch on how to do that
  correctly (kgdb may be a good source, for example).
- Support for other architectures is going to be added soon
- Only MAXCPU for amd64 is bumped now

The patch has been tested by sbruno and Nicholas Esborn on opteron
4 x 12 pack CPUs. More testing on big SMP is expected to came soon.
pluknet tested the patch with his 8-ways on both amd64 and i386.

Tested by:	pluknet, sbruno, gianni, Nicholas Esborn
Reviewed by:	jeff, jhb, sbruno
2011-05-05 14:39:14 +00:00
John Baldwin
83c41143ca Reimplement how PCI-PCI bridges manage their I/O windows. Previously the
driver would verify that requests for child devices were confined to any
existing I/O windows, but the driver relied on the firmware to initialize
the windows and would never grow the windows for new requests.  Now the
driver actively manages the I/O windows.

This is implemented by allocating a bus resource for each I/O window from
the parent PCI bus and suballocating that resource to child devices.  The
suballocations are managed by creating an rman for each I/O window.  The
suballocated resources are mapped by passing the bus_activate_resource()
call up to the parent PCI bus.  Windows are grown when needed by using
bus_adjust_resource() to adjust the resource allocated from the parent PCI
bus.  If the adjust request succeeds, the window is adjusted and the
suballocation request for the child device is retried.

When growing a window, the rman_first_free_region() and
rman_last_free_region() routines are used to determine if the front or
end of the existing I/O window is free.  From using that, the smallest
ranges that need to be added to either the front or back of the window
are computed.  The driver will first try to grow the window in whichever
direction requires the smallest growth first followed by the other
direction if that fails.

Subtractive bridges will first attempt to satisfy requests for child
resources from I/O windows (including attempts to grow the windows).  If
that fails, the request is passed up to the parent PCI bus directly
however.

The PCI-PCI bridge driver will try to use firmware-assigned ranges for
child BARs first and only allocate a "fresh" range if that specific range
cannot be accommodated in the I/O window.  This allows systems where the
firmware assigns resources during boot but later wipes the I/O windows
(some ACPI BIOSen are known to do this) to "rediscover" the original I/O
window ranges.

The ACPI Host-PCI bridge driver has been adjusted to correctly honor
hw.acpi.host_mem_start and the I/O port equivalent when a PCI-PCI bridge
makes a wildcard request for an I/O window range.

The new PCI-PCI bridge driver is only enabled if the NEW_PCIB kernel option
is enabled.  This is a transition aide to allow platforms that do not
yet support bus_activate_resource() and bus_adjust_resource() in their
Host-PCI bridge drivers (and possibly other drivers as needed) to use the
old driver for now.  Once all platforms support the new driver, the
kernel option and old driver will be removed.

PR:		kern/143874 kern/149306
Tested by:	mav
2011-05-03 17:37:24 +00:00
Jung-uk Kim
a990fbf972 Fix build with clang. Please note there is an LLVM/Clang PR:
http://llvm.org/bugs/show_bug.cgi?id=9379

Reported by:	rpaulo, dim
2011-05-02 17:08:36 +00:00
John Baldwin
d2c9344ff9 Add implementations of BUS_ADJUST_RESOURCE() to the PCI bus driver,
generic PCI-PCI bridge driver, x86 nexus driver, and x86 Host to PCI bridge
drivers.
2011-05-02 14:13:12 +00:00
John Baldwin
b67d11bbcc Change rman_manage_region() to actually honor the rm_start and rm_end
constraints on the rman and reject attempts to manage a region that is out
of range.
- Fix various places that set rm_end incorrectly (to ~0 or ~0u instead of
  ~0ul).
- To preserve existing behavior, change rman_init() to set rm_start and
  rm_end to allow managing the full range (0 to ~0ul) if they are not set by
  the caller when rman_init() is called.
2011-04-29 18:41:21 +00:00
Jung-uk Kim
5da5812ba7 Detect VMware guest and set the TSC frequency as reported by the hypervisor.
VMware products virtualize TSC and it run at fixed frequency in so-called
"apparent time".  Although virtualized i8254 also runs in apparent time, TSC
calibration always gives slightly off frequency because of the complicated
timer emulation and lost-tick correction mechanism.
2011-04-29 18:20:12 +00:00
Jung-uk Kim
5ac44f727f Turn off periodic recalibration of CPU ticker frequency if it is invariant. 2011-04-28 17:56:02 +00:00
Attilio Rao
2be767e069 Add the watchdogs patting during the (shutdown time) disk syncing and
disk dumping.
With the option SW_WATCHDOG on, these operations are doomed to let
watchdog fire, fi they take too long.

I implemented the stubs this way because I really want wdog_kern_*
KPI to not be dependant by SW_WATCHDOG being on (and really, the option
only enables watchdog activation in hardclock) and also avoid to
call them when not necessary (avoiding not-volountary watchdog
activations).

Sponsored by:	Sandvine Incorporated
Discussed with:	emaste, des
MFC after:	2 weeks
2011-04-28 16:02:05 +00:00