Commit Graph

235 Commits

Author SHA1 Message Date
Konstantin Belousov
aea810386d Implement mechanism to export some kernel timekeeping data to
usermode, using shared page.  The structures and functions have vdso
prefix, to indicate the intended location of the code in some future.

The versioned per-algorithm data is exported in the format of struct
vdso_timehands, which mostly repeats the content of in-kernel struct
timehands. Usermode reading of the structure can be lockless.
Compatibility export for 32bit processes on 64bit host is also
provided. Kernel also provides usermode with indication about
currently used timecounter, so that libc can fall back to syscall if
configured timecounter is unknown to usermode code.

The shared data updates are initiated both from the tc_windup(), where
a fast task is queued to do the update, and from sysctl handlers which
change timecounter. A manual override switch
kern.timecounter.fast_gettime allows to turn off the mechanism.

Only x86 architectures export the real algorithm data, and there, only
for tsc timecounter. HPET counters page could be exported as well, but
I prefer to not further glue the kernel and libc ABI there until
proper vdso-based solution is developed.

Minimal stubs neccessary for non-x86 architectures to still compile
are provided.

Discussed with:	bde
Reviewed by:	jhb
Tested by:	flo
MFC after:	1 month
2012-06-22 07:06:40 +00:00
Jung-uk Kim
6ad799103d - Remove unused code for CR3 and CR4.
- Fix few style(9) nits while I am here.
2012-06-13 22:53:56 +00:00
Mitsuru IWASAKI
77c80e2e5b Share IPI init and startup code of mp_machdep.c with acpi_wakeup.c
as ipi_startup().
2012-06-12 00:14:54 +00:00
Mitsuru IWASAKI
fb864578af Add x86/acpica/acpi_wakeup.c for amd64 and i386. Difference of
suspend/resume procedures are minimized among them.

common:
- Add global cpuset suspended_cpus to indicate APs are suspended/resumed.
- Remove acpi_waketag and acpi_wakemap from acpivar.h (no longer used).
- Add some variables in acpi_wakecode.S in order to minimize the difference
  among amd64 and i386.
- Disable load_cr3() because now CR3 is restored in resumectx().

amd64:
- Add suspend/resume related members (such as MSR) in PCB.
- Modify savectx() for above new PCB members.
- Merge acpi_switch.S into cpu_switch.S as resumectx().

i386:
- Merge(and remove) suspendctx() into savectx() in order to match with
  amd64 code.

Reviewed by:	attilio@, acpi@
2012-06-09 00:37:26 +00:00
Andriy Gapon
7adc598a15 free wdog_kern_pat calls in post-panic paths from under SW_WATCHDOG
Those calls are useful with hardware watchdog drivers too.

MFC after:	3 weeks
2012-06-03 08:01:12 +00:00
David E. O'Brien
8bed40c9fe Consitently use "__LP64__".
[there are 33 __LP64__'s in the kernel (minus cddl/ and contrib/),
and 11 _LP64's]
2012-05-24 21:44:46 +00:00
John Baldwin
da65bface2 Don't expose i386-only ptrace constants on amd64. This broke gdb with
libthread_db on amd64.

Reported by:	avg
2012-05-17 20:21:55 +00:00
Attilio Rao
b8be27bf29 Revert part of r234723 by re-enabling the SMP protection for
intr_bind() on x86.
This has been requested by jhb and I strongly disagree with this,
but as long as he is the x86 and interrupt subsystem maintainer I will
follow his directives.

The disagreement cames from what we should really consider as a
public KPI. IMHO, if we really need a selection between the kernel
functions, we may need an explicit protection like _KERNEL_KPI, which
defines which subset of the kernel function might really be considered
as part of the KPI (for thirdy part modules) and which not.
As long as we don't have this mechanism I just consider any possible
function as usable by thirdy part code, thus intr_bind() included.

MFC after:	1 week
2012-05-03 21:44:01 +00:00
Attilio Rao
70dbd1604c Clean up the intr* MD KPI from the SMP dependency, removing a cause of
discrepancy between modules and kernel, but deal with SMP differences
within the functions themselves.

As an added bonus this also helps in terms of code readability.

Requested by:	gibbs
Reviewed by:	jhb, marius
MFC after:	1 week
2012-04-26 20:24:25 +00:00
Peter Grehan
26b1d645e0 Add x2apic MSR definitions
Reviewed by:	jhb
Obtained from:	bhyve via Neel via NetApp
2012-04-17 00:54:38 +00:00
John Baldwin
45b516f642 Trim stray blank line. 2012-04-11 21:00:33 +00:00
John Baldwin
bcd6068179 Recognize the RDRAND instruction feature.
Submitted by:	Michael Fuckner  michael fuckner net
MFC after:	3 days
2012-04-09 15:20:16 +00:00
Justin T. Gibbs
47c77b2265 Fix interrupt load balancing regression, introduced in revision
222813, that left all un-pinned interrupts assigned to CPU 0.

sys/x86/x86/intr_machdep.c:
	In intr_shuffle_irqs(), remove CPU_SETOF() call that initialized
	the "intr_cpus" cpuset to only contain CPU0.

	This initialization is too late and nullifies the results of calls
	the intr_add_cpu() that occur much earlier in the boot process.
	Since "intr_cpus" is statically initialized to the empty set, and
	all processors, including the BSP, already add themselves to
	"intr_cpus" no special initialization for the BSP is necessary.

MFC after:	3 days
2012-04-06 21:19:28 +00:00
John Baldwin
b867b16dc9 Further tweak the changes made in r233709. The kernel doesn't permit
sleeping from a swi handler (even though in this case it would be ok), so
switch the refill and scanning SWI handlers to being tasks on a fast
taskqueue.  Also, only schedule the refill task for a CMCI as an MC# can
fire at any time, so it should do the minimal amount of work needed and
avoid opportunities to deadlock before it panics (such as scheduling a
task it won't ever need in practice).  To handle the case of an MC# only
finding recoverable errors (which should never happen), always try to
refill the event free list when the periodic scan executes.

MFC after:	2 weeks
2012-04-02 17:26:21 +00:00
John Baldwin
f2e3bfc074 Make machine check exception logging more readable. On newer Intel systems,
an uncorrected ECC error tends to fire on all CPUs in a package
simultaneously and the current printf hacks are not sufficient to make
the messages legible.  Instead, use the existing mca_lock spinlock to
serialize calls to mca_log() and change the machine check code to panic
directly when an unrecoverable error is encoutered rather than falling
back to a trap_fatal() call in trap() (which adds nearly a screen-full of
logging messages that aren't useful for machine checks).

MFC after:	2 weeks
2012-04-02 15:07:22 +00:00
John Baldwin
8b9e9831bf Attempt to make machine check handling a bit more robust:
- Don't malloc() new MCA records for machine checks logged due to a
  CMCI or MC# exception.  Instead, use a pre-allocated pool of records.
  When a CMCI or MC# exception fires, schedule a swi to refill the pool.
  The pool is sized to hold at least one record per available machine
  bank, and one record per CPU. This should handle the case of all CPUs
  triggering a single bank at once as well as the case a single CPU
  triggering all of its banks.  The periodic scans still use malloc()
  since they are run from a safe context.
- Since we have to create an swi to handle refills, make the periodic scan
  a second swi for the same thread instead of having a separate taskqueue
  thread for the scans.

Suggested by:	mdf (avoiding malloc())
MFC after:	2 weeks
2012-03-30 20:17:39 +00:00
John Baldwin
435803f3c7 Move the legacy(4) driver to x86. 2012-03-30 19:10:14 +00:00
Dimitry Andric
a80f8859c4 Fix an issue introduced in sys/x86/include/endian.h with r232721. In
that revision, the bswapXX_const() macros were renamed to bswapXX_gen().

Also, bswap64_gen() was implemented as two calls to bswap32(), and
similarly, bswap32_gen() as two calls to bswap16().  This mainly helps
our base gcc to produce more efficient assembly.

However, the arguments are not properly masked, which results in the
wrong value being calculated in some instances.  For example,
bswap32(0x12345678) returns 0x7c563412, and bswap64(0x123456789abcdef0)
returns 0xfcdefc9a7c563412.

Fix this by appropriately masking the arguments to bswap16() in
bswap32_gen(), and to bswap32() in bswap64_gen().  This should also
silence warnings from clang.

Submitted by:	jh
2012-03-29 23:31:48 +00:00
Dimitry Andric
4715a95fb4 Revert sys/x86/include/endian.h to what it was before r233419, as that
revision has two problems:
- It can produce worse code with both clang and gcc.
- It doesn't fix the actual issue introduced in r232721, which will be
  fixed in the next commit.

Submitted by:	bde, tijl and jh
Pointy hat to:	dim
2012-03-29 23:30:17 +00:00
John Baldwin
0d95597ca9 Use a more proper fix for enabling HT MSI mapping windows on Host-PCI
bridges.  Rather than blindly enabling the windows on all of them, only
enable the window when an MSI interrupt is enabled for a device behind
the bridge, similar to what already happens for HT PCI-PCI bridges.

To implement this, each x86 Host-PCI bridge driver has to be able to
locate it's actual backing device on bus 0.  For ACPI, use the _ADR
method to find the slot and function of the device.  For the non-ACPI
case, the legacy(4) driver already scans bus 0 looking for Host-PCI
bridge devices.  Now it saves the slot and function of each bridge that
it finds as ivars that the Host-PCI bridge driver can then use in its
pcib_map_msi() method.

This fixes machines where non-MSI interrupts were broken by the previous
round of HT MSI changes.

Tested by:	bapt
MFC after:	1 week
2012-03-29 19:03:22 +00:00
John Baldwin
46092aeec0 Restore proper use of bounce buffers for ISA DMA. When locking was
added, the call to pmap_kextract() was moved up, and as a result the
code never updated the physical address to use for DMA if a bounce
buffer was used.  Restore the earlier location of pmap_kextract() so
it takes bounce buffers into account.

Tested by:	kargl
MFC after:	1 week
2012-03-29 18:58:02 +00:00
John Baldwin
45a225844f Allocate the ioapics[] array dynamically since it is only needed for the
duration of madt_setup_io().  This avoids having the array take up
permanent space in the BSS.

Inspired by:	bde
MFC after:	2 weeks
2012-03-28 18:53:48 +00:00
John Baldwin
5dba6ec3b3 Move the DTrace return IDT vector back up from 0x20 to 0x92. The 0x20
vector is currently dedicated to servicing IRQ 0 from the 8259A's, so
it shouldn't be overloaded for DTrace.

Tested by:	rstone
MFC after:	1 week
2012-03-28 16:32:17 +00:00
Dimitry Andric
d4ddb330c9 Fix the following clang warning in sys/dev/dcons/dcons.c, caused by the
recent changes in sys/x86/include/endian.h:

  sys/dev/dcons/dcons.c:190:15: error: implicit conversion from '__uint32_t' (aka 'unsigned int') to '__uint16_t' (aka 'unsigned short') changes value from 1684238190 to 28526 [-Werror,-Wconstant-conversion]
	  buf->magic = ntohl(DCONS_MAGIC);
		       ^~~~~~~~~~~~~~~~~~
  sys/sys/param.h:306:18: note: expanded from:
  #define ntohl(x)        __ntohl(x)
			  ^
  ./x86/endian.h:128:20: note: expanded from:
  #define __ntohl(x)      __bswap32(x)
			  ^
  ./x86/endian.h:78:20: note: expanded from:
	      __bswap32_gen((__uint32_t)(x)) : __bswap32_var(x))
			    ^
  ./x86/endian.h:68:26: note: expanded from:
	  (((__uint32_t)__bswap16(x) << 16) | __bswap16((x) >> 16))
				  ^
  ./x86/endian.h:75:53: note: expanded from:
	      __bswap16_gen((__uint16_t)(x)) : __bswap16_var(x)))
					       ~~~~~~~~~~~~~ ^

This is because the __bswapXX_gen() macros (for x86) call the regular
__bswapXX() macros.  Since the __bswapXX_gen() variants are only called
when their arguments are constant, there is no need to do that constancy
check recursively.  Also, it causes the above error with clang.

Fix it by calling __bswap16_gen() from __bswap32_gen(), and similarly,
__bswap32_gen() from  __bswap64_gen().

While here, add extra parentheses around the __bswap16_gen() macro
expansion, to prevent unexpected side effects.
2012-03-24 10:07:21 +00:00
John Baldwin
d8c827012c Mark the 'lapics' and 'ioapics' arrays here static since they are
private to this file.  The 'lapics' array was actually shadowing a
completely different 'lapics' array that is private to local_apic.c.

Reported by:	bde
MFC after:	2 weeks
2012-03-22 12:23:32 +00:00
Tijl Coosemans
dfb1c11345 Copy amd64 sysarch.h to x86 and merge with i386 sysarch.h. Replace
amd64/i386/pc98 sysarch.h with stubs.
2012-03-19 21:57:31 +00:00
Tijl Coosemans
2c7879ea84 Copy i386 specialreg.h to x86 and merge with amd64 specialreg.h. Replace
amd64/i386/pc98 specialreg.h with stubs.
2012-03-19 21:34:11 +00:00
Tijl Coosemans
68156ad982 Copy i386 psl.h to x86 and replace amd64/i386/pc98 psl.h with stubs. 2012-03-19 21:29:57 +00:00
Tijl Coosemans
bcde3b9f67 Move userland bits (and some common kernel bits) from amd64 and i386
segments.h to a new x86 segments.h.

Add __packed attribute to some structs (just to be sure).
Also make it clear that i386 GDT and LDT entries are used in ia64 code.
2012-03-19 21:24:50 +00:00
Tijl Coosemans
6e310b206f Eliminate ia32_reg.h by moving its contents to x86 and ia64 reg.h.
Reviewed by:	kib
2012-03-18 19:12:11 +00:00
Tijl Coosemans
01cd19680d Copy i386 reg.h to x86 and merge with amd64 reg.h. Replace i386/amd64/pc98
reg.h with stubs.

The tREGISTER macros are only made visible on i386. These macros are
deprecated and should not be available on amd64.

The i386 and amd64 versions of struct reg have been renamed to struct
__reg32 and struct __reg64. During compilation either __reg32 or __reg64
is defined as reg depending on the machine architecture. On amd64 the i386
struct is also available as struct reg32 which is used in COMPAT_FREEBSD32
code.

Most of compat/ia32/ia32_reg.h is now IA64 only.

Reviewed by:	kib (previous version)
2012-03-18 19:06:38 +00:00
Tijl Coosemans
786645078b Move userland bits of i386 npx.h and amd64 fpu.h to x86 fpu.h.
Remove FPU types from compat/ia32/ia32_reg.h that are no longer needed.
Create machine/npx.h on amd64 to allow compiling i386 code that uses
this header.

The original npx.h and fpu.h define struct envxmm differently. Both
definitions have been included in the new x86 header as struct __envxmm32
and struct __envxmm64. During compilation either __envxmm32 or __envxmm64
is defined as envxmm depending on machine architecture. On amd64 the i386
struct is also available as struct envxmm32.

Reviewed by:	kib
2012-03-16 20:24:30 +00:00
John Baldwin
3b22825af7 Revert the PCIe 4GB boundary issue workaround now that the proper fix is
in HEAD.

Ok'd by:	scottl
2012-03-16 16:12:10 +00:00
Yoshihiro Takahashi
dff207f860 - Fix to build a native i386 kernel without the SMP and atpic.
- Merge r232744 changes to pc98.
  (Allow a kernel to be built with 'nodevice atpic'.)
- Move ICU related defines from x86/isa/atpic.c to x86/isa/icu.h and
  use them in x86/x86/intr_machdep.c.

Reviewed by:	jhb
2012-03-16 12:13:44 +00:00
John Baldwin
646af7c6af Move i386's intr_machdep.c to the x86 tree and share it with amd64. 2012-03-09 20:43:29 +00:00
Dimitry Andric
63d094a7e2 Add casts to __uint16_t to the __bswap16() macros on all arches which
didn't already have them.  This is because the ternary expression will
return int, due to the Usual Arithmetic Conversions.  Such casts are not
needed for the 32 and 64 bit variants.

While here, add additional parentheses around the x86 variant, to
protect against unintended consequences.

MFC after:	2 weeks
2012-03-09 20:34:31 +00:00
Tijl Coosemans
ced8176236 Cast the expression in __bswap16(x) to __uint16_t because it is promoted
to int.

Reviewed by:	dim
2012-03-09 16:39:34 +00:00
Tijl Coosemans
0502467707 Clean up x86 endian.h:
- Remove extern "C". There are no functions with external linkage here. [1]
- Rename bswapNN_const(x) to bswapNN_gen(x) to indicate that these macros
  are generic implementations that can take non-constant arguments. [1]
- Split up __GNUCLIKE_ASM && __GNUCLIKE_BUILTIN_CONSTANT_P and deal with
  each separately.
- Replace _LP64 with __amd64__ because asm instructions are machine
  dependent, not ABI dependent.

Submitted by:	bde [1]
Reviewed by:	bde
2012-03-09 11:48:56 +00:00
Tijl Coosemans
d8a023328d Copy amd64 ptrace.h to x86 and merge with i386 ptrace.h. Replace
amd64/i386/pc98 ptrace.h with stubs.

For amd64 PT_GETXSTATE and PT_SETXSTATE have been redefined to match the
i386 values. The old values are still supported but should no longer be
used.

Reviewed by:	kib
2012-03-04 20:24:28 +00:00
Tijl Coosemans
21d0ce7868 Do not use INT64_C and UINT64_C to define 64 bit integer limits. They
aren't defined for C++ code unless __STDC_CONSTANT_MACROS is defined.

Reported by:	jhb
2012-03-04 20:02:20 +00:00
Tijl Coosemans
8b4a1ed0de Copy amd64 trap.h to x86 and replace amd64/i386/pc98 trap.h with stubs. 2012-03-04 14:12:57 +00:00
Tijl Coosemans
ee0d5ab989 Copy amd64 float.h to x86 and merge with i386 float.h. Replace
amd64/i386/pc98 float.h with stubs.
2012-03-04 14:00:32 +00:00
John Baldwin
831ce4cb3d - Change contigmalloc() to use the vm_paddr_t type instead of an unsigned
long for specifying a boundary constraint.
- Change bus_dma tags to use bus_addr_t instead of bus_size_t for boundary
  constraints.

These allow boundary constraints to be fully expressed for cases where
sizeof(bus_addr_t) != sizeof(bus_size_t).  Specifically, it allows a
driver to properly specify a 4GB boundary in a PAE kernel.

Note that this cannot be safely MFC'd without a lot of compat shims due
to KBI changes, so I do not intend to merge it.

Reviewed by:	scottl
2012-03-01 19:58:34 +00:00
Tijl Coosemans
5b2a5decd1 Copy amd64 stdarg.h to x86 and replace amd64/i386/pc98 stdarg.h with stubs. 2012-02-28 22:30:58 +00:00
Tijl Coosemans
f85ac30a3d Copy amd64 setjmp.h to x86 and replace amd64/i386/pc98 setjmp.h with stubs. 2012-02-28 22:17:52 +00:00
Ed Maste
3f8e262e8c Workaround for PCIe 4GB boundary issue
Enforce a boundary of no more than 4GB - transfers crossing a 4GB
boundary can lead to data corruption due to PCIe limitations.  This
change is a less-intrusive workaround that can be quickly merged back
to older branches; a cleaner implementation will arrive in HEAD later
but may require KPI changes.

This change is based on a suggestion by jhb@.

Reviewed by:    scottl, jhb
Sponsored by:   Sandvine Incorporated
MFC after:      3 days
2012-02-28 19:42:40 +00:00
Tijl Coosemans
95b1d16df5 Copy amd64 endian.h to x86 and merge with i386 endian.h. Replace
amd64/i386/pc98 endian.h with stubs.

In __bswap64_const(x) the conflict between 0xffUL and 0xffULL has been
resolved by reimplementing the macro in terms of __bswap32(x). As a side
effect __bswap64_var(x) is now implemented using two bswap instructions on
i386 and should be much faster. __bswap32_const(x) has been reimplemented
in terms of __bswap16(x) for consistency.
2012-02-28 19:39:54 +00:00
Tijl Coosemans
8770e9db97 Copy amd64 _stdint.h to x86 and merge with i386 _stdint.h. Replace
amd64/i386/pc98 _stdint.h with stubs.
2012-02-28 18:38:33 +00:00
Tijl Coosemans
8cfa93e4be Copy amd64 _limits.h to x86 and merge with i386 _limits.h. Replace
amd64/i386/pc98 _limits.h with stubs.
2012-02-28 18:24:28 +00:00
Tijl Coosemans
8f77be2b4c Copy amd64 _types.h to x86 and merge with i386 _types.h. Replace existing
amd64/i386/pc98 _types.h with stubs.
2012-02-28 18:15:28 +00:00
John Baldwin
8fef42c511 - Panic up front if a kernel does not include 'device atpic' and an
APIC is not found.
- Don't panic if lapic_enable_cmc() is called and the APIC is not enabled.
  This can happen due to booting a kernel with APIC disabled on a CPU that
  supports CMCI.
- Wrap a long line.
2012-02-27 17:33:16 +00:00
Alexander Kabaev
2f42a9bf0d Fix apparent logic reversal in setting the 'auto_mode' flag.
MFC after: 2 weeks
2012-02-26 21:24:27 +00:00
John Baldwin
289908743e Fix a few bugs in the SRAT parsing code:
- Actually increment ndomain when building our list of known domains
  so that we can properly renumber them to be 0-based and dense.
- If the number of domains exceeds the configured maximum (VM_NDOMAIN),
  bail out of processing the SRAT and disable NUMA rather than hitting an
  obscure panic later.
- Don't bother parsing the SRAT at all if VM_NDOMAIN is set to 1 to
  disable NUMA (the default).

Reported by:	phk (2)
MFC after:	1 week
2012-01-03 20:53:58 +00:00
Ed Schouten
b66c0c3405 Get rid of kludgy per-descriptor state handling in acpi_apm.
Where i386/bios/apm.c requires no per-descriptor state, the ACPI version
of these device do. Instead of using hackish clone lists that leave
stale device nodes lying around, use the cdevpriv API.
2011-12-05 16:08:18 +00:00
Marius Strobl
4b7ec27007 - There's no need to overwrite the default device method with the default
one. Interestingly, these are actually the default for quite some time
  (bus_generic_driver_added(9) since r52045 and bus_generic_print_child(9)
  since r52045) but even recently added device drivers do this unnecessarily.
  Discussed with: jhb, marcel
- While at it, use DEVMETHOD_END.
  Discussed with: jhb
- Also while at it, use __FBSDID.
2011-11-22 21:28:20 +00:00
Ed Schouten
6472ac3d8a Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs.
The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.
2011-11-07 15:43:11 +00:00
Ed Schouten
d745c852be Mark MALLOC_DEFINEs static that have no corresponding MALLOC_DECLAREs.
This means that their use is restricted to a single C file.
2011-11-07 06:44:47 +00:00
John Baldwin
4d99cfb313 Ignore SRAT memory entries if the memory range does not overlap with an
existing phys_avail[] table.  If a hw.physmem setting causes a memory
domain to not be present in phys_avail[], the SRAT table will now be
ignored rather than triggering a panic when a CPU in the missing domain
tries to allocate a page.

MFC after:	1 week
2011-10-05 16:03:47 +00:00
Attilio Rao
6aba400a70 Fix a deficiency in the selinfo interface:
If a selinfo object is recorded (via selrecord()) and then it is
quickly destroyed, with the waiters missing the opportunity to awake,
at the next iteration they will find the selinfo object destroyed,
causing a PF#.

That happens because the selinfo interface has no way to drain the
waiters before to destroy the registered selinfo object. Also this
race is quite rare to get in practice, because it would require a
selrecord(), a poll request by another thread and a quick destruction
of the selrecord()'ed selinfo object.

Fix this by adding the seldrain() routine which should be called
before to destroy the selinfo objects (in order to avoid such case),
and fix the present cases where it might have already been called.
Sometimes, the context is safe enough to prevent this type of race,
like it happens in device drivers which installs selinfo objects on
poll callbacks. There, the destruction of the selinfo object happens
at driver detach time, when all the filedescriptors should be already
closed, thus there cannot be a race.
For this case, mfi(4) device driver can be set as an example, as it
implements a full correct logic for preventing this from happening.

Sponsored by:	Sandvine Incorporated
Reported by:	rstone
Tested by:	pluknet
Reviewed by:	jhb, kib
Approved by:	re (bz)
MFC after:	3 weeks
2011-08-25 15:51:54 +00:00
Mike Silbersack
5cf8ac1bc2 Disable TSC usage inside SMP VM environments. On my VMware ESXi 4.1
environment with a core i5-2500K, operation in this mode causes timeouts
from the mpt driver.  Switching to the ACPI-fast timer resolves this issue.
Switching the VM back to single CPU mode also works, which is why I have
not disabled the TSC in that mode.

I did not test with KVM or other VM environments, but I am being cautious
and assuming that the TSC is not reliable in SMP mode there as well.

Reviewed by:	kib
Approved by:	re (kib)
MFC after:	Not applicable, the timecounter code is new for 9.x
2011-08-22 03:10:29 +00:00
John Baldwin
869e878c19 Fix build when NEW_PCIB is not defined.
Submitted by:	gcooper (partially)
Pointy hat to:	jhb
2011-07-16 14:05:34 +00:00
John Baldwin
34ff71eecd Respect the BIOS/firmware's notion of acceptable address ranges for PCI
resource allocation on x86 platforms:
- Add a new helper API that Host-PCI bridge drivers can use to restrict
  resource allocation requests to a set of address ranges for different
  resource types.
- For the ACPI Host-PCI bridge driver, use Producer address range resources
  in _CRS to enumerate valid address ranges for a given Host-PCI bridge.
  This can be disabled by including "hostres" in the debug.acpi.disabled
  tunable.
- For the MPTable Host-PCI bridge driver, use entries in the extended
  MPTable to determine the valid address ranges for a given Host-PCI
  bridge.  This required adding code to parse extended table entries.

Similar to the new PCI-PCI bridge driver, these changes are only enabled
if the NEW_PCIB kernel option is enabled (which is enabled by default on
amd64 and i386).

Approved by:	re (kib)
2011-07-15 21:08:58 +00:00
Jung-uk Kim
08e1b4f4a9 If TSC stops ticking in C3, disable deep sleep when the user forcefully
select TSC as timecounter hardware.

Tested by:	Fabian Keil (freebsd-listen at fabiankeil dot de)
2011-07-14 21:00:26 +00:00
John Baldwin
1368987ae4 Move {amd64,i386}/pci/pci_bus.c and {amd64,i386}/include/pci_cfgreg.h to
the x86 tree.  The $PIR code is still only enabled on i386 and not amd64.
While here, make the qpi(4) driver on conditional on 'device pci'.
2011-06-22 21:04:13 +00:00
Jung-uk Kim
a49399a903 Set negative quality to TSC timecounter when C3 state is enabled for Intel
processors unless the invariant TSC bit of CPUID is set.  Intel processors
may stop incrementing TSC when DPSLP# pin is asserted, according to Intel
processor manuals, i. e., TSC timecounter is useless if the processor can
enter deep sleep state (C3/C4).  This problem was accidentally uncovered by
r222869, which increased timecounter quality of P-state invariant TSC, e.g.,
for Core2 Duo T5870 (Family 6, Model f) and Atom N270 (Family 6, Model 1c).

Reported by:	Fabian Keil (freebsd-listen at fabiankeil dot de)
		Ian FREISLICH (ianf at clue dot co dot za)
Tested by:	Fabian Keil (freebsd-listen at fabiankeil dot de)
		- Core2 Duo T5870 (C3 state available/enabled)
		jkim - Xeon X5150 (C3 state unavailable)
2011-06-22 16:40:45 +00:00
Jung-uk Kim
5df88f46bb Teach the compiler how to shift TSC value efficiently. As noted in r220631,
some times compiler inserts redundant instructions to preserve unused upper
32 bits even when it is casted to a 32-bit value.  Unfortunately, it seems
the problem becomes more serious when it is shifted, especially on amd64.
2011-06-17 21:41:06 +00:00
Jung-uk Kim
bc8e4ad2ef Tidy up r222866.
- Re-add accidentally removed atomic op. for sysctl(9) handler.
- Remove a period(`.') at the end of a debugging message.
- Consistently spell "low" for "TSC-low" timecounter throughout.

Pointed out by:	bde
2011-06-08 23:44:59 +00:00
Jung-uk Kim
26e6537a73 Increase quality of TSC (or TSC-low) timecounter to 1000 if it is P-state
invariant.  For SMP case (TSC-low), it also has to pass SMP synchronization
test and the CPU vendor/model has to be white-listed explicitly.  Currently,
all Intel CPUs and single-socket AMD Family 15h processors are listed here.

Discussed with:	hackers
2011-06-08 20:08:06 +00:00
Jung-uk Kim
95f2f0985b Introduce low-resolution TSC timecounter "TSC-low". It replaces the normal
TSC timecounter if TSC frequency is higher than ~4.29 MHz (or 2^32-1 Hz) or
multiple CPUs are present.  The "TSC-low" frequency is always lower than a
preset maximum value and derived from TSC frequency (by being halved until
it becomes lower than the maximum).  Note the maximum value for SMP case is
significantly lower than UP case because we want to reduce (rare but known)
"temporal anomalies" caused by non-serialized RDTSC instruction.  Normally,
it is still higher than "ACPI-fast" timecounter frequency (which was default
timecounter hardware for long time until r222222) to be useful.
2011-06-08 19:38:31 +00:00
Jung-uk Kim
75aa1914d5 Remove a redundant assignment since r221703. 2011-06-08 18:52:42 +00:00
Attilio Rao
bd55ede060 MFC 2011-05-09 18:53:13 +00:00
Jung-uk Kim
65e7d70b09 Implement boot-time TSC synchronization test for SMP. This test is executed
when the user has indicated that the system has synchronized TSCs or it has
P-state invariant TSCs.  For the former case, we may clear the tunable if it
fails the test to prevent accidental foot-shooting.  For the latter case, we
may set it if it passes the test to notify the user that it may be usable.
2011-05-09 17:34:00 +00:00
Attilio Rao
aa8b9e0706 MFC 2011-05-06 22:45:33 +00:00
John Baldwin
f9a9473702 Retire isa_setup_intr() and isa_teardown_intr() and use the generic bus
versions instead.  They were never needed as bus_generic_intr() and
bus_teardown_intr() had been changed to pass the original child device up
in 42734, but the ISA bus was not converted to new-bus until 45720.
2011-05-06 13:48:53 +00:00
Alexander Motin
00aa5aab1e Some changes around LAPIC timer programming.
This fixes heavy interrupt storm and resulting system freeze when using
LAPIC timer in one-shot mode under Xen HVM. There, unlike real hardware,
programming timer with zero period almost immediately causes interrupt.
2011-05-05 18:56:48 +00:00
Attilio Rao
71a19bdc64 Commit the support for removing cpumask_t and replacing it directly with
cpuset_t objects.
That is going to offer the underlying support for a simple bump of
MAXCPU and then support for number of cpus > 32 (as it is today).

Right now, cpumask_t is an int, 32 bits on all our supported architecture.
cpumask_t on the other side is implemented as an array of longs, and
easilly extendible by definition.

The architectures touched by this commit are the following:
- amd64
- i386
- pc98
- arm
- ia64
- XEN

while the others are still missing.
Userland is believed to be fully converted with the changes contained
here.

Some technical notes:
- This commit may be considered an ABI nop for all the architectures
  different from amd64 and ia64 (and sparc64 in the future)
- per-cpu members, which are now converted to cpuset_t, needs to be
  accessed avoiding migration, because the size of cpuset_t should be
  considered unknown
- size of cpuset_t objects is different from kernel and userland (this is
  primirally done in order to leave some more space in userland to cope
  with KBI extensions). If you need to access kernel cpuset_t from the
  userland please refer to example in this patch on how to do that
  correctly (kgdb may be a good source, for example).
- Support for other architectures is going to be added soon
- Only MAXCPU for amd64 is bumped now

The patch has been tested by sbruno and Nicholas Esborn on opteron
4 x 12 pack CPUs. More testing on big SMP is expected to came soon.
pluknet tested the patch with his 8-ways on both amd64 and i386.

Tested by:	pluknet, sbruno, gianni, Nicholas Esborn
Reviewed by:	jeff, jhb, sbruno
2011-05-05 14:39:14 +00:00
John Baldwin
83c41143ca Reimplement how PCI-PCI bridges manage their I/O windows. Previously the
driver would verify that requests for child devices were confined to any
existing I/O windows, but the driver relied on the firmware to initialize
the windows and would never grow the windows for new requests.  Now the
driver actively manages the I/O windows.

This is implemented by allocating a bus resource for each I/O window from
the parent PCI bus and suballocating that resource to child devices.  The
suballocations are managed by creating an rman for each I/O window.  The
suballocated resources are mapped by passing the bus_activate_resource()
call up to the parent PCI bus.  Windows are grown when needed by using
bus_adjust_resource() to adjust the resource allocated from the parent PCI
bus.  If the adjust request succeeds, the window is adjusted and the
suballocation request for the child device is retried.

When growing a window, the rman_first_free_region() and
rman_last_free_region() routines are used to determine if the front or
end of the existing I/O window is free.  From using that, the smallest
ranges that need to be added to either the front or back of the window
are computed.  The driver will first try to grow the window in whichever
direction requires the smallest growth first followed by the other
direction if that fails.

Subtractive bridges will first attempt to satisfy requests for child
resources from I/O windows (including attempts to grow the windows).  If
that fails, the request is passed up to the parent PCI bus directly
however.

The PCI-PCI bridge driver will try to use firmware-assigned ranges for
child BARs first and only allocate a "fresh" range if that specific range
cannot be accommodated in the I/O window.  This allows systems where the
firmware assigns resources during boot but later wipes the I/O windows
(some ACPI BIOSen are known to do this) to "rediscover" the original I/O
window ranges.

The ACPI Host-PCI bridge driver has been adjusted to correctly honor
hw.acpi.host_mem_start and the I/O port equivalent when a PCI-PCI bridge
makes a wildcard request for an I/O window range.

The new PCI-PCI bridge driver is only enabled if the NEW_PCIB kernel option
is enabled.  This is a transition aide to allow platforms that do not
yet support bus_activate_resource() and bus_adjust_resource() in their
Host-PCI bridge drivers (and possibly other drivers as needed) to use the
old driver for now.  Once all platforms support the new driver, the
kernel option and old driver will be removed.

PR:		kern/143874 kern/149306
Tested by:	mav
2011-05-03 17:37:24 +00:00
Jung-uk Kim
a990fbf972 Fix build with clang. Please note there is an LLVM/Clang PR:
http://llvm.org/bugs/show_bug.cgi?id=9379

Reported by:	rpaulo, dim
2011-05-02 17:08:36 +00:00
John Baldwin
d2c9344ff9 Add implementations of BUS_ADJUST_RESOURCE() to the PCI bus driver,
generic PCI-PCI bridge driver, x86 nexus driver, and x86 Host to PCI bridge
drivers.
2011-05-02 14:13:12 +00:00
John Baldwin
b67d11bbcc Change rman_manage_region() to actually honor the rm_start and rm_end
constraints on the rman and reject attempts to manage a region that is out
of range.
- Fix various places that set rm_end incorrectly (to ~0 or ~0u instead of
  ~0ul).
- To preserve existing behavior, change rman_init() to set rm_start and
  rm_end to allow managing the full range (0 to ~0ul) if they are not set by
  the caller when rman_init() is called.
2011-04-29 18:41:21 +00:00
Jung-uk Kim
5da5812ba7 Detect VMware guest and set the TSC frequency as reported by the hypervisor.
VMware products virtualize TSC and it run at fixed frequency in so-called
"apparent time".  Although virtualized i8254 also runs in apparent time, TSC
calibration always gives slightly off frequency because of the complicated
timer emulation and lost-tick correction mechanism.
2011-04-29 18:20:12 +00:00
Jung-uk Kim
5ac44f727f Turn off periodic recalibration of CPU ticker frequency if it is invariant. 2011-04-28 17:56:02 +00:00
Attilio Rao
2be767e069 Add the watchdogs patting during the (shutdown time) disk syncing and
disk dumping.
With the option SW_WATCHDOG on, these operations are doomed to let
watchdog fire, fi they take too long.

I implemented the stubs this way because I really want wdog_kern_*
KPI to not be dependant by SW_WATCHDOG being on (and really, the option
only enables watchdog activation in hardclock) and also avoid to
call them when not necessary (avoiding not-volountary watchdog
activations).

Sponsored by:	Sandvine Incorporated
Discussed with:	emaste, des
MFC after:	2 weeks
2011-04-28 16:02:05 +00:00
Jung-uk Kim
43d645f96b Use ACPI-supplied CPU frequencies instead of estimated ones as we are about
to use other values from the same table anyway.

MFC after:	3 days
2011-04-27 00:32:35 +00:00
Jung-uk Kim
8143750196 Use newly added rdtsc32() for DELAY(9) as well. 2011-04-14 19:11:45 +00:00
Jung-uk Kim
0e78005e5c Work around an emulator problem where virtual CPU advertises TSC is P-state
invariant and APERF/MPERF MSRs exist but these MSRs never tick.  When we
calculate effective frequency from cpu_est_clockrate(), it caused panic of
division-by-zero.  Now we test whether these MSRs actually increase to avoid
such foot-shooting.

Reported by:	dim
Tested by:	dim
2011-04-14 17:50:26 +00:00
Jung-uk Kim
727c7b2d66 Use newly added rdtsc32() for the timecounter_get_t method. 2011-04-14 17:08:23 +00:00
Jung-uk Kim
5331d61da4 Add some tunable descriptions about x86 timers.
Requested by:	arundel
2011-04-14 00:07:08 +00:00
Jung-uk Kim
e94d5ad227 Do not use TSC for DELAY(9) if it not P-state invariant to avoid possible
foot-shooting.  DELAY() becomes unreliable when TSC frequency varies wildly,
especially cpufreq(4) and powerd(8) are used at the same time.
2011-04-12 22:41:52 +00:00
Jung-uk Kim
155094d77a Probe capability to find effective frequency. When the TSC is P-state
invariant, APERF/MPERF ratio can be used to find effective frequency.
2011-04-12 22:15:46 +00:00
Jung-uk Kim
a4e4127f42 Add a new tunable 'machdep.disable_tsc_calibration' to allow skipping TSC
frequency calibration.  For Intel processors, if brand string from CPUID
contains its nominal frequency, this frequency is used instead.
2011-04-12 21:08:34 +00:00
Jung-uk Kim
57d7a7fb0a Merge two similar functions to reduce duplication. 2011-04-11 19:27:44 +00:00
Jung-uk Kim
80c2cdcffe Refactor DELAYDEBUG as it is only useful for correcting i8254 frequency. 2011-04-08 19:54:29 +00:00
Jung-uk Kim
3453537fa5 Use atomic load & store for TSC frequency. It may be overkill for amd64 but
safer for i386 because it can be easily over 4 GHz now.  More worse, it can
be easily changed by user with 'machdep.tsc_freq' tunable (directly) or
cpufreq(4) (indirectly).  Note it is intentionally not used in performance
critical paths to avoid performance regression (but we should, in theory).
Alternatively, we may add "virtual TSC" with lower frequency if maximum
frequency overflows 32 bits (and ignore possible incoherency as we do now).
2011-04-07 23:28:28 +00:00
Jung-uk Kim
7ebbcb21ba Revert r219676.
Requested by:	jhb, bde
2011-03-16 16:44:08 +00:00
Jung-uk Kim
a8f8643e3a Do not let machdep.tsc_freq modify tsc_freq itself. It is bad for i386 as
it does not operate atomically.  Actually, it serves no purpose.

Noticed by:	bde
2011-03-15 19:47:20 +00:00
Jung-uk Kim
38b8542ca9 Deprecate tsc_present as the last of its real consumers finally disappeared. 2011-03-15 17:19:52 +00:00
Jung-uk Kim
856e88c1f5 When TSC is unavailable, broken or disabled and the current timecounter has
better quality than i8254 timer, use it for DELAY(9).
2011-03-14 22:05:59 +00:00
Jung-uk Kim
79422085d4 Add a tunable "machdep.disable_tsc" to turn off TSC. Specifically, it turns
off boot-time CPU frequency calibration, DELAY(9) with TSC, and using TSC as
a CPU ticker.  Note tsc_present does not change by this tunable.
2011-03-11 00:44:32 +00:00
Jung-uk Kim
a106a27c6a Turn off pointless P-state invariant TSC detection based on CPU model
on a virtual machine.
2011-03-10 23:06:13 +00:00