Commit Graph

1739 Commits

Author SHA1 Message Date
marius
a5b338a967 - In oneshot-mode it doesn't make sense to try to compensate the clock
drift in order to achieve a more stable clock as the tick intervals may
  vary in the first place. In fact I haven't seen this code kick in when
  in oneshot-mode so just skip it in that case.
- There's no need to explicitly stop the (S)TICK counter in oneshot-mode
  with every tick as it just won't trigger again with the (S)TICK compare
  register set to a value in the past (with a wrap-around once every ~195
  years of uptime at 1.5 GHz this isn't something we have to worry about
  in practice).
- Given that we'll disable interrupts completely anyway there's no
  need to enter critical sections.
2010-10-17 16:46:54 +00:00
marius
1917efc627 Explicitly lower the PIL to 0 as part of enabling interrupts, similar to
what is done on other platforms. Unlike as with the sched_throw(NULL)
called on BSPs during their startup apparently there's nothing which will
reliably lower it on APs. I'm unsure why this only came up on V215 though,
breaking these with r207248. My best guess is that these are the only
supported ones so far fast enough to loose some race.

PR:		151404
MFC after:	3 days
2010-10-14 21:46:53 +00:00
marius
d4dfd1db59 - In the spirit of r212559 add a comment describing what will eventually
lower the PIL.
- Just as with the AP ensure that the (S)TICK timer(s) are in a known
  state when starting BSPs.
2010-10-14 21:34:53 +00:00
marius
6074b9919b In the replacement text of the __bswapN_const() macros cast the argument
to the expected type so they work like the corresponding __bswapN_var()
functions and the compiler doesn't complain when arguments of different
width are passed.
2010-10-08 14:59:45 +00:00
neel
11129fcf49 Fix bogus error message from bus_dmamem_alloc() about incorrect alignment.
The check for alignment should be made against the physical address and not
the virtual address that maps it.

Sponsored by:	NetApp
Submitted by:	Will McGovern (will at netapp dot com)
Reviewed by:	mjacob, jhb
2010-09-29 21:53:11 +00:00
marius
aed96e11cd minor simplifications and cosmetics 2010-09-24 15:12:18 +00:00
davidxu
b9eeaa21c2 Now userland POSIX semaphore is based on umtx. The kernel module
is only used to support binary compatible, if want to run old
binary, you need to kldload the module.
2010-09-24 09:04:16 +00:00
kib
0702b407d6 For sparc64 relocations that directly put bits of the symbol value into
the location, apply elf_relocaddr to the symbol value to have right
values for the symbols from dpcpu segment.

PR:	kern/147769
Discussed with:	avg
Tested by:	marius
MFC after:	2 weeks
2010-09-22 12:52:12 +00:00
marius
af982c87f7 Remove accidentally committed test code which effectively prevented the
use of the SPARC64 V VIS-based block copy function added in r212709.
Reported by:	Michael Moll
2010-09-16 12:05:46 +00:00
marius
6c236115f4 Add a VIS-based block copy function for SPARC64 V and later, which
additionally takes advantage of the prefetch cache of these CPUs.
Unlike the uncommitted US-III version, which provide no measurable
speedup or even resulted in a slight slowdown on certain CPUs models
compared to using the US-I version with these, the SPARC64 version
actually results in a slight improvement.
2010-09-15 21:44:31 +00:00
marius
3ae351af19 Add macros for alternate entry points. 2010-09-15 21:11:29 +00:00
marius
2e2ae916fe Sync with other platforms:
- make dflt_lock() always panic,
- add kludge to use contigmalloc() when the alignment is larger than the size
  and print a diagnostic when we didn't satisfy the alignment.
2010-09-15 17:11:15 +00:00
marius
3950b71da6 - Update the comment in swi_vm() regarding busdma bounce buffers; it's
unlikely that support for these ever will be implemented on sparc64 as
  the IOMMUs are able to translate to up to the maximum physical address
  supported by the respective machine, bypassing the IOMMU is affected
  by hardware errata and being able to support DMA engines which cannot
  do at least 32-bit DMA does not justify the costs.
- The page zeroing in uma_small_alloc() may use the VIS-based block zero
  function so take advantage of it.
2010-09-15 15:18:41 +00:00
marius
2172eaf479 Remove a KASSERT which will also trigger for perfectly valid combinations
of small maxsize and "large" (including BUS_SPACE_UNRESTRICTED) nsegments
parameters. Generally using a presz of 0 (which indeed might indicate the
use of bogus parameters for DMA tag creation) is not fatal, it just means
that no additional DVMA space will be preallocated.
2010-09-14 20:31:09 +00:00
marius
f5db453abe Remove redundant raising of the PIL to PIL_TICK as the respective locore
code already did that.
2010-09-14 19:35:43 +00:00
mav
eb4931dc6c Refactor timer management code with priority to one-shot operation mode.
The main goal of this is to generate timer interrupts only when there is
some work to do. When CPU is busy interrupts are generating at full rate
of hz + stathz to fullfill scheduler and timekeeping requirements. But
when CPU is idle, only minimum set of interrupts (down to 8 interrupts per
second per CPU now), needed to handle scheduled callouts is executed.
This allows significantly increase idle CPU sleep time, increasing effect
of static power-saving technologies. Also it should reduce host CPU load
on virtualized systems, when guest system is idle.

There is set of tunables, also available as writable sysctls, allowing to
control wanted event timer subsystem behavior:
  kern.eventtimer.timer - allows to choose event timer hardware to use.
On x86 there is up to 4 different kinds of timers. Depending on whether
chosen timer is per-CPU, behavior of other options slightly differs.
  kern.eventtimer.periodic - allows to choose periodic and one-shot
operation mode. In periodic mode, current timer hardware taken as the only
source of time for time events. This mode is quite alike to previous kernel
behavior. One-shot mode instead uses currently selected time counter
hardware to schedule all needed events one by one and program timer to
generate interrupt exactly in specified time. Default value depends of
chosen timer capabilities, but one-shot mode is preferred, until other is
forced by user or hardware.
  kern.eventtimer.singlemul - in periodic mode specifies how much times
higher timer frequency should be, to not strictly alias hardclock() and
statclock() events. Default values are 2 and 4, but could be reduced to 1
if extra interrupts are unwanted.
  kern.eventtimer.idletick - makes each CPU to receive every timer interrupt
independently of whether they busy or not. By default this options is
disabled. If chosen timer is per-CPU and runs in periodic mode, this option
has no effect - all interrupts are generating.

As soon as this patch modifies cpu_idle() on some platforms, I have also
refactored one on x86. Now it makes use of MONITOR/MWAIT instrunctions
(if supported) under high sleep/wakeup rate, as fast alternative to other
methods. It allows SMP scheduler to wake up sleeping CPUs much faster
without using IPI, significantly increasing performance on some highly
task-switching loads.

Tested by:	many (on i386, amd64, sparc64 and powerc)
H/W donated by:	Gheorghe Ardelean
Sponsored by:	iXsystems, Inc.
2010-09-13 07:25:35 +00:00
mav
4687e7f394 Sparc64 uses dummy cpu_idle() method. It's CPUs never sleeping. Tell
scheduler that it doesn't need to use IPI to "wake up" CPU.
2010-09-11 07:24:10 +00:00
avg
c9fe8ad7f0 bus_add_child: change type of order parameter to u_int
This reflects actual type used to store and compare child device orders.
Change is mostly done via a Coccinelle (soon to be devel/coccinelle)
semantic patch.
Verified by LINT+modules kernel builds.

Followup to:	r212213
MFC after:	10 days
2010-09-10 11:19:03 +00:00
jhb
ab70899cde Catch up to rename of the constant for the Master Data Parity Error bit in
the PCI status register.

Pointed out by:	mdf
Pointy hat to:	jhb
2010-09-09 20:26:30 +00:00
yongari
7ecefebc58 Enable sis(4). sis(4) should work on all architectures. 2010-09-02 18:12:54 +00:00
marius
dd1113bf70 Skip a KASSERT which isn't appropriate when not employing page coloring.
Reported by: Michael Moll
2010-08-21 14:28:48 +00:00
jhb
d02cab2556 Remove unused KTRACE includes. 2010-08-19 16:41:27 +00:00
kib
d9f088a03e Supply some useful information to the started image using ELF aux vectors.
In particular, provide pagesize and pagesizes array, the canary value
for SSP use, number of host CPUs and osreldate.

Tested by:	marius (sparc64)
MFC after:	1 month
2010-08-17 08:55:45 +00:00
jhb
1c3734f021 Update various places that store or manipulate CPU masks to use cpumask_t
instead of int or u_int.  Since cpumask_t is currently u_int on all
platforms this should just be a cosmetic change.
2010-08-11 23:22:53 +00:00
marius
48a7f0957b Wrap some sun4u-only symbols. 2010-08-08 14:37:16 +00:00
marius
e052e204a8 - As it is not possible for sched_bind(9) to context switch with
td_critnest > 1 when not already running on the desired CPU read the
  TICK counter of the BSP via a direct cross trap request in that case
  instead.
- Treat the STICK based timecounter the same way as the TICK based one
  regarding its quality and obtaining the counter value from the BSP.
  Like the TICK timers the STICK ones also are only synchronized during
  their startup (which might not result in good synchronicity in the
  first place) but not afterwards and might drift over time, causing
  problems when the time is read from different CPUs (see r135972).
2010-08-08 14:00:21 +00:00
marius
68af74e7b3 - Introduce a cpu_ipi_single() function pointer in order to send IPIs
to single CPUs more efficiently with Cheetah(-class) and Jalapeno CPUs.
  Besides being used to implement the ipi_cpu() introduced in r210939,
  cpu_ipi_single() will also be used internally by the sparc64 MD code.
- Factor out the Jalapeno support from the Cheetah IPI send functions
  in order to be able to more easily and efficiently implement support
  for more than 32 target CPUs as well as a workaround for Cheetah+
  erratum 25 for the latter.
2010-08-08 00:09:22 +00:00
marius
8a9a5f2db2 For CPUs which ignore TD_CV and support hardware unaliasing don't
bother doing page coloring. This results in a small but measurable
performance improvement in buildworld times.
2010-08-08 00:01:08 +00:00
jhb
19ddbf5c38 Add a new ipi_cpu() function to the MI IPI API that can be used to send an
IPI to a specific CPU by its cpuid.  Replace calls to ipi_selected() that
constructed a mask for a single CPU with calls to ipi_cpu() instead.  This
will matter more in the future when we transition from cpumask_t to
cpuset_t for CPU masks in which case building a CPU mask is more expensive.

Submitted by:	peter, sbruno
Reviewed by:	rookie
Obtained from:	Yahoo! (x86)
MFC after:	1 month
2010-08-06 15:36:59 +00:00
mav
4f9a242ba1 Adapt sparc64 and sun4v timer code for the new event timers infrastructure.
Reviewed by:	marius@
2010-07-29 12:08:46 +00:00
mdf
6857471cf3 Add MALLOC_DEBUG_MAXZONES debug malloc(9) option to use multiple uma
zones for each malloc bucket size.  The purpose is to isolate
different malloc types into hash classes, so that any buffer overruns
or use-after-free will usually only affect memory from malloc types in
that hash class.  This is purely a debugging tool; by varying the hash
function and tracking which hash class was corrupted, the intersection
of the hash classes from each instance will point to a single malloc
type that is being misused.  At this point inspection or memguard(9)
can be used to catch the offending code.

Add MALLOC_DEBUG_MAXZONES=8 to -current GENERIC configuration files.
The suggestion to have this on by default came from Kostik Belousov on
-arch.

This code is based on work by Ron Steinke at Isilon Systems.

Reviewed by:    -arch (mostly silence)
Reviewed by:    zml
Approved by:    zml (mentor)
2010-07-28 15:36:12 +00:00
jhb
f27c8b35e2 Very rough first cut at NUMA support for the physical page allocator. For
now it uses a very dumb first-touch allocation policy.  This will change in
the future.
- Each architecture indicates the maximum number of supported memory domains
  via a new VM_NDOMAIN parameter in <machine/vmparam.h>.
- Each cpu now has a PCPU_GET(domain) member to indicate the memory domain
  a CPU belongs to.  Domain values are dense and numbered from 0.
- When a platform supports multiple domains, the default freelist
  (VM_FREELIST_DEFAULT) is split up into N freelists, one for each domain.
  The MD code is required to populate an array of mem_affinity structures.
  Each entry in the array defines a range of memory (start and end) and a
  domain for the range.  Multiple entries may be present for a single
  domain.  The list is terminated by an entry where all fields are zero.
  This array of structures is used to split up phys_avail[] regions that
  fall in VM_FREELIST_DEFAULT into per-domain freelists.
- Each memory domain has a separate lookup-array of freelists that is
  used when fulfulling a physical memory allocation.  Right now the
  per-domain freelists are listed in a round-robin order for each domain.
  In the future a table such as the ACPI SLIT table may be used to order
  the per-domain lookup lists based on the penalty for each memory domain
  relative to a specific domain.  The lookup lists may be examined via a
  new vm.phys.lookup_lists sysctl.
- The first-touch policy is implemented by using PCPU_GET(domain) to
  pick a lookup list when allocating memory.

Reviewed by:	alc
2010-07-27 20:33:50 +00:00
attilio
c3ada57ad6 KTR_CTx are long time aliased by existing classes so they can't serve
their purpose anymore. Axe them out.

Sponsored by:	Sandvine Incorporated
Discussed with:	jhb, emaste
Possible MFC:	TBD
2010-07-21 10:05:07 +00:00
mav
6a650899c5 Allocate proper ammount of memory for interrupt names on sparc64 and
sun4v, same as done on other architectures. This removes garbage from
`vmstat -ia` output.

Reviewed by:	marius@
2010-07-16 22:09:29 +00:00
marius
5758b8c344 - Pin the IPI cache and TLB demap functions in order to prevent migration
between determining the other CPUs and calling cpu_ipi_selected(), which
  apart from generally doing the wrong thing can lead to a panic when a
  CPU is told to IPI itself (which sun4u doesn't support).
  Reported and tested by: Nathaniel W Filardo
- Add __unused where appropriate.

MFC after:	3 days
2010-07-04 12:43:12 +00:00
jhb
de324e256c Move prototypes for kern_sigtimedwait() and kern_sigprocmask() to
<sys/syscallsubr.h> where all other kern_<syscall> prototypes live.
2010-06-30 18:03:42 +00:00
nwhitehorn
c757ee90ae Provide for multiple, cascaded PICs on PowerPC systems, and extend the
OFW interrupt map interface to also return the device's interrupt parent.

MFC after:	8.1-RELEASE
2010-06-18 14:06:27 +00:00
marius
47283486da Update a branch missed in r207537.
MFC after:	3 days
2010-06-13 20:29:55 +00:00
alc
6a3535c3fa Relax one of the new assertions in pmap_enter() a little. Specifically,
allow pmap_enter() to be performed on an unmanaged page that doesn't have
VPO_BUSY set.  Having VPO_BUSY set really only matters for managed pages.
(See, for example, pmap_remove_write().)
2010-06-11 15:49:39 +00:00
alc
7c212e010d Reduce the scope of the page queues lock and the number of
PG_REFERENCED changes in vm_pageout_object_deactivate_pages().
Simplify this function's inner loop using TAILQ_FOREACH(), and shorten
some of its overly long lines.  Update a stale comment.

Assert that PG_REFERENCED may be cleared only if the object containing
the page is locked.  Add a comment documenting this.

Assert that a caller to vm_page_requeue() holds the page queues lock,
and assert that the page is on a page queue.

Push down the page queues lock into pmap_ts_referenced() and
pmap_page_exists_quick().  (As of now, there are no longer any pmap
functions that expect to be called with the page queues lock held.)

Neither pmap_ts_referenced() nor pmap_page_exists_quick() should ever
be passed an unmanaged page.  Assert this rather than returning "0"
and "FALSE" respectively.

ARM:

Simplify pmap_page_exists_quick() by switching to TAILQ_FOREACH().

Push down the page queues lock inside of pmap_clearbit(), simplifying
pmap_clear_modify(), pmap_clear_reference(), and pmap_remove_write().
Additionally, this allows for avoiding the acquisition of the page
queues lock in some cases.

PowerPC/AIM:

moea*_page_exits_quick() and moea*_page_wired_mappings() will never be
called before pmap initialization is complete.  Therefore, the check
for moea_initialized can be eliminated.

Push down the page queues lock inside of moea*_clear_bit(),
simplifying moea*_clear_modify() and moea*_clear_reference().

The last parameter to moea*_clear_bit() is never used.  Eliminate it.

PowerPC/BookE:

Simplify mmu_booke_page_exists_quick()'s control flow.

Reviewed by:	kib@
2010-06-10 16:56:35 +00:00
alc
b11ca4bc20 Don't set PG_WRITEABLE in pmap_enter() unless the page is managed.
Correct a typo in a nearby comment on sparc64.
2010-06-05 18:20:09 +00:00
alc
3f1d4b057c Push down page queues lock acquisition in pmap_enter_object() and
pmap_is_referenced().  Eliminate the corresponding page queues lock
acquisitions from vm_map_pmap_enter() and mincore(), respectively.  In
mincore(), this allows some additional cases to complete without ever
acquiring the page queues lock.

Assert that the page is managed in pmap_is_referenced().

On powerpc/aim, push down the page queues lock acquisition from
moea*_is_modified() and moea*_is_referenced() into moea*_query_bit().
Again, this will allow some additional cases to complete without ever
acquiring the page queues lock.

Reorder a few statements in vm_page_dontneed() so that a race can't lead
to an old reference persisting.  This scenario is described in detail by a
comment.

Correct a spelling error in vm_page_dontneed().

Assert that the object is locked in vm_page_clear_dirty(), and restrict the
page queues lock assertion to just those cases in which the page is
currently writeable.

Add object locking to vnode_pager_generic_putpages().  This was the one
and only place where vm_page_clear_dirty() was being called without the
object being locked.

Eliminate an unnecessary vm_page_lock() around vnode_pager_setsize()'s call
to vm_page_clear_dirty().

Change vnode_pager_generic_putpages() to the modern-style of function
definition.  Also, change the name of one of the parameters to follow
virtual memory system naming conventions.

Reviewed by:	kib
2010-05-26 18:00:44 +00:00
alc
32b13ee957 Roughly half of a typical pmap_mincore() implementation is machine-
independent code.  Move this code into mincore(), and eliminate the
page queues lock from pmap_mincore().

Push down the page queues lock into pmap_clear_modify(),
pmap_clear_reference(), and pmap_is_modified().  Assert that these
functions are never passed an unmanaged page.

Eliminate an inaccurate comment from powerpc/powerpc/mmu_if.m:
Contrary to what the comment says, pmap_mincore() is not simply an
optimization.  Without a complete pmap_mincore() implementation,
mincore() cannot return either MINCORE_MODIFIED or MINCORE_REFERENCED
because only the pmap can provide this information.

Eliminate the page queues lock from vfs_setdirty_locked_object(),
vm_pageout_clean(), vm_object_page_collect_flush(), and
vm_object_page_clean().  Generally speaking, these are all accesses
to the page's dirty field, which are synchronized by the containing
vm object's lock.

Reduce the scope of the page queues lock in vm_object_madvise() and
vm_page_dontneed().

Reviewed by:	kib (an earlier version)
2010-05-24 14:26:57 +00:00
kib
4208ccbe79 Reorganize syscall entry and leave handling.
Extend struct sysvec with three new elements:
sv_fetch_syscall_args - the method to fetch syscall arguments from
  usermode into struct syscall_args. The structure is machine-depended
  (this might be reconsidered after all architectures are converted).
sv_set_syscall_retval - the method to set a return value for usermode
  from the syscall. It is a generalization of
  cpu_set_syscall_retval(9) to allow ABIs to override the way to set a
  return value.
sv_syscallnames - the table of syscall names.

Use sv_set_syscall_retval in kern_sigsuspend() instead of hardcoding
the call to cpu_set_syscall_retval().

The new functions syscallenter(9) and syscallret(9) are provided that
use sv_*syscall* pointers and contain the common repeated code from
the syscall() implementations for the architecture-specific syscall
trap handlers.

Syscallenter() fetches arguments, calls syscall implementation from
ABI sysent table, and set up return frame. The end of syscall
bookkeeping is done by syscallret().

Take advantage of single place for MI syscall handling code and
implement ptrace_lwpinfo pl_flags PL_FLAG_SCE, PL_FLAG_SCX and
PL_FLAG_EXEC. The SCE and SCX flags notify the debugger that the
thread is stopped at syscall entry or return point respectively.  The
EXEC flag augments SCX and notifies debugger that the process address
space was changed by one of exec(2)-family syscalls.

The i386, amd64, sparc64, sun4v, powerpc and ia64 syscall()s are
changed to use syscallenter()/syscallret(). MIPS and arm are not
converted and use the mostly unchanged syscall() implementation.

Reviewed by:	jhb, marcel, marius, nwhitehorn, stas
Tested by:	marcel (ia64), marius (sparc64), nwhitehorn (powerpc),
	stas (mips)
MFC after:	1 month
2010-05-23 18:32:02 +00:00
marius
427e94121a Change ad_firmware_geom_adjust() to operate on a struct disk * only and
hook it up to ada(4) also. While at it, rename *ad_firmware_geom_adjust()
to *ata_disk_firmware_geom_adjust() etc now that these are no longer
limited to ad(4).

Reviewed by:	mav
MFC after:	3 days
2010-05-20 12:46:19 +00:00
alc
f6c07c5b87 On entry to pmap_enter(), assert that the page is busy. While I'm
here, make the style of assertion used by pmap_enter() consistent
across all architectures.

On entry to pmap_remove_write(), assert that the page is neither
unmanaged nor fictitious, since we cannot remove write access to
either kind of page.

With the push down of the page queues lock, pmap_remove_write() cannot
condition its behavior on the state of the PG_WRITEABLE flag if the
page is busy.  Assert that the object containing the page is locked.
This allows us to know that the page will neither become busy nor will
PG_WRITEABLE be set on it while pmap_remove_write() is running.

Correct a long-standing bug in vm_page_cowsetup().  We cannot possibly
do copy-on-write-based zero-copy transmit on unmanaged or fictitious
pages, so don't even try.  Previously, the call to pmap_remove_write()
would have failed silently.
2010-05-16 23:45:10 +00:00
marius
f545ac862c - Enable DMA write parity error interrupts on Schizo with a working
implementation.
- Revert the Sun Fire V890 WAR of r205254. Instead let schizo_pci_bus()
  only panic in case of fatal errors as the interrupt triggered by the
  error the firmware of these and also Sun Fire 280R with version 7
  Schizo caused may happen as late as using the HBA and not only prior
  to touching the PCI bus (in the former case the actual error still is
  fatal but we clear it before touching the PCI bus).
  While at it count and export non-fatal error interrupts via sysctl(9).
- Remove unnecessary locking from schizo_ue().
2010-05-14 20:00:21 +00:00
alc
40b44f9713 Push down the page queues into vm_page_cache(), vm_page_try_to_cache(), and
vm_page_try_to_free().  Consequently, push down the page queues lock into
pmap_enter_quick(), pmap_page_wired_mapped(), pmap_remove_all(), and
pmap_remove_write().

Push down the page queues lock into Xen's pmap_page_is_mapped().  (I
overlooked the Xen pmap in r207702.)

Switch to a per-processor counter for the total number of pages cached.
2010-05-08 20:34:01 +00:00
alc
4b98b3d320 Push down the page queues lock inside of vm_page_free_toq() and
pmap_page_is_mapped() in preparation for removing page queues locking
around calls to vm_page_free().  Setting aside the assertion that calls
pmap_page_is_mapped(), vm_page_free_toq() now acquires and holds the page
queues lock just long enough to actually add or remove the page from the
paging queues.

Update vm_page_unhold() to reflect the above change.
2010-05-06 16:39:43 +00:00
alc
966cbc602a Use an OBJT_PHYS object and thus PG_UNMANAGED pages to implement the TSB.
The TSB is not a pageable structure, so there is no point in using managed
pages.

Reviewed by:	kib
2010-05-05 07:47:40 +00:00