ways:
(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.
This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.
Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.
(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.
Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.
Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)
In particular:
- smp_tlb_mtx is no longer used, so it is axed.
- smp rendezvous lock isn't really a leaf spin-mutex. Its bad placement in
the table, however, has been the source of a false positive LOR reporting
with the dt_lock. However, smp rendezvous lock would have had sched_lock
there for older lock, so it wasn't still a leaf lock.
- allpmaps is only used in ia32 architecture, so it is inserted in the
appropriate stub.
Addictionally:
- kse_zombie_lock is no longer present, so its definition is axed out.
- zombie_lock doesn't need to have an exported symbol, so just let's it be
declared as static.
Tested by: kris
Approved by: jeff (mentor)
Approved by: re
Together with the sys/i386/i386/trap.c rev. 1.306 it fixes the PR.
Submitted by: rdivacky
Suggested by: jhb
Sponsored by: Google Summer of Code 2007
PR: kern/77710
Approved by: re (kensmith)
3 arguments, but we had forgotten the second argument. Also make the
Linux statfs64 struct depend on the architecture because it has an
extra 4 bytes padding on amd64 compared to i386.
The three argument fix is from David Taylor, the struct statfs64
stuff is my fault. With this patch I can install i386 Linux matlab
on an amd64 machine.
Submitted by: David Taylor <davidt_at_yadt.co.uk>
Approved by: re (kensmith)
was used in assembler code in such a way that no unresolved relocation
records were generated, so ld didn't flag the problem. You can see
this with an 'nm' of the kernel. There will be 'U MAXCPU' on SMP systems.
The impact of this is that the intrcount/intrnames arrays do not have
the intended amount of space reserved. This could lead to interesting
problems due to the arrays being present in the middle of kernel code.
An overflow would be rather interesting as executable code would be used
as per-cpu incrementing interrupt counters.
This fixes it for now by exporting MAXCPU to the assembler. A better fix
might be to define these data structures in C - they're only referenced
in the kernel from C code these days anyway.
Approved by: re (kensmith)
- p_sflag was mostly protected by PROC_LOCK rather than the PROC_SLOCK or
previously the sched_lock. These bugs have existed for some time.
- Allow swapout to try each thread in a process individually and then
swapin the whole process if any of these fail. This allows us to move
most scheduler related swap flags into td_flags.
- Keep ki_sflag for backwards compat but change all in source tools to
use the new and more correct location of P_INMEM.
Reported by: pho
Reviewed by: attilio, kib
Approved by: re (kensmith)
of pages don't sum to anywhere near the total number of pages on amd64.
This is for the most part because uma_small_alloc() pages have never been
counted as wired pages, like their kmem_malloc() brethren. They should
be. This changes fixes that.
It is no longer necessary for the page queues lock to be held to free
pages allocated by uma_small_alloc(). I removed the acquisition and
release of the page queues lock from uma_small_free() on amd64 and ia64
weeks ago. This patch updates the other architectures that have
uma_small_alloc() and uma_small_free().
Approved by: re (kensmith)
per-primitive macros like MTX_NOPROFILE, SX_NOPROFILE or RW_NOPROFILE) is
not really honoured. In particular lock_profile_obtain_lock_failure() and
lock_profile_obtain_lock_success() are naked respect this flag.
The bug leads to locks marked with no-profiling to be profiled as well.
In the case of the clock_lock, used by the timer i8254 this leads to
unpredictable behaviour both on amd64 and ia32 (double faults panic,
sudden reboots, etc.). The amd64 clock_lock is also not marked as
not profilable as it should be.
Fix these bugs adding proper checks in the lock profiling code and at
clock_lock initialization time.
i8254 bug pointed out by: kris
Tested by: matteo, Giuseppe Cocomazzi <sbudella at libero dot it>
Approved by: jeff (mentor)
Approved by: re
topology foo functions.
Working at the patch for topology problems in ia32/amd64 evicted some
problems regarding functions ordering in the SI_SUB_CPU family of
SYSINIT'ed subsystems.
In order to avoid problems with new modified to involved functions, a
correct ordering is not semantically specified for SI_SUB_CPU functions
(for a larger view of the issue please visit:
http://lists.freebsd.org/pipermail/freebsd-current/2007-July/075409.html )
Discussed with: peter
Tested by: kris, Rui Paulo <rpaulo@FreeBSD.org>
Approved by: jeff
Approved by: re
with Linux 2.6 emulation. This shall be reimplemented once FreeBSD gets
native scheduler affinity syscalls.
Submitted by: rdivacky
Reviewed by: jkim
Sponsored by: Google Summer of Code 2007
Approved by: re (kensmith)
longer create a pv entry for that mapping. (The two exceptions are
mappings into the kernel's exec and pipe submaps.) Consequently, there is
no reason for get_pv_entry() to dig deep into the free page queues, i.e.,
use VM_ALLOC_SYSTEM, by default. This revision changes get_pv_entry() to
use VM_ALLOC_NORMAL by default, i.e., before calling pmap_collect() to
reclaim pv entries.
Approved by: re (kensmith)
and newer CPUs (including Core 2 and Core / Core 2 based Xeons). The
driver attaches to each cpu device and creates a sysctl node in that
device's sysctl context (dev.cpu.N.temperature). When invoked, the
handler binds to the appropriate CPU to ensure a correct reading.
Submitted by: Rui Paulo <rpaulo@fnop.net>
Sponsored by: Google Summer of Code 2007
Tested by: des, marcus, Constantine A. Murenin, Ian FREISLICH
Approved by: re (kensmith)
MFC after: 3 weeks
cpu_start_mp(). This is after we have read the cpuid registers to
calculate the hyperthreading_cpus value for the sysctl that enables or
disables hyperthread cores. Change mp_topology() to use that information
rather than trying to do it itself.
This solves the problem of ULE being incorrectly told that dual core
Athlon64 X2 or Operton cpus are hyperthreading cores. At the very least,
we now have a single piece of code to identify hyperthreading.
Obtained from: jhb
Approved by: re (kensmith)
value, then we would use a negative index into the trap_msg[] array
resulting in a nested page fault. Make the 'type' variable holding the
trap number unsigned to avoid this.
MFC after: 2 weeks
Approved by: re (rwatson)
print a one line error message. Add some comments on not being able to
trust the day of week field (I'll act on these comments in a follow up
commit).
Approved by: re
MFC after: 3 weeks
require fewer blocking loops.
- Don't use atomic ops with 4BSD or on UP.
- Only use the blocking loop if ULE is compiled in.
- Use the correct memory barrier.
Discussed with: attilio, jhb, ssouhlal
Tested by: current@
Approved by: re
kernels exposed by the recent fixes to resource limits for 32-bit processes
on 64-bit kernels:
- Let ABIs expose their maximum stack size via a new pointer in sysentvec
and use that in preference to maxssiz during exec() rather than always
using maxssiz for all processses.
- Apply the ABI's limit fixup to the previous stack size when adjusting
RLIMIT_STACK to determine if the existing mapping for the stack needs to
be grown or shrunk (as well as how much it should be grown or shrunk).
Approved by: re (kensmith)
the 7.0 timeframe.
This is needed because I4B is not locked and NET_NEEDS_GIANT goes away.
The plan is to lock I4B and bring everything back for 7.1.
Approved by: re (kensmith)
holding the page queues lock. Thus, the page table pages released by
pmap_remove() and pmap_remove_pages() can be freed after the page queues
lock is released.
Approved by: re (kensmith)
114 bytes of cmos ram in the PC clock chip. The big difference between
this and the Linux version is that we do not recalculate the checksums
for bytes 16..31.
We use this at work when cloning identical machines - we can copy the
bios settings as well. Reading /dev/nvram gives 114 bytes of data but
you can seek/read/write whichever bytes you like.
Yes, this is a "foot, gun, fire!" type of device.
more exposure. The current state of SCTP implementation is
considered to be ready for 32-bit platforms, but still need some
work/testing on 64-bit platforms.
Approved by: re (kensmith)
Discussed with: rrs
making the relevant files standard. This avoids duplication and
makes it easier to override/disable unwanted schemes. Since ARM
doesn't have a DEFAULTS configuration file, leave the source
files for the BSD and MBR partitioning schemes in files.arm for
now.
In particular:
- Add an explicative table for locking of struct vmmeter members
- Apply new rules for some of those members
- Remove some unuseful comments
Heavily reviewed by: alc, bde, jeff
Approved by: jeff (mentor)
caches with data caches after writing to memory. This typically
is required to make breakpoints work on ia64 and powerpc. For
those architectures the function is implemented.
oldthread should point at before we return.
- When cpu_switch() is called the td_lock pointer in the old thread may
point at the blocked lock. This prevents other processors from
switching into this thread while we're still switching out. Wait
until we're done deactivating the vmspace before we release the
thread by assigning to td_lock.
- Before we can activate the new vmspace we must make sure that the new
thread is not assigned to the blocked lock. It may be in the process
of switching out on another cpu. Spin until the new thread is
available.
- Use thread_lock() rather than sched_lock for per-thread scheduling
sychronization.
- Use the per-process spinlock rather than the sched_lock for per-process
scheduling synchronization.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
- There is no globally visible scheduler lock any longer. For now the
watchdog can only check Giant. This model of checking particular locks
is flawed and should be revisited. Other metrics should be considered.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
- Use sched_throw() rather than replicating the same cpu_throw() code for
each architecture. This also allows the scheduler to use any locking it
may want to.
- Use the thread_lock() rather than sched_lock when preempting.
- The scheduler lock is not required to synchronize release_aps.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
- Rename PCPU_LAZY_INC into PCPU_INC
- Add the PCPU_ADD interface which just does an add on the pcpu member
given a specific value.
Note that for most architectures PCPU_INC and PCPU_ADD are not safe.
This is a point that needs some discussions/work in the next days.
Reviewed by: alc, bde
Approved by: jeff (mentor)
sysctl_handle_int is not sizeof the int type you want to export.
The type must always be an int or an unsigned int.
Remove the instances where a sizeof(variable) is passed to stop
people accidently cut and pasting these examples.
In a few places this was sysctl_handle_int was being used on 64 bit
types, which would truncate the value to be exported. In these
cases use sysctl_handle_quad to export them and change the format
to Q so that sysctl(1) can still print them.
Probabilly, a general approach is not the better solution here, so we should
solve the sched_lock protection problems separately.
Requested by: alc
Approved by: jeff (mentor)
handler is wrapped in a couple of functions - a filter wrapper and an
ithread wrapper. In this case (and just in this case), the filter
wrapper could ask the system to schedule the ithread and mask the
interrupt source if the wrapped handler is composed of just an ithread
handler: modify the "old" interrupt code to make it support
this situation, while the "new" interrupt code is already ok.
Discussed with: jhb