2488 Commits

Author SHA1 Message Date
kib
2d39ccdd94 Lock the new map in vmspace_fork(). The newly allocated map should not
be accessible outside vmspace_fork() yet, but locking it would satisfy
the protocol of the vm_map_entry_link() and other functions called
from vmspace_fork().

Use trylock that is supposedly cannot fail, to silence WITNESS warning
of the nested acquisition of the sx lock with the same name.

Suggested and reviewed by:	tegge
2009-02-08 19:55:03 +00:00
kib
0125db5772 Assert that vnode is exclusively locked when its vm object is resized.
Reviewed by:	tegge
2009-02-08 19:44:50 +00:00
kib
7a52ad11f4 Do not leak the MAP_ENTRY_IN_TRANSITION flag when copying map entry
on fork. Otherwise, copied entry cannot be removed in the child map.

Reviewed by:	tegge
MFC after:	2 weeks
2009-02-08 19:41:08 +00:00
kib
b798264c6d Style. 2009-02-08 19:37:01 +00:00
jeff
69d1bd8670 - Make the keg abstraction more complete. Permit a zone to have multiple
backend kegs so it may source compatible memory from multiple backends.
   This is useful for cases such as NUMA or different layouts for the same
   memory type.
 - Provide a new api for adding new backend kegs to secondary zones.
 - Provide a new flag for adjusting the layout of zones to stagger
   allocations better across cache lines.

Sponsored by:	Nokia
2009-01-25 09:11:24 +00:00
jhb
a9601871c9 - Mark all standalone INT/LONG/QUAD sysctl's MPSAFE. This is done
inside the SYSCTL() macros and thus does not need to be done for
  all of the nodes scattered across the source tree.
- Mark the name-cache related sysctl's (including debug.hashstat.*) MPSAFE.
- Mark vm.loadavg MPSAFE.
- Remove GIANT_REQUIRED from vmtotal() (everything in this routine already
  has sufficient locking) and mark vm.vmtotal MPSAFE.
- Mark the vm.stats.(sys|vm).* sysctls MPSAFE.
2009-01-23 22:49:23 +00:00
jhb
dc43531891 Now that vfs_markatime() no longer requires an exclusive lock due to
the VOP_MARKATIME() changes, use a shared vnode lock for mmap().

Submitted by:	ups
2009-01-21 14:43:35 +00:00
kib
ac1b596fda Extend the struct vm_page wire_count to u_int to avoid the overflow
of the counter, that may happen when too many sendfile(2) calls are
being executed with this vnode [1].

To keep the size of the struct vm_page and offsets of the fields
accessed by out-of-tree modules, swap the types and locations
of the wire_count and cow fields. Add safety checks to detect cow
overflow and force fallback to the normal copy code for zero-copy
sockets. [2]

Reported by:	Anton Yuzhaninov <citrin citrin ru> [1]
Suggested by:	alc [2]
Reviewed by:	alc
MFC after:	2 weeks
2009-01-03 13:24:08 +00:00
alc
a2ba509efc Resurrect shared map locks allowing greater concurrency during some map
operations, such as page faults.

An earlier version of this change was ...

Reviewed by:	kib
Tested by:	pho
MFC after:	6 weeks
2009-01-01 00:31:46 +00:00
alc
81e94e0a4c Update or eliminate some stale comments. 2008-12-31 05:44:05 +00:00
alc
fa3b7c7db3 Avoid an unnecessary memory dereference in vm_map_entry_splay(). 2008-12-30 21:52:18 +00:00
alc
c1be5ff444 Style change to vm_map_lookup(): Eliminate a macro of dubious value. 2008-12-30 20:51:07 +00:00
alc
96037be899 Move the implementation of the vm map's fast path on address lookup from
vm_map_lookup{,_locked}() to vm_map_lookup_entry().  Having the fast path
in vm_map_lookup{,_locked}() limits its benefits to page faults.  Moving
it to vm_map_lookup_entry() extends its benefits to other operations on
the vm map.
2008-12-30 19:48:03 +00:00
rnoland
63e9bb0efa Fix printing of KASSERT message missed in r163604.
Approved by:	kib
2008-12-21 16:56:13 +00:00
kib
4edc1f9451 Instead of forcing vn_start_write() to reset mp back to NULL for the
failed calls with non-NULL vp, explicitely clear mp after failure.

Tested by:	stass
Reviewed by:	tegge
PR:		123768
MFC after:	1 week
2008-11-16 21:57:54 +00:00
raj
032a270ba5 Support kernel crash mini dumps on ARM architecture.
Obtained from:	Juniper Networks, Semihalf
2008-11-06 16:20:27 +00:00
keramida
4b957256d0 Various comment nits, and typos. 2008-11-02 00:41:26 +00:00
rwatson
3667673537 Update mmap() comment: no more block devices, so no more block device
cache coherency questions.

MFC after:	3 days
2008-10-22 16:50:12 +00:00
attilio
b8bf37e585 Remove the struct thread unuseful argument from bufobj interface.
In particular following functions KPI results modified:
- bufobj_invalbuf()
- bufsync()

and BO_SYNC() "virtual method" of the buffer objects set.
Main consumers of bufobj functions are affected by this change too and,
in particular, functions which changed their KPI are:
- vinvalbuf()
- g_vfs_close()

Due to the KPI breakage, __FreeBSD_version will be bumped in a later
commit.

As a side note, please consider just temporary the 'curthread' argument
passing to VOP_SYNC() (in bufsync()) as it will be axed out ASAP

Reviewed by:	kib
Tested by:	Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
2008-10-10 21:23:50 +00:00
kib
de9e891748 Move the code for doing out-of-memory grass from vm_pageout_scan()
into the separate function vm_pageout_oom(). Supply a parameter for
vm_pageout_oom() describing a reason for the call.

Call vm_pageout_oom() from the swp_pager_meta_build() when swap zone
is exhausted.

Reviewed by:	alc
Tested by:	pho, jhb
MFC after:	2 weeks
2008-09-29 19:45:12 +00:00
emaste
a173e0f84d Move CTASSERT from header file to source file, per implementation note now
in the CTASSERT man page.
2008-09-26 18:44:40 +00:00
kib
496b70bc6a Save previous content of the td_fpop before storing the current
filedescriptor into it. Make sure that td_fpop is NULL when calling
d_mmap from dev_pager_getpages().

Change guards against td_fpop field being non-NULL with private state
for another device, and against sudden clearing the td_fpop. This
could occur when either a driver method calls another driver through
the filedescriptor operation, or a page fault happen while driver is
writing to a memory backed by another driver.

Noted by:	rwatson
Tested by:	rnoland
MFC after:	3 days
2008-09-26 14:50:49 +00:00
alc
409b7f1ce0 Prevent an integer overflow in vm_pageout_page_stats() on machines with a
large number of physical pages.

PR:		126158
Submitted by:	Dmitry Tejblum
MFC after:	3 days
2008-09-21 18:01:34 +00:00
kib
2a1cba02c6 Allow the d_mmap driver methods to use cdevpriv KPI during verification
phase of establishing mapping.

Discussed with:	rwatson, jhb, rnoland
Tested by:	rnoland
MFC after:	3 days
2008-09-20 19:56:02 +00:00
attilio
dbf35e279f Decontextualize the couplet VOP_GETATTR / VOP_SETATTR as the passed thread
was always curthread and totally unuseful.

Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
2008-08-28 15:23:18 +00:00
antoine
d4ceeb8daa Remove unused variable nosleepwithlocks.
PR:		126609
Submitted by:	Mateusz Guzik
MFC after:	1 month
X-MFC:		to stable/7 only, this variable is still used in stable/6
2008-08-23 12:40:07 +00:00
nwhitehorn
077618820d Allow the MD UMA allocator to use VM routines like kmem_*(). Existing code requires MD allocator to be available early in the boot process, before the VM is fully available. This defines a new VM define (UMA_MD_SMALL_ALLOC_NEEDS_VM) that allows an MD UMA small allocator to become available at the same time as the default UMA allocator.
Approved by:	marcel (mentor)
2008-08-23 01:35:36 +00:00
julian
0592958505 A bunch of formatting fixes brough to light by, or created by the Vimage commit
a few days ago.
2008-08-20 01:05:56 +00:00
kmacy
de0a83a794 Work around differences in page allocation for initial page tables on xen
MFC after:	1 month
2008-08-17 23:40:29 +00:00
emaste
b30356b634 Fix REDZONE(9) on amd64 and perhaps other 64 bit targets -- ensure the space
that redzone adds to the allocation for storing its metadata is at least as
large as the metadata that it will store there.

Submitted by:	Nima Misaghian
2008-08-13 17:32:48 +00:00
jhb
8af56fb687 If a thread that is swapped out is made runnable, then the setrunnable()
routine wakes up proc0 so that proc0 can swap the thread back in.
Historically, this has been done by waking up proc0 directly from
setrunnable() itself via a wakeup().  When waking up a sleeping thread
that was swapped out (the usual case when waking proc0 since only sleeping
threads are eligible to be swapped out), this resulted in a bit of
recursion (e.g. wakeup() -> setrunnable() -> wakeup()).

With sleep queues having separate locks in 6.x and later, this caused a
spin lock LOR (sleepq lock -> sched_lock/thread lock -> sleepq lock).
An attempt was made to fix this in 7.0 by making the proc0 wakeup use
the ithread mechanism for doing the wakeup.  However, this required
grabbing proc0's thread lock to perform the wakeup.  If proc0 was asleep
elsewhere in the kernel (e.g. waiting for disk I/O), then this degenerated
into the same LOR since the thread lock would be some other sleepq lock.

Fix this by deferring the wakeup of the swapper until after the sleepq
lock held by the upper layer has been locked.  The setrunnable() routine
now returns a boolean value to indicate whether or not proc0 needs to be
woken up.  The end result is that consumers of the sleepq API such as
*sleep/wakeup, condition variables, sx locks, and lockmgr, have to wakeup
proc0 if they get a non-zero return value from sleepq_abort(),
sleepq_broadcast(), or sleepq_signal().

Discussed with:	jeff
Glanced at by:	sam
Tested by:	Jurgen Weber  jurgen - ish com au
MFC after:	2 weeks
2008-08-05 20:02:31 +00:00
trhodes
f37865f7f0 Fill in a few sysctl descriptions.
Reviewed by:	alc, Matt Dillon <dillon@apollo.backplane.com>
Approved by:	alc
2008-08-03 14:26:15 +00:00
jhb
dccc76958e One more whitespace nit. 2008-07-30 21:23:32 +00:00
jhb
746c7fb6aa A few more whitespace fixes. 2008-07-30 21:18:08 +00:00
jhb
785ebd6040 If the kernel has run out of metadata for swap, then explicitly panic()
instead of emitting a warning before deadlocking.

MFC after:	1 month
2008-07-30 21:12:15 +00:00
kib
85be7d9093 The behaviour of the lockmgr going back at least to the 4.4BSD-Lite2 was
to downgrade the exclusive lock to shared one when exclusive lock owner
requested shared lock. New lockmgr panics instead.

The vnode_pager_lock function requests shared lock on the vnode backing
the OBJT_VNODE, and can be called when the current thread already holds
an exlcusive lock on the vnode. For instance, it happens when handling
page fault from the VOP_WRITE() uiomove that writes to the file, with
the faulted in page fetched from the vm object backed by the same file.
We then get the situation described above.

Verify whether the vnode is already exclusively locked by the curthread
and request recursed exclusive vnode lock instead of shared, if true.

Reported by:	gallatin
Discussed with:	attilio
2008-07-30 18:16:06 +00:00
alc
dba03ac26a Eliminate stale comments from kmem_malloc(). 2008-07-18 17:41:31 +00:00
kib
b341cd2da2 Use the VM_ALLOC_INTERRUPT for the page requests when allocating memory
for the bio for swapout write. It allows the page allocator to drain
free page list deeper. As result, a deadlock where pageout deamon sleeps
waiting for bio to be allocated for swapout is no more reproducable in
practice.

Alan said that M_USE_RESERVE shall be ressurrected and used there, but
until this is implemented, M_NOWAIT does exactly what is needed.

Tested by:	pho, kris
Reviewed by:	alc
No objections from:	phk
MFC after:	2 weeks (RELENG_7 only)
2008-07-11 11:27:42 +00:00
alc
c016906f4e Enable the creation of a kmem map larger than 4GB.
Submitted by: Tz-Huan Huang

Make several variables related to kmem map auto-sizing static.
Found by: CScout
2008-07-05 19:34:33 +00:00
alc
9fe1756df8 Make preparations for increasing the size of the kernel virtual address space
on the amd64 architecture.  The amd64 architecture requires kernel code and
global variables to reside in the highest 2GB of the 64-bit virtual address
space.  Thus, the memory allocated during bootstrap, before the call to
kmem_init(), starts at KERNBASE, which is not necessarily the same as
VM_MIN_KERNEL_ADDRESS on amd64.
2008-06-22 04:54:27 +00:00
alc
bec9e612d2 KERNBASE is not necessarily an address within the kernel map, e.g.,
PowerPC/AIM.  Consequently, it should not be used to determine the maximum
number of kernel map entries.  Intead, use VM_MIN_KERNEL_ADDRESS, which marks
the start of the kernel map on all architectures.

Tested by:	marcel@ (PowerPC/AIM)
2008-06-21 21:02:13 +00:00
ups
16b9649ce4 Fix vm object creation locking to allow SHARED vnode locking for vnode_create_vobject.
(Not currently used)

Noticed by: kib@
2008-06-12 20:46:47 +00:00
alc
25f7299f0f Essentially, neither madvise(..., MADV_DONTNEED) nor madvise(..., MADV_FREE)
work.  (Moreover, I don't believe that they have ever worked as intended.)
The explanation is fairly simple.  Both MADV_DONTNEED and MADV_FREE perform
vm_page_dontneed() on each page within the range given to madvise().  This
function moves the page to the inactive queue.  Specifically, if the page is
clean, it is moved to the head of the inactive queue where it is first in
line for processing by the page daemon.  On the other hand, if it is dirty,
it is placed at the tail.  Let's further examine the case in which the page
is clean.  Recall that the page is at the head of the line for processing by
the page daemon.  The expectation of vm_page_dontneed()'s author was that
the page would be transferred from the inactive queue to the cache queue by
the page daemon.  (Once the page is in the cache queue, it is, in effect,
free, that is, it can be reallocated to a new vm object by vm_page_alloc()
if it isn't reactivated quickly enough by a user of the old vm object.)  The
trouble is that nowhere in the execution of either MADV_DONTNEED or
MADV_FREE is either the machine-independent reference flag (PG_REFERENCED)
or the reference bit in any page table entry (PTE) mapping the page cleared.
Consequently, the immediate reaction of the page daemon is to reactivate the
page because it is referenced.  In effect, the madvise() was for naught.
The case in which the page was dirty is not too different.  Instead of being
laundered, the page is reactivated.

Note: The essential difference between MADV_DONTNEED and MADV_FREE is
that MADV_FREE clears a page's dirty field.  So, MADV_FREE is always
executing the clean case above.

This revision changes vm_page_dontneed() to clear both the machine-
independent reference flag (PG_REFERENCED) and the reference bit in all PTEs
mapping the page.

MFC after:	6 weeks
2008-06-06 18:38:43 +00:00
alc
933221c70b To date, our implementation of munmap(2) has required that the
entirety of the specified range be mapped.  Specifically, it has
returned EINVAL if the entire range is not mapped.  There is not,
however, any basis for this in either SuSv2 or our own man page.
Moreover, neither Linux nor Solaris impose this requirement.  This
revision removes this requirement.

Submitted by: Tijl Coosemans
PR: 118510
MFC after: 6 weeks
2008-05-24 21:57:16 +00:00
ups
fbd329664f Allow VM object creation in ufs_lookup. (If vfs.vmiodirenable is set)
Directory IO without a VM object will store data in 'malloced' buffers
severely limiting caching of the data. Without this  change VM objects for
directories are only created on an open() of the directory.
TODO: Inline test if VM object already exists to avoid locking/function call
overhead.

Tested by: kris@
Reviewed by: jeff@
Reported by: David Filo
2008-05-20 19:05:43 +00:00
alc
a8f81206ad Retire pmap_addr_hint(). It is no longer used. 2008-05-18 04:16:57 +00:00
alc
0a1d595c3a In order to map device memory using superpages, mmap(2) must find a
superpage-aligned virtual address for the mapping.  Revision 1.65
implemented an overly simplistic and generally ineffectual method for
finding a superpage-aligned virtual address.  Specifically, it rounds
the virtual address corresponding to the end of the data segment up to
the next superpage-aligned virtual address.  If this virtual address
is unallocated, then the device will be mapped using superpages.
Unfortunately, in modern times, where applications like the X server
dynamically load much of their code, this virtual address is already
allocated.  In such cases, mmap(2) simply uses the first available
virtual address, which is not necessarily superpage aligned.

This revision changes mmap(2) to use a more robust method,
specifically, the VMFS_ALIGNED_SPACE option that is now implemented by
vm_map_find().
2008-05-17 19:32:48 +00:00
alc
1e4c5769eb Preset a device object's alignment ("pg_color") based upon the
physical address of the device's memory.  This enables
pmap_align_superpage() to propose a virtual address for mapping the
device memory that permits the use of superpage mappings.
2008-05-17 16:26:34 +00:00
alc
47efda9e75 Don't call vm_reserv_alloc_page() on device-backed objects. Otherwise, the
system may panic because there is no reservation structure corresponding to
the physical address of the device memory.

Reported by: Giorgos Keramidas
2008-05-15 18:52:31 +00:00
alc
0f95f2184f Provide the new argument to kmem_suballoc(). 2008-05-10 23:39:27 +00:00