of pcpu locks. This makes uma_zone somewhat smaller (by (LOCKNAME_LEN *
sizeof(char) + sizeof(struct mtx) * maxcpu) bytes, to be exact).
No Objections from jeff.
releasing the lock only if we are about to sleep (e.g., vm_pager_get_pages()
or vm_pager_has_pages()). If we sleep, we have marked the vm object with
the paging-in-progress flag.
Several of the subtypes have an associated vnode which is used for
stuff like the f*() functions.
By giving the vnode a speparate field, a number of checks for the specific
subtype can be replaced simply with a check for f_vnode != NULL, and
we can later free f_data up to subtype specific use.
At this point in time, f_data still points to the vnode, so any code I
might have overlooked will still work.
to the machine-independent parts of the VM. At the same time, this
introduces vm object locking for the non-i386 platforms.
Two details:
1. KSTACK_GUARD has been removed in favor of KSTACK_GUARD_PAGES. The
different machine-dependent implementations used various combinations
of KSTACK_GUARD and KSTACK_GUARD_PAGES. To disable guard page, set
KSTACK_GUARD_PAGES to 0.
2. Remove the (unnecessary) clearing of PG_ZERO in vm_thread_new. In
5.x, (but not 4.x,) PG_ZERO can only be set if VM_ALLOC_ZERO is passed
to vm_page_alloc() or vm_page_grab().
processes in the first pass. Among other things, this will give
us a chance to launder vnode-backed pages before concluding that
we need more swap. This is particularly useful for systems that
have no swap.
While here, update a comment and remove some long-unused code.
Reported by: Lucky Green <shamrock@cypherpunks.to>
Suggested by: dillon
Approved by: re (rwatson)
- Move struct sigacts out of the u-area and malloc() it using the
M_SUBPROC malloc bucket.
- Add a small sigacts_*() API for managing sigacts structures: sigacts_alloc(),
sigacts_free(), sigacts_copy(), sigacts_share(), and sigacts_shared().
- Remove the p_sigignore, p_sigacts, and p_sigcatch macros.
- Add a mutex to struct sigacts that protects all the members of the struct.
- Add sigacts locking.
- Remove Giant from nosys(), kill(), killpg(), and kern_sigaction() now
that sigacts is locked.
- Several in-kernel functions such as psignal(), tdsignal(), trapsignal(),
and thread_stopped() are now MP safe.
Reviewed by: arch@
Approved by: re (rwatson)
trustworthy for vnode-backed objects.
- Restore the old behavior of vm_object_page_remove() when the end
of the given range is zero. Add a comment to vm_object_page_remove()
regarding this behavior.
Reported by: iedowse
- Eliminate an odd, special-case feature:
if start == end == 0 then all pages are removed. Only one caller
used this feature and that caller can trivially pass the object's
size.
- Assert that the vm_object is locked on entry; don't bother testing
for a NULL vm_object.
- Style: Fix lines that are longer than 80 characters.
fork1() and never changes.
- The proc lock is enough to cover reading p_state, so push down sched_lock
into the PRS_NORMAL case of the switch on p_state.
- Remove the Giant required from vm_page_free_toq(). (Any locking
errors will be caught by vm_page_remove().)
This remedies a panic that occurred when kmem_malloc(NOWAIT) performed
without Giant failed to allocate the necessary pages.
Reported by: phk
- Add a parameter to vm_pageout_flush() that tells vm_pageout_flush()
whether its caller has locked the vm_object. (This is a temporary
measure to bootstrap vm_object locking.)
race where a thread could assume that a process was swapped in by
PHOLD() when it actually wasn't fully swapped in yet.
- In faultin(), always msleep() if PS_SWAPPINGIN is set instead of doing
this check after bumping p_lock in the PS_INMEM == 0 case. Also,
sched_lock is only needed for setting and clearning swapping PS_*
flags and the swap thread inhibitor.
- Don't set and clear the thread swap inhibitor in the same loops as the
pmap_swapin/out_thread() since we have to do it under sched_lock.
Instead, mimic the treatment of the PS_INMEM flag and use separate loops
to set the inhibitors when clearing PS_INMEM and clear the inhibitors
when setting PS_INMEM.
- swapout() now returns with the proc lock held as it holds the lock
while adjusting the swapping-related PS_* flags so that the proc lock
can be used to test those flags.
- Only use the proc lock to check the swapping-related PS_* flags in
several places.
- faultin() no longer requires sched_lock to be held by callers.
- Rename PS_SWAPPING to PS_SWAPPINGOUT to be less ambiguous now that we
have PS_SWAPPINGIN.
called without Giant; and obj_alloc() in turn calls vm_page_alloc()
without Giant. This causes an assertion failure in vm_page_alloc().
Fortunately, obj_alloc() is now MPSAFE. So, we need only clean up
some assertions.
- Weaken the assertion in vm_page_lookup() to require Giant only
if the vm_object isn't locked.
- Remove an assertion from vm_page_alloc() that duplicates a check
performed in vm_page_lookup().
In collaboration with: gallatin, jake, jeff
vm_object_pip_add() and vm_object_pip_wakeup().
- Remove GIANT_REQUIRED from vm_object_pip_subtract() and
vm_object_pip_subtract().
- Lock the vm_object when performing vm_object_page_remove().
critical and should not be killed when pageout is looking for more
memory pages in all the wrong places.
Reviewed by: arch@
Sponsored by: St. Bernard Software
where physical addresses larger than virtual addresses, such as i386s
with PAE.
- Use this to represent physical addresses in the MI vm system and in the
i386 pmap code. This also changes the paddr parameter to d_mmap_t.
- Fix printf formats to handle physical addresses >4G in the i386 memory
detection code, and due to kvtop returning vm_paddr_t instead of u_long.
Note that this is a name change only; vm_paddr_t is still the same as
vm_offset_t on all currently supported platforms.
Sponsored by: DARPA, Network Associates Laboratories
Discussed with: re, phk (cdevsw change)
are machine dependent because they are not required to update the tlb when
mappings are added or removed, and doing so is machine dependent.
In addition, an implementation may require that pages mapped with pmap_kenter
have a backing vm_page_t, which is not necessarily true of all physical
pages, and so may choose to pass the vm_page_t to pmap_kenter instead of the
physical address in order to make this requirement clear.
process to kill, don't block on a map lock while holding the
process lock. Instead, skip processes whose map locks are held
and find something else to kill.
- Add vm_map_trylock_read() to support the above.
Reviewed by: alc, mike (mentor)
- On receive, vm_map_lookup() needs to trigger the creation of a shadow
object. To make that happen, call vm_map_lookup() with PROT_WRITE
instead of PROT_READ in vm_pgmoveco().
- On send, a shadow object will be created by the vm_map_lookup() in
vm_fault(), but vm_page_cowfault() will delete the original page from
the backing object rather than simply letting the legacy COW mechanism
take over. In other words, the new page should be added to the shadow
object rather than replacing the old page in the backing object. (i.e.
vm_page_cowfault() should not be called in this case.) We accomplish
this by making sure fs.object == fs.first_object before calling
vm_page_cowfault() in vm_fault().
Submitted by: gallatin, alc
Tested by: ken
modules to authorize disabling of swap against a particular vnode.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
- Remove the buftimelock mutex and acquire the buf's interlock to protect
these fields instead.
- Hold the vnode interlock while locking bufs on the clean/dirty queues.
This reduces some cases from one BUF_LOCK with a LK_NOWAIT and another
BUF_LOCK with a LK_TIMEFAIL to a single lock.
Reviewed by: arch, mckusick
- Get rid of the useless atop() / pmap_phys_address() detour. The
device mmap handlers must now give back the physical address
without atop()'ing it.
- Don't borrow the physical address of the mapping in the returned
int. Now we properly pass a vm_offset_t * and expect it to be
filled by the mmap handler when the mapping was successful. The
mmap handler must now return 0 when successful, any other value
is considered as an error. Previously, returning -1 was the only
way to fail. This change thus accidentally fixes some devices
which were bogusly returning errno constants which would have been
considered as addresses by the device pager.
- Garbage collect the poorly named pmap_phys_address() now that it's
no longer used.
- Convert all the d_mmap_t consumers to the new API.
I'm still not sure wheter we need a __FreeBSD_version bump for this,
since and we didn't guarantee API/ABI stability until 5.1-RELEASE.
Discussed with: alc, phk, jake
Reviewed by: peter
Compile-tested on: LINT (i386), GENERIC (alpha and sparc64)
Runtime-tested on: i386
It's unnecessary for two reasons: (1) Giant is at present already held in
such cases and (2) our various implementations of pmap_growkernel() look to
be MP safe. (For example, for sparc64 the proof of (2) is trivial.)
dereferenced when a process exits due to the vmspace ref-count being
bumped. Change shmexit() and shmexit_myhook() to take a vmspace instead
of a process and call it in vmspace_dofree(). This way if it is missed
in exit1()'s early-resource-free it will still be caught when the zombie is
reaped.
Also fix a potential race in shmexit_myhook() by NULLing out
vmspace->vm_shm prior to calling shm_delete_mapping() and free().
MFC after: 7 days
The objective being to eliminate some cases of page queues locking.
(See, for example, vm/vm_fault.c revision 1.160.)
Reviewed by: tegge
(Also, pointed out by tegge that I changed vm_fault.c before changing
vm_page.c. Oops.)
pointer types, and remove a huge number of casts from code using it.
Change struct xfile xf_data to xun_data (ABI is still compatible).
If we need to add a #define for f_data and xf_data we can, but I don't
think it will be necessary. There are no operational changes in this
commit.
requests when the number of free pages is below the reserved threshold.
Previously, VM_ALLOC_ZERO was only honored when the number of free pages
was above the reserved threshold. Honoring it in all cases generally
makes sense, does no harm, and simplifies the code.
to sort out disk-io from file-io in the vm/buffer/filesystem space.
The intent is to sort VOP_STRATEGY calls into those which operate
on "real" vnodes and those which operate on VCHR vnodes. For
the latter kind, the call will be changed to VOP_SPECSTRATEGY,
possibly conditionally for those places where dual-use happens.
Add a default VOP_SPECSTRATEGY method which will call the normal
VOP_STRATEGY. First time it is called it will print debugging
information. This will only happen if a normal vnode is passed
to VOP_SPECSTRATEGY by mistake.
Add a real VOP_SPECSTRATEGY in specfs, which does what VOP_STRATEGY
does on a VCHR vnode today.
Add a new VOP_STRATEGY method in specfs to catch instances where
the conversion to VOP_SPECSTRATEGY has not yet happened. Handle
the request just like we always did, but first time called print
debugging information.
Apart up to two instances of console messages per boot, this amounts
to a glorified no-op commit.
If you get any of the messages on your console I would very much
like a copy of them mailed to phk@freebsd.org
is now synchronized by a mutex, whereas access to user maps is still
synchronized by a lockmgr()-based lock. Why? No single type of lock,
including sx locks, meets the requirements of both types of vm map.
Sometimes we sleep while holding the lock on a user map. Thus, a
a mutex isn't appropriate. On the other hand, both lockmgr()-based
and sx locks release Giant when a thread/process blocks during
contention for a lock. This could lead to a race condition in a legacy
driver (that relies on Giant for synchronization) if it attempts to
kmem_malloc() and fails to immediately obtain the lock. Fortunately,
we never sleep while holding a system map lock.
- Add a mtx_destroy() to vm_object_collapse(). (This allows a bzero()
to migrate from _vm_object_allocate() to vm_object_zinit(), where it
will be performed less often.)
comes along and flushes a file which has been mmap()'d SHARED/RW, with
dirty pages, it was flushing the underlying VM object asynchronously,
resulting in thousands of 8K writes. With this change the VM Object flushing
code will cluster dirty pages in 64K blocks.
Note that until the low memory deadlock issue is reviewed, it is not safe
to allow the pageout daemon to use this feature. Forced pageouts still
use fs block size'd ops for the moment.
MFC after: 3 days
skipping read-only pages, which can result in valuable non-text-related
data not getting dumped, the ELF loader and the dynamic loader now mark
read-only text pages NOCORE and the coredump code only checks (primarily) for
complete inaccessibility of the page or NOCORE being set.
Certain applications which map large amounts of read-only data will
produce much larger cores. A new sysctl has been added,
debug.elf_legacy_coredump, which will revert to the old behavior.
This commit represents collaborative work by all parties involved.
The PR contains a program demonstrating the problem.
PR: kern/45994
Submitted by: "Peter Edwards" <pmedwards@eircom.net>, Archie Cobbs <archie@dellroad.org>
Reviewed by: jdp, dillon
MFC after: 7 days
resource starvation we clean-up as much of the vmspace structure as we
can when the last process using it exits. The rest of the structure
is cleaned up when it is reaped. But since exit1() decrements the ref
count it is possible for a double-free to occur if someone else, such as
the process swapout code, references and then dereferences the structure.
Additionally, the final cleanup of the structure should not occur until
the last process referencing it is reaped.
This commit solves the problem by introducing a secondary reference count,
calling 'vm_exitingcnt'. The normal reference count is decremented on exit
and vm_exitingcnt is incremented. vm_exitingcnt is decremented when the
process is reaped. When both vm_exitingcnt and vm_refcnt are 0, the
structure is freed for real.
MFC after: 3 weeks
intended to be used by significant memory consumers so that they may drain
some of their caches.
Inspired by: phk
Approved by: re
Tested on: x86, alpha
to reflect its new location, and add page queue and flag locking.
Notes: (1) alpha, i386, and ia64 had identical implementations
of pmap_collect() in terms of machine-independent interfaces;
(2) sparc64 doesn't require it; (3) powerpc had it as a TODO.
protected. Furthermore, in some RISC architectures with no normal
byte operations, the surrounding 3 bytes are also affected by the
read-modify-write that has to occur.
indirectly through vm_page_protect(). The one remaining page flag that
is updated by vm_page_protect() is already being updated by our various
pmap implementations.
Note: A later commit will similarly change the VM_PROT_READ case and
eliminate vm_page_protect().
sysctls to MI code; this reduces code duplication and makes all of them
available on sparc64, and the latter two on powerpc.
The semantics by the i386 and pc98 hw.availpages is slightly changed:
previously, holes between ranges of available pages would be included,
while they are excluded now. The new behaviour should be more correct
and brings i386 in line with the other architectures.
Move physmem to vm/vm_init.c, where this variable is used in MI code.
because it's no longer used. (See revision 1.215.)
- Fix a harmless bug: the number of vm_page structures allocated wasn't
properly adjusted when uma_bootstrap() was introduced. Consequently,
we were allocating 30 unused vm_page structures.
- Wrap a long line.
vm_page_alloc not to insert this page into an object. The pindex is
still used for colorization.
- Rework vm_page_select_* to accept a color instead of an object and
pindex to work with VM_PAGE_NOOBJ.
- Document other VM_ALLOC_ flags.
Reviewed by: peter, jake
mac_check_system_swapon(), to reflect the fact that the primary
object of this change is the running kernel as a whole, rather
than just the vnode. We'll drop additional checks of this
class into the same check namespace, including reboot(),
sysctl(), et al.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
extra function calls. Refactor uma_zalloc_internal into seperate functions
for finding the most appropriate slab, filling buckets, allocating single
items, and pulling items off of slabs. This makes the code significantly
cleaner.
- This also fixes the "Returning an empty bucket." panic that a few people
have seen.
Tested On: alpha, x86
held. This avoids a lock order reversal when destroying zones.
Unfortunately, this also means that the free checks are not done before
the destructor is called.
Reported by: phk
permitting policies to restrict access to memory mapping based on
the credential requesting the mapping, the target vnode, the
requested rights, or other policy considerations.
Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
perform authorization checks during swapon() events; policies
might choose to enforce protections based on the credential
requesting the swap configuration, the target of the swap operation,
or other factors such as internal policy state.
Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
trying to acquire it's proc lock since the proc lock may not have been
constructed yet.
- Split up the one big comment at the top of the loop and put the pieces
in the right order above the various checks.
Reported by: kris (1)
on-write (COW) mechanism. (This mechanism is used by the zero-copy
TCP/IP implementation.)
- Extend the scope of the page queues lock in vm_fault()
to cover vm_page_cowfault().
- Modify vm_page_cowfault() to release the page queues lock
if it sleeps.
be no major change in performance from this change at this time but this
will allow other work to progress: Giant lock removal around VM system
in favor of per-object mutexes, ranged fsyncs, more optimal COMMIT rpc's for
NFS, partial filesystem syncs by the syncer, more optimal object flushing,
etc. Note that the buffer cache is already using a similar splay tree
mechanism.
Note that a good chunk of the old hash table code is still in the tree.
Alan or I will remove it prior to the release if the new code does not
introduce unsolvable bugs, else we can revert more easily.
Submitted by: alc (this is Alan's code)
Approved by: re
- Begin moving scheduler specific functionality into sched_4bsd.c
- Replace direct manipulation of scheduler data with hooks provided by the
new api.
- Remove KSE specific state modifications and single runq assumptions from
kern_switch.c
Reviewed by: -arch
name instead. (e.g., SLOCK instead of SMTX, TD_ON_LOCK() instead of
TD_ON_MUTEX()) Eventually a turnstile abstraction will be added that
will be shared with mutexes and other types of locks. SLOCK/TDI_LOCK will
be used internally by the turnstile code and will not be specific to
mutexes. Making the change now ensures that turnstiles can be dropped
in at a later date without affecting the ABI of userland applications.
doesn't give them enough stack to do much before blowing away the pcb.
This adds MI and MD code to allow the allocation of an alternate kstack
who's size can be speficied when calling kthread_create. Passing the
value 0 prevents the alternate kstack from being created. Note that the
ia64 MD code is missing for now, and PowerPC was only partially written
due to the pmap.c being incomplete there.
Though this patch does not modify anything to make use of the alternate
kstack, acpi and usb are good candidates.
Reviewed by: jake, peter, jhb
constants VM_MIN_ADDRESS, VM_MAXUSER_ADDRESS, USRSTACK and PS_STRINGS.
This is mainly so that they can be variable even for the native abi, based
on different machine types. Get stack protections from the sysentvec too.
This makes it trivial to map the stack non-executable for certain abis, on
machines that support it.
- Remove all instances of the mallochash.
- Stash the slab pointer in the vm page's object pointer when allocating from
the kmem_obj.
- Use the overloaded object pointer to find slabs for malloced memory.
v_tag is now const char * and should only be used for debugging.
Additionally:
1. All users of VT_NTS now check vfsconf->vf_type VFCF_NETWORK
2. The user of VT_PROCFS now checks for the new flag VV_PROCDEP, which
is propagated by pseudofs to all child vnodes if the fs sets PFS_PROCDEP.
Suggested by: phk
Reviewed by: bde, rwatson (earlier version)
address space yet.
- Check whether a process is a system process prior to dereferencing
its p_vmspace. Aio assumes that only the curthread switches the address
space of a system process.
The process allocator now caches and hands out complete process structures
*including substructures* .
i.e. it get's the process structure with the first thread (and soon KSE)
already allocated and attached, all in one hit.
For the average non threaded program (non KSE that is) the allocated thread and its stack remain attached to the process, even when the process is
unused and in the process cache. This saves having to allocate and attach it
later, effectively bringing us (hopefully) close to the efficiency
of pre-KSE systems where these were a single structure.
Reviewed by: davidxu@freebsd.org, peter@freebsd.org
s/SNGL/SINGLE/
s/SNGLE/SINGLE/
Fix abbreviation for P_STOPPED_* etc flags, in original code they were
inconsistent and difficult to distinguish between them.
Approved by: julian (mentor)
in the original hardwired sysctl implementation.
The buf size calculator still overflows an integer on machines with large
KVA (eg: ia64) where the number of pages does not fit into an int. Use
'long' there.
Change Maxmem and physmem and related variables to 'long', mostly for
completeness. Machines are not likely to overflow 'int' pages in the
near term, but then again, 640K ought to be enough for anybody. This
comes for free on 32 bit machines, so why not?
pmap_zero_page() and pmap_zero_page_area() were modified to accept
a struct vm_page * instead of a physical address, vm_page_zero_fill()
and vm_page_zero_fill_area() have served no purpose.
- v_vflag is protected by the vnode lock and is used when synchronization
with VOP calls is needed.
- v_iflag is protected by interlock and is used for dealing with vnode
management issues. These flags include X/O LOCK, FREE, DOOMED, etc.
- All accesses to v_iflag and v_vflag have either been locked or marked with
mp_fixme's.
- Many ASSERT_VOP_LOCKED calls have been added where the locking was not
clear.
- Many functions in vfs_subr.c were restructured to provide for stronger
locking.
Idea stolen from: BSD/OS
by pmap_qenter() and pmap_qremove() is pointless. In fact, it probably
leads to unnecessary pmap_page_protect() calls if one of these pages is
paged out after unwiring.
Note: setting PG_MAPPED asserts that the page's pv list may be
non-empty. Since checking the status of the page's pv list isn't any
harder than checking this flag, the flag should probably be eliminated.
Alternatively, PG_MAPPED could be set by pmap_enter() exclusively
rather than various places throughout the kernel.
vm_page_sleep_busy() with vm_page_sleep_if_busy(). At the same time,
increase the scope of the page queues lock. (This should significantly
reduce the locking overhead in vm_object_page_remove().)
o Apply some style fixes.
swapped in, we do not have to ask for the scheduler thread to do
that.
- Assert that a process is not swapped out in runq functions and
swapout().
- Introduce thread_safetoswapout() for readability.
- In swapout_procs(), perform a test that may block (check of a
thread working on its vm map) first. This lets us call swapout()
with the sched_lock held, providing a better atomicity.
except for the fact tha they are presently swapped out. Also add a process
flag to indicate that the process has started the struggle to swap
back in. This will be needed for the case where multiple threads
start the swapin action top a collision. Also add code to stop
a process fropm being swapped out if one of the threads in this
process is actually off running on another CPU.. that might hurt...
Submitted by: Seigo Tanimura <tanimura@r.dl.itc.u-tokyo.ac.jp>
vm_page_rename() from vm_object_backing_scan(). vm_page_rename()
also performs vm_page_deactivate() on pages in the cache queues,
making the removed vm_page_deactivate() redundant.
handler in the kernel at the same time. Also, allow for the
exec_new_vmspace() code to build a different sized vmspace depending on
the executable environment. This is a big help for execing i386 binaries
on ia64. The ELF exec code grows the ability to map partial pages when
there is a page size difference, eg: emulating 4K pages on 8K or 16K
hardware pages.
Flesh out the i386 emulation support for ia64. At this point, the only
binary that I know of that fails is cvsup, because the cvsup runtime
tries to execute code in pages not marked executable.
Obtained from: dfr (mostly, many tweaks from me).
the loadav. This is not real load. If you have a nice process running in
the background, pagezero may sit in the run queue for ages and add one to
the loadav, and thereby affecting other scheduling decisions.
when VM_ALLOC_WIRED is specified: set the PG_MAPPED bit in flags.
o In both vm_page_wire() and vm_page_allocate() add a comment saying
that setting PG_MAPPED does not belong there.
that pre-zeroes free pages.
o Remove GIANT_REQUIRED from some low-level page queue functions. (Instead
assertions on the page queue lock are being added to the higher-level
functions, like vm_page_wire(), etc.)
In collaboration with: peter
Use lmin(long, long), not min(u_int, u_int). This is a problem here on
ia64 which has *way* more than 2^32 pages of KVA. 281474976710655 pages
to be precice.
to return a wired page.
o Use VM_ALLOC_WIRED within Alpha's pmap_growkernel(). Also, because
Alpha's pmap_growkernel() calls vm_page_alloc() from within a critical
section, specify VM_ALLOC_INTERRUPT instead of VM_ALLOC_SYSTEM. (Only
VM_ALLOC_INTERRUPT is implemented entirely with a spin mutex.)
o Assert that the page queues mutex is held in vm_page_wire()
on Alpha, just like the other platforms.
vm_page_zero_idle() instead of partially duplicated implementations.
In particular, this change guarantees that the number of free pages
in the free queue(s) matches the global free page count when Giant
is released.
Submitted by: peter (via his p4 "pmap" branch)