Use the same scheme implemented to manage credentials.
Code needing to look at process's credentials (as opposed to thred's) is
provided with *_proc variants of relevant functions.
Places which possibly had to take the proc lock anyway still use the proc
pointer to access limits.
The point of this is to be able to add RACCT (with RACCT_DISABLED)
to GENERIC, to avoid having to rebuild the kernel to use rctl(8).
Differential Revision: https://reviews.freebsd.org/D2369
Reviewed by: kib@
MFC after: 1 month
Relnotes: yes
Sponsored by: The FreeBSD Foundation
the wired attribute of the mapping. As result, some pmap
implementations clear the wired state of the page table entries, which
breaks invariants and allows the entries to be lost. Avoid calling
vm_map_pmap_enter() for the MADV_WILLNEED on the wired entry, the
pages must be already mapped.
Noted and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
pmap_unwire() and vm_object_unwire().
Retire vm_fault_{un,}wire(), since they are no longer used.
(See r268327 and r269134 for the motivation behind this change.)
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
underlying physical pages are mapped by the pmap. If, for example, the
application has performed an mprotect(..., PROT_NONE) on any part of the
wired region, then those pages will no longer be mapped by the pmap.
So, using the pmap to lookup the wired pages in order to unwire them
doesn't always work, and when it doesn't work wired pages are leaked.
To avoid the leak, introduce and use a new function vm_object_unwire()
that locates the wired pages by traversing the object and its backing
objects.
At the same time, switch from using pmap_change_wiring() to the recently
introduced function pmap_unwire() for unwiring the region's mappings.
pmap_unwire() is faster, because it operates a range of virtual addresses
rather than a single virtual page at a time. Moreover, by operating on
a range, it is superpage friendly. It doesn't waste time performing
unnecessary demotions.
Reported by: markj
Reviewed by: kib
Tested by: pho, jmg (arm)
Sponsored by: EMC / Isilon Storage Division
These changes prevent sysctl(8) from returning proper output,
such as:
1) no output from sysctl(8)
2) erroneously returning ENOMEM with tools like truss(1)
or uname(1)
truss: can not get etype: Cannot allocate memory
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.
Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.
MFC after: 2 weeks
Sponsored by: Mellanox Technologies
have to undo it by calling crfree(). This reduces the total number of calls
by vm_map_insert() to crhold() and crfree() by 45% in my tests.
Eliminate an unnecessary variable from vm_map_insert().
Reviewed by: kib
Tested by: pho
that it creates (r267645), we can place the check that blocks map entry
coalescing on stack entries in vm_map_simplify_entry() where it properly
belongs.
Reviewed by: kib
numerical values as MAP_STACK_GROWS_*, but the former is for entries'
eflags, while the later for the cow argument of vm_map_insert().
Submitted by: alc
map entries list, and that it does not overlap with the previous and
next entries.
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
corresponding flag(s) in the new map entry. Previously, the caller was
responsible for setting them after vm_map_insert() returned.
Pass MAP_STACK_GROWS_DOWN to vm_map_insert() from vm_map_growstack() when
extending the stack in the downward direction.
Together these changes slightly simplify the caller's task when creating a
downward growing stack. In particular, the caller no longer needs to clip
the previous entry, because the new stack entry can't possibly coalesce
with the previous entry.
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
and prevents the request from deleting existing mappings in the
region, failing instead.
Reviewed by: alc
Discussed with: jhb
Tested by: markj, pho (previous version, as part of the bigger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
would be read once and cached in a local variable so that the resource limit
check and map entry insertion would be guaranteed to use the same value.
However, the value being passed to vm_map_insert() is still from "sgrowsiz"
and not the local variable. Correct this oversight.
Reviewed by: kib
to !MAP_STACK mapping requests. For MAP_STACK | MAP_FIXED, clear any
mappings which could previously exist in the used range.
For this, teach vm_map_find() and vm_map_fixed() to handle
MAP_STACK_GROWS_DOWN or _UP cow flags, by calling a new
vm_map_stack_locked() helper, which is factored out from
vm_map_stack().
The side effect of the change is that MAP_STACK started obeying
MAP_ALIGNMENT and MAP_32BIT flags.
Reported by: rwatson
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
a partially populated reservation becomes fully populated, and decrease this
field when a fully populated reservation becomes partially populated.
Use this field to simplify the implementation of pmap_enter_object() on
amd64, arm, and i386.
On all architectures where we support superpages, the cost of creating a
superpage mapping is roughly the same as creating a base page mapping. For
example, both kinds of mappings entail the creation of a single PTE and PV
entry. With this in mind, use the page size field to make the
implementation of vm_map_pmap_enter(..., MAP_PREFAULT_PARTIAL) a little
smarter. Previously, if MAP_PREFAULT_PARTIAL was specified to
vm_map_pmap_enter(), that function would only map base pages. Now, it will
create up to 96 base page or superpage mappings.
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
situation checked by assert is verified to not take place in
vm_map_wire(), and protection permissions on the wired entry can be
revoked afterward.
Reported by: markj
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
on execve(2), it calls vmspace_exec(), which frees the current
vmspace. The thread executing an exec syscall gets new vmspace
assigned, and old vmspace is freed if only referenced by the current
process. The free operation includes pmap_release(), which
de-constructs the paging structures used by hardware.
If the calling process is multithreaded, other threads are suspended
in the thread_suspend_check(), and need to be unsuspended and run to
be able to exit on successfull exec. Now, since the old vmspace is
destroyed, paging structures are invalid, threads are resumed on the
non-existent pmaps (page tables), which leads to triple fault on x86.
To fix, postpone the free of old vmspace until the threads are resumed
and exited. To avoid modifications to all image activators all of
which use exec_new_vmspace(), memoize the current (old) vmspace in
kern_execve(), and notify it about the need to call vmspace_free()
with a thread-private flag TDP_EXECVMSPC.
http://bugs.debian.org/743141
Reported by: Ivo De Decker <ivo.dedecker@ugent.be> through secteam
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
parent and child processes. Previously, we copied these pages even though
they are read only. However, the reason for copying them is historical and
no longer exists. In recent times, vm_map_protect() has developed the
ability to copy pages when write access is added to wired copy-on-write
pages. So, in this case, copy-on-write sharing of wired pages is not to be
feared. It is not going to lead to copy-on-write faults on wired memory.
Reviewed by: kib
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
avoid soft page faults when adding write access to user wired entries in
vm_map_protect(). Previously, we only avoided the soft page fault when
the underlying pages were copy-on-write. In other words, we avoided the
pages faults that might sleep on page allocation, but not the trivial
page faults to update the physical map.
Reviewed by: kib
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
page queues for the backing objects. The queues are huge and clutter
the display, when mostly the map entries and its backing storage is
interesting.
The page queues can be seen with ddb 'show object' command.
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.
Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.
Exp-run revealed no ports using it directly.
No objection from: arch@
Sponsored by: EMC / Isilon Storage Division
region is claimed by a new entry.
Pass MAP_STACK_GROWS_DOWN and MAP_STACK_GROWS_UP flags to
vm_map_insert() from vm_map_stack(), to really turn off coalescing
code and call to vm_map_simplify_entry() [1].
Reported by: avg, peter, many
Tested by: avg, peter
Noted by: avg [1]
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
larger than the operational region. If the op region size is zero,
clipping would create a zero-sized map entry. The result is that vm
map splay starts behaving inconsistently, sometimes returning
zero-sized entry, sometimes the next (or previous) entry.
One step further, it could result in e.g. vm_map_wire() setting
MAP_ENTRY_IN_TRANSITION on the zero-sized entry, but failing to clear
it in the done part. The vm_map_delete() than hangs forever waiting
for the flag removal.
Verify for zero-length requests and act as if it is always successfull
without performing any action on the address space.
Diagnosed by: pho
Tested by: pho (previous version)
Reviewed by: alc (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
- add fields to 'struct pmap' that are required to manage nested page tables.
- add a parameter to 'vmspace_alloc()' that can be used to override the
default pmap initialization routine 'pmap_pinit()'.
These changes are pushed ahead of the remaining changes in 'bhyve_npt_pmap'
in anticipation of the upcoming KBI freeze for 10.0.
Reviewed by: kib@, alc@
Approved by: re (glebius)
an address in the first 2GB of the process's address space. This flag should
have the same semantics as the same flag on Linux.
To facilitate this, add a new parameter to vm_map_find() that specifies an
optional maximum virtual address. While here, fix several callers of
vm_map_find() to use a VMFS_* constant for the findspace argument instead of
TRUE and FALSE.
Reviewed by: alc
Approved by: re (kib)
MADV_DONTNEED) and madvise(..., MADV_FREE). Specifically, introduce a new
pmap function, pmap_advise(), that operates on a range of virtual addresses
within the specified pmap, allowing for a more efficient implementation of
MADV_DONTNEED and MADV_FREE. Previously, the implementation of
MADV_DONTNEED and MADV_FREE relied on per-page pmap operations, such as
pmap_clear_reference(). Intuitively, the problem with this implementation
is that the pmap-level locks are acquired and released and the page table
traversed repeatedly, once for each resident page in the range
that was specified to madvise(2). A more subtle flaw with the previous
implementation is that pmap_clear_reference() would clear the reference bit
on all mappings to the specified page, not just the mapping in the range
specified to madvise(2).
Since our malloc(3) makes heavy use of madvise(2), this change can have a
measureable impact. For example, the system time for completing a parallel
"buildworld" on a 6-core amd64 machine was reduced by about 1.5% to 2.0%.
Note: This change only contains pmap_advise() implementations for a subset
of our supported architectures. I will commit implementations for the
remaining architectures after further testing. For now, a stub function is
sufficient because of the advisory nature of pmap_advise().
Discussed with: jeff, jhb, kib
Tested by: pho (i386), marcel (ia64)
Sponsored by: EMC / Isilon Storage Division
which is the part of struct vmspace, allocated from UMA_ZONE_NOFREE
zone. Initialize the pmap lock in the vmspace zone init function, and
remove pmap lock initialization and destruction from pmap_pinit() and
pmap_release().
Suggested and reviewed by: alc (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
address alignment of mappings.
- MAP_ALIGNED(n) requests a mapping aligned on a boundary of (1 << n).
Requests for n >= number of bits in a pointer or less than the size of
a page fail with EINVAL. This matches the API provided by NetBSD.
- MAP_ALIGNED_SUPER is a special case of MAP_ALIGNED. It can be used
to optimize the chances of using large pages. By default it will align
the mapping on a large page boundary (the system is free to choose any
large page size to align to that seems best for the mapping request).
However, if the object being mapped is already using large pages, then
it will align the virtual mapping to match the existing large pages in
the object instead.
- Internally, VMFS_ALIGNED_SPACE is now renamed to VMFS_SUPER_SPACE, and
VMFS_ALIGNED_SPACE(n) is repurposed for specifying a specific alignment.
MAP_ALIGNED(n) maps to using VMFS_ALIGNED_SPACE(n), while
MAP_ALIGNED_SUPER maps to VMFS_SUPER_SPACE.
- mmap() of a device object now uses VMFS_OPTIMAL_SPACE rather than
explicitly using VMFS_SUPER_SPACE. All device objects are forced to
use a specific color on creation, so VMFS_OPTIMAL_SPACE is effectively
equivalent.
Reviewed by: alc
MFC after: 1 month
transparent layering and better fragmentation.
- Normalize functions that allocate memory to use kmem_*
- Those that allocate address space are named kva_*
- Those that operate on maps are named kmap_*
- Implement recursive allocation handling for kmem_arena in vmem.
Reviewed by: alc
Tested by: pho
Sponsored by: EMC / Isilon Storage Division
locks don't accidentally appear to have been already
initialized.
In particular, this fixes a consistent kernel crash on
armv6 with:
panic: lock "vm map (user)" 0xc09cc050 already initialized
that appeared with r251709.
PR: arm/180820
- Add a new address space allocation method (VMFS_OPTIMAL_SPACE) for
vm_map_find() that will try to alter the alignment of a mapping to match
any existing superpage mappings of the object being mapped. If no
suitable address range is found with the necessary alignment,
vm_map_find() will fall back to using the simple first-fit strategy
(VMFS_ANY_SPACE).
- Change mmap() without MAP_FIXED, shmat(), and the GEM mapping ioctl to
use VMFS_OPTIMAL_SPACE instead of VMFS_ANY_SPACE.
Reviewed by: alc (earlier version)
MFC after: 2 weeks
parallel creation of the map entries, e.g. by mmap() or stack growing.
It also breaks when other entry is wired in parallel.
The vm_map_wire() iterates over the map entries in the region, and
assumes that map entries it finds are marked as in transition before,
also that any entry marked as in transition, are marked by the current
invocation of vm_map_wire(). This is not true for new entries in the
holes.
Add the thread owner of the MAP_ENTRY_IN_TRANSITION flag to struct
vm_map_entry. In vm_map_wire() and vm_map_unwire(), only process the
entries which transition owner is the current thread.
Reported and tested by: pho
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
to a memory-mapped file in the traced process's address space
even if neither the traced process nor the tracing process had
write access to that file.
Security: CVE-2013-2171
Security: FreeBSD-SA-13:06.mmap
Approved by: so
o Relax locking assertions for pmap_enter_object() and add them also
to architectures that currently don't have any
o Introduce VM_OBJECT_LOCK_DOWNGRADE() which is basically a downgrade
operation on the per-object rwlock
o Use all the mechanisms above to make vm_map_pmap_enter() to work
mostl of the times only with readlocks.
Sponsored by: EMC / Isilon storage division
Reviewed by: alc
with the MAP_ENTRY_VN_WRITECNT flag:
- Move the assertion that verifies the state of the v_writecount and
vnp.writecount, under the block where the object is locked.
- Check that the object type is OBJT_VNODE before asserting.
Reported by: avg
Reviewed by: alc
MFC after: 1 week
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.
The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
- VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
- VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
- VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
- VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
(in order to avoid visibility of implementation details)
- The read-mode operations are added:
VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
sys/mutex.h in consumers directly to cater its inlining functions
using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
the compat layer because the name clash between FreeBSD and solaris
versions must be avoided.
At this purpose zfs redefines the vm_object locking functions
directly, isolating the FreeBSD components in specific compat stubs.
The KPI results heavilly broken by this commit. Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).
Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: pjd (ZFS specific review)
Discussed with: alc
Tested by: pho
Replace the sub-optimal uma_zone_set_obj() primitive with more modern
uma_zone_reserve_kva(). The new primitive reserves before hand
the necessary KVA space to cater the zone allocations and allocates pages
with ALLOC_NOOBJ. More specifically:
- uma_zone_reserve_kva() does not need an object to cater the backend
allocator.
- uma_zone_reserve_kva() can cater M_WAITOK requests, in order to
serve zones which need to do uma_prealloc() too.
- When possible, uma_zone_reserve_kva() uses directly the direct-mapping
by uma_small_alloc() rather than relying on the KVA / offset
combination.
The removal of the object attribute allows 2 further changes:
1) _vm_object_allocate() becomes static within vm_object.c
2) VM_OBJECT_LOCK_INIT() is removed. This function is replaced by
direct calls to mtx_init() as there is no need to export it anymore
and the calls aren't either homogeneous anymore: there are now small
differences between arguments passed to mtx_init().
Sponsored by: EMC / Isilon storage division
Reviewed by: alc (which also offered almost all the comments)
Tested by: pho, jhb, davide