Since vm objects are allocated from type-stable memory, we don't need to
initialize the trie's root in _vm_object_allocate() on every vm object
allocation. We can instead do it once in vm_object_zinit().
We don't need to call vm_radix_reclaim_allnodes() in vm_object_terminate()
unless the resident page count is non-zero.
Reviewed by: attilio
Sponsored by: EMC / Isilon Storage Division
valid if the flag OBJ_COLORED is set. Since _vm_object_allocate()
doesn't set this flag, it needn't initialize pg_color.
Sponsored by: EMC / Isilon Storage Division
- Capability is no longer separate descriptor type. Now every descriptor
has set of its own capability rights.
- The cap_new(2) system call is left, but it is no longer documented and
should not be used in new code.
- The new syscall cap_rights_limit(2) should be used instead of
cap_new(2), which limits capability rights of the given descriptor
without creating a new one.
- The cap_getrights(2) syscall is renamed to cap_rights_get(2).
- If CAP_IOCTL capability right is present we can further reduce allowed
ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
ioctls can be retrived with cap_ioctls_get(2) syscall.
- If CAP_FCNTL capability right is present we can further reduce fcntls
that can be used with the new cap_fcntls_limit(2) syscall and retrive
them with cap_fcntls_get(2).
- To support ioctl and fcntl white-listing the filedesc structure was
heavly modified.
- The audit subsystem, kdump and procstat tools were updated to
recognize new syscalls.
- Capability rights were revised and eventhough I tried hard to provide
backward API and ABI compatibility there are some incompatible changes
that are described in detail below:
CAP_CREATE old behaviour:
- Allow for openat(2)+O_CREAT.
- Allow for linkat(2).
- Allow for symlinkat(2).
CAP_CREATE new behaviour:
- Allow for openat(2)+O_CREAT.
Added CAP_LINKAT:
- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
- Allow to be target for renameat(2).
Added CAP_SYMLINKAT:
- Allow for symlinkat(2).
Removed CAP_DELETE. Old behaviour:
- Allow for unlinkat(2) when removing non-directory object.
- Allow to be source for renameat(2).
Removed CAP_RMDIR. Old behaviour:
- Allow for unlinkat(2) when removing directory.
Added CAP_RENAMEAT:
- Required for source directory for the renameat(2) syscall.
Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
- Allow for unlinkat(2) on any object.
- Required if target of renameat(2) exists and will be removed by this
call.
Removed CAP_MAPEXEC.
CAP_MMAP old behaviour:
- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
PROT_WRITE.
CAP_MMAP new behaviour:
- Allow for mmap(2)+PROT_NONE.
Added CAP_MMAP_R:
- Allow for mmap(PROT_READ).
Added CAP_MMAP_W:
- Allow for mmap(PROT_WRITE).
Added CAP_MMAP_X:
- Allow for mmap(PROT_EXEC).
Added CAP_MMAP_RW:
- Allow for mmap(PROT_READ | PROT_WRITE).
Added CAP_MMAP_RX:
- Allow for mmap(PROT_READ | PROT_EXEC).
Added CAP_MMAP_WX:
- Allow for mmap(PROT_WRITE | PROT_EXEC).
Added CAP_MMAP_RWX:
- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).
Renamed CAP_MKDIR to CAP_MKDIRAT.
Renamed CAP_MKFIFO to CAP_MKFIFOAT.
Renamed CAP_MKNODE to CAP_MKNODEAT.
CAP_READ old behaviour:
- Allow pread(2).
- Disallow read(2), readv(2) (if there is no CAP_SEEK).
CAP_READ new behaviour:
- Allow read(2), readv(2).
- Disallow pread(2) (CAP_SEEK was also required).
CAP_WRITE old behaviour:
- Allow pwrite(2).
- Disallow write(2), writev(2) (if there is no CAP_SEEK).
CAP_WRITE new behaviour:
- Allow write(2), writev(2).
- Disallow pwrite(2) (CAP_SEEK was also required).
Added convinient defines:
#define CAP_PREAD (CAP_SEEK | CAP_READ)
#define CAP_PWRITE (CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ)
#define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
#define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W)
#define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X)
#define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X)
#define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
#define CAP_RECV CAP_READ
#define CAP_SEND CAP_WRITE
#define CAP_SOCK_CLIENT \
(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
#define CAP_SOCK_SERVER \
(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
CAP_SETSOCKOPT | CAP_SHUTDOWN)
Added defines for backward API compatibility:
#define CAP_MAPEXEC CAP_MMAP_X
#define CAP_DELETE CAP_UNLINKAT
#define CAP_MKDIR CAP_MKDIRAT
#define CAP_RMDIR CAP_UNLINKAT
#define CAP_MKFIFO CAP_MKFIFOAT
#define CAP_MKNOD CAP_MKNODAT
#define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER)
Sponsored by: The FreeBSD Foundation
Reviewed by: Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with: rwatson, benl, jonathan
ABI compatibility discussed with: kib
VM_OBJECT_LOCKED() macro is only used to implement a custom version
of lock assertions right now (which likely spread out thanks to
copy and paste).
Remove it and implement actual assertions.
Sponsored by: EMC / Isilon storage division
Reviewed by: alc
Tested by: pho
Replace the sub-optimal uma_zone_set_obj() primitive with more modern
uma_zone_reserve_kva(). The new primitive reserves before hand
the necessary KVA space to cater the zone allocations and allocates pages
with ALLOC_NOOBJ. More specifically:
- uma_zone_reserve_kva() does not need an object to cater the backend
allocator.
- uma_zone_reserve_kva() can cater M_WAITOK requests, in order to
serve zones which need to do uma_prealloc() too.
- When possible, uma_zone_reserve_kva() uses directly the direct-mapping
by uma_small_alloc() rather than relying on the KVA / offset
combination.
The removal of the object attribute allows 2 further changes:
1) _vm_object_allocate() becomes static within vm_object.c
2) VM_OBJECT_LOCK_INIT() is removed. This function is replaced by
direct calls to mtx_init() as there is no need to export it anymore
and the calls aren't either homogeneous anymore: there are now small
differences between arguments passed to mtx_init().
Sponsored by: EMC / Isilon storage division
Reviewed by: alc (which also offered almost all the comments)
Tested by: pho, jhb, davide
the objects zone ensures type-stability and thus we want to execute
actual lock initialization only when the objects are brought into the
zone otherwise there could be races between lock threads doing
re-initilization and other threads that want to acquire the lock
without a reference.
Sponsored by: EMC / Isilon storage division
Reported by: alc
macro VM_OBJECT_SLEEP().
This hides some implementation details like the usage of the msleep()
primitive and the necessity to access to the lock address directly.
For this reason VM_OBJECT_MTX() macro is now retired.
Sponsored by: EMC / Isilon storage division
Reviewed by: alc
Tested by: pho
really make sense for this comment to name specific backend allocators,
instead simply refer to backend allocators.
Sponsored by: EMC / Isilon Storage Division
more modern uma_zone_reserve_kva(). The difference is that it doesn't
rely anymore on an obj to allocate pages and the slab allocator doesn't
use any more any specific locking but atomic operations to complete
the operation.
Where possible, the uma_small_alloc() is instead used and the uk_kva
member becomes unused.
The subsequent cleanups also brings along the removal of
VM_OBJECT_LOCK_INIT() macro which is not used anymore as the code
can be easilly cleaned up to perform a single mtx_init(), private
to vm_object.c.
For the same reason, _vm_object_allocate() becomes private as well.
Sponsored by: EMC / Isilon storage division
Reviewed by: alc
* VM_OBJECT_LOCK and VM_OBJECT_UNLOCK are mapped to write operations
* VM_OBJECT_SLEEP() is introduced as a general purpose primitve to
get a sleep operation using a VM_OBJECT_LOCK() as protection
* The approach must bear with vm_pager.h namespace pollution so many
files require including directly rwlock.h
machine to another. Therefore, VM_MAX_KERNEL_ADDRESS can't be a constant.
Instead, #define it to be a variable, vm_max_kernel_address, just like we
do on sparc64.
Reviewed by: kib
Tested by: ian
the actual number of vm_page_t that will be derived, so v_page_count
should be used appropriately.
Besides that, add a panic condition in case UMA fails to properly
restrict the area in a way to keep all the desired objects.
Sponsored by: EMC / Isilon storage division
Reported by: alc
- Use predict_false() to tag boot-time cache decisions
- Compact boot-time cache allocation into a separate, non-inline,
function that won't be called most of the times.
Sponsored by: EMC / Isilon storage division
assertions and other code in this file.
- Reinsert some comments that were lost during the work but which are
actual yet, reducing differences with HEAD.
Sponsoed by: EMC / Isilon storage division
use a different scheme for preallocation: reserve few KB of nodes to be
used to cater page allocations before the memory can be efficiently
pre-allocated by UMA.
This at all effects remove boot_pages further carving and along with
this modifies to the boot_pages allocation system and necessity to
initialize the UMA zone before pmap_init().
Reported by: pho, jhb
includes path-compression. This greatly helps with sparsely populated
tries, where an uncompressed trie may end up by having a lot of
intermediate nodes for very little leaves.
The new algorithm introduces 2 main concepts: the node level and the
node owner. Every node represents a branch point where the leaves share
the key up to the level specified in the node-level (current level
excluded, of course). Such key partly shared is the one contained in
the owner. Of course, the root branch is exempted to keep a valid
owner, because theoretically all the keys are contained in the space
designed by the root branch node. The search algorithm seems very
intuitive and that is where one should start reading to understand the
full approach.
In the end, the algorithm ends up by demanding only one node per insert
and this is not necessary in all the cases. To stay safe, we basically
preallocate as many nodes as the number of physical pages are in the
system, using uma_preallocate(). However, this raises 2 concerns:
* As pmap_init() needs to kmem_alloc(), the nodes must be pre-allocated
when vm_radix_init() is currently called, which is much before UMA
is fully initialized. This means that uma_prealloc() will dig into the
UMA_BOOT_PAGES pool of pages, which is often not enough to keep track
of such large allocations.
In order to fix this, change a bit the concept of UMA_BOOT_PAGES and
vm.boot_pages. More specifically make the UMA_BOOT_PAGES an initial "value"
as long as vm.boot_pages and extend the boot_pages physical area by as
many bytes as needed with the information returned by
vm_radix_allocphys_size().
* A small amount of pages will be held in per-cpu buckets and won't be
accessible from curcpu, so the vm_radix_node_get() could really panic
when the pre-allocation pool is close to be exhausted.
In theory we could pre-allocate more pages than the number of physical
frames to satisfy such request, but as many insert would happen without
a node allocation anyway, I think it is safe to assume that the
over-allocation is already compensating for such problem.
On the field testing can stand me correct, of course. This could be
further helped by the case where we allow a single-page insert to not
require a complete root node.
The use of pre-allocation gets rid all the non-direct mapping trickery
and introduced lock recursion allowance for vm_page_free_queue.
The nodes children are reduced in number from 32 -> 16 and from 16 -> 8
(for respectively 64 bits and 32 bits architectures).
This would make the children to fit into cacheline for amd64 case,
for example, and in general spawn less cacheline, which may be
helpful in lookup_ge() case.
Also, path-compression cames to help in cases where there are many levels,
making the fallouts of such change less hurting.
Sponsored by: EMC / Isilon storage division
Reviewed by: jeff (partially)
Tested by: flo
- Avoid the return value for vm_radix_insert()
- Name the functions argument per-style(9)
- Avoid to get and return opaque objects but use vm_page_t as vm_radix is
thought to not really be general code but to cater specifically page
cache and resident cache.
in devfs if a particular race condition is hit in the device pager
code.
This was a side effect of change 227530 which changed the device
pager interface to call a new destructor routine for the cdev.
That destructor routine, old_dev_pager_dtor(), takes a VM object
handle.
The object handle is cast to a struct cdev *, and passed into
dev_rel().
That works in most cases, except the case in cdev_pager_allocate()
where there is a race condition between two threads allocating an
object backed by the same device. The loser of the race
deallocates its object at the end of the function.
The problem is that before inserting the object into the
dev_pager_object_list, the object's handle is changed from the
struct cdev pointer to the object's own address. This is to avoid
conflicts with the winner of the race, which already inserted an
object in the list with a handle that is a pointer to the same cdev
structure.
The object is then passed to vm_object_deallocate(), and eventually
makes its way down to old_dev_pager_dtor(). That function passes
the handle pointer (which is actually a VM object, not a struct
cdev as usual) into dev_rel(). dev_rel() decrements the reference
count in the assumed struct cdev (which happens to be 0), and
that triggers the assertion in dev_rel() that the reference count
is greater than or equal to 0.
The fix is to add a cdev pointer to the VM object, and use that
pointer when calling the cdev_pg_dtor() routine.
vm_object.h: Add a struct cdev pointer to the VM object
structure.
device_pager.c: In cdev_pager_allocate(), populate the new cdev
pointer.
In dev_pager_dealloc(), use the new cdev pointer
when calling the object's cdev_pg_dtor() routine.
Reviewed by: kib
Sponsored by: Spectra Logic Corporation
MFC after: 1 week