These changes prevent sysctl(8) from returning proper output,
such as:
1) no output from sysctl(8)
2) erroneously returning ENOMEM with tools like truss(1)
or uname(1)
truss: can not get etype: Cannot allocate memory
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.
Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.
MFC after: 2 weeks
Sponsored by: Mellanox Technologies
To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.
Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.
Exp-run revealed no ports using it directly.
No objection from: arch@
Sponsored by: EMC / Isilon Storage Division
reclaim the last preexisting cached page in the object, resulting in a call
to vdrop(). Detect this scenario so that the vnode's hold count is
correctly maintained. Otherwise, we panic.
Reported by: scottl
Tested by: pho
Discussed with: attilio, jeff, kib
for nodes used in vm_radix.
On architectures supporting direct mapping, also avoid to pre-allocate
the KVA for such nodes.
In order to do so make the operations derived from vm_radix_insert()
to fail and handle all the deriving failure of those.
vm_radix-wise introduce a new function called vm_radix_replace(),
which can replace a leaf node, already present, with a new one,
and take into account the possibility, during vm_radix_insert()
allocation, that the operations on the radix trie can recurse.
This means that if operations in vm_radix_insert() recursed
vm_radix_insert() will start from scratch again.
Sponsored by: EMC / Isilon storage division
Reviewed by: alc (older version)
Reviewed by: jeff
Tested by: pho, scottl
functions, reverse the numbering scheme for the levels. The highest
numbered level in the tree now appears near the root instead of the leaves.
Sponsored by: EMC / Isilon Storage Division
change the way that these functions ascend the tree when the search for a
matching leaf fails at an interior node. Rather than returning to the root
of the tree and repeating the lookup with an updated key, maintain a stack
of interior nodes that were visited during the descent and use that stack
to resume the lookup at the closest ancestor that might have a matching
descendant.
Sponsored by: EMC / Isilon Storage Division
Reviewed by: attilio
Tested by: pho
the number of interior nodes, we have previously created a level zero
interior node at the root of every non-empty trie, even when that node is
not strictly necessary, i.e., it has only one child. This change is the
second (and final) step in eliminating those unnecessary level zero interior
nodes. Specifically, it updates the deletion and insertion functions so
that they do not require a level zero interior node at the root of the trie.
For a "buildworld" workload, this change results in a 16.8% reduction in the
number of interior nodes allocated and a similar reduction in the average
execution time for lookup functions. For example, the average execution
time for a call to vm_radix_lookup_ge() is reduced by 22.9%.
Reviewed by: attilio, jeff (an earlier version)
Sponsored by: EMC / Isilon Storage Division
the number of interior nodes, we always create a level zero interior node at
the root of every non-empty trie, even when that node is not strictly
necessary, i.e., it has only one child. This change is the first step in
eliminating those unnecessary level zero interior nodes. Specifically, it
updates all of the lookup functions so that they do not require a level zero
interior node at the root.
Reviewed by: attilio, jeff (an earlier version)
Sponsored by: EMC / Isilon Storage Division
arrange for all of the fields to start at a short offset from the
beginning of the structure.
Eliminate unnecessary masking of VM_RADIX_FLAGS from the root pointer in
vm_radix_getroot().
Sponsored by: EMC / Isilon Storage Division
vm_radix_topage(). This transformation eliminates some unnecessary
conditional branches from the inner loops of vm_radix_insert(),
vm_radix_lookup{,_ge,_le}(), and vm_radix_remove().
Simplify the control flow of vm_radix_lookup_{ge,le}().
Reviewed by: attilio (an earlier version)
Tested by: pho
Sponsored by: EMC / Isilon Storage Division
using vm_radix_node_page() == NULL, the compiler is able to generate one
less conditional branch when vm_radix_isleaf() is used. More use cases
involving the inner loops of vm_radix_insert(), vm_radix_lookup{,_ge,_le}(),
and vm_radix_remove() will follow.
Reviewed by: attilio
Sponsored by: EMC / Isilon Storage Division
that could never be reached in vm_radix_insert(). (If the pointer being
checked by the panic call were ever NULL, the immmediately preceding loop
would have already crashed on a NULL pointer dereference.)
Reviewed by: attilio (an earlier version)
Sponsored by: EMC / Isilon Storage Division
vm_radix_node_get() with a small change to vm_radix_reclaim_allnodes_int().
This change further reduced the average number of cycles per
vm_page_insert() call from 532 to 519.
Reviewed by: attilio
Sponsored by: EMC / Isilon Storage Division
"index". The content of a radix tree leaf, or at least its "key", is not
opaque to the other radix tree operations. Specifically, they know how to
extract the "key" from a leaf. So, eliminating the parameter "index" isn't
breaking the abstraction. Moreover, eliminating the parameter "index"
effectively prevents the caller from passing an inconsistent "index" and
leaf to vm_radix_insert().
Reviewed by: attilio
Sponsored by: EMC / Isilon Storage Division
for allocating the nodes before to have the possibility to carve
directly from the UMA subsystem.
Sponsored by: EMC / Isilon storage division
Reviewed by: alc
more modern uma_zone_reserve_kva(). The difference is that it doesn't
rely anymore on an obj to allocate pages and the slab allocator doesn't
use any more any specific locking but atomic operations to complete
the operation.
Where possible, the uma_small_alloc() is instead used and the uk_kva
member becomes unused.
The subsequent cleanups also brings along the removal of
VM_OBJECT_LOCK_INIT() macro which is not used anymore as the code
can be easilly cleaned up to perform a single mtx_init(), private
to vm_object.c.
For the same reason, _vm_object_allocate() becomes private as well.
Sponsored by: EMC / Isilon storage division
Reviewed by: alc
the actual number of vm_page_t that will be derived, so v_page_count
should be used appropriately.
Besides that, add a panic condition in case UMA fails to properly
restrict the area in a way to keep all the desired objects.
Sponsored by: EMC / Isilon storage division
Reported by: alc
- Use predict_false() to tag boot-time cache decisions
- Compact boot-time cache allocation into a separate, non-inline,
function that won't be called most of the times.
Sponsored by: EMC / Isilon storage division
use a different scheme for preallocation: reserve few KB of nodes to be
used to cater page allocations before the memory can be efficiently
pre-allocated by UMA.
This at all effects remove boot_pages further carving and along with
this modifies to the boot_pages allocation system and necessity to
initialize the UMA zone before pmap_init().
Reported by: pho, jhb
includes path-compression. This greatly helps with sparsely populated
tries, where an uncompressed trie may end up by having a lot of
intermediate nodes for very little leaves.
The new algorithm introduces 2 main concepts: the node level and the
node owner. Every node represents a branch point where the leaves share
the key up to the level specified in the node-level (current level
excluded, of course). Such key partly shared is the one contained in
the owner. Of course, the root branch is exempted to keep a valid
owner, because theoretically all the keys are contained in the space
designed by the root branch node. The search algorithm seems very
intuitive and that is where one should start reading to understand the
full approach.
In the end, the algorithm ends up by demanding only one node per insert
and this is not necessary in all the cases. To stay safe, we basically
preallocate as many nodes as the number of physical pages are in the
system, using uma_preallocate(). However, this raises 2 concerns:
* As pmap_init() needs to kmem_alloc(), the nodes must be pre-allocated
when vm_radix_init() is currently called, which is much before UMA
is fully initialized. This means that uma_prealloc() will dig into the
UMA_BOOT_PAGES pool of pages, which is often not enough to keep track
of such large allocations.
In order to fix this, change a bit the concept of UMA_BOOT_PAGES and
vm.boot_pages. More specifically make the UMA_BOOT_PAGES an initial "value"
as long as vm.boot_pages and extend the boot_pages physical area by as
many bytes as needed with the information returned by
vm_radix_allocphys_size().
* A small amount of pages will be held in per-cpu buckets and won't be
accessible from curcpu, so the vm_radix_node_get() could really panic
when the pre-allocation pool is close to be exhausted.
In theory we could pre-allocate more pages than the number of physical
frames to satisfy such request, but as many insert would happen without
a node allocation anyway, I think it is safe to assume that the
over-allocation is already compensating for such problem.
On the field testing can stand me correct, of course. This could be
further helped by the case where we allow a single-page insert to not
require a complete root node.
The use of pre-allocation gets rid all the non-direct mapping trickery
and introduced lock recursion allowance for vm_page_free_queue.
The nodes children are reduced in number from 32 -> 16 and from 16 -> 8
(for respectively 64 bits and 32 bits architectures).
This would make the children to fit into cacheline for amd64 case,
for example, and in general spawn less cacheline, which may be
helpful in lookup_ge() case.
Also, path-compression cames to help in cases where there are many levels,
making the fallouts of such change less hurting.
Sponsored by: EMC / Isilon storage division
Reviewed by: jeff (partially)
Tested by: flo