Even on Illumos, with its much larger KVA, ZFS ARC steps back if KVA usage
reaches certain threshold (3/4 on i386 or 16/17 otherwise). FreeBSD has
even less KVA, but had no such limit on archs with direct map as amd64.
As result, on machines with a lot of RAM, during load with very small user-
space memory pressure, such as `zfs send`, it was possible to reach state,
when there is enough both physical RAM and KVA (I've seen up to 25-30%),
but no continuous KVA range to allocate even single 128KB I/O request.
Address this situation from two sides:
- restore KVA usage limitations in a way the most close to Illumos;
- introduce new requirement for KVA fragmentation, specifying that we
should have at least one sequential KVA range of zfs_max_recordsize bytes.
Experiments show that first limitation done alone is not sufficient. On
machine with 64GB of RAM it is sometimes needed to drop up to half of ARC
size to get at leats one 1MB KVA chunk. Statically limiting ARC to half
of KVA/RAM is too strict, so second limitation makes it to work in cycles:
accumulate trash up to certain critical mass, do massive spring-cleaning,
and then start littering again. :)
MFC after: 1 month
Amd64 uses relocatable object files as the modules format. It is good
WRT not having unneeded overhead for PIC code, in particular, due to
absence of useless GOT and PLT. But the cost is that the module
linking process cannot use hash to speed up the symbol lookup, and
that each reference to the symbol requiring a relocation, instead of
single-place relocation in GOT.
Cache the successfull symbol lookup results in the module symbol
table, using the newly allocated SHN_FBSD_CACHED value from
SHN_LOOS-HIOS range as an indicator. The SHN_FBSD_CACHED together
with the non-existent definition of the found symbol are reverted
after successfull relocations, which is done under kld_sx lock, so it
should not be visible to other consumers of the symbol table.
Submitted by: Conrad Meyer
Differential Revision: https://reviews.freebsd.org/D1718
MFC after: 3 weeks
A couple of internal functions used by malloc(9) and uma truncated
a size_t down to an int. This could cause any number of issues
(e.g. indefinite sleeps, memory corruption) if any kernel
subsystem tried to allocate 2GB or more through malloc. zfs would
attempt such an allocation when run on a system with 2TB or more
of RAM.
Note to self: When this is MFCed, sparc64 needs the same fix.
Differential revision: https://reviews.freebsd.org/D2106
Reviewed by: kib
Reported by: Michael Fuckner <michael@fuckner.net>
Tested by: Michael Fuckner <michael@fuckner.net>
MFC after: 2 weeks
ask for resource reclamation again.
This is kind of dirty hack, but as last resort this is better then stuck
indefinitely because of KVA fragmentation, waiting until some random event
free something sufficient. OpenSolaris also has this hack in its vmem(9).
MFC after: 2 weeks
In particular, such DDB commands were added:
show vmem <addr>
show all vmem
show vmemdump <addr>
show all vmemdump
As possible usage, that allows to see KVA usage and fragmentation.
in this area and by the Clang static analyzer.
Remove some dead assignments.
Fix a typo in a panic string.
Use umtx_pi_disown() instead of duplicate code.
Use an existing variable instead of curthread.
Approved by: kib (mentor)
MFC after: 3 days
Sponsored by: Dell Inc
CPU, also add protection against invalid CPU's as well as
split c_flags and c_iflags so that if a user plays with the active
flag (the one expected to be played with by callers in MPSAFE) without
a lock, it won't adversely affect the callout system by causing a corrupt
list. This also means that all callers need to use the macros and *not*
play with the falgs directly (like netgraph used to).
Differential Revision: htts://reviews.freebsd.org/D1894
Reviewed by: .. timed out but looked at by jhb, imp, adrian hselasky
tested by hiren and netflix.
Sponsored by: Netflix Inc.
number of dynamically created and destroyed SYSCTLs during runtime it
is very likely that the current new OID number limit of 0x7fffffff can
be reached. Especially if dynamic OID creation and destruction results
from automatic tests. Additional changes:
- Optimize the typical use case by decrementing the next automatic OID
sequence number instead of incrementing it. This saves searching time
when inserting new OIDs into a fresh parent OID node.
- Add simple check for duplicate non-automatic OID numbers.
MFC after: 1 week
delist_dev() function. In addition to this change:
- add a proper description of this function
- add a proper witness assert inside this function
- switch a nearby line to use the "cdp" pointer instead of cdev2priv()
MFC after: 3 days
This allows us to get rid of bzero which was added specifically to make
mtx_init on p_mtx reliable.
This also fixes a potential problem where mtx_init on other mutexes
could trip over on unitialized memory and fire an assertion.
Reviewed by: kib
proc_set_cred_init can be used to set first credentials of a new
process.
Update proc_set_cred assertions so that it only expects already used
processes.
This fixes panics where p_ucred of a new process happens to be non-NULL.
Reviewed by: kib
Prior to this change the kernel would take p1's credentials and assign
them tempororarily to p2. But p1 could change credentials at that time
and in effect give us a use-after-free.
No objections from: kib
named objects to zero before the virtual address is selected. Previously,
the color setting was delayed until after the virtual address was
selected. In rtld, this delay effectively prevented the mapping of a
shared library's code section using superpages. Now, for example, we see
the first 1 MB of libc's code on armv6 mapped by a superpage after we've
gotten through the initial cold misses that bring the first 1 MB of code
into memory. (With the page clustering that we perform on read faults,
this happens quickly.)
Differential Revision: https://reviews.freebsd.org/D2013
Reviewed by: jhb, kib
Tested by: Svatopluk Kraus (armv6)
MFC after: 6 weeks
- Use real locking, replace Giant with global sx protecting the
subsystem. Since the subsystem' lock is no longer dropped during
the sleepsk, remove not needed SHMSEG_WANTED segment flag, and
revert r278963.
- To do proper code simplification possible after the change of the
lock, restructure several functions into _locked body and
originally-named wrapper which calls into _locked variant. This
allows to eliminate the 'goto done2' spread over the code.
- Merge shm_find_segment_by_shmid() and shm_find_segment_by_shmidx().
- Consistently change all function prototypes to ANSI C.
Reviewed by: mjg (who has earlier version of the similar patch to
introduce real locking)
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Previously format string traversal could happen while the string itself was
being modified.
Use allproc_lock as coredumping is a rare operation and as such we don't
have to create a dedicated lock.
Submitted by: Tiwei Bie <btw mail.ustc.edu.cn>
Reviewed by: kib
X-Additional: JuniorJobs project
Values smaller than two lead to strange asserts that have nothing to do
with the actual problem (in the case of size=0), or to writing beyond the
end of the allocated buffer in sbuf_finish() (in the case of size=1).
without a commit message...
Use sbuf_new() + SYSCTL_OUT() instead of wiring the userland buffer and
using sbuf_new_for_sysctl(). The preallocated 256 byte buffer is always
going to be big enough to hold these results, and this should be more
efficient than wiring the old buffer.
INCLUDENUL is set and sbuf_finish() has been called, the length has been
incremented to count the nulterm byte, and in that case current length is
allowed to be equal to buffer size, otherwise it must be less than.
Add a predicate macro to test for SBUF_INCLUDENUL, and use it in tests, to
be consistant with the style in the rest of this file.
A comment in the code stated we PROC_LOCK and as a side effect guarantee
all writers released process lock. But at that point such lock was already
taken while we were removing the process from all lists, so it should be already
unreachable.
The goal here is to provide one place altering process credentials.
This eases debugging and opens up posibilities to do additional work when such
an action is performed.
strings returned to userland include the nulterm byte.
Some uses of sbuf_new_for_sysctl() write binary data rather than strings;
clear the SBUF_INCLUDENUL flag after calling sbuf_new_for_sysctl() in
those cases. (Note that the sbuf code still automatically adds a nulterm
byte in sbuf_finish(), but since it's not included in the length it won't
get copied to userland along with the binary data.)
Remove explicit adding of a nulterm byte in a couple places now that it
gets done automatically by the sbuf drain code.
PR: 195668
The SBUF_INCLUDENUL flag causes the nulterm byte at the end of the string
to be counted in the length of the data. If copying the data using the
sbuf_data() and sbuf_len() functions, or if writing it automatically with
a drain function, the net effect is that the nulterm byte is copied along
with the rest of the data.
drivers can use it. This avoids some code duplication. Add missing
default case to all switch statements while at it. Also move the
hashing of the IPv6 flow field to layer 4 because the IPv6 flow field
is constant on a per L4 connection basis and not on a per L3 network.
Differential Revision: https://reviews.freebsd.org/D1987
Sponsored by: Mellanox Technologies
MFC after: 1 month
A late change to the SR-IOV infrastructure broke passthrough of
VFs. device_set_devclass() was being used to try to force the
ppt driver to attach to the device, but this didn't work because
the DF_FIXEDCLASS flag wasn't being set on the device, so the
ppt driver probe routine would not match when it returned
BUS_NOWILDCARD. Fix this by adding a new device function that
both sets the devclass and sets the DF_FIXEDCLASS flag, and use
that to force the ppt driver to attach to VFs.
Differential Revision: https://reviews.freebsd.org/D2041
Reviewed by: jhb
MFC after: 3 weeks
in kern_gzio.c. The old gzio interface was somewhat inflexible and has not
worked properly since r272535: currently, the gzio functions are called with
a range lock held on the output vnode, but kern_gzio.c does not pass the
IO_RANGELOCKED flag to vn_rdwr() calls, resulting in deadlock when vn_rdwr()
attempts to reacquire the range lock. Moreover, the new gzio interface can
be used to implement kernel core compression.
This change also modifies the kernel configuration options needed to enable
userland core dump compression support: gzio is now an option rather than a
device, and the COMPRESS_USER_CORES option is removed. Core dump compression
is enabled using the kern.compress_user_cores sysctl/tunable.
Differential Revision: https://reviews.freebsd.org/D1832
Reviewed by: rpaulo
Discussed with: kib
executables. The goal here, not yet accomplished, is to let the e500 kernel
run under QEMU by setting KERNBASE to something that fits in low memory and
then having the kernel relocate itself at runtime.