it to a random value between 100 and 1123, rather than 0 as before.
Submitted by: Marie Helene Kvello-Aune <marieheleneka@gmail.com>
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D5336
namecache_ts differs from mere namecache by few fields placed mid struct.
The access to the last element (the name) is thus special-cased.
The standard solution is to put new fields at the very beginning anad
embedd the original struct. The pointer shuffled around points to the
embedded part. If needed, access to new fields can be gained through
__containerof.
MFC after: 1 week
While these locks are guarnteed to not share their respective cache lines,
their current placement leaves unnecessary holes in lines which preceeded them.
For instance the annotation of vm_page_queue_free_mtx allows 2 neighbour
cachelines (previously separate by the lock) to be collapsed into 1.
The annotation is only effective on architectures which have it implemented in
their linker script (currently only amd64). Thus locks are not converted to
their not-padaligned variants as to not affect the rest.
MFC after: 1 week
It would be better to fix API consumers to not pass NULL there - most of them,
such as gmirror, already contain the neccessary checks - but this is easier
and much less error-prone.
One known user-visible result is that it fixes panic on a failed "graid label".
PR: 221846
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
This adds support in pass(4) for data to be described with a
scatter-gather list (sglist) to augment the existing (single) virtual
address.
Differential Revision: https://reviews.freebsd.org/D11361
Submitted by: Chuck Tuffli
Reviewed by: imp@, scottl@, kenm@
The old code allowed calling vdrop() before insmntque() to place the vnode back
onto the freelist for later recycling. Some downstream consumers may rely on
this support. Normally insmntque() failing is fine since is uses vgone() and
immediately frees the vnode rather than attempting to add it to the freelist if
vdrop() were used instead.
Also assert that vhold() cannot be used on such a vnode.
Reviewed by: kib, cem, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12126
Print the full conflicting oid path, and include the function name in the
warning so it is clear that the warnings are sysctl-related.
PR: 221853
Submitted by: Fabian Keil <fk AT fabiankeil.de> (earlier version)
Sponsored by: Dell EMC Isilon
Improve scheduler performance by flattening nonsensical topology layers
(layers with only one child don't serve any purpose).
This is especially relevant on non-AMD Zen systems after r322776. On my
dual core Intel laptop, this brings the kern.sched.topology_spec table down
from three levels to two.
Submitted by: jeff
Reviewed by: attilio
Sponsored by: Dell EMC Isilon
The AIO job holds a reference on the associated file descriptor, so the
socket's count should already be > 0. This fixes a LOR with the socket
buffer lock after recent socket locking changes in HEAD.
Sponsored by: Chelsio Communications
removal of the "blk" parameter from blst_meta_alloc() had the unintended
effect of generating an out-of-range allocation when the cursor reaches
the end of the tree if the number of managed blocks in the tree equals
the so-called "radix" (which in the blist code is not the standard notion
of what a radix is but rather the maximum number of leaves in a tree of
the current height.) In other words, only certain swap configurations
were affected, which is why earlier testing did not reveal the problem.
Submitted by: Doug Moore <dougm@rice.edu>
Reported by: pho, kib
Tested by: pho
X-MFC with: r322459
Differential Revision: https://reviews.freebsd.org/D12106
via soisdisconnected(), and in the earlier unlock earlier to avoid lock
recursion.
This fixes a situation when a socket on accept queue is reset before being
accepted.
Reported by: Jason Eggleston <jeggleston llnw.com>
A follow-up to r322836.
Warnings for the unused declaration were breaking some second tier
architectures, but did not show up in Clang on x86.
Reported by: markj (ddb.4), emaste (declaration)
Sponsored by: Dell EMC Isilon
When debugging a deadlock, it is useful to follow the full chain of locks as
far as possible.
Reviewed by: jhb
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12115
Rather than repeatedly nesting loops, separate concerns with a single loop
per call stack level. Use a table to drive the recursive routine. Handle
missing topology layers more gracefully (infer a single unit).
Analyze some additional optional layers which may be present on e.g. AMD Zen
systems (groups, aka dies, per package; and cachegroups, aka CCXes, per
group).
Display that additional information in the boot-time topology information,
when it is relevent (non-one).
Reviewed by: markj@, mjoras@ (earlier version)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12019
This mode allows other clean buffers to arrive while we flush the buf
lists for the vnode, which is fine for the targeted use. We only need
that all buffers existed at the time of the function start were
flushed. In fact, only one assert has to be relaxed.
In collaboration with: pho
Reviewed by: rmacklem
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
X-Differential revision: https://reviews.freebsd.org/D12083
Right now we only need to pad when writing kernel dump headers, so
flatten three related subroutines into one. The encrypted kernel dump
code already writes out its key in a dumper.blocksize-sized block.
No functional change intended.
Reviewed by: cem, def
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D11647
This helps simplify the code in kern_shutdown.c and reduces the number
of globally visible functions.
No functional change intended.
Reviewed by: cem, def
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D11603
dump_start() and dump_finish() are responsible for writing kernel dump
headers, optionally writing the key when encryption is enabled, and
initializing the initial offset into the dump device.
Also remove the unused dump_pad(), and make some functions static now that
they're only called from kern_shutdown.c.
No functional change intended.
Reviewed by: cem, def
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D11584
sbuf is filled to capacity by vsnprintf(), the loop exits without error, and
the sbuf is not marked as auto-extendable.
SBUF_HASROOM() evaluates true if there is room for one or more non-NULL
characters, but in the case that the sbuf was filled exactly to capacity,
SBUF_HASROOM() evaluates false. Consequently, sbuf_vprintf() incorrectly
assigns an ENOMEM error to the sbuf when in fact everything is fine, in turn
poisoning the buffer for all subsequent operations.
Correct by moving the ENOMEM assignment into the loop where it can be made
unambiguously.
As a related safety net change, explicitly check for the zero bytes drained
case in sbuf_drain() and set EDEADLK as the error. This avoids an infinite loop
in sbuf_vprintf() if a drain function were to inadvertently return a value of
zero to sbuf_drain().
Reviewed by: cem, jtl, gallatin
MFC after: 2 weeks
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D8535
during drain operations. When an sbuf is configured to use this feature by way
of the SBUF_DRAINTOEOR sbuf_new() flag, top-level sections started with
sbuf_start_section() create a record boundary marker that is used to avoid
flushing partial records.
Reviewed by: cem,imp,wblock
MFC after: 2 weeks
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D8536
it automatically after it runs.
The config_intrhook mechanism allows a driver to stall the boot process
until device(s) required for booting are available, by not allowing system
inits to proceed until all intrhook functions have been unregistered.
Virtually all existing code simply unregisters from within the hook function
when it gets called.
This new function makes that common usage more convenient. Instead of
allocating and filling in a struct, passing it to a function that might (in
theory) fail, and checking the return code, now a driver can simply call
this cannot-fail routine, passing just the intrhook function and its arg.
Differential Revision: https://reviews.freebsd.org/D11963
another parameter that identifies a starting point in the memory address
block. Radix is a power of two, blk is a multiple of radix, and the
starting point is in the range [blk, blk+radix), so that blk can always be
computed from the other two. This change drops the blk parameter from the
meta functions and computes it instead. (On amd64, for example, this
change reduces subr_blist.o's text size by 7%.)
It also makes the radix parameters unsigned to address concerns that the
calculation of '-radix' might overflow without the -fwrapv option. (See
https://reviews.freebsd.org/D11819.)
Submitted by: Doug Moore <dougm@rice.edu>
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D11964
cpumask it probably means we are unable to sent interrupts to CPUs outside
the map. As such only return the current CPU when it's within the mask
otherwise return the first valid CPU.
This is needed on ThunderX as, in a dual socket configuration, we are
unable to send MSI/MSI-X interrupts between sockets.
Reviewed by: mmel
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D11957
vm_page_grab() on consecutive page indices. Besides simplifying the code
in the caller, vm_page_grab_pages() allows for batching optimizations.
For example, the current implementation replaces calls to vm_page_lookup()
on consecutive page indices by cheaper calls to vm_page_next().
Reviewed by: kib, markj
Tested by: pho (an earlier version)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D11926
p1003_1b.aio_listio_max is now a tunable. Its value is reflected in the
sysctl of the same name, and the sysconf(3) variable _SC_AIO_LISTIO_MAX.
Its value will be bounded from below by the compile-time constant
AIO_LISTIO_MAX and from above by the compile-time constant
MAX_AIO_QUEUE_PER_PROC and the tunable vfs.aio.max_aio_queue.
Reviewed by: jhb, kib
MFC after: 3 weeks
Relnotes: yes
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D11601
o Replace __riscv64 with (__riscv && __riscv_xlen == 64)
This is required to support new GCC 7.1 compiler.
This is compatible with current GCC 6.1 compiler.
RISC-V is extensible ISA and the idea here is to have built-in define
per each extension, so together with __riscv we will have some subset
of these as well (depending on -march string passed to compiler):
__riscv_compressed
__riscv_atomic
__riscv_mul
__riscv_div
__riscv_muldiv
__riscv_fdiv
__riscv_fsqrt
__riscv_float_abi_soft
__riscv_float_abi_single
__riscv_float_abi_double
__riscv_cmodel_medlow
__riscv_cmodel_medany
__riscv_cmodel_pic
__riscv_xlen
Reviewed by: ngie
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D11901
When all instances of a lock type are destroyed (for example, after a
module unload), the corresponding witness entry remains associated with
that lock type. In this case, we shouldn't panic if a new instance of the
lock type is created and its lock class does not match that recorded in the
witness entry.
Reviewed by: jhb
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D11788
'skip', which denote, respectively, the largest number of blocks that can be
managed by a subtree of that height, and one less than the number of nodes
in a subtree of that height. This change removes the 'skip' argument from
those functions because 'skip' can be trivially computed from 'radius'.
This change also redefines 'skip' so that it denotes the number of nodes in
the subtree, and so changes loop upper bound tests from '<= skip' to '<
skip' to account for the change.
The 'skip' field is also removed from the blist struct.
The self-test program is changed so that the print command includes the
cursor value in the output.
Submitted by: Doug Moore <dougm@rice.edu>
MFC after: 1 week
Linux specific things to the native fdescfs file system.
Unlike FreeBSD, the Linux fdescfs is a directory containing a symbolic
links to the actual files, which the process has open.
A readlink(2) call on this file returns a full path in case of regular file
or a string in a special format (type:[inode], anon_inode:<file-type>, etc..).
As well as in a FreeBSD, opening the file in the Linux fdescfs directory is
equivalent to duplicating the corresponding file descriptor.
Here we have mutually exclusive requirements:
- in case of readlink(2) call fdescfs lookup() method should return VLNK
vnode otherwise our kern_readlink() fail with EINVAL error;
- in the other calls fdescfs lookup() method should return non VLNK vnode.
For what new vnode v_flag VV_READLINK was added, which is set if fdescfs has beed
mounted with linrdlnk option an modified kern_readlinkat() to properly handle it.
For now For Linux ABI compatibility mount fdescfs volume with linrdlnk option:
mount -t fdescfs -o linrdlnk null /compat/linux/dev/fd
Reviewed by: kib@
MFC after: 1 week
Relnotes: yes
Atomic updates to v_wire_count are a significant source of contention, so
combine multiple updates into one in this easy case. Also remove an old
printf that gets executed if the page is shared-busied, which is a case
that will lead to a panic anyway.
Reviewed by: alc, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D11791
request that their clock_settime() methods be called at a given offset
from top-of-second. This adds a timeout_task to the rtc_instance so that
each clock can be separately added to taskqueue_thread with the scheduling
it prefers, instead of looping through all the clocks at once with a
single task on taskqueue_thread. If a driver doesn't call clock_schedule()
the default is the old behavior: clock_settime() is queued immediately.
The motivation behind this is that I was on the path of adding identical
code to a third RTC driver to figure out a delta to top-of-second and
sleep for that amount of time because writing the the RTC registers resets
the hardware's concept of top-of-second. (Sometimes it's not top-of-second,
some RTC clocks tick over a half second after you set their time registers.)
Worst-case would be to sleep for almost a full second, which is a rude thing
to do on a shared task queue thread.
over the scheduling precision than 'ticks' can offer, and because sometimes
you're already working with sbintime_t units and it's dumb to convert them
to ticks just so they can get converted back to sbintime_t under the hood.
No functional change.
This is handy for FreeBSD derivatives that want to modify the value of
MAXPATHLEN, but not the kld_file_stat ABI.
Submitted by: Siddhant Agarwal <sagarwal AT isilon.com>
Sponsored by: Dell EMC Isilon
New kern.lognosys values are
1 - log to ctty
2 - log to console
3 - log to both.
Inspired by: eugen
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
"leaf" functions for alloc, free, and fill. After the change, the interface
functions call "meta" unconditionally, and the "meta" functions recur
unconditionally in looping over their descendants. The "meta" functions
start with a validity test, and then a test for the "leaf" case, before
falling into the general recursive case. This simplifies and shrinks the
code, and, for "free" and "fill" moves panic tests that check the same meta
node repeatedly in a loop to a place that will have each node tested once.
Remove irrelevant null checks from blist_free and blist_fill.
Make the code that initializes a meta node the same in blist_meta_alloc and
blist_meta_fill.
Parenthesize return expressions in blst_meta_fill.
Submitted by: Doug Moore <dougm@rice.edu>
MFC after: 1 week
Most realtime clocks store the year as 2 BCD digits. Some add a century bit
to extend the range another hundred years. Every clock driver has its own
code to determine the century and pass a full year value to clock_ct_to_ts().
Now clock drivers can just convert BCD to bin and store the result in the
clocktime struct and let the common code figure out the century. Clocks
with a century bit can just add 100 to year if the century bit is on.
of-line whitespace, remove excessive whitespace and blank lines, remove
dead code, follow our standard style for function definitions, and
correct grammatical and factual errors in some of the comments.
Submitted by: Doug Moore <dougm@rice.edu>
MFC after: 1 week
that start in 1970, assume most conversions are going to be for recent dates
and use a precomputed number of days through the end of 2016.
This is a do-over of r320997, hopefully this time with 100% more workiness.
The first attempt had an off-by-one error, but instead of just adding
another mysterious +1 adjustment, this rearranges the relationship between
recent_base_year and recent_base_days so that the latter is the number of
days that occurred before the start of the associated year (instead of the
count thru the end of that year). This makes the recent_base stuff work
more like the original loop logic that didn't need any +1 adjustments.
This appears to have been an oversight in r213536.
Reviewed by: markj
MFC after: 1 week
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D11521
the IO type (Admin or NVM) using XPT op-codes XPT_NVME_ADMIN or
XPT_NVME_IO.
Submitted by: Chuck Tuffli <chuck@tuffli.net>
Differential Revision: https://reviews.freebsd.org/D10247
Using the https://github.com/google/capsicum-test/ suite, the
PosixMqueue.CapModeForked test was failing due to an ECAPMODE after
calling kmq_notify(). On further inspection, the dynamically
loaded syscall entry was initialized with sy_flags zeroed out, since
SYSCALL_INIT_HELPER() left sysent.sy_flags with the default value.
Add a new helper SYSCALL{,32}_INIT_HELPER_F() which takes an
additional argument to specify the sy_flags value.
Submitted by: Siva Mahadevan <smahadevan@freebsdfoundation.org>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11576
Make the %b formatter accept number formatting flags. It will now accept
alternate form, precision, and length modifiers. It also now partially
supports field width (but forces left justification).
Reviewed by: markj
Approved by: markj (mentor)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D11284
on clock drivers.
This tracks multiple concurrent realtime clock drivers in a list sorted by
clock resolution. When system time changes (and periodically) the
clock_settime() methods of all registered clocks are invoked.
To initialize system time, each driver is tried in turn from best to worst
resolution, until one succesfully returns a valid time.
The code no longer holds a mutex while calling the clock_settime() and
clock_gettime() methods of the registered clocks. This allows clock drivers
to do whatever kind of locking or sleeping is necessary (this is especially
important for i2c clock chips since i2c drivers often need to sleep).
A new clock_register_flags() function allows the clock driver to pass
flags. The flags currently defined help support drivers that use their own
techniques to avoid roundoff errors (prevents the 4/5 rounding done by the
subr_rtc code). A driver which may need to wait for resources (such as bus
ownership) may pass a flag to indicate that it will obtain system time for
itself after waiting for resources; this is merely an optimization to avoid
the common code retrieving a timespec that will never get used.
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D11484
Uiomove can only block when the segflag is UIO_USERSPACE,
otherwise we end up just doing a bcopy (or nothing) and
moving cursors. So only emit witness warnings and
set deadlock thread flags in the UIO_USERSPACE case.
Reviewed by: kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D11489
socket structure into a listening socket. This resulted in an invalid
instruction fault for all 32-bit platforms.
When INVARIANTS is set the union where the two uninitialized fields
reside gets properly zeroed. This patch ensures the two uninitialized
fields are zeroed when INVARIANTS is undefined.
For 64-bit platforms this issue was not visible because so->sol_upcall
which is uninitialized overlaps with so->so_rcv.sb_state which is
already zero during soalloc();
For 32-bit platforms this issue was visible and resulted in an invalid
instruction fault, because so->sol_upcall overlaps with
so->so_rcv.sb_sel which is always initialized to a valid data pointer
during soalloc().
Verifying the offset locations mentioned above are identical is left
as an exercise to the reader.
PR: 220452
PR: 220358
Reviewed by: ae (network), gallatin
Differential Revision: https://reviews.freebsd.org/D11475
Sponsored by: Mellanox Technologies
The vm_map_fixed() and vm_map_stack() VM functions return Mach error
codes. Convert them into errno values before returning result from
exec_new_vmspace().
While there, modernize the comment and do minor style adjustments.
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
performance.
To find in the leaf bitmap all ranges of sufficient length, use a doubling
strategy with shift-and-and until each bit still set represents a bit
sequence of length 'count', or until the bitmask is zero. In the latter
case, update the hint based on the first bit sequence length not found to
be available. For example, seeking an interval of length 12, the set bits
of the bitmap would represent intervals of length 1, then 2, then 3, then
6, then 12. If no bits are set at the point when each bit represents an
interval of length 6, then the hint can be updated to 5 and the search
terminated.
If long-enough intervals are found, discard those before the cursor. If
any remain, use binary search to find the position of the first of them,
and allocate that interval.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: kib, markj
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D11426
recycles the current vm space. Otherwise, an mlockall(MCL_FUTURE) could
still be in effect on the process after an execve(2), which violates the
specification for mlockall(2).
It's pointless for vm_map_stack() to check the MEMLOCK limit. It will
never be asked to wire the stack. Moreover, it doesn't even implement
wiring of the stack.
Reviewed by: kib, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D11421
Process core notes for a 32-bit process running on a 64-bit host need to
use 32-bit structures so that the note layout matches the layout of notes
of a core dump of a 32-bit process under a 32-bit kernel.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D11407
struct g_kevent_args.
On some architectures, e.g. PowerPC, there is additional padding in uap.
Reported and tested by: andreast
Sponsored by: The FreeBSD Foundation
and "next_skip" variables. The "skip" value in struct blist has long been
a 64-bit quantity but various functions have implicitly truncated this
value to 32 bits. Now, all arithmetic involving the "skip" value is 64
bits wide. (This should allow us to relax the size limit on a swap device
in the swap pager.)
Maintain the ability to test this allocator as a user-space application by
including <stdbool.h>.
Remove an unused variable from blst_radix_print().
Reviewed by: kib, markj
MFC after: 4 weeks
Differential Revision: https://reviews.freebsd.org/D11358
It distinguishes between data flow sockets and listening sockets, and
in case of the latter doesn't change resource limits, since listening
sockets don't hold any buffers, they only carry values to be inherited
by their children.
Most of the lock slowpaths assert that the calling thread isn't an idle
thread. However, this may not be true if the system has panicked, and in
some cases the assertion appears before a SCHEDULER_STOPPED() check.
MFC after: 3 days
Sponsored by: Dell EMC Isilon
device nodes.
Otherwise, the current check of aio_offset == -1LL makes it possible
to pass negative file offsets down to the filesystems. This trips
assertions and is even unsafe for e.g. FFS which keeps metadata at
negative offsets.
Reported and tested by: pho
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11266
that disk writes are more likely to be sequential. This change is
beneficial on both the solid state and mechanical disks that I've
tested. (A similar change in allocation policy was made by DragonFly
BSD in 2013 to speed up Poudriere with "stressful memory parameters".)
Increase the width of blst_meta_alloc()'s parameter "skip" and the local
variables whose values are derived from it to 64 bits. (This matches the
width of the field "skip" that is stored in the structure "blist" and
passed to blst_meta_alloc().)
Eliminate a pointless check for a NULL blist_t.
Simplify blst_meta_alloc()'s handling of the ALL-FREE case.
Address nearby style errors.
Reviewed by: kib, markj
MFC after: 5 weeks
Differential Revision: https://reviews.freebsd.org/D11247
By making MAXBCACHEBUF a tunable, it can be increased to allow for
larger read/write data sizes for the NFS client.
The tunable is limited to MAXPHYS, which is currently 128K.
Making MAXPHYS a tunable or increasing its value is being discussed,
since it would be nice to support a read/write data size of 1Mbyte
for the NFS client when mounting the AmazonEFS file service.
Reviewed by: kib
MFC after: 2 weeks
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D10991
This change implements NOTE_ABSTIME flag for EVFILT_TIMER, which
specifies that the data field contains absolute time to fire the
event.
To make this useful, data member of the struct kevent must be extended
to 64bit. Using the opportunity, I also added ext members. This
changes struct kevent almost to Apple struct kevent64, except I did
not changed type of ident and udata, the later would cause serious API
incompatibilities.
The type of ident was kept uintptr_t since EVFILT_AIO returns a
pointer in this field, and e.g. CHERI is sensitive to the type
(discussed with brooks, jhb).
Unlike Apple kevent64, symbol versioning allows us to claim ABI
compatibility and still name the new syscall kevent(2). Compat shims
are provided for both host native and compat32.
Requested by: bapt
Reviewed by: bapt, brooks, ngie (previous version)
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D11025
Display the mbuf/cluster count for a sockbuf and fix a couple whitespace
issues in the output.
Reviewed by: jhb, markj (both previous version)
Approved by: markj (mentor)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D11062
This makes ddb show files more descriptive and also adjusts the
whitespace to align the columns for non-32-bit architectures.
Reviewed by: cem (previous version), jhb
Approved by: markj (mentor)
Differential Revision: https://reviews.freebsd.org/D11061
additional allocation overhead. Previously, blst_meta_alloc() updated the
hint after every successful allocation. However, these "eager" hint
updates are of no actual benefit if, instead, the "lazy" hint update at
the start of blst_meta_alloc() is generalized to handle all cases where
the number of available blocks is less than the requested allocation.
Previously, the lazy hint update at the start of blst_meta_alloc() only
handled the ALL-FULL case. (I would also note that this change provides
consistency between blist_alloc() and blist_fill() in that their hint
maintenance is now entirely lazy.)
Eliminate unnecessary checks for terminators in blst_meta_alloc() and
blst_meta_fill() when handling ALL-FREE meta nodes.
Eliminate the field "bl_free" from struct blist. It is redundant. Unless
the entire radix tree is a single leaf, the count of free blocks is stored
in the root node. Instead, provide a function blist_avail() for obtaining
the number of free blocks.
In blst_meta_alloc(), perform a sanity check on the allocation once rather
than repeating it in a loop over the meta node's children.
In blst_leaf_fill(), use the optimized bitcount*() function instead of a
loop to count the blocks being allocated.
Add or improve several comments.
Address some nearby style errors.
Reviewed by: kib
MFC after: 6 weeks
Differential Revision: https://reviews.freebsd.org/D11146
struct thread.
For all architectures, the syscall trap handlers have to allocate the
structure on the stack. The structure takes 88 bytes on 64bit arches
which is not negligible. Also, it cannot be easily found by other
code, which e.g. caused duplication of some members of the structure
to struct thread already. The change removes td_dbg_sc_code and
td_dbg_sc_nargs which were directly copied from syscall_args.
The structure is put into the copied on fork part of the struct thread
to make the syscall arguments information correct in the child after
fork.
This move will also allow several more uses shortly.
Reviewed by: jhb (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
X-Differential revision: https://reviews.freebsd.org/D11080
what this field represented was also inaccurate.) Suggested by: kib
In r178792, blist_create() grew a malloc flag, allowing M_NOWAIT to be
specified. However, blist_create() was not modified to handle the
possibility that a malloc() call failed. Address this omission.
Increase the width of the local variable "radix" to 64 bits. (This
matches the width of the corresponding field in struct blist.)
Reviewed by: kib
MFC after: 6 weeks
quantity as the size of the range to fill, but returns a 32-bit quantity
as the number of blocks that were allocated to fill that range. This
revision corrects that mismatch. Currently, swaponsomething() limits
the size of a swap area to prevent arithmetic arithmetic overflow in
other parts of the blist allocator. That limit has also prevented this
type mismatch from causing problems.
Reviewed by: kib, markj
MFC after: 6 weeks
Differential Revision: https://reviews.freebsd.org/D11096
Provide a new mode "2" which returns a special overflow indicator in
the non-representable field instead of the silent truncation (mode
"0") or EOVERFLOW (mode "1").
In particular, the typical use of st_ino to detect hard links with
mode "2" reports false positives, which might be more suitable for
some uses.
Discussed with: bde
Sponsored by: The FreeBSD Foundation
o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().
o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.
o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.
o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770
subtree is already zero, then setting the "largest contiguous free block"
hint for that subtree to anything other than zero makes no sense. To be
clear, assigning a value to the hint that is too large is not a correctness
problem, only a pessimization.
Dragonfly BSD has applied the same change to blst_meta_alloc() but not
blst_meta_fill().
MFC after: 6 weeks
This happens when closing a socket with upcall, and trace is: soclose()->
... protocol ... -> soisdisconnected() -> socantrcvmore_locked() ->
sowakeup() -> soisconnected().
Right now this case is innocent for two reasons. First, soisconnected()
doesn't clear SS_ISDISCONNECTED flag. Second, the mutex to lock the
socket is the socket receive buffer mutex, and sodisconnected() first
disables the receive buffer. But in future code, the mutex to lock
socket is different to buffer mutex, and we would get undesired mutex
recursion.
The fix is to check SS_ISDISCONNECTED flag before calling upcall.
testing purposes. However, over the years, various changes to the kernel
have broken this feature. This revision applies some fixes to get user-
space compilation working again. There are no changes in this revision
to code that is used by the kernel.
MFC after: 3 days
pager used a different scheme for striping the allocation of swap space
across multiple devices. And, although blist_fill() was intended to support
fill operations with large counts, the old striping scheme never performed a
fill larger than the stripe size. Consequently, the misplacement of a
sanity check in blst_meta_fill() went undetected. Now, moving forward in
time to r118390, a new scheme for striping was introduced that maintained a
blist allocator per device, but as noted in r318995, swapoff_one() was not
fully and correctly converted to the new scheme. This change completes what
was started in r318995 by fixing the underlying bug in blst_meta_fill() that
stops swapoff_one() from simply performing a single blist_fill() operation.
Reviewed by: kib
MFC after: 5 days
Differential Revision: https://reviews.freebsd.org/D11043
You may now optionally specify allow.noreserved_ports to prevent root
inside a jail from using privileged ports (less than 1024)
PR: 217728
Submitted by: Matt Miller <mattm916@pulsar.neomailbox.ch>
Reviewed by: jamie, cem, smh
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D10202
short, half of the memory that is allocated to implement the radix tree is
wasted because we did not change "u_daddr_t" to be a 64-bit unsigned int
when we changed "daddr_t" to be a 64-bit (signed) int. (See r96849 and
r96851.)
Reviewed by: kib, markj
Tested by: pho
MFC after: 5 days
Differential Revision: https://reviews.freebsd.org/D11028
inode number or link count for the ABI compat binaries.
Right now, and by default after the change, too large 64bit values are
silently truncated to 32 bits. Enabling the knob causes the system to
return EOVERFLOW for stat(2) family of compat syscalls when some
values cannot be completely represented by the old structures. For
getdirentries(2), knob skips the dirents which would cause non-trivial
truncation of d_ino.
EOVERFLOW error is specified by the X/Open 1996 LFS document
('Adding Support for Arbitrary File Sizes to the Single UNIX
Specification').
Based on the discussion with: bde
Sponsored by: The FreeBSD Foundation
Since ino64 expanded dev_t to 64bit, make VOP_GETATTR(9) provide all
bits of mnt_stat.f_fsid as va_fsid for vnodes on filesystems which use
f_fsid. In particular, NFSv3 and sometimes NFSv4, and ZFS use this
method or reporting st_dev by stat(2).
Provide a new helper vn_fsid() to avoid duplicating code to copy
f_fsid to va_fsid.
Note that the change is mostly cosmetic. Its motivation is to avoid
sign-extension of f_fsid[0] into 64bit dev_t value which happens after
dev_t becomes 64bit..
Reviewed by: avg(zfs), rmacklem (nfs) (both for previous version)
Sponsored by: The FreeBSD Foundation
bhyve was recently sandboxed with capsicum, and needs to be able to
control the CPU sets of its vcpu threads
Reviewed by: emaste, oshogbo, rwatson
MFC after: 2 weeks
Sponsored by: ScaleEngine Inc.
Differential Revision: https://reviews.freebsd.org/D10170
Extend the ino_t, dev_t, nlink_t types to 64-bit ints. Modify
struct dirent layout to add d_off, increase the size of d_fileno
to 64-bits, increase the size of d_namlen to 16-bits, and change
the required alignment. Increase struct statfs f_mntfromname[] and
f_mntonname[] array length MNAMELEN to 1024.
ABI breakage is mitigated by providing compatibility using versioned
symbols, ingenious use of the existing padding in structures, and
by employing other tricks. Unfortunately, not everything can be
fixed, especially outside the base system. For instance, third-party
APIs which pass struct stat around are broken in backward and
forward incompatible ways.
Kinfo sysctl MIBs ABI is changed in backward-compatible way, but
there is no general mechanism to handle other sysctl MIBS which
return structures where the layout has changed. It was considered
that the breakage is either in the management interfaces, where we
usually allow ABI slip, or is not important.
Struct xvnode changed layout, no compat shims are provided.
For struct xtty, dev_t tty device member was reduced to uint32_t.
It was decided that keeping ABI compat in this case is more useful
than reporting 64-bit dev_t, for the sake of pstat.
Update note: strictly follow the instructions in UPDATING. Build
and install the new kernel with COMPAT_FREEBSD11 option enabled,
then reboot, and only then install new world.
Credits: The 64-bit inode project, also known as ino64, started life
many years ago as a project by Gleb Kurtsou (gleb). Kirk McKusick
(mckusick) then picked up and updated the patch, and acted as a
flag-waver. Feedback, suggestions, and discussions were carried
by Ed Maste (emaste), John Baldwin (jhb), Jilles Tjoelker (jilles),
and Rick Macklem (rmacklem). Kris Moore (kris) performed an initial
ports investigation followed by an exp-run by Antoine Brodin (antoine).
Essential and all-embracing testing was done by Peter Holm (pho).
The heavy lifting of coordinating all these efforts and bringing the
project to completion were done by Konstantin Belousov (kib).
Sponsored by: The FreeBSD Foundation (emaste, kib)
Differential revision: https://reviews.freebsd.org/D10439
Previously open(2) was allowed in capability mode, with a comment that
suggested this was likely the case to facilitate debugging. The system
call would still fail later on, but it's better to disallow the syscall
altogether.
We now have the kern.trap_enotcap sysctl or PROC_TRAPCAP_CTL proccontrol
to aid in debugging.
In any case libc has translated open() to the openat syscall since
r277032.
Reviewed by: kib, rwatson
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D10850
A long long time ago the register keyword told the compiler to store
the corresponding variable in a CPU register, but it is not relevant
for any compiler used in the FreeBSD world today.
ANSIfy related prototypes while here.
Reviewed by: cem, jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D10193
This function permits a range of one scatter/gather list to be appended to
another sglist. This can be used to construct a scatter/gather list that
reorders or duplicates ranges from one or more existing scatter/gather
lists.
Sponsored by: Chelsio Communications
Previously, when the VI_TRYLOCK failed, we would spin under the mutex
that protects the vnode active list until we either succeeded or
noticed that we had hogged the CPU. Since we were violating the lock
order, this would guarantee that we would become a hog under any
deadlock condition (e.g. a race with vdrop(9) on the same vnode). In
the presence of many concurrent threads in sync(2) or vdrop etc, the
victim could hang for a long time.
Now, avoid spinning by dropping and reacquiring the locks in the
conventional lock order when the trylock fails. This requires a dance
with the vnode hold count.
Submitted by: Tom Rix <trix@juniper.net>
Tested by: pho
Differential revision: https://reviews.freebsd.org/D10692
is blocked. The spurious wakeup might result in spurious EINTR.
The reschedule_signals() function is called when the calling thread
has the signal mask changed. For each newly blocked signal, we try to
find a thread which might have the signal not blocked. If no such
thread exists, sigtd() returns random thread, which must not be waken
up. I decided that re-checking, as suggested by PR submitter, is more
reasonable change than to change sigtd() interface, due to other uses
of sigtd(). signotify() already performs this check.
Submitted by: Duane <parakleta@darkreality.org>
PR: 219228
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
When a thread enters ptracestop(), for example because it had received
SIGSTOP from ptrace(PT_ATTACH), it attempts to suspend other threads in
the same process. In the case of a thread sleeping interruptibly in an
SBDRY section, sig_suspend_threads() must wake the thread and allow it to
reach the user-mode boundary. However, sig_suspend_threads() would
erroneously avoid waking up such threads, resulting in an apparent hang.
Reviewed by: kib
Tested by: pho
MFC after: 2 weeks
Sponsored by: Dell EMC Isilon
the code auto-generated for *.m - kobj_lookup_method(9) is useful;
for example in back-ends or base class device drivers in order to
determine whether a default method has been overridden. Thus, allow
for the kobj_method_t pointer argument - used by KOBJOPLOOKUP in
order to update the cache entry - of kobj_lookup_method(9), to be
NULL. Actually, that pointer is redundant as it's just set to the
same kobj_method_t that the kobj_lookup_method(9) function returns
in the first place, but probably it serves to reduce the number of
instructions generated for KOBJOPLOOKUP.
- For the same reason, move updating kobj_lookup_{hits,misses} (if
KOBJ_STATS is defined) from kobj_lookup_method(9) to KOBJOPLOOKUP.
As a side-effect, this gets rid of the convoluted approach of always
incrementing kobj_lookup_hits in KOBJOPLOOKUP and then in case of
a cache miss, decrementing it in kobj_lookup_method(9) again.
The previous misuse of sys_sigqueue() was sending random register or
stack garbage to 64-bit targets. The freebsd32 implementation preserves
the sival_int member of value when signaling a 64-bit process.
Document the mixed ABI implementation of union sigval and the
incompability of sival_ptr with pointer integrity schemes.
Reviewed by: kib, wblock
MFC after: 1 week
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D10605
Add IRQ placement-only and ithread-only API variants. intr_event_bind
has been extended with sibling methods, as it has many more callsites in
existing code.
Reviewed by: kib@, adrian@ (earlier version)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D10586
Some notes:
- Only i386 and amd64 layouts are checked, other Tier-1 (or close to
it) architectures would benefit from the same check.
- Unconditional enabling of the asserts depend on the stability of locks
memory layout. If locks are optimized to avoid bloat when some debugging
or profiling features turned off, it makes sense to only assert layout
for production configs.
Reviewed by: badger, emaste, jhb, vangyzen
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D10526
This check has been redundant since it was introduced in r162554.
Reviewed by: emaste, glebius
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D10322
in place. To do per-cpu stats, convert all fields that previously were
maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
before we have set up UMA and we can do counter_u64_alloc(), provide an
early counter mechanism:
o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
so that at early stages of boot, before counters are allocated we already
point to a counter that can be safely written to.
o For sparc64 that required a whole dummy pcpu[MAXCPU] array.
Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.
This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html
Reviewed by: kib, gallatin, marius, lidl
Differential Revision: https://reviews.freebsd.org/D10156
This fixes some panics after disconnecting mounted disks.
Submitted by: imp (slightly different version, which I've then lost)
Reviewed by: kib, imp, mckusick
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D9674
by the shutdown(2) system call. This ability has been lost as part of the svn
revision 285910.
Reviewed by: ed, rwatson, glebius, hiren
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D10351
do for streaming sockets.
And do more cleanup in the sbappendaddr_locked_internal() to prevent
leak information from existing mbuf to the one, that will be possible
created later by netgraph.
Suggested by: glebius
Tested by: Irina Liakh <spell at itl ua>
MFC after: 1 week
The arm64 binutils only accepts 0 as an offset to the Load-Acquire Register
instructions where llvm will acceps both 0 and 0x0. The thread switching
code uses these with SCHED_ULE to block waiting for a lock to be released.
As the offset of the data to be loaded is zero this is safe, however it is
useful to keep the offset in the instruction to document what is being
loaded.
To work around this issue in binutils only generate the 0x prefix for
non-zero values.
Reported by: kan
Sponsored by: DARPA, AFRL
The MFC will include a compat definition of smp_no_rendevous_barrier()
that calls smp_no_rendezvous_barrier().
Reviewed by: gnn, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D10313
This matches the getcwd() definition.
This is technically an ABI change, but that would only effect 64-bit
big-endian platforms that pass arguments on the stack. We have none of
those.
Reviewed by: jhb
Obtained from: CheriABI
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D9428
was issued during VM-initiated i/o (pageout), so that the function
does not try to flush or remove pages or wait for the vm object
paging-in-progress counter.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
X-Differential revision: https://reviews.freebsd.org/D10241
Don't zero unused pointer members again.
Per discussion with secteam we are not issuing an advisory for this
issue as we have no current evidence it leaks exploitable information.
Reviewed by: rwatson, glebius, delphij
MFC after: 1 day
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D10227
As posix_fadvise() does not lock the vnode argument, don't capture
detailed vnode information for the time being.
Obtained from: TrustedBSD Project
MFC after: 3 weeks
Sponsored by: DARPA, AFRL
This requires minor changes to the audit framework to allow capturing
paths that are not filesystem paths (i.e., will not be canonicalised
relative to the process current working directory and/or filesystem
root).
Obtained from: TrustedBSD Project
MFC after: 3 weeks
Sponsored by: DARPA, AFRL
resulting in a process dumping core in the corefile.
Also extend procstat to view select members of 'struct ptrace_lwpinfo'
from the contents of the note.
Sponsored by: Dell EMC Isilon
map the 'which' argument into a suitable audit event identifier for the
specific operation requested.
Obtained from: TrustedBSD Project
MFC after: 3 weeks
Sponsored by: DARPA, AFRL
that used to work via the bold hack).
Fix the table entry for bright black. Fix spelling of plain black in
nearby table entries (use the macro for black everywhere everywhere).
Fix the currently-unused non-bright color table to not have bright
colors in entries 9-15.
Improve nearby comments. Start converting to the xterm terminology
and default rendering of "bright" instead of "light" for bright
colors.
Syscons wasn't affected by the bug since I optimized it a little by
converting colors 0-15 directly. This also fixes the layering of
the conversion for these colors.
Apply the same optimization to vt (actually the layer above it). This
also moves the conversion 1 closer to the correct layer for colors
0-15.
The optimization of just avoiding 2 calls to a trivial function is worth
about 10% for simple output to the virtual buffer with occasional
rendering. The optimization is so large because the 2 calls are done
on every character, so although there are too many other calls and
other instructions per character, there are only about 10 times as
many. Old versions of syscons were about 10 times faster for simple
output, by using a fast path with about 12 instructions per character.
Rendering to even slow hardware takes relatively little time provided
it is rarely actually done.
crash when the file shrinks. This also fixes sendfile(2) not sending more
data in a case when the file grows, and the request is open-ended or
specifies a size that is greater than old file size.
PR: 217789
Reviewed by: gallatin
MFC after: 10 days
The existing ELF image activator requires the brandinfo to provide such
a string unconditionally, even if the executable format in question
doesn't use this type of branding. Skip matching when it's a null
pointer.
Reviewed by: kib
MFC after: 2 weeks
This is done so that the thread state changes during the switch
are not confused with the thread state changes reported when the thread
spins on a lock.
Here is an example, three consecutive entries for the same thread (from top to
bottom):
KTRGRAPH group:"thread", id:"zio_write_intr_3 tid 100260", state:"sleep", attributes: prio:84, wmesg:"-", lockname:"(null)"
KTRGRAPH group:"thread", id:"zio_write_intr_3 tid 100260", state:"spinning", attributes: lockname:"sched lock 1"
KTRGRAPH group:"thread", id:"zio_write_intr_3 tid 100260", state:"running", attributes: none
The above trace could leave an impression that the final state of
the thread was "running".
After this change the sleep state will be reported after the "spinning"
and "running" states reported for the sched lock.
Reviewed by: jhb, markj
MFC after: 1 week
Sponsored by: Panzura
Differential Revision: https://reviews.freebsd.org/D9961
matches static binaries.
Interpretation of the 'static' there is that the binary must not
specify an interpreter. In particular, shared objects are matched by
the brand if BI_CAN_EXEC_DYN is also set.
This improves precision of the brand matching, which should eliminate
surprises due to brand ordering.
Revert r315701.
Discussed with and tested by: ed (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
This will provide a slightly better smoking gun than just stating
"can't remove non-dynamic nodes!" when calling sysctl_ctx_free(9)
and sysctl_remove_{name,oid}(9) with a non-dynamic (likely
static) sysctl.
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Because of integer types, the timeout calculation result was limited to
INT_MAX / (1000 * hz) seconds. For systems with hz=10000, this is only 215
seconds. Perform the calculation with 64-bit math to allow sleeping for the
full INT_MAX / hz interval (215000 seconds on such hz=10000 systems).
Submitted by: Scott Ferris <sferris at isilon.com>
Sponsored by: Dell EMC Isilon
We must ensure there's space for the terminating null in the temporary
buffer in imgact_binmisc_populate_interp().
Note that there's no buffer overflow here because xbe->xbe_interpreter's
length and null termination is checked in imgact_binmisc_add_entry()
before imgact_binmisc_populate_interp() is called. However, the latter
should correctly enforce its own bounds.
Reviewed by: sbruno
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D10042
vm_pindex_t's into a vm_ooffset_t.
The length given to shm_dotruncate() must never be negative. Assert this.
Tidy up a comment.
Reviewed by: kib
MFC after: 1 week
Add a clock_nanosleep() syscall, as specified by POSIX.
Make nanosleep() a wrapper around it.
Attach the clock_nanosleep test from NetBSD. Adjust it for the
FreeBSD behavior of updating rmtp only when interrupted by a signal.
I believe this to be POSIX-compliant, since POSIX mentions the rmtp
parameter only in the paragraph about EINTR. This is also what
Linux does. (NetBSD updates rmtp unconditionally.)
Copy the whole nanosleep.2 man page from NetBSD because it is complete
and closely resembles the POSIX description. Edit, polish, and reword it
a bit, being sure to keep any relevant text from the FreeBSD page.
Reviewed by: kib, ngie, jilles
MFC after: 3 weeks
Relnotes: yes
Sponsored by: Dell EMC
Differential Revision: https://reviews.freebsd.org/D10020
Typically, when elf_load_section() unconditionally passed VM_PROT_ALL to
elf_map_insert(), it was needlessly enabling execute access on the
mapping, and it would later have to call vm_map_protect() to correct the
mapping's access rights. Now, instead, elf_load_section() always passes
its parameter "prot" to elf_map_insert(). So, elf_load_section() must
only call vm_map_protect() if it needs to remove the write access that
was temporarily granted to perform a copyout().
Reviewed by: kib
MFC after: 1 week
nanosleep() updates rmtp on EINVAL. In that case, kern_nanosleep()
has not updated rmt, so sys_nanosleep() updates the user-space rmtp
by copying garbage from its stack frame. This is not only a kernel
memory disclosure, it's also not POSIX-compliant. Fix it to update
rmtp only on EINTR.
Reviewed by: jilles (via D10020), dchagin
MFC after: 3 days
Security: possibly
Sponsored by: Dell EMC
Differential Revision: https://reviews.freebsd.org/D10044
for vt. Restore syscons' rendering of background (bg) brightness as
foreground (fg) blinking and vice versa, and add rendering of blinking
as background brightness to vt.
Bright/saturated is conflated with light/white in the implementation
and in this description.
Bright colors were broken in all cases, but appeared to work in the
only case shown by "vidcontrol show". A boldness hack was applied
only in 1 layering-violation place (for some syscons sequences) where
it made some cases seem to work but was undone by clearing bold using
ANSI sequences, and more seriously was not undone when setting
ANSI/xterm dark colors so left them bright. Move this hack to drivers.
The boldness hack is only for fg brightness. Restore/add a similar hack
for bg brightness rendered as fg blinking and vice versa. This works
even better for vt, since vt changes the default text mode to give the
more useful bg brightness instead of fg blinking.
The brightness bit in colors was unnecessarily removed by the boldness
hack. In other cases, it was lost later by teken_256to8(). Use
teken_256to16() to not lose it. teken_256to8() was intended to be
used for bg colors to allow finer or bg-specific control for the more
difficult reduction to 8; however, since 16 bg colors actually work
on VGA except in syscons text mode and the conversion isn't subtle
enough to significantly in that mode, teken_256to8() is not used now.
There are still bugs, especially in vidcontrol, if bright/blinking
background colors are set.
Restore XOR logic for bold/bright fg in syscons (don't change OR
logic for vt). Remove broken ifdef on FG_UNDERLINE and its wrong
or missing bit and restore the correct hard-coded bit. FG_UNDERLINE
is only for mono mode which is not really supported.
Restore XOR logic for blinking/bright bg in syscons (in vt, add
OR logic and render as bright bg). Remove related broken ifdef
on BG_BLINKING and its missing bit and restore the correct
hard-coded bit. The same bit means blinking or bright bg depending
on the mode, and we want to ignore the difference everywhere.
Simplify conversions of attributes in syscons. Don't pretend to
support bold fonts. Don't support unusual encodings of brightness.
It is as good as possible to map 16 VGA colors to 16 xterm-16
colors. E.g., VGA brown -> xterm-16 Olive will be converted back
to VGA brown, so we don't need to convert to xterm-256 Brown. Teken
cons25 compatibility code already does the same, and duplicates some
small tables. This is mostly for the sc -> te direction. The other
direction uses teken_256to16() which is too generic.
The ptrace() user has the option of discarding the signal. In such a
case, p_ptevents should not be modified. If the ptrace() user decides to
send a SIGKILL, ptevents will be cleared in ptracestop(). procfs events
do not have the capability to discard the signal, so continue to clear
the mask in that case.
Reviewed by: jhb (initial revision)
MFC after: 1 week
Sponsored by: Dell EMC
Differential Revision: https://reviews.freebsd.org/D9939
uma_zcreate()'s alignment argument is supposed to be sizeof(foo) - 1,
and uma.h provides a set of helper macros for common types. Passing
sizeof(void *) results in all of the members being misaligned triggering
unaligned access faults on certain architectures (notably MIPS).
Reported by: brooks
Obtained from: CheriBSD
MFC after: 3 days
Sponsored by: DARPA / AFRL
reviewing all uses of OFF_TO_IDX(), I observed that
vm_object_page_noreuse() is requiring an exclusive lock on the object
when, in fact, a shared lock suffices.
Reviewed by: kib, markj
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D10011
The callout may reschedule itself and execute again before callout_drain()
returns, but we should not clear CALLOUT_ACTIVE until the callout is
stopped.
Tested by: pho
MFC after: 2 weeks
Sponsored by: Dell EMC Isilon
I moved this branch from github to a private server, and pulled from the
wrong one when committing r315280, so I failed to include two recent commits.
Thankfully, they were only cosmetic and were included in the review.
Specifically:
Add documentation, polish comments, and improve style(9).
Tested by: pho (r315280)
MFC after: 2 weeks
Sponsored by: Dell EMC
Differential Revision: https://reviews.freebsd.org/D9791
POSIX 2008 says this about clock_settime(2):
If the value of the CLOCK_REALTIME clock is set via clock_settime(),
the new value of the clock shall be used to determine the time
of expiration for absolute time services based upon the
CLOCK_REALTIME clock. This applies to the time at which armed
absolute timers expire. If the absolute time requested at the
invocation of such a time service is before the new value of
the clock, the time service shall expire immediately as if the
clock had reached the requested time normally.
Setting the value of the CLOCK_REALTIME clock via clock_settime()
shall have no effect on threads that are blocked waiting for
a relative time service based upon this clock, including the
nanosleep() function; nor on the expiration of relative timers
based upon this clock. Consequently, these time services shall
expire when the requested relative interval elapses, independently
of the new or old value of the clock.
When the real-time clock is adjusted, such as by clock_settime(3),
wake any threads sleeping until an absolute real-clock time.
Such a sleep is indicated by a non-zero td_rtcgen. The sleep functions
will set that field to zero and return zero to tell the caller
to reevaluate its sleep duration based on the new value of the clock.
At present, this affects the following functions:
pthread_cond_timedwait(3)
pthread_mutex_timedlock(3)
pthread_rwlock_timedrdlock(3)
pthread_rwlock_timedwrlock(3)
sem_timedwait(3)
sem_clockwait_np(3)
I'm working on adding clock_nanosleep(2), which will also be affected.
Reported by: Sebastian Huber <sebastian.huber@embedded-brains.de>
Reviewed by: jhb, kib
MFC after: 2 weeks
Relnotes: yes
Sponsored by: Dell EMC
Differential Revision: https://reviews.freebsd.org/D9791
When sending SIGCHLD informing reaper that a zombie was reparented to
it, we might race with the situation where the previous parent still
not finished delivering SIGCHLD and having its p_ksi structure on the
signal queue. While on queue, the ksi should not be used for another
send.
Fix this by copying p_ksi into newly allocated ksi, which is directly
put onto reaper sigqueue. The later ensures that siginfo for reaper
SIGCHLD is always present, similar to guarantees for siginfo of child.
Reported by: bdrewery
Discussed with: jilles
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
For such segments, GNU bfd linker writes knowingly incorrect value
into the the file offset field of the program header entry, with the
motivation that file should not be mapped for creation of this segment
at all.
Relax checks for the ELF structure validity when on-disk segment
length is zero, and explicitely set mapping length to zero for such
segments to avoid validating rounding arithmetic.
PR: 217610
Reported by: Robert Clausecker <fuz@fuz.su>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
overly large allocation requests.
When ktrace-ing io, sys_kevent() allocates memory to copy the
requested changes and reported events. Allocations are sized by the
incoming syscall lengths arguments, which are user-controlled, and
might cause overflow in calculations or too large allocations.
Since io trace chunks are limited by ktr_geniosize, there is no sense
it even trying to satisfy unbounded allocations. Export ktr_geniosize
and clamp the buffers sizes in advance.
PR: 217435
Reported by: Tim Newsham <tim.newsham@nccgroup.trust>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
This function may be called recursively, when a module pulls its dependencies.
Under certain circumstances, e.g. quad chain of dependencies and presence
of dtrace we may run out of stack.
Suppose a traced process is stopped in ptracestop() due to receipt of a
SIGSTOP signal, and is awaiting orders from the tracing process on how
to handle the signal. Before sending any such orders, the tracing
process exits. This should kill the traced process. But suppose a second
thread handles the SIGKILL and proceeds to exit1(), calling
thread_single(). The first thread will now awaken and will have a chance
to check once more if it should go to sleep due to the SIGSTOP. It must
not sleep after P_SINGLE_EXIT has been set; this would prevent the
SIGKILL from taking effect, leaving a stopped orphan behind after the
tracing process dies.
Also add new tests for this condition.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: Dell EMC
Differential Revision: https://reviews.freebsd.org/D9890
interpreter exactly matches the one requested by the activated image.
This change applies r295277, which did the same for note branding, to
the old brand selection, with the same reasoning of fixing compat32
interpreter substitution.
PR: 211837
Reported by: kenji@kens.fm
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
elf_load_section.
The values passed currently as vm_offset_t are phdr.p_offset, which
have the native Elf word size. Since elf_load_section interprets them
as the file offset, use vm object offset type.
Noted and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
stuck spinning at 100% cpu around sbcut_internal(). Inside
sbflush_internal(), sb_ccc reached to about 4GB and before passing it
to sbcut_internal(), we type-cast it from uint to int making it -ve.
The root cause of sockbuf growing this large is unknown. Correct fix
is also not clear but based on mailing list discussions, adding
KASSERTs to panic instead of looping endlessly.
Reviewed by: glebius
Sponsored by: Limelight Networks
header. This will help to correlate console server logs with dump files,
no matter how precise is clock on a console server appliance, and how
buggy the appliance is.
This KPI explicitely indicates the intent of creating the mapping at
the fixed address, and incorporates the map locking into the callee.
Suggested and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
all the clocks that they provide.
Each clocks are exported under the node 'clock.<clkname>' and have the following
children nodes :
- frequency
- parent (The selected parent, if any)
- parents (The list of parents, if any)
- childrens (The list of childrens, if any)
- enable_cnt (The enabled counter)
This give us the possibility to examine clocks at runtime and make graph of
the clock flow.
Reviewed by: mmel
MFC after: 2 month
Differential Revision: https://reviews.freebsd.org/D9833
We may fail to reset the %CPU tracking window if a thread does not run
for over half of the ticks rollover period, resulting in a bogus %CPU
value for the thread until ticks fully rolls over. Handle this by comparing
the unsigned difference ticks - ts_ltick with SCHED_TICK_TARG instead.
Reviewed by: cem, jeff
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Elf_map_insert() needs to create mapping at the known fixed address.
Usage of vm_map_find() assumes, on the other hand, that any suitable
address space range above or equal the specified hint, is acceptable.
Due to operating on the fresh or cleared address space, vm_map_find()
usually creates mapping starting exactly at hint.
Switch to vm_map_insert() use to clearly request fixed mapping from
the VM.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
vm_map_insert() failure, drop the vnode lock around the call to
vm_object_deallocate().
Since the deallocated object is the vm object of the vnode, we might
get the vnode lock recursion there. In fact, it is almost impossible
to make vm_map_insert() failing there on stock kernel.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Unclear how, but the locking routine for mutexes was using the *release*
barrier instead of acquire. This must have been either a copy-pasto or bad
completion.
Going through other uses of atomics shows no barriers in:
- upgrade routines (addressed in this patch)
- sections protected with turnstile locks - this should be fine as necessary
barriers are in the worst case provided by turnstile unlock
I would like to thank Mark Millard and andreast@ for reporting the problem and
testing previous patches before the issue got identified.
ps.
.-'---`-.
,' `.
| \
| \
\ _ \
,\ _ ,'-,/-)\
( * \ \,' ,' ,'-)
`._,) -',-')
\/ ''/
) / /
/ ,'-'
Hardware provided by: IBM LTC
to stdout in the non-kernel case and to the console+log
in the kernel case. For the kernel case it hooks the
putbuf() machinery underneath printf(9) so that the buffer
is written completely atomically and without a copy into
another temporary buffer. This is useful for fixing
compound console/log messages that become broken and
interleaved when multiple threads are competing for the
console.
Reviewed by: ken, imp
Sponsored by: Netflix
Thread might create a condition for delayed SU cleanup, which creates
a reference to the mount point in td_su, but exit without returning
through userret(), e.g. when terminating due to single-threading or
process exit. In this case, td_su reference is not dropped and mount
point cannot be freed.
Handle the situation by clearing td_su also in the thread destructor
and in exit1(). softdep_ast_cleanup() has to receive the thread as
argument, since e.g. thread destructor is executed in different
context.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
On Core2 and older Intel CPUs, where TSC stops in C2, system does not
allow C2 entrance if timecounter hardware is TSC. This is done by
tc_windup() which tests for TC_FLAGS_C2STOP flag of the new
timecounter and increases cpu_disable_c2_sleep if flag is set. Right
now init_TSC_tc() only sets the flag if cpu_deepest_sleep >= 2, but
TSC is initialized too early for this variable to be set by
acpi_cpu.c.
There is no reason to require that ACPI reported C2 and deeper states
to set TC_FLAGS_C2STOP, so remove cpu_deepest_sleep test from
init_TSC_tc() condition. And since this is the only use of the
variable, remove it at all.
Reported and submitted by: Jia-Shiun Li <jiashiun@gmail.com>
Suggested by: jhb
MFC after: 2 weeks
This function allows the caller to specify the reference clock
and choose between absolute and relative mode. In relative mode,
the remaining time can be returned.
The API is similar to clock_nanosleep(3). Thanks to Ed Schouten
for that suggestion.
While I'm here, reduce the sleep time in the semaphore "child"
test to greatly reduce its runtime. Also add a reasonable timeout.
Reviewed by: ed (userland)
MFC after: 2 weeks
Relnotes: yes
Sponsored by: Dell EMC
Differential Revision: https://reviews.freebsd.org/D9656
data structures.
vt_change_font() calls vtbuf_grow() to change some vt driver data
structures. It uses TF_MUTE to prevent the console from trying to use those
data structures while it changes them.
During the early stage of the boot process, the vt driver's tc_done routine
uses those data structures; however, it is currently called outside the
TF_MUTE check.
Move the tc_done routine inside the locked TF_MUTE check.
PR: 217282
Reviewed by: ed, ray
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9709
then return EAGAIN. The current code just returns that if the LAST buf
failed.
Reviewed by: kib@, trasz@
Differential Revision: https://reviews.freebsd.org/D9677
Previously, the first lines of various generated files from system call
tables were generated in two sections. Some of the initialization was
done in BEGIN, and the rest was done when the first line was encountered.
The main reason for this split before r313564 was that most of the
initialization done in the second section depended on the $FreeBSD$ tag
extracted from the system call table. Now that the $FreeBSD$ tag is no
longer used, consolidate all of the file initialization in the BEGIN
section.
This change was tested by confirming that the content of generated files
did not change.
When a thread is stopped in ptracestop(), the ptrace(2) user may request
a signal be delivered upon resumption of the thread. Heretofore, those signals
were discarded unless ptracestop()'s caller was issignal(). Fix this by
modifying ptracestop() to queue up signals requested by the ptrace user that
will be delivered when possible. Take special care when the signal is SIGKILL
(usually generated from a PT_KILL request); no new stop events should be
triggered after a PT_KILL.
Add a number of tests for the new functionality. Several tests were authored
by jhb.
PR: 212607
Reviewed by: kib
Approved by: kib (mentor)
MFC after: 2 weeks
Sponsored by: Dell EMC
In collaboration with: jhb
Differential Revision: https://reviews.freebsd.org/D9260
Right now the noexec mount option disallows image activators to try
execve the files on the mount point. Also, after r127187, noexec
also limits max_prot map entries permissions for mappings of files
from such mounts, but not the actual mapping permissions.
As result, the API behaviour is inconsistent. The files from noexec
mount can be mapped with PROT_EXEC, but if mprotect(2) drops execution
permission, it cannot be re-enabled later. Make this consistent
logically and aligned with behaviour of other systems, by disallowing
PROT_EXEC for mmap(2).
Note that this change only ensures aligned results from mmap(2) and
mprotect(2), it does not prevent actual code execution from files
coming from noexec mount. Such files can always be read into
anonymous executable memory and executed from there.
Reported by: shamaz.mazum@gmail.com
PR: 217062
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Since fcmpset can fail without lock contention e.g. on arm, it was possible
to get spurious failures when the caller was expecting the primitive to succeed.
Reported by: mmel
called for all threads belonging to a procedure. Currently the first
thread in a procedure is kept around as an optimisation step and is
never freed. Because the first thread in a procedure is never freed
nor allocated, its destructor and constructor callbacks are never
called which means per thread structures allocated by dtrace and the
Linux emulation layers for example, might be present for threads which
don't need these structures.
This patch adds a thread construction and destruction call for the
first thread in a procedure.
Tested: dtrace, linux emulation
Reviewed by: kib @
MFC after: 1 week
Sponsored by: Mellanox Technologies
Implement get_pcpu() for amd64/sparc64/mips/powerpc, and use it to
replace pcpu_find(curcpu) in MI code.
Reviewed by: andreast, kan, lidl
Tested by: lidl(mips, sparc64), andreast(powerpc)
Differential Revision: https://reviews.freebsd.org/D9587
This denotes changes which went in by accident in r313877.
On most production kernels both said parameters are zeroed and have nothing
reading them in either __mtx_lock_sleep or __mtx_unlock_sleep. Thus this change
stops passing them by internal consumers which this is the case.
Kernel modules use _flags variants which are not affected kbi-wise.
sx primitives use inlines as opposed to macros. Change the tested condition
to LOCK_DEBUG which covers the case, but is slightly overzelaous.
Reported by: kib
It is only needed if the LOCK_PROFILING is enabled. It has to always check if
the lock is about to be released which requires an avoidable read if the option
is not specified..
They all fallback to the slow path if necessary and the check is there.
This means a panicked kernel executing code from modules will be able to
succeed doing actual lock/unlock, but this was already the case for core code
which has said primitives inlined.
Something evidently got mangled in my git tree in between testing and
review, as an old and broken version of the patch was apparently submitted
to svn. Revert this while I work out what went wrong.
Reported by: tuexen
Pointy hat to: rstone
inet_ntoa() cannot be used safely in a multithreaded environment
because it uses a static local buffer. Instead, use inet_ntoa_r()
with a buffer on the caller's stack.
Suggested by: glebius, emaste
Reviewed by: gnn
MFC after: 2 weeks
Sponsored by: Dell EMC
Differential Revision: https://reviews.freebsd.org/D9625
Somehow in the late stages of testing my sched_ule patch, a character was
accidentally deleted from the file. Correct this.
While I'm committing anyway, the previous commit message requires some
clarification: in the normal case of unlending priority after releasing
a mutex, the thread that was doing the lending will be woken up and
immediately become the highest-priority thread, and in that case no
priority inversion would take place. However, if that thread is pinned
to a different CPU, then the currently running thread that just had its
priority lowered will not be preempted and then priority inversion can
occur.
Reported by: O. Hartmann (typo), jhb (scheduler clarification)
MFC after: 1 month
Pointy hat to: rstone
When a high-priority thread is waiting for a mutex held by a
low-priority thread, it temporarily lends its priority to the
low-priority thread to prevent priority inversion. When the mutex
is released, the lent priority is revoked and the low-priority
thread goes back to its original priority.
When the priority of that thread is lowered (through a call to
sched_priority()), the schedule was not checking whether
there is now a high-priority thread in the run queue. This can
cause threads with real-time priority to be starved in the run
queue while the low-priority thread finishes its quantum.
Fix this by explicitly checking whether preemption is necessary
when a thread's priority is lowered.
Sponsored by: Dell EMC Isilon
Obtained from: Sandvine Inc
Differential Revision: https://reviews.freebsd.org/D9518
Reviewed by: Jeff Roberson (ule)
MFC after: 1 month
This effectively provides the same benefit as applying MADV_FREE inline
upon every execve, since the page daemon invokes lowmem handlers prior to
scanning the inactive queue. It also has less overhead; the cost of
applying MADV_FREE is very noticeable on many-CPU systems since it includes
that of a TLB shootdown of global PTEs. For instance, this change nearly
halves the system CPU usage during a buildkernel on a 128-vCPU EC2
instance (with some other patches applied).
Benchmarked by: cperciva (earlier version)
Reviewed by: kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D9586
Since locks are dropped when a thread suspends, it's possible for another
thread to deliver a signal to the suspended thread. If the thread awakens from
suspension without checking for signals, it may go to sleep despite having
a pending signal that should wake it up. Therefore the suspension check is
done first, so any signals sent while suspended will be caught in the
subsequent signal check.
Reviewed by: kib
Approved by: kib (mentor)
MFC after: 2 weeks
Sponsored by: Dell EMC
Differential Revision: https://reviews.freebsd.org/D9530
There could be a race between the vm daemon setting RACCT_RSS based on
the vm space and vmspace_exit (called from exit1) resetting RACCT_RSS to
zero. In that case we can get a zombie process with non-zero RACCT_RSS.
If the process is jailed, that may break accounting for the jail.
There could be other consequences.
Fix this race in the vm daemon by updating RACCT_RSS only when a process
is in the normal state. Also, make accounting a little bit more
accurate by refreshing the page resident count after calling
vm_pageout_map_deactivate_pages().
Finally, add an assert that the RSS is zero when a process is reaped.
PR: 210315
Reviewed by: trasz
Differential Revision: https://reviews.freebsd.org/D9464
Rename kern_vm_* functions to kern_*. Move the prototypes to
syscallsubr.h. Also change Mach VM types to uintptr_t/size_t as
needed, to avoid headers pollution.
Requested by: alc, jhb
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D9535
For regular files and posix shared memory, POSIX requires that
[offset, offset + size) range is legitimate. At the maping time,
check that offset is not negative. Allowing negative offsets might
expose the data that filesystem put into vm_object for internal use,
esp. due to OFF_TO_IDX() signess treatment. Fault handler verifies
that the mapped range is valid, assuming that mmap(2) checked that
arithmetic gives no undefined results.
For device mappings, leave the semantic of negative offsets to the
driver. Correct object page index calculation to not erronously
propagate sign.
In either case, disallow overflow of offset + size.
Update mmap(2) man page to explain the requirement of the range
validity, and behaviour when the range becomes invalid after mapping.
Reported and tested by: royger (previous version)
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
This is both a microoptimization and a move of the consumer to more
commonly used vm function.
Suggested and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
The main lockmgr routine takes 8 arguments which makes it impossible to
tail-call it by the intermediate vop_stdlock/unlock routines.
The routine itself starts with an if-forest and reads from the lock itself
several times.
This slows things down both single- and multi-threaded. With the patch
single-threaded fstats go 4% up and multithreaded up to ~27%.
Note that there is still a lot of room for improvement.
Reviewed by: kib
Tested by: pho
This information is less useful when the generated files are included in
source control along with the source. If needed it can be reconstructed
from the $FreeBSD$ tag in the generated file. Removing this information
from the generated output permits committing the generated files along
with the change to the system call master list without having inconsistent
metadata in the generated files.
Reviewed by: emaste, kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D9497
The file type DTYPE_VNODE can be assigned as a fallback if VOP_OPEN()
did not initialized file type. This is a typical code path used by
normal file systems.
Also, change error returned for inappropriate file type used for
O_EXLOCK to EOPNOTSUPP, as declared in the open(2) man page.
Reported by: cy, dhw, Iblis Lin <iblis@hs.ntnu.edu.tw>
Tested by: dhw
Sponsored by: The FreeBSD Foundation
MFC after: 13 days
If a file opened over a vnode has an advisory lock set at close,
vn_closefile() acquires additional vnode use reference to prevent
freeing the vnode in vn_close(). Side effect is that for device
vnodes, devfs_close() sees that vnode reference count is greater than
one and refuses to call d_close(). Create internal version of
vn_close() which can avoid dropping the vnode reference if needed, and
use this to execute VOP_CLOSE() without acquiring a new reference.
Note that any parallel reference to the vnode would still prevent
d_close call, if the reference is not from an opened file, e.g. due to
stat(2).
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
for files which do not have DTYPE_VNODE type.
Both flock(2) and fcntl(2) syscalls refuse to acquire advisory lock on
a file which type is not DTYPE_VNODE. Do the same when lock is
requested from open(2).
Restructure the block in vn_open_vnode() which handles O_EXLOCK and
O_SHLOCK open flags to make it easier to quit its execution earlier
with an error.
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Update comments to note these functions are reachable if lockstat is
enabled.
Check if the lock has any bits set before attempting unlock, which saves
an unnecessary atomic operation.
This improves singlethreaded throughput on my test machine from ~247 mln
ops/s to ~328 mln.
It is mostly about avoiding the setup cost of lockstat.
Reviewed by: jhb (previous version)
In the kernel, cache the machine and flags fields from ELF header to use in
the ELF header of a core dump. For gcore, the copy these fields over from
the ELF header in the binary.
This matters for platforms which encode ABI information in the flags field
(such as o32 vs n32 on MIPS).
Reviewed by: kib
Sponsored by: DARPA / AFRL
Differential Revision: https://reviews.freebsd.org/D9392
Previous implementation would use a random factor to spread readers and
reduce chances of starvation. This visibly reduces effectiveness of the
mechanism.
Switch to the more traditional exponential variant. Try to limit starvation
by imposing an upper limit of spins after which spinning is half of what
other threads get. Note the mechanism is turned off by default.
Reviewed by: kib (previous version)
reasons. First is rerooting into USB-mounted device that happens
to be not yet enumerated. The second is when mounting with (non-root)
filesystem on USB device on a hub that's enumerated later than the root
mount: the rc scripts explicitly mount for the root mount holds to be
released, but each USB bus takes the hold asynchronously, and if that
happens after root mount, it would just get ignored.
Reviewed by: marcel
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D9388
for mount hold release if the root device already exists. So, unless your
rootdev is not on USB - ie in the usual case - the root mount won't wait
for USB. However, the old behaviour was sometimes used as "wait until USB
is fully enumerated", and r290196 broke that.
This commit adds vfs.root_mount_always_wait tunable, to force the kernel
to always wait for root mount holds, even if the root is already there.
Reviewed by: kib
MFC after: 2 weeks
Relnotes: yes
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D9387
controller drivers handle either MSI/MSI-X interrupts, or regular
interrupts, as such enforce this in the interrupt handling framework.
If a later driver was to handle both it would need to create one of each.
This will allow future changes to allow the xref space to overlap, but
refer to different drivers.
Obtained from: ABT Systems Ltd
Sponsored by: The FreeBSD Foundation
X-Differential Revision: https://reviews.freebsd.org/D8616
When a relevant lockstat probe is enabled the fallback primitive is called with
a constant signifying a free lock. This works fine for typical cases but breaks
with recursion, since it checks if the passed value is that of the executing
thread.
Read the value if necessary.
See r313275 for details.
One difference here is that recursion handling was removed from the fallback
routine. As it is it was never supposed to see a recursed lock in the first
place. Future changes will move it out of inline variants, but right now
there is no easy to way to test if the lock is recursed without reading
additional words.
and use it in compats instead of their sys_*() counterparts.
Reviewed by: kib, jhb, dchagin
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D9383
Lockstat requires checking if it is enabled and if so, calling a 6 argument
function. Further, determining whether to call it on unlock requires
pre-reading the lock value.
This is problematic in at least 3 ways:
- more branches in the hot path than necessary
- additional cacheline ping pong under contention
- bigger code
Instead, check first if lockstat handling is necessary and if so, just fall
back to regular locking routines. For this purpose a new macro is introduced
(LOCKSTAT_PROFILE_ENABLED).
LOCK_PROFILING uninlines all primitives. Fold in the current inline lock
variant into the _mtx_lock_flags to retain the support. With this change
the inline variants are not used when LOCK_PROFILING is defined and thus
can ignore its existence.
This results in:
text data bss dec hex filename
22259667 1303208 4994976 28557851 1b3c21b kernel.orig
21797315 1303208 4994976 28095499 1acb40b kernel.patched
i.e. about 3% reduction in text size.
A remaining action is to remove spurious arguments for internal kernel
consumers.
Shared locking routines explicitly read the value and test it. If the
change attempt fails, they fall back to a regular function which would
retry in a loop.
The problem is that with many concurrent readers the risk of failure is pretty
high and even the value returned by fcmpset is very likely going to be stale
by the time the loop in the fallback routine is reached.
Uninline said primitives. It gives a throughput increase when doing concurrent
slocks/sunlocks with 80 hardware threads from ~50 mln/s to ~56 mln/s.
Interestingly, rwlock primitives are already not inlined.
The found value is passed to locking routines in order to reduce cacheline
accesses.
mtx_unlock grows an explicit check for regular unlock. On ll/sc architectures
the routine can fail even if the lock could have been handled by the inline
primitive.
Discussed with: jhb
Tested by: pho (previous version)
witness_warn() either breaks into the debugger or panics the system, so its
output should go to the console regardless of the witness(4) output channel
configuration.
MFC after: 1 week
Sponsored by: Dell EMC Isilon
The switch to get_pcpu() in MI code seems to cause hangs on MIPS.
Back out until we can get a better idea of what's happening there.
Reported by: kan, lidl
for EVFILT_READ at the point of the check not when the event is registers.
This fixes a problem with asio when accepting a connection.
Reviewed by: kib@, Scott Mitchell
of their sys_*() counterparts. The svr4 is left unchanged.
Reviewed by: kib@
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D9379
The checks have quadratic complexity over a number of advisory locks
active for a file and that could be a lot. What's the worse is that the
checks are done while holding ls_lock. That could lead to a long a very
long backlog and performance degradation even if all requested locks are
compatible (e.g. all shared locks).
The checks used to be under INVARIANTS.
Discussed with: kib
MFC after: 2 weeks
Sponsored by: Panzura
instead of their sys_*() counterparts in various compats. The svr4
is left untouched, because there's no point.
Reviewed by: ed@, kib@
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D9367
Add additionally safety and overflow checks to clock_ts_to_ct and the
BCD routines while we're here.
Perform a safety check in sys_clock_settime() first to avoid easy local
root panic, without having to propagate an error value back through
dozens of APIs currently lacking error returns.
PR: 211960, 214300
Submitted by: Justin McOmie <justin.mcomie at gmail.com>, kib@
Reported by: Tim Newsham <tim.newsham at nccgroup.trust>
Reviewed by: kib@
Sponsored by: Dell EMC Isilon, FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D9279
Add internal tracking of smp startup status to reliably figure out
what methods are to be used to get gtaskqueue up and running.
e1000:
Calculating this pointer gives undefined behaviour when (last == -1)
(it is before the buffer). The pointer is always followed. Panics
occurred when it points to an unmapped page. Otherwise, the pointed-to
garbage tends to not have the E1000_TXD_STAT_DD bit set in it, so in the
broken case the loop was usually null and the function just returned, and
this was acidentally correct.
Submitted by: bde
Reported by: Matt Macy <mmacy@nextbsd.org>
Add internal tracking of smp startup status to reliably figure out
what methods are to be used to get gtaskqueue up and running.
e1000:
Calculating this pointer gives undefined behaviour when (last == -1)
(it is before the buffer). The pointer is always followed. Panics
occurred when it points to an unmapped page. Otherwise, the pointed-to
garbage tends to not have the E1000_TXD_STAT_DD bit set in it, so in the
broken case the loop was usually null and the function just returned, and
this was acidentally correct.
Submitted by: bde
Reviewed by: Matt Macy <mmacy@nextbsd.org>
critical_exit().
Based on the discussion with: jhb
Reviewed by: imp
Sponsored by: The FreeBSD Foundation
Differential revision: D9276
MFC after: 1 week
Stop testing for LK_RETRY and error multiple times. Also postpone the
VI_DOOMED until after LK_RETRY was seen as it reads from the vnode.
No functional changes.
configtimer().
During normal operation "state->nextcallopt" will always be less than
or equal to "state->nextcall" and checking only "state->nextcallopt"
before calling "callout_process()" is sufficient. However when
"configtimer()" is called a race might happen requiring both of these
binary times to be checked.
Short description of race:
1) A configtimer() call will reset both "state->nextcall" and
"state->nextcallopt" to the same binary time.
2) If a "callout_reset()" call happens between "configtimer()" and the
next "callout_process()" call, "state->nextcallopt" will get updated
and "state->nextcall" will remain at the current time. Refer to logic
inside cpu_new_callout().
3) getnextcpuevent() only respects "state->nextcall" and returns this
value over and over again, even if it is in the past, until "now >=
state->nextcallopt" becomes true. Then these two time variables are
corrected by a "callout_process()" call and the situation goes back to
normal.
The problem manifests itself in different ways. The common factor is
the timer process(es) consume all CPU on one or more CPU cores for a
long time, blocking other kernel processes from getting execution
time. This can be seen by very high interrupt counts as displayed by
"vmstat -i | grep timer" right after boot.
When EARLY_AP_STARTUP was enabled in r310177 the likelyhood of hitting
this bug apparently increased.
Example output from "vmstat -i" before patch:
cpu0:timer 7591 69
cpu9:timer 39031773 358089
cpu4:timer 9359 85
cpu3:timer 9100 83
cpu2:timer 9620 88
Example output from "vmstat -i" after patch:
cpu0:timer 4242 34
cpu6:timer 5531 44
cpu3:timer 6450 52
cpu1:timer 4545 36
cpu9:timer 7153 58
Before the patch cpu9 in the example above, was spinning in a loop in
order to reach 39 million interrupts just a few seconds after
bootup. After the patch the timer interrupt counts are more or less
consistent.
Discussed with: mav @
Reported by: several people
MFC after: 1 week
Sponsored by: Mellanox Technologies
It's possible to get EFAULT when writing a segment backed by a file
if the segment extends beyond the file.
The core dump could still be useful if we skip the rest of the segment
and proceed to other segements.
The skipped segment (or a portion of it) will be zero-filled.
While there, use 'const' to signify that core_write() only reads the
buffer and use __DECONST before calling vn_rdwr_inchunks() because it
can be used for both reading and writing.
Before the change:
kernel: Failed to write core file for process mmap_trunc_core (error 14)
kernel: pid 77718 (mmap_trunc_core), uid 1001: exited on signal 6
After the change:
kernel: Failed to fully fault in a core file segment at VA 0x800645000 with size 0x4000 to be written at offset 0x29000 for process mmap_trunc_core
kernel: pid 4901 (mmap_trunc_core), uid 1001: exited on signal 6 (core dumped)
Reviewed by: julian, kib
Obtained from: Panzura (older version of the change)
MFC after: 5 days
Sponsored by: Panzura
Differential Revision: https://reviews.freebsd.org/D9233
Commit r270423 fixed a regression in sched_yield() that was introduced
in earlier changes. Unfortunately, at the same time it introduced an
new regression. The problem is that SWT_RELINQUISH (6), like all other
SWT_* constants and unlike SW_* flags, is not a bit flag. So, (flags &
SWT_RELINQUISH) is true in cases where that was not really indended,
for example, with SWT_OWEPREEMPT (2) and SWT_REMOTEPREEMPT (11).
A straight forward fix would be to use (flags & SW_TYPE_MASK) ==
SWT_RELINQUISH, but my impression is that the switch types are designed
mostly for gathering statistics, not for influencing scheduling
decisions.
So, I decided that it would be better to check for SW_PREEMPT flag
instead. That's also the same flag that was checked before r239157.
I double-checked how that flag is used and I am confident that the flag
is set only in the places where we really have the preemption:
- critical_exit + td_owepreempt
- sched_preempt in the ULE scheduler
- sched_preempt in the 4BSD scheduler
Reviewed by: kib, mav
MFC after: 4 days
Sponsored by: Panzura
Differential Revision: https://reviews.freebsd.org/D9230
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
Previously "panic: msleep" could happen for a few different reasons.
Break the KASSERTs out into individual cases to identify the failing
condition. Found during the investigation that resulted in r308288.
Reviewed by: kib, jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D8604
sources to return timestamps when SO_TIMESTAMP is enabled. Two additional
clock sources are:
o nanosecond resolution realtime clock (equivalent of CLOCK_REALTIME);
o nanosecond resolution monotonic clock (equivalent of CLOCK_MONOTONIC).
In addition to this, this option provides unified interface to get bintime
(equivalent of using SO_BINTIME), except it also supported with IPv6 where
SO_BINTIME has never been supported. The long term plan is to depreciate
SO_BINTIME and move everything to using SO_TS_CLOCK.
Idea for this enhancement has been briefly discussed on the Net session
during dev summit in Ottawa last June and the general input was positive.
This change is believed to benefit network benchmarks/profiling as well
as other scenarios where precise time of arrival measurement is necessary.
There are two regression test cases as part of this commit: one extends unix
domain test code (unix_cmsg) to test new SCM_XXX types and another one
implementis totally new test case which exchanges UDP packets between two
processes using both conventional methods (i.e. calling clock_gettime(2)
before recv(2) and after send(2)), as well as using setsockopt()+recv() in
receive path. The resulting delays are checked for sanity for all supported
clock types.
Reviewed by: adrian, gnn
Differential Revision: https://reviews.freebsd.org/D9171
gtaskqueue bits at SI_SUB_INIT_IF instead of waiting until SI_SUB_SMP
which is far too late.
Add an assertion in taskqgroup_attach() to catch startup initialization
failures in the future.
Reported by: kib bde
Replace archaic "busses" with modern form "buses."
Intentionally excluded:
* Old/random drivers I didn't recognize
* Old hardware in general
* Use of "busses" in code as identifiers
No functional change.
http://grammarist.com/spelling/buses-busses/
PR: 216099
Reported by: bltsrc at mail.ru
Sponsored by: Dell EMC Isilon