The manipulations done by mountcheckdirs() are not that useful during
the unmount, they can bring about unexpected security consequences.
Thic change effectively reverts the change in r73241.
The change also allows to simplify the handling of rootvnode global
variable.
Discussed with: mckusick, mjg, kib
Reviewed by: trasz
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D12366
posix_fallocate is logically equivalent to writing zero blocks to the
desired file size and there is no reason to prevent calling it in
capability mode. posix_fallocate already checked for the CAP_WRITE
right, so we merely need to list it in capabilities.conf.
Reviewed by: allanjude
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D12640
Previously before you could call unrhdr_delete you needed to
individually free every allocated unit. It is useful to be able to tear
down the unr without having to go through this process, as it is
significantly faster than freeing the individual units.
Reviewed by: cem, lidl
Approved by: rstone (mentor)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12591
all cache the last system time (uptime + boottime). Only the format
differs. Do not re-calculate the bintime and simply use the value
used to calculate the microtime and nanotime.
Group all the updates under the relevant comment. Remove obsoleted
XXX part.
Submitted by: Sebastian Huber <sebastian.huber@embedded-brains.de>
MFC after: 1 week
Sendfile() should match the error checking order of send() which
is currently:
SBS_CANTSENDMORE
so_error
SS_ISCONNECTED
Submitted by: Jason Eggleston <jason@eggnet.com>
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12633
o Fall back to default m_ext free mech, using function pointer in
m_ext_free, and remove sf_ext_free() called directly from mbuf code.
Testing on modern CPUs showed no regression.
o Provide internally used flag EXT_FLAG_SYNC, to mark that I/O uses
SF_SYNC flag. Lack of the flag allows us not to dereference
ext_arg2, saving from a cache line miss.
o Create function sendfile_free_page() that later will be used, for
multi-page mbufs. For now compiler will inline it into
sendfile_free_mext().
In collaboration with: gallatin
Differential Revision: https://reviews.freebsd.org/D12615
in mb_free_ext() always use fields from the original refcount holding
mbuf (see. r296242) mbuf. Cuts another cache miss from mb_free_ext().
However, treat EXT_EXTREF mbufs differently, since they are different -
they don't have a refcount holding mbuf.
Provide longer comments in m_ext declaration to explain this change
and change from r296242.
In collaboration with: gallatin
Differential Revision: https://reviews.freebsd.org/D12615
All of these arguments are stored in m_ext, so there is no reason
to pass them in the argument list. Not all functions need the second
argument, some don't even need the first one. The second argument
lives in next cache line, so not dereferencing it is a performance
gain. This was discovered in sendfile(2), which will be covered by
next commits.
The second goal of this commit is to bring even more flexibility
to m_ext mbufs, allowing to create more fields in m_ext, opaque to
the generic mbuf code, and potentially set and dereferenced by
subsystems.
Reviewed by: gallatin, kbowling
Differential Revision: https://reviews.freebsd.org/D12615
properly dump all the sendqueues and not just the first one
History:
It appears that in the commit which introduced the code,
r165272, the array indexes of "sq_blocked[0]" and "td_name[i]"
were interchanged. In r180927 "td_name[i]" was corrected to
"td_name[0]", but "sq_blocked[0]" was left unchanged.
PR: 222624
Discussed with: kmacy @
MFC after: 1 week
Sponsored by: Mellanox Technologies
appearing only where the code explicitly set it, but since much of the
data was not initialized, '-1' appeared other places too, and led to
panics. Clear the allocated data before initializing nonzero values by
allocating with M_ZERO.
Submitted by: Doug Moore <dougm@rice.edu>
Reported by: Oleg V. Nauman <oleg@theweb.org.ua>, cy
Tested by: Oleg V. Nauman <oleg@theweb.org.ua>
MFC after: 1 week
X-MFC with: r324420
Differential Revision: https://reviews.freebsd.org/D12627
nodes to allocate for the blist, and to initialize them. The computation
can be done much more quickly by identifying the terminating node, if any,
at every level of the tree and then summing the number of nodes at each
level that precedes the topmost terminator. The initialization can also be
done quickly, since settings at the root mark the tree as all-allocated, and
only a few terminator nodes need to be marked in the rest of the tree.
Eliminate blst_radix_init, and perform its two functions more simply in
blist_create.
The allocation of the blist takes places in two pieces, but there's no good
reason to do so, when a single allocation is sufficient, and simpler.
Allocate the blist struct, and the array of nodes associated with it, with a
single allocation.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj (an earlier version)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D11968
The detach case is slightly complicated by the fact that some in-kernel
consumers may want to know before a device detaches (so they can release
related resources, stop using the device, etc), but the detach can fail. So
there are pre- and post-detach notifications for those consumers who need to
handle all cases.
A couple salient comments from the review, they amount to some helpful
documentation about these events, but there's currently no good place for
such documentation...
Note that in the current newbus locking model, DETACH_BEGIN and
DETACH_COMPLETE/FAILED sequence of event handler invocation might interweave
with other attach/detach events arbitrarily. The handlers should be prepared
for such situations.
Also should note that detach may be called after the parent bus knows the
hardware has left the building. In-kernel consumers have to be prepared to
cope with this race.
Differential Revision: https://reviews.freebsd.org/D12557
When the EVENTHANDLER(9) subsystem was created, it was a documented feature
that an eventhandler callback function could safely deregister itself. In
r200652 that feature was inadvertantly broken by adding drain-wait logic to
eventhandler_deregister(), so that it would be safe to unload a module upon
return from deregistering its event handlers.
There are now 145 callers of EVENTHANDLER_DEREGISTER(), and it's likely many
of them are depending on the drain-wait logic that has been in place for 8
years. So instead of creating a separate eventhandler_drain() and adding it
to some or all of those 145 call sites, this creates a separate
eventhandler_drain_nowait() function for the specific purpose of
deregistering a callback from within the running callback.
Differential Revision: https://reviews.freebsd.org/D12561
connection was reset by the remote end, sendfile() would just report
ENOTCONN instead of ECONNRESET.
Submitted by: Jason Eggleston <jason@eggnet.com>
Reviewed by: glebius
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12575
Lookups of the sort are rare compared to regular ones and succesfull ones
result in removing entries from the cache.
In the current code buckets are rlocked and a trylock dance is performed,
which can fail and cause a restart. Fixing it will require a little bit
of surgery and in order to keep the code maintaineable the 2 cases have
to split.
MFC after: 1 week
When using the default address list nam is still valid, the code in
r324054 assumed that is was NULL.
Reported by: Guy Yur <guyyur@gmail.com>
Tested by: Guy Yur <guyyur@gmail.com>
Previous code would always spin once before checking the lock. But a lock
with e.g. 6 readers is not going to become free in the duration of once spin
even if they start draining immediately.
Conservatively perform one for each reader.
Note that the total number of allowed spins is still extremely small and is
subject to change later.
MFC after: 1 week
Improved logging added in r323879 exposed an error during
attach. We need the irq, not the rid to work correctly. em uses
shared irqs, so it will use the same irq for TX as RX. bnxt does
not use shared irqs, or TX irqs at all, so there's no need to set
the TX irq affinity.
Reviewed by: sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12496
spin first instant of instantly re-readoing and don't re-read after
spinning is finished - the state is already known.
Note the code is subject to significant changes later.
MFC after: 1 week
A sysctl can have a custom handler that may access data that is initialized
via SYSINIT(9) or via a module event handler (also invoked via SYSINIT).
Thus, it is not safe to allow access to the module's sysctl-s until
the initialization is performed. Likewise, we should not allow access
to teh sysctl-s after the module is uninitialized.
The latter is easy to achieve by properly ordering linker_file_unregister_sysctls
and linker_file_sysuninit.
The former is not as easy for two reasons:
- the initialization may depend on tunables which get set when sysctl-s are
registered, so we need to set the tunables before running sysinit-s
- the initialization may try to dynamically add more sysctl-s under statically
defined sysctl nodes
So, this change splits the sysctl setup into two phases. In the first phase
the sysctl-s are registered as before but they are disabled and hidden from
consumers. In the second phase, done after sysinit-s, normal access to the
sysctl-s is enabled.
The change should affect only dynamic module loading and unloading after
the system boot-up. Nothing changes for sysctl-s compiled into the kernel
and sysctl-s in preloaded modules.
Discussed with: hselasky, ian, jhb
Reviewed by: julian, kib
MFC after: 2 weeks
Sponsored by: Panzura
Differential Revision: https://reviews.freebsd.org/D12545
Violators may define _WANT_SOCKET and _WANT_UNPCB respectively and
are not guaranteed for stability of the structures. The violators
list is the the usual one: libprocstat(3) and netstat(1) internally
and lsof in ports.
In struct xunpcb remove the inclusion of kernel structure and add
a bunch of spare fields. The xsocket already has socket not included,
but add there spares as well. Embed xsockbuf into xsocket.
Sort declarations in sys/socketvar.h to separate kernel only from
userland available ones.
PR: 221820 (exp-run)
Previously, uiomove_object_page() would maintain LRU by requeuing the
accessed page. This involves acquiring one of the heavily contended page
queue locks. Moreover, it is unnecessarily expensive for pages in the
active queue.
As of r254304 the page daemon continually performs a slow scan of the
active queue, with the effect that unreferenced pages are gradually
moved to the inactive queue, from which they can be reclaimed. Prior to
that revision, the active queue was scanned only during shortages of
free and inactive pages, meaning that unreferenced pages could get
"stuck" in the queue. Thus, tmpfs was required to use the inactive queue
and requeue pages in order to maintain LRU. Now that this is no longer
the case, tmpfs I/O operations can use the active queue and avoid the
page queue locks in most cases, instead setting PGA_REFERENCED on
referenced pages to provide pseudo-LRU.
Reviewed by: alc (previous version)
MFC after: 2 weeks
some still useful bits of the reverted revision.
The problem with the committed fix is that there are still issues with
returning from NMI, when NMI interrupted kernel in a moment where the
kernel segments selectors were still not loaded into registers. If
this happens, the NMI return would loose the userspace selectors
because r323722 does not reload segment registers on return to kernel
mode.
Fixing the problem is complicated. Since an alternative approach to
handle the original bug exists, it makes sence to stop adding more
complexity.
Discussed with: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
This fixes kernel crashes due to misaligned accesses to the 64-bit
time_t embedded in struct namecache_ts in MIPS n32 kernels.
MFC after: 1 week
Sponsored by: DARPA / AFRL
If the filesystem is not exported directly return NULL.
If no address is given and filesystem is exported using some default
one return it directly, if it doesn't have a default one directly
return NULL.
Reviewed by: kib, bapt
MFC after: 1 week
Sponsored by: Gandi.net
Differential Revision: https://reviews.freebsd.org/D12505
Said checks were inherently racy anyway as jokers could unmap target areas
before the handler got around to accessing them.
This saves time by avoiding locking the address space.
MFC after: 1 week
tid must be equal to curthread and the target routine was already reading
it anyway, which is not a problem. Not passing it as a parameter allows for
a little bit shorter code in callers.
MFC after: 1 week
Add a DDB command that mirrors sysctl debug.witness.badstacks.
Reapply r323935 after fixing trivial deficiency. I forgot to compile with
WITNESS enabled. Thanks emaste@ for fixing the build while I was asleep.
Reported by: rstone
Reviewed by: rstone (previous version)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12468
Previously, it was just a pointer which was copied, but
some callers pass in a stack variable which will go out of scope.
Add GROUPTASK_NAMELEN macro (32) and snprintf() the name into it,
using "grouptask" if name is NULL. We can now safely include
gtask->gt_name in console messages.
Reviewed by: sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12449
check hash to cylinder groups. If a check hash fails when a cylinder
group is read, no further allocations are attempted in that cylinder
group until it has been fixed by fsck. This avoids a class of
filesystem panics related to corrupted cylinder group maps. The
hash is done using crc32c.
Check hases are added only to UFS2 and not to UFS1 as UFS1 is primarily
used in embedded systems with small memories and low-powered processors
which need as light-weight a filesystem as possible.
Specifics of the changes:
sys/sys/buf.h:
Add BX_FSPRIV to reserve a set of eight b_xflags that may be used
by individual filesystems for their own purpose. Their specific
definitions are found in the header files for each filesystem
that uses them. Also add fields to struct buf as noted below.
sys/kern/vfs_bio.c:
It is only necessary to compute a check hash for a cylinder
group when it is actually read from disk. When calling bread,
you do not know whether the buffer was found in the cache or
read. So a new flag (GB_CKHASH) and a pointer to a function to
perform the hash has been added to breadn_flags to say that the
function should be called to calculate a hash if the data has
been read. The check hash is placed in b_ckhash and the B_CKHASH
flag is set to indicate that a read was done and a check hash
calculated. Though a rather elaborate mechanism, it should
also work for check hashing other metadata in the future. A
kernel internal API change was to change breada into a static
fucntion and add flags and a function pointer to a check-hash
function.
sys/ufs/ffs/fs.h:
Add flags for types of check hashes; stored in a new word in the
superblock. Define corresponding BX_ flags for the different types
of check hashes. Add a check hash word in the cylinder group.
sys/ufs/ffs/ffs_alloc.c:
In ffs_getcg do the dance with breadn_flags to get a check hash and
if one is provided, check it.
sys/ufs/ffs/ffs_vfsops.c:
Copy across the BX_FFSTYPES flags in background writes.
Update the check hash when writing out buffers that need them.
sys/ufs/ffs/ffs_snapshot.c:
Recompute check hash when updating snapshot cylinder groups.
sys/libkern/crc32.c:
lib/libufs/Makefile:
lib/libufs/libufs.h:
lib/libufs/cgroup.c:
Include libkern/crc32.c in libufs and use it to compute check
hashes when updating cylinder groups.
Four utilities are affected:
sbin/newfs/mkfs.c:
Add the check hashes when building the cylinder groups.
sbin/fsck_ffs/fsck.h:
sbin/fsck_ffs/fsutil.c:
Verify and update check hashes when checking and writing cylinder groups.
sbin/fsck_ffs/pass5.c:
Offer to add check hashes to existing filesystems.
Precompute check hashes when rebuilding cylinder group
(although this will be done when it is written in fsutil.c
it is necessary to do it early before comparing with the old
cylinder group)
sbin/dumpfs/dumpfs.c
Print out the new check hash flag(s)
sbin/fsdb/Makefile:
Needs to add libufs now used by pass5.c imported from fsck_ffs.
Reviewed by: kib
Tested by: Peter Holm (pho)
It doesn't appear to be safe to use gtask->gt_name.
Reported by: Mark Johnston, Jenkins
Reviewed by: sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12448
Move handling of these three pathconf() variables out of vop_stdpathconf()
and into devfs_pathconf() as TTY devices can only be devfs files. In
addition, only return settings for these three variables for devfs devices
whose device switch has the D_TTY flag set.
Discussed with: bde, kib
Sponsored by: Chelsio Communications
Check the return code of intr_setaffinity() and log any errors
it returns. When a qid is not located, log an error before returning
failure. Also, use __func__ rather than hardcoding the function name
Reviewed by: sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12436
Previously had the same short and long description as taskqueues.
This could cause problems with memguard(9) and vmstat -m which use
the short description as a unique identifier.
Reviewed by: sbruno
Approved by: sbruno (mentor)
MFC after: 3 days
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12438
If vrele() changes the hold count to zero, it needs to acquire the
vnode lock.
Sponsored by: The FreeBSD Foundation
Discussed with: avg
X-MFC with: r323578
One consequence of the patch is that msyncing unlinked file mappings
no longer reduces the amount of the dirty memory in the system, but I
do not think that there are users of msync(2) that utilize it for such
side-effect.
Reported and tested by: tjil
PR: 222356
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D12411
Suppose that userspace is executing with the non-standard segment
descriptors. Then, until exception or interrupt handler executed
SET_KERNEL_SEGS, kernel is still executing with user %ds, %es and %fs.
If an interrupt occurs in this window, the interrupt handler is
executed unsafely, relying on usability of the usermode registers. If
the interrupt results in the context switch on return, the
contamination of the kernel state spreads to the thread we switched
to. As result, kernel data accesses might fault or, if only the base
is changed, completely messed up.
More, if the user segment was allocated in LDT, another thread might
mark the descriptor as invalid before doreti code tried to reload
them. In this case kernel panics.
The issue exists for all exception entry points which use trap gate,
and thus do not automatically disable interrupts on entry, and for
lcall_handler.
Fix is two-fold: first, we need to disable interrupts for all kernel
entries, changing the IDT descriptor types from trap gate to interrupt
gate. Interrupts are re-enabled not earlier than the kernel segments
are loaded into the segment registers. Second, we only load the
segment registers from the trap frame when returning to usermode. For
the later, all interrupt return paths must happen through the doreti
common code.
There is no way to disable interrupts on call gate, which is the
supposed mode of servicing for lcall $7,$0 syscalls. Change the LDT
descriptor 0 into a code segment type and point it to the userspace
trampoline which redirects the syscall to int $0x80.
All the measures make the segment register handling similar to that of
amd64. We do not apply amd64 optimizations of not reloading segment
registers on return from the syscall.
Reported by: Maxime Villard <max@m00nbsd.net>
Tested by: pho (the non-lcall part)
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D12402
Modify blst_leaf_alloc to find allocations that cross the boundary between
one leaf node and the next when those two leaves descend from the same
meta node.
Update the hint field for leaves so that it represents a bound on how
large an allocation can begin in that leaf, where it currently represents
a bound on how large an allocation can be found within the boundaries of
the leaf.
The first phase of blst_leaf_alloc currently shrinks sequences of
consecutive 1-bits in mask until each has been shrunken by count-1 bits,
so that any bits remaining show where an allocation can begin, or until
all the bits have disappeared, in which case the allocation fails. This
change amends that so that the high-order bit is copied, as if, when the
last block was free, it was followed by an endless stream of free
blocks. It also amends the early stopping condition, so that the shrinking
of 1-sequences stops early when there are none, or there is only one
unbounded one remaining.
The search for the first set bit is unchanged, and the code path
thereafter is mostly unchanged unless the first set bit is in a position
that makes some of those copied sign bits matter. In that case, we look
for a next leaf, and at what blocks it can provide, to see if a
cross-boundary allocation is possible.
The hint is updated on a successful allocation that clears the last bit,
but it not updated on a failed allocation that leaves the last bit
set. So, as long as the last block is free, the hint value for the leaf is
large. As long as the last block is free, and there's a next leaf, a large
allocation can begin here, perhaps. A stricter rule than this would mean
that allocations and frees in one leaf could require hint updates to the
preceding leaf, and this change seeks to leave the freeing code
unmodified.
Define BLIST_BMAP_MASK, and use it for bit masking in blst_leaf_free and
blist_leaf_fill, as well as in blst_leaf_alloc.
Correct a panic message in blst_leaf_free.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj (an earlier version)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D11819
This was really too big of a commit even if everything worked, but there
are multiple new issues introduced in the one huge commit, so it's not
worth keeping this until it's fixed.
I'll work on splitting this up into logical chunks and introduce them one
at a time over the next week or two.
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
When a newborn socket moves from incomplete queue to complete
one, we need to obtain the listening socket lock after the child,
which is a wrong order. The old code did that in potentially
endless loop of mtx_trylock(). The new one does only one attempt
of mtx_trylock(), and in case of failure references listening
socket, unlocks child and locks everything in right order. In
case if listening socket shuts down during that, just bail out.
Reported & tested by: Jason Eggleston <jeggleston llnw.com>
Reported & tested by: Jason Wolfe <jason llnw.com>