struct thread.
For all architectures, the syscall trap handlers have to allocate the
structure on the stack. The structure takes 88 bytes on 64bit arches
which is not negligible. Also, it cannot be easily found by other
code, which e.g. caused duplication of some members of the structure
to struct thread already. The change removes td_dbg_sc_code and
td_dbg_sc_nargs which were directly copied from syscall_args.
The structure is put into the copied on fork part of the struct thread
to make the syscall arguments information correct in the child after
fork.
This move will also allow several more uses shortly.
Reviewed by: jhb (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
X-Differential revision: https://reviews.freebsd.org/D11080
what this field represented was also inaccurate.) Suggested by: kib
In r178792, blist_create() grew a malloc flag, allowing M_NOWAIT to be
specified. However, blist_create() was not modified to handle the
possibility that a malloc() call failed. Address this omission.
Increase the width of the local variable "radix" to 64 bits. (This
matches the width of the corresponding field in struct blist.)
Reviewed by: kib
MFC after: 6 weeks
quantity as the size of the range to fill, but returns a 32-bit quantity
as the number of blocks that were allocated to fill that range. This
revision corrects that mismatch. Currently, swaponsomething() limits
the size of a swap area to prevent arithmetic arithmetic overflow in
other parts of the blist allocator. That limit has also prevented this
type mismatch from causing problems.
Reviewed by: kib, markj
MFC after: 6 weeks
Differential Revision: https://reviews.freebsd.org/D11096
Provide a new mode "2" which returns a special overflow indicator in
the non-representable field instead of the silent truncation (mode
"0") or EOVERFLOW (mode "1").
In particular, the typical use of st_ino to detect hard links with
mode "2" reports false positives, which might be more suitable for
some uses.
Discussed with: bde
Sponsored by: The FreeBSD Foundation
o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().
o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.
o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.
o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770
subtree is already zero, then setting the "largest contiguous free block"
hint for that subtree to anything other than zero makes no sense. To be
clear, assigning a value to the hint that is too large is not a correctness
problem, only a pessimization.
Dragonfly BSD has applied the same change to blst_meta_alloc() but not
blst_meta_fill().
MFC after: 6 weeks
This happens when closing a socket with upcall, and trace is: soclose()->
... protocol ... -> soisdisconnected() -> socantrcvmore_locked() ->
sowakeup() -> soisconnected().
Right now this case is innocent for two reasons. First, soisconnected()
doesn't clear SS_ISDISCONNECTED flag. Second, the mutex to lock the
socket is the socket receive buffer mutex, and sodisconnected() first
disables the receive buffer. But in future code, the mutex to lock
socket is different to buffer mutex, and we would get undesired mutex
recursion.
The fix is to check SS_ISDISCONNECTED flag before calling upcall.
testing purposes. However, over the years, various changes to the kernel
have broken this feature. This revision applies some fixes to get user-
space compilation working again. There are no changes in this revision
to code that is used by the kernel.
MFC after: 3 days
pager used a different scheme for striping the allocation of swap space
across multiple devices. And, although blist_fill() was intended to support
fill operations with large counts, the old striping scheme never performed a
fill larger than the stripe size. Consequently, the misplacement of a
sanity check in blst_meta_fill() went undetected. Now, moving forward in
time to r118390, a new scheme for striping was introduced that maintained a
blist allocator per device, but as noted in r318995, swapoff_one() was not
fully and correctly converted to the new scheme. This change completes what
was started in r318995 by fixing the underlying bug in blst_meta_fill() that
stops swapoff_one() from simply performing a single blist_fill() operation.
Reviewed by: kib
MFC after: 5 days
Differential Revision: https://reviews.freebsd.org/D11043
You may now optionally specify allow.noreserved_ports to prevent root
inside a jail from using privileged ports (less than 1024)
PR: 217728
Submitted by: Matt Miller <mattm916@pulsar.neomailbox.ch>
Reviewed by: jamie, cem, smh
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D10202
short, half of the memory that is allocated to implement the radix tree is
wasted because we did not change "u_daddr_t" to be a 64-bit unsigned int
when we changed "daddr_t" to be a 64-bit (signed) int. (See r96849 and
r96851.)
Reviewed by: kib, markj
Tested by: pho
MFC after: 5 days
Differential Revision: https://reviews.freebsd.org/D11028
inode number or link count for the ABI compat binaries.
Right now, and by default after the change, too large 64bit values are
silently truncated to 32 bits. Enabling the knob causes the system to
return EOVERFLOW for stat(2) family of compat syscalls when some
values cannot be completely represented by the old structures. For
getdirentries(2), knob skips the dirents which would cause non-trivial
truncation of d_ino.
EOVERFLOW error is specified by the X/Open 1996 LFS document
('Adding Support for Arbitrary File Sizes to the Single UNIX
Specification').
Based on the discussion with: bde
Sponsored by: The FreeBSD Foundation
Since ino64 expanded dev_t to 64bit, make VOP_GETATTR(9) provide all
bits of mnt_stat.f_fsid as va_fsid for vnodes on filesystems which use
f_fsid. In particular, NFSv3 and sometimes NFSv4, and ZFS use this
method or reporting st_dev by stat(2).
Provide a new helper vn_fsid() to avoid duplicating code to copy
f_fsid to va_fsid.
Note that the change is mostly cosmetic. Its motivation is to avoid
sign-extension of f_fsid[0] into 64bit dev_t value which happens after
dev_t becomes 64bit..
Reviewed by: avg(zfs), rmacklem (nfs) (both for previous version)
Sponsored by: The FreeBSD Foundation
bhyve was recently sandboxed with capsicum, and needs to be able to
control the CPU sets of its vcpu threads
Reviewed by: emaste, oshogbo, rwatson
MFC after: 2 weeks
Sponsored by: ScaleEngine Inc.
Differential Revision: https://reviews.freebsd.org/D10170
Extend the ino_t, dev_t, nlink_t types to 64-bit ints. Modify
struct dirent layout to add d_off, increase the size of d_fileno
to 64-bits, increase the size of d_namlen to 16-bits, and change
the required alignment. Increase struct statfs f_mntfromname[] and
f_mntonname[] array length MNAMELEN to 1024.
ABI breakage is mitigated by providing compatibility using versioned
symbols, ingenious use of the existing padding in structures, and
by employing other tricks. Unfortunately, not everything can be
fixed, especially outside the base system. For instance, third-party
APIs which pass struct stat around are broken in backward and
forward incompatible ways.
Kinfo sysctl MIBs ABI is changed in backward-compatible way, but
there is no general mechanism to handle other sysctl MIBS which
return structures where the layout has changed. It was considered
that the breakage is either in the management interfaces, where we
usually allow ABI slip, or is not important.
Struct xvnode changed layout, no compat shims are provided.
For struct xtty, dev_t tty device member was reduced to uint32_t.
It was decided that keeping ABI compat in this case is more useful
than reporting 64-bit dev_t, for the sake of pstat.
Update note: strictly follow the instructions in UPDATING. Build
and install the new kernel with COMPAT_FREEBSD11 option enabled,
then reboot, and only then install new world.
Credits: The 64-bit inode project, also known as ino64, started life
many years ago as a project by Gleb Kurtsou (gleb). Kirk McKusick
(mckusick) then picked up and updated the patch, and acted as a
flag-waver. Feedback, suggestions, and discussions were carried
by Ed Maste (emaste), John Baldwin (jhb), Jilles Tjoelker (jilles),
and Rick Macklem (rmacklem). Kris Moore (kris) performed an initial
ports investigation followed by an exp-run by Antoine Brodin (antoine).
Essential and all-embracing testing was done by Peter Holm (pho).
The heavy lifting of coordinating all these efforts and bringing the
project to completion were done by Konstantin Belousov (kib).
Sponsored by: The FreeBSD Foundation (emaste, kib)
Differential revision: https://reviews.freebsd.org/D10439
Previously open(2) was allowed in capability mode, with a comment that
suggested this was likely the case to facilitate debugging. The system
call would still fail later on, but it's better to disallow the syscall
altogether.
We now have the kern.trap_enotcap sysctl or PROC_TRAPCAP_CTL proccontrol
to aid in debugging.
In any case libc has translated open() to the openat syscall since
r277032.
Reviewed by: kib, rwatson
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D10850
A long long time ago the register keyword told the compiler to store
the corresponding variable in a CPU register, but it is not relevant
for any compiler used in the FreeBSD world today.
ANSIfy related prototypes while here.
Reviewed by: cem, jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D10193
This function permits a range of one scatter/gather list to be appended to
another sglist. This can be used to construct a scatter/gather list that
reorders or duplicates ranges from one or more existing scatter/gather
lists.
Sponsored by: Chelsio Communications
Previously, when the VI_TRYLOCK failed, we would spin under the mutex
that protects the vnode active list until we either succeeded or
noticed that we had hogged the CPU. Since we were violating the lock
order, this would guarantee that we would become a hog under any
deadlock condition (e.g. a race with vdrop(9) on the same vnode). In
the presence of many concurrent threads in sync(2) or vdrop etc, the
victim could hang for a long time.
Now, avoid spinning by dropping and reacquiring the locks in the
conventional lock order when the trylock fails. This requires a dance
with the vnode hold count.
Submitted by: Tom Rix <trix@juniper.net>
Tested by: pho
Differential revision: https://reviews.freebsd.org/D10692
is blocked. The spurious wakeup might result in spurious EINTR.
The reschedule_signals() function is called when the calling thread
has the signal mask changed. For each newly blocked signal, we try to
find a thread which might have the signal not blocked. If no such
thread exists, sigtd() returns random thread, which must not be waken
up. I decided that re-checking, as suggested by PR submitter, is more
reasonable change than to change sigtd() interface, due to other uses
of sigtd(). signotify() already performs this check.
Submitted by: Duane <parakleta@darkreality.org>
PR: 219228
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
When a thread enters ptracestop(), for example because it had received
SIGSTOP from ptrace(PT_ATTACH), it attempts to suspend other threads in
the same process. In the case of a thread sleeping interruptibly in an
SBDRY section, sig_suspend_threads() must wake the thread and allow it to
reach the user-mode boundary. However, sig_suspend_threads() would
erroneously avoid waking up such threads, resulting in an apparent hang.
Reviewed by: kib
Tested by: pho
MFC after: 2 weeks
Sponsored by: Dell EMC Isilon
the code auto-generated for *.m - kobj_lookup_method(9) is useful;
for example in back-ends or base class device drivers in order to
determine whether a default method has been overridden. Thus, allow
for the kobj_method_t pointer argument - used by KOBJOPLOOKUP in
order to update the cache entry - of kobj_lookup_method(9), to be
NULL. Actually, that pointer is redundant as it's just set to the
same kobj_method_t that the kobj_lookup_method(9) function returns
in the first place, but probably it serves to reduce the number of
instructions generated for KOBJOPLOOKUP.
- For the same reason, move updating kobj_lookup_{hits,misses} (if
KOBJ_STATS is defined) from kobj_lookup_method(9) to KOBJOPLOOKUP.
As a side-effect, this gets rid of the convoluted approach of always
incrementing kobj_lookup_hits in KOBJOPLOOKUP and then in case of
a cache miss, decrementing it in kobj_lookup_method(9) again.
The previous misuse of sys_sigqueue() was sending random register or
stack garbage to 64-bit targets. The freebsd32 implementation preserves
the sival_int member of value when signaling a 64-bit process.
Document the mixed ABI implementation of union sigval and the
incompability of sival_ptr with pointer integrity schemes.
Reviewed by: kib, wblock
MFC after: 1 week
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D10605
Add IRQ placement-only and ithread-only API variants. intr_event_bind
has been extended with sibling methods, as it has many more callsites in
existing code.
Reviewed by: kib@, adrian@ (earlier version)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D10586
Some notes:
- Only i386 and amd64 layouts are checked, other Tier-1 (or close to
it) architectures would benefit from the same check.
- Unconditional enabling of the asserts depend on the stability of locks
memory layout. If locks are optimized to avoid bloat when some debugging
or profiling features turned off, it makes sense to only assert layout
for production configs.
Reviewed by: badger, emaste, jhb, vangyzen
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D10526
This check has been redundant since it was introduced in r162554.
Reviewed by: emaste, glebius
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D10322