- Document the kern_kevent_anonymous() function.
- Add assertions to ensure that we don't silently leave the kqueue
linked from a file descriptor table.
Reviewed by: jmg
Differential Revision: https://reviews.freebsd.org/D3364
We already properly return ENOTDIR when calling *at() on a non-directory
vnode, but it turns out that if you call it on a socket, we see EINVAL.
Patch up namei to properly translate this to ENOTDIR.
As CloudABI processes cannot adjust their signal handlers, we need to
make sure that we start up CloudABI processes with consistent signal
masks. Though the POSIx standard signal behavior is all right, we do
need to make sure that we ignore SIGPIPE, as it would otherwise be
hard to interact with pipes and sockets.
Extend execsigs() to iterate over ps_sigignore and call sigdflt() for
each of the ignored signals.
Reviewed by: kib
Obtained from: https://github.com/NuxiNL/freebsd
Differential Revision: https://reviews.freebsd.org/D3365
CloudABI's polling system calls merge the concept of one-shot polling
(poll, select) and stateful polling (kqueue). They share the same data
structures.
Extend FreeBSD's kqueue to provide support for waiting for events on an
anonymous kqueue. Unlike stateful polling, there is no need to support
timeouts, as an additional timer event could be used instead.
Furthermore, it makes no sense to use a different number of input and
output kevents. Merge this into a single argument.
Obtained from: https://github.com/NuxiNL/freebsd
Differential Revision: https://reviews.freebsd.org/D3307
The existing sys_cap_rights_limit() expects that a cap_rights_t object
lives in userspace. It is therefore hard to call into it from
kernelspace.
Move the interesting bits of sys_cap_rights_limit() into
kern_cap_rights_limit(), so that we can call into it from the CloudABI
compatibility layer.
Obtained from: https://github.com/NuxiNL/freebsd
Differential Revision: https://reviews.freebsd.org/D3314
initial thread stack is not adjusted by the tunable, the stack is
allocated too early to get access to the kernel environment. See
TD0_KSTACK_PAGES for the thread0 stack sizing on i386.
The tunable was tested on x86 only. From the visual inspection, it
seems that it might work on arm and powerpc. The arm
USPACE_SVC_STACK_TOP and powerpc USPACE macros seems to be already
incorrect for the threads with non-default kstack size. I only
changed the macros to use variable instead of constant, since I cannot
test.
On arm64, mips and sparc64, some static data structures are sized by
KSTACK_PAGES, so the tunable is disabled.
Sponsored by: The FreeBSD Foundation
MFC after: 2 week
This makes the PPS API behave correctly, but isn't ideal -- we still end
up capturing PPS data for non-enabled edges, we just don't process the
data into an event that becomes visible outside of kern_tc. That's because
the event type isn't passed to pps_capture(), so it can't do the filtering.
Any solution for capture filtering is going to require touching every driver.
On CloudABI we want to create file descriptors with just the minimal set
of Capsicum rights in place. The reason for this is that it makes it
easier to obtain uniform behaviour across different operating systems.
By explicitly whitelisting the operations, we can return consistent
error codes, but also prevent applications from depending OS-specific
behaviour.
Extend kern_kqueue() to take an additional struct filecaps that is
passed on to falloc_caps(). Update the existing consumers to pass in
NULL.
Differential Revision: https://reviews.freebsd.org/D3259
It looks like EVFILT_READ and EVFILT_WRITE trigger under the same
conditions as poll()'s POLLRDNORM and POLLWRNORM as described by POSIX.
The only difference is that POLLRDNORM has to be triggered on regular
files unconditionally, whereas EVFILT_READ only triggers when not EOF.
Introduce a new flag, NOTE_FILE_POLL, that can be used to make
EVFILT_READ and EVFILT_WRITE behave identically to poll(). This flag
will be used by cloudlibc's poll() function.
Reviewed by: jmg
Differential Revision: https://reviews.freebsd.org/D3303
of the timehands, from the kern_tc.c implementation to vdso. Add
comments giving hints where to look for the algorithm explanation.
To compensate the removal of rmb() in userspace binuptime(), add
explicit lfence instruction before rdtsc. On i386, add usual
complications to detect SSE2 presence; assume that old CPUs which do
not implement SSE2 also execute rdtsc almost in order.
Reviewed by: alc, bde (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
It looks like umtx_key_get() has the addition and subtraction the wrong
way around, meaning that it fails to match in certain cases. This causes
the cloudlibc unit tests to deadlock in certain cases.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D3287
a pcb from stoppcbs[] rather than the thread's PCB. However, exited threads
retained td_oncpu from the last time they ran, and newborn threads had their
CPU fields cleared to zero during fork and thread creation since they are
in the set of fields zeroed when threads are setup. To fix, explicitly
update the CPU fields for exiting threads in sched_throw() to reflect the
switch out and reset the CPU fields for new threads in sched_fork_thread()
to NOCPU.
Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D3193
CloudABI processes should run in capabilities mode automatically. There
is no need to switch manually (e.g., by calling cap_enter()). Add a
flag, SV_CAPSICUM, that can be used to call into cap_enter() during
execve().
Reviewed by: kib
This field is only used in a KASSERT that verifies that no locks are held
when returning to user mode. Moreover, the td_locks accounting is only
correct when LOCK_DEBUG > 0, which is implied by INVARIANTS.
Reviewed by: jhb
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D3205
original parent. Otherwise the debugee will be set as an orphan of
the debugger.
Add tests for tracing forks via PT_FOLLOW_FORK.
Reviewed by: kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D2809
This allows you to specify the capabilities that the new file descriptor
should have. This allows us to create shared memory objects that only
have the rights we're interested in.
The idea behind restricting the rights is that it makes it a lot easier
for CloudABI to get consistent behaviour across different operating
systems. We only need to make sure that a shared memory implementation
consistently implements the operations that are whitelisted.
Approved by: kib
Obtained from: https://github.com/NuxiNL/freebsd
On CloudABI, the rights bits returned by cap_rights_get() match up with
the operations that you can actually perform on the file descriptor.
Limiting the rights is good, because it makes it easier to get uniform
behaviour across different operating systems. If process descriptors on
FreeBSD would suddenly gain support for any new file operation, this
wouldn't become exposed to CloudABI processes without first extending
the rights.
Extend fork1() to gain a 'struct filecaps' argument that allows you to
construct process descriptors with custom rights. Use this in
cloudabi_sys_proc_fork() to limit the rights to just fstat() and
pdwait().
Obtained from: https://github.com/NuxiNL/freebsd
has observable overhead when the buffer pages are not resident or not
mapped. The overhead comes at least from two factors, one is the
additional work needed to detect the situation, prepare and execute
the rollbacks. Another is the consequence of the i/o splitting into
the batches of the held pages, causing filesystems see series of the
smaller i/o requests instead of the single large request.
Note that expected case of the resident i/o buffer does not expose
these issues. Provide a prefaulting for the userspace i/o buffers,
disabled by default. I am careful of not enabling prefaulting by
default for now, since it would be detrimental for the applications
which speculatively pass extra-large buffers of anonymous memory to
not deal with buffer sizing (if such apps exist).
Found and tested by: bde, emaste
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Right now there is a chance that sysctl unregister will cause reader to
block on the sx lock associated with sysctl rmlock, in which case kernels
with debug enabled will panic.
The check added in r285872 can trigger for valid buffers if the buffer space
used happens to be just after unmapped_buf in KVA space.
Discussed with: kib
Sponsored by: Citrix Systems R&D
Summary:
Pipes in CloudABI are unidirectional. The reason for this is that
CloudABI attempts to provide a uniform runtime environment across
different flavours of UNIX.
Instead of implementing a custom pipe that is unidirectional, we can
simply reuse Capsicum permission bits to support this. This is nice,
because CloudABI already attempts to restrict permission bits to
correspond with the operations that apply to a certain file descriptor.
Replace kern_pipe() and kern_pipe2() by a single kern_pipe() that takes
a pair of filecaps. These filecaps are passed to the newly introduced
falloc_caps() function that creates the descriptors with rights in
place.
Test Plan:
CloudABI pipes seem to be created with proper rights in place:
https://github.com/NuxiNL/cloudlibc/blob/master/src/libc/unistd/pipe_test.c#L44
Reviewers: jilles, mjg
Reviewed By: mjg
Subscribers: imp
Differential Revision: https://reviews.freebsd.org/D3236
falloc_noinstall() followed by finstall() allows you to create and
install file descriptors with custom capabilities. Add falloc_caps()
that can do both of these actions in one go.
This will be used by CloudABI to create pipes with custom capabilities.
Reviewed by: mjg
'buf' is inconvenient and has lead me to some irritating to discover
bugs over the years. It also makes it more challenging to refactor
the buf allocation system.
- Move swbuf and declare it as an extern in vfs_bio.c. This is still
not perfect but better than it was before.
- Eliminate the unused ffs function that relied on knowledge of the buf
array.
- Move the shutdown code that iterates over the buf array into vfs_bio.c.
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
attached to bufs to avoid the overhead of the vm. This purposes is now
better served by vmem. Freeing the kva immediately when a buf is
destroyed leads to lower fragmentation and a much simpler scan algorithm.
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
Summary:
Back in 2005, maxim@ attempted to fix shutdown() to return ENOTCONN in case the socket was not connected (r150152). This had to be rolled back (r150155), as it broke some of the existing programs that depend on this behavior. I reapplied this change on my system and indeed, syslogd failed to start up. I fixed this back in February (279016) and MFC'ed it to the supported stable branches. Apart from that, things seem to work out all right.
Since at least Linux and Mac OS X do the right thing, I'd like to go ahead and give this another try. To keep old copies of syslogd working, only start returning ENOTCONN for recent binaries.
I took a look at the XNU sources and they seem to test against both SS_ISCONNECTED, SS_ISCONNECTING and SS_ISDISCONNECTING, instead of just SS_ISCONNECTED. That seams reasonable, so let's do the same.
Test Plan:
This issue was uncovered while writing tests for shutdown() in CloudABI:
https://github.com/NuxiNL/cloudlibc/blob/master/src/libc/sys/socket/shutdown_test.c#L26
Reviewers: glebius, rwatson, #manpages, gnn, #network
Reviewed By: gnn, #network
Subscribers: bms, mjg, imp
Differential Revision: https://reviews.freebsd.org/D3039
Currently LOCK_DEBUG is always defined in sys/lock.h (0 or 1).
This means that debugging code always built. In addition the kernel
modules have always defined LOCK_DEBUG as 1. So, debugging rmlock code
is always used by kernel modules.
MFC after: 1 week
b_kvabase when the buffer is reclaimed. Otherwise, if b_data for the
mapped buffer was adjusted with the page-offset portion of b_offset,
nothing would re-adjust the b_data, which breaks buffer management
code which expects page-aligned b_data (see e.g. bpman_qenter(), which
skips partial pages).
Fix a minor issue with the GB_KVAALLOC requests, which could result in
returning the mapped buffer if the reused buffer is mapped and have
the right amount of KVA reserved.
Improve assertion in the vfs_buf_check_mapped() to catch unmapped
buffers which have their b_data incorrectly adjusted with offset.
Reported and tested by: pho (previous version)
Reviewed by: jeff (previous version)
Sponsored by: The FreeBSD Foundation
from x86 to use smp_ipi_mtx spin lock not only for smp_rendezvous_cpus()
but also for the MD cache invalidation, TLB demapping and remote register
reading IPIs due to the following reasons:
- The cross-IPI SMP deadlock x86 otherwise is subject to can't happen on
sparc64. That's because on sparc64, spin locks don't disable interrupts
completely but only raise the processor interrupt level to PIL_TICK. This
means that IPIs still get delivered and direct dispatch IPIs such as the
cache invalidation etc. IPIs in question are still executed.
- In smp_rendezvous_cpus(), smp_ipi_mtx is held not only while sending an
IPI_RENDEZVOUS, but until all CPUs have processed smp_rendezvous_action().
Consequently, smp_ipi_mtx may be locked for an extended amount of time as
queued IPIs (as opposed to the direct ones) such as IPI_RENDEZVOUS are
scheduled via a soft interrupt. Moreover, given that this soft interrupt
is only delivered at PIL_RENDEZVOUS, processing of smp_rendezvous_action()
on a target may be interrupted by f. e. a tick interrupt at PIL_TICK, in
turn leading to the target in question trying to send an IPI by itself
while IPI_RENDEZVOUS isn't fully handled, yet, and, thus, resulting in a
deadlock.
o As mentioned in the commit message of r245850, on least some sun4u platforms
concurrent sending of IPIs by different CPUs is fatal. Therefore, hold the
reintroduced MD ipi_mtx also while delivering cross-traps via MI helpers,
i. e. ipi_{all_but_self,cpu,selected}().
o Akin to x86, let the last CPU to process cpu_mp_bootstrap() set smp_started
instead of the BSP in cpu_mp_unleash(). This ensures that all APs actually
are started, when smp_started is no longer 0.
o In all MD and MI IPI helpers, check for smp_started == 1 rather than for
smp_cpus > 1 or nothing at all. This avoids races during boot causing IPIs
trying to be delivered to APs that in fact aren't up and running, yet.
While at it, move setting of the cpu_ipi_{selected,single}() pointers to
the appropriate delivery functions from mp_init() to cpu_mp_start() where
it's better suited and allows to get rid of the global isjbus variable.
o Given that now concurrent IPI delivery no longer is possible, also nuke
the delays before completely disabling interrupts again in the CPU-specific
cross-trap delivery functions, previously giving other CPUs a window for
sending IPIs on their part. Actually, we now should be able to entirely get
rid of completely disabling interrupts in these functions. Such a change
needs more testing, though.
o In {s,}tick_get_timecount_mp(), make the {s,}tick variable static. While not
necessary for correctness, this avoids page faults when accessing the stack
of a foreign CPU as {s,}tick now is locked into the TLBs as part of static
kernel data. Hence, {s,}tick_get_timecount_mp() always execute as fast as
possible, avoiding jitter.
PR: 201245
MFC after: 3 days
- Use pointer assignment rather than a combination of pointers and
flags to switch buffers between unmapped and mapped. This eliminates
multiple flags and generally simplifies the logic.
- Eliminate b_saveaddr since it is only used with pager bufs which have
their b_data re-initialized on each allocation.
- Gather up some convenience routines in the buffer cache for
manipulating buf space and buf malloc space.
- Add an inline, buf_mapped(), to standardize checks around unmapped
buffers.
In collaboration with: mlaier
Reviewed by: kib
Tested by: pho (many small revisions ago)
Sponsored by: EMC / Isilon Storage Division
most recently used buffer when we are under paging pressure. This is
a perversion of the buffer and page replacement algorithms and recent
improvements to the page daemon have rendered it unnecessary. In the
event that low-memory deadlocks become an issue it would be possible
to make a daemon or event handler that performs a similar action on
the oldest buffers rather than the newest. Since the buf cache
is analogous to the page cache and some minimum working set is desired
another possibility is to simply shrink the minimum working set which
has less downside now that file pages are not directly mapped.
Sponsored by: EMC / Isilon
Reviewed by: alc, kib (with some minor objection)
Tested by: pho
done by the functions called on other CPUs, are visible to the caller.
Pair otherwise useless acquire on smp_rv_waiters[3] with a release add
to ensure synchronized with relation, which guarantees visibility.
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
it_need was wrong [*]. Restore the releases and add a comment
explaining why it is needed.
Noted by: alc [*]
Reviewed by: bde [*]
Sponsored by: The FreeBSD Foundation
This change refactors the existing create_thread() function to be more
generic. It replaces almost all of its arguments by a callback that can
be used to extract the thread ID and copy it out to the right place, but
also to perform additional initialization steps, such as setting the
trapframe. This also makes the difference between thr_new() and
thr_create() more clear in my opinion.
This function is going to be used by the CloudABI compatibility layer.
It looks like the OpenSolaris compatibility framework already provides a
function called thread_create(). Rename this function to
do_thread_create() and use a macro to deal with the namespacing
conflict. A similar approach is already used for thread_exit().
MFC after: 1 month
in lockstat.ko. This means that lockstat probes now have typed arguments and
will utilize SDT probe hot-patching support when it arrives.
Reviewed by: gnn
Differential Revision: https://reviews.freebsd.org/D2993