locks/unlocks the vnode and does a VOP_GETATTR()
for the SEEK_END case. This is safe to do, since
lf_advlock{async}() only uses the size argument
for the SEEK_END case.
The NFSv4 server needs this when
vfs.nfsd.enable_locallocks!=0 since locking the
vnode results in a LOR that can cause a deadlock
for the nfsd threads.
Reviewed by: kib
MFC after: 1 week
the created file name was cached. Use the flag for core dumps.
Requested by: rpaulo
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
(UTC) rather than the archaic (GMT) in comments. Except where the
comments are making fun of people doing this (and pedants who insist
on the new terms).
why there could appear a zero-sized mbufs in socket buffers.
A proper fix would be to divorce record socket buffers and stream
socket buffers, and divorce pru_send that accepts normal data from
pru_send that accepts control data.
in sbappend_locked() and sbappendrecord_locked().
This is a quick fix to the panic introduced by r274712.
A proper solution should be to make sosend_generic() avoid calling
pru_send() with NULL mbuf for the protocols that do not understand
control messages. Those protocols that understand control messages,
should be able to receive NULL mbuf, if control is non-NULL.
into namecache, to avoid cache trashing when doing large operations.
E.g., tar archive extraction is not usually followed by access to many
of the files created.
Right now, each VOP_LOOKUP() implementation explicitely knowns about
this quirk and tests for both MAKEENTRY flag presence and op != CREATE
to make the call to cache_enter(). Centralize the handling of the
quirk into VFS, by deciding to cache only by MAKEENTRY flag in VOP.
VFS now sets NOCACHE flag for CREATE namei() calls.
Note that the change in semantic is backward-compatible and could be
merged to the stable branch, and is compatible with non-changed
third-party filesystems which correctly handle MAKEENTRY.
Suggested by: Chris Torek <torek@pi-coral.com>
Reviewed by: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
the orphaned descendants. Base of the API is modelled after the same
feature from the DragonFlyBSD.
Requested by: bapt
Reviewed by: jilles (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
A _NEW flag passed to _init_flags() to avoid check for double-init.
Differential Revision: https://reviews.freebsd.org/D1208
Reviewed by: jhb, wblock
MFC after: 1 Month
feature is to quisce the system before suspend.
Stop is implemented by reusing the thread_single(9) with the special
mode SINGLE_ALLPROC. SINGLE_ALLPROC differs from the existing
single-threading modes by allowing (requiring) caller to operate on
other process. Interruptible sleeps for !TDF_SBDRY threads are
suspended like SIGSTOP does it, instead of aborting the sleep, like
SINGLE_NO_EXIT, to avoid spurious EINTRs on resume.
Provide debugging sysctl debug.stop_all_proc, which causes total stop
and suspends syncer, while waiting for variable reset for resume. It
is used for debugging; should be removed after the real use of the
interface is added.
In collaboration with: pho
Discussed with: avg
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
filesystem specified VFCF_SBDRY flag, i.e. for NFS.
There are two issues with the sleeps. First, applications may get
unexpected EINTR from the disk i/o syscalls. Second, interruptible
sleep allows the stop of the process, and since mount point is
referenced while thread sleeps, unmount cannot free mount point
structure' memory, blocking unmount indefinitely.
Even for NFS, it is probably only reasonable to enable PCATCH for intr
mounts, but this information is currently not available at VFS level.
Reported and tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
creating delayed write buffers belonging to the reclaimed vnode. Put
the buffer cleanup code after inactivation.
Add asserts that ensure that buffer queues are empty and add BO_DEAD
flag for bufobj to check that no buffers are added after the cleanup.
BO_DEAD is only used by INVARIANTS-enabled kernels.
Reported and tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Since VFS does not/cannot stop writes, sync might run indefinitely, or
be a wrong thing to do at all. E. g. NFS ignores VFS_SYNC() for
forced unmounts, since non-responding server does not allow sync to
finish. On the other hand, filesystems can and do stop writes using
fs-specific facilities, and should already fully flush caches in
VFS_UNMOUNT() due to the race.
Adjust msdosfs tp sync in unmount for forced call, to accomodate the
new behaviour. Note that it is still racy, since writes are not
stopped.
Discussed with: avg, bjk, mckusick
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
to be called before suspension and after resume, correspondingly. The
syncer_suspend() ensures that all filesystems dirty data and metadata
are saved to the permanent storage, and stops kernel threads which
might modify filesystems. The syncer_resume() restores stopped
threads.
For now, only syncer is stopped. This is needed, because each sync
loop causes superblock updates for UFS.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
the vnode owning the buffer is not locked. More, it cannot be locked
safely, since getnewbuf_reuse_bp() is called from newbuf(), and some
other vnode is already locked, for which reused buffer will be
reassigned.
As the consequence, reclamation of the owning vnode could go in
parallel, in particular, the call to vnode_destroy_vobject(), which
deallocates the vm object and zeroes the v_bufobj->bo_object. Note
that the pages wired by the buffer are left wired and can be safely
freed by the vfs_vmio_release() without the need for the vm object
lock. Also, seeing stale pointer to the v_object is safe due to vm
object type stability.
Check for bo_bufobj != NULL and cache the value in local variable to
avoid trying to lock NULL vm object.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
preparation for the global stop commit.
Move the code to weed suspended or sleeping threads into the
appropriate state, into the helper weed_inhib(). Current code already
has deep nesting and hard to follow [1].
Add currently useless helper remain_for_mode(), which returns the
count of threads which are allowed to run, according to the
single-threading mode.
In thread_single_end(), do not save curthread into local variable, it
is unused after, except to find curproc.
Remove stray empty line.
Requested by: avg [1]
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
for the suspension.
Currently, the loop performs uninterruptible cv_wait(9) call, which
prevents suspension until child allows further execution of parent.
If child is stopped, suspension or single-threading is delayed
indefinitely.
Create a helper thread_suspend_check_needed() to identify the need for
a call to thread_suspend_check(). It is required since call to the
thread_suspend_check() cannot be safely done while owning the child
(p2) process lock. Only when suspension is needed, drop p2 lock and
call thread_suspend_check(). Perform wait for cv with timeout, in
case suspend is requested after wait started; I do not see a better
way to interrupt the wait.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
multithreaded status of the process.
The stopped state must be cleared before P_WEXIT is set. A stop
signal delivered just before first PROC_LOCK() block in exit1(9) would
put the process into pending stop with P_WEXIT set or assertion
triggered. Also recheck for the suspension after failed
thread_single(9) call, since process lock could be dropped.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
In the old days callout(9) had 1 tick precision and that was inadequate
for some uses, e.g. DTrace profile module, so we had to emulate cyclic
API and behavior. Now we can directly use callout(9) in the very few
places where cyclic was used.
Differential Revision: https://reviews.freebsd.org/D1161
Reviewed by: gnn, jhb, markj
MFC after: 2 weeks
Provide pru_ready for AF_LOCAL sockets. Local sockets sendsdata directly
to the receive buffer of the peer, thus pru_ready also works on the peer
socket.
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
sending not ready data:
o Add new flag to pru_send() flags - PRUS_NOTREADY.
o Add new protocol method pru_ready().
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
o Introduce a notion of "not ready" mbufs in socket buffers. These
mbufs are now being populated by some I/O in background and are
referenced outside. This forces following implications:
- An mbuf which is "not ready" can't be taken out of the buffer.
- An mbuf that is behind a "not ready" in the queue neither.
- If sockbet buffer is flushed, then "not ready" mbufs shouln't be
freed.
o In struct sockbuf the sb_cc field is split into sb_ccc and sb_acc.
The sb_ccc stands for ""claimed character count", or "committed
character count". And the sb_acc is "available character count".
Consumers of socket buffer API shouldn't already access them directly,
but use sbused() and sbavail() respectively.
o Not ready mbufs are marked with M_NOTREADY, and ready but blocked ones
with M_BLOCKED.
o New field sb_fnrdy points to the first not ready mbuf, to avoid linear
search.
o New function sbready() is provided to activate certain amount of mbufs
in a socket buffer.
A special note on SCTP:
SCTP has its own sockbufs. Unfortunately, FreeBSD stack doesn't yet
allow protocol specific sockbufs. Thus, SCTP does some hacks to make
itself compatible with FreeBSD: it manages sockbufs on its own, but keeps
sb_cc updated to inform the stack of amount of data in them. The new
notion of "not ready" data isn't supported by SCTP. Instead, only a
mechanical substitute is done: s/sb_cc/sb_ccc/.
A proper solution would be to take away struct sockbuf from struct
socket and allow protocols to implement their own socket buffers, like
SCTP already does. This was discussed with rrs@.
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
crowded as we now are at about 70k. Bump the limit to 1MB instead
which is still quite a reasonable limit and allows for future growth
of this file and possible future expansion to additional data.
MFC After: 2 weeks
recursion on mutex initialization.
The only places where the recursive acquire is performed are read and
write filters, since knlist, which uses the pipe pair mutex as lock,
is locked when filter is called.
The recursion was added in r93296, and consistent locking for
kn_fop->f_event() introduced in r133741.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
This allows one to make a kernel module to tune the
number of queues before the driver loads.
This is needed so that a module at SI_SUB_CPU can set
tunables for these drivers to take. Otherwise getenv
is called too early by the TUNABLE macros.
Reviewed by: smh
Phabric: https://reviews.freebsd.org/D1149
- Threads lifetime cycle, in particular, counting of the threads in
the process, and interlocking with process mutex and thread lock.
The main reason of this is that turnstile locks are after thread
locks, so you e.g. cannot unlock blockable mutex (think process
mutex) while owning thread lock.
- Virtual and profiling itimers, since the timers activation is done
from the clock interrupt context. Replace the p_slock by p_itimmtx
and PROC_ITIMLOCK().
- Profiling code (profil(2)), for similar reason. Replace the p_slock
by p_profmtx and PROC_PROFLOCK().
- Resource usage accounting. Need for the spinlock there is subtle,
my understanding is that spinlock blocks context switching for the
current thread, which prevents td_runtime and similar fields from
changing (updates are done at the mi_switch()). Replace the p_slock
by p_statmtx and PROC_STATLOCK().
The split is done mostly for code clarity, and should not affect
scalability.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
method needs pre-reset state of the ps_siginfo to correctly construct
signal frame.
Move sigdflt() call after the sv_sendsig() invocation in postsig().
Simultaneously extract common code from trapsignal() and postsig()
into new helper postsig_done().
Submitted by: rea
MFC after: 1 week
a new per-device '%domain' sysctl node that returns the NUMA domain a
device is associated with if it is associated with one.
Note that this API is still a WIP and might change before 11.0 actually
ships.
Differential Revision: https://reviews.freebsd.org/D930
Reviewed by: kib, adrian
o Provide a new VOP_GETPAGES_ASYNC(), which works like VOP_GETPAGES(), but
doesn't sleep. It returns immediately, and will execute the I/O done handler
function that must be supplied as argument.
o Provide VOP_GETPAGES_ASYNC() for the FFS, which uses vnode_pager.
o Extend pagertab to support pgo_getpages_async method, and implement this
method for vnode_pager.
Reviewed by: kib
Tested by: pho
Sponsored by: Netflix
Sponsored by: Nginx, Inc.