reliability when tracing fast-moving processes or writing traces to
slow file systems by avoiding unbounded queueuing and dropped records.
Record loss was previously possible when the global pool of records
become depleted as a result of record generation outstripping record
commit, which occurred quickly in many common situations.
These changes partially restore the 4.x model of committing ktrace
records at the point of trace generation (synchronous), but maintain
the 5.x deferred record commit behavior (asynchronous) for situations
where entering VFS and sleeping is not possible (i.e., in the
scheduler). Records are now queued per-process as opposed to
globally, with processes responsible for committing records from their
own context as required.
- Eliminate the ktrace worker thread and global record queue, as they
are no longer used. Keep the global free record list, as records
are still used.
- Add a per-process record queue, which will hold any asynchronously
generated records, such as from context switches. This replaces the
global queue as the place to submit asynchronous records to.
- When a record is committed asynchronously, simply queue it to the
process.
- When a record is committed synchronously, first drain any pending
per-process records in order to maintain ordering as best we can.
Currently ordering between competing threads is provided via a global
ktrace_sx, but a per-process flag or lock may be desirable in the
future.
- When a process returns to user space following a system call, trap,
signal delivery, etc, flush any pending records.
- When a process exits, flush any pending records.
- Assert on process tear-down that there are no pending records.
- Slightly abstract the notion of being "in ktrace", which is used to
prevent the recursive generation of records, as well as generating
traces for ktrace events.
Future work here might look at changing the set of events marked for
synchronous and asynchronous record generation, re-balancing queue
depth, timeliness of commit to disk, and so on. I.e., performing a
drain every (n) records.
MFC after: 1 month
Discussed with: jhb
Requested by: Marc Olzheim <marcolz at stack dot nl>
happiness, as well as correct other bugs:
- Replace notion of current and saved accounting credential/vnode with a
single credential/vnode and an acct_suspended flag. This simplifies the
accounting logic substantially.
- Replace acct_mtx with acct_sx, a sleepable lock held exclusively during
reconfiguration and space polling, but shared during log entry
generation. This avoids holding a mutex over sleepable VFS operations.
- Hold the sx lock over the duration of the I/O so that the vnode I/O
cannot occur after vnode close, which could occur previously if
accounting was disabled as a process exited.
- Write the accounting log entry with Giant conditionally acquired based
on the file system where the log is stored. Previously, the accounting
code relied on the caller acquiring Giant.
- Acquire Giant conditionally in the accounting callout based on the file
system where the accounting log is stored. Run the callout MPSAFE.
- Expose acct_suspended via a read-only sysctl so it is possibly to
programmatically determine whether accounting is suspended or not without
attempting to parse logs.
- Check both acct_vp and acct_suspended lock-free before entering the
accounting sx lock in acct().
- When accounting is disabled due to a VBAD vnode (i.e., forceable unmount),
generate a log message indicating accounting has been disabled.
- Correct a long-standing bug in how free space is calculated and compared
to the required space: generate and compare signed results, not unsigned
results, or negative free space will cause accounting to not be suspended
when required, or worse, incorrectly resumed once negative free space is
reached.
MFC after: 2 weeks
socket file descriptor garbage collection code, which is intended to
detect and clear cycles of orphaned file descriptors that are "in-flight"
in a socket when that socket is closed before they are received. The
algorithm present was both run at poor times (resulting in recursion and
reentrance), and also buggy in the presence of parallelism. In order to
fix these problems, make the following changes:
- When there are in-flight sockets and a UNIX domain socket is destroyed,
asynchronously schedule the garbage collector, rather than running it
synchronously in the current context. This avoids lock order issues
when the garbage collection code reenters the UNIX domain socket code,
avoiding lock order reversals, deadlocks, etc. Run the code
asynchronously in a task queue.
- In the garbage collector, when skipping file descriptors that have
entered a closing state (i.e., have f_count == 0), re-test the FDEFER
flag, and decrement unp_defer. As file descriptors can now transition
to a closed state, while the garbage collector is running, it is no
longer the case that unp_defer will remain an accurate count of
deferred sockets in the mark portion of the GC algorithm. Otherwise,
the garbage collector will loop waiting waiting for unp_defer to reach
zero, which it will never do as it is skipping file descriptors that
were marked in an earlier pass, but now closed.
- Acquire the UNIX domain socket subsystem lock in unp_discard() when
modifying the unp_rights counter, or a read/write race is risked with
other threads also manipulating the counter.
While here:
- Remove #if 0'd code regarding acquiring the socket buffer sleep lock in
the garbage collector, this is not required as we are able to use the
socket buffer receive lock to protect scanning the receive buffer for
in-flight file descriptors on the socket buffer.
- Annotate that the description of the garbage collector implementation
is increasingly inaccurate and needs to be updated.
- Add counters of the number of deferred garbage collections and recycled
file descriptors. This will be removed and is here temporarily for
debugging purposes.
With these changes in place, the unp_passfd regression test now appears
to be passed consistently on UP and SMP systems for extended runs,
whereas before it hung quickly or panicked, depending on which bug was
triggered.
Reported by: Philip Kizer <pckizer at nostrum dot com>
MFC after: 2 weeks
state about each open file, and identify the first process in the process
table that references the file. This is helpful in debugging leaks of
file descriptors.
MFC after: 1 week
The PR and patch have the details. The ultimate fix requires architectural
changes and clarifications to the VFS API, but this will prevent the system
from panicking when someone does "ls /dev" while running in a shell under the
linuxulator.
This issue affects HEAD and RELENG_6 only.
PR: 88249
Submitted by: "Devon H. O'Dell" <dodell@ixsystems.com>
MFC after: 3 days
with the file descriptor. When a file descriptor is closed as a result
of garbage collecting a UNIX domain socket, the file descriptor will
not have any associated thread, so the logic to identify advisory locks
held by that thread is not appropriate. Check the thread for NULL to
avoid this scenario. Expand an existing comment to say a bit more about
this.
MFC after: 1 week
thread context. While it doesn't matter too much at the moment, in
the future we could be back in the same boat if/when more restrictions
are placed (or enforced) in a SWI.
Suggested by: njl, bde, jhb, scottl
in the hardware interrupt context (even if it is likely just an
ithread). We don't document that suspend/resume routines are run from
such a context and some of the things that happen in those routines
aren't interrupt safe. Since there's no real need to run from that
context, this restores assumptions that suspend routines have made.
This fixes Thierry Herbelot's 'Trying to sleep while sleeping is
prohibited' problem.
to user-space if a parameter named "errmsg" is passed into the iovec.
Used in conjunction with vfs_mount_error(), more useful error messages
than errno can be passed back to userspace when mounting a filesystem
fails.
Discussed with: phk, pjd
have started aio, instead, initialize aio management structure
if it hasn't been done, the reason to adjust this behavior is
to make it a bit friendly for threaded program, consider two
threads, one submits aio_write, and another just calls
aio_waitcomplete to wait any I/O to be completed and recycle the
aio requests, before submitter doing any I/O, the recycler wants
to wait in kernel. This also fixes inconsistency with other aio
syscalls.
- Use curthread for calls to knlist_delete() and add a big comment
explaining why as well as appropriate assertions.
- Use TAILQ_FOREACH and TAILQ_FOREACH_SAFE instead of handrolling them.
- Use fget() family of functions to lookup file objects instead of
grovelling around in file descriptor tables.
- Destroy the aio_freeproc mutex if we are unloaded.
Tested on: i386
-Change unconditional aquisition of Giant to only pickup Giant if the vnode
for the controlling tty resides on a non-mpsafe file system.
-Pickup Giant around executable vnode reference counting operations only if
the executable resides on a non-mpsafe file system.
-If this process is being traced, pickup Giant for trace file reference count
operations only if it resides on a non-mpsafe file system.
Discussed with: jhb
Tested by: kris
For each child process whose status has been changed, a SIGCHLD instance
is queued, if the signal is stilling pending, and process changed status
several times, signal information is updated to reflect latest process
status. If wait() returns because the status of a child process is
available, pending SIGCHLD signal associated with the child process is
discarded. Any other pending SIGCHLD signals remain pending.
The signal information is allocated at the same time when proc structure
is allocated, if process signal queue is fully filled or there is a memory
shortage, it can still send the signal to process.
There is a booting time tunable kern.sigqueue.queue_sigchild which
can control the behavior, setting it to zero disables the SIGCHLD queueing
feature, the tunable will be removed if the function is proved that it is
stable enough.
Tested on: i386 (SMP and UP)
from there. All others get broken up and free'd individually to the mbuf
and cluster zones.
The packet zone is a secondary zone to the mbuf zone. There is currently
a limitation in UMA which prevents decreasing the packet zone stock when
the mbuf and cluster zone are drained and all their members are part of
packets. When this is fixed this change may be reverted.
current context in the IPI_STOP handler so that we can get accurate stack
traces of threads on other CPUs on these two archs like we do now on i386
and amd64.
Tested on: alpha, sparc64
both proc pointer and thread pointer, if thread pointer is NULL,
tdsignal automatically finds a thread, otherwise it sends signal
to given thread.
Add utility function psignal_event to send a realtime sigevent
to a process according to the delivery requirement specified in
struct sigevent.
based jumbo 9k and jumbo 16k cluster support.
All mbuf's with external storage attached are mandatory reference
counted. For clusters and jumbo clusters UMA provides the refcnt
storage directly. It does not have to be separatly allocated. Any
other type of external storage gets its own refcnt allocated from
an UMA mbuf refcnt zone instead of normal kernel malloc.
The refcount API MEXT_ADD_REF() and MEXT_REM_REF() is no longer
publically accessible. The proper m_* functions have to be used.
mb_ctor_clust() and mb_dtor_clust() both handle normal 2K as well
as 9k and 16k clusters.
Clusters and jumbo clusters may be obtained without attaching it
immideatly to an mbuf. This is for high performance cluster
allocation in network drivers where mbufs are attached after the
cluster has been filled.
Tested by: rwatson
Sponsored by: TCP/IP Optimizations Fundraise 2005
Having an additional MT_HEADER mbuf type is superfluous and redundant
as nothing depends on it. It only adds a layer of confusion. The
distinction between header mbuf's and data mbuf's is solely done
through the m->m_flags M_PKTHDR flag.
Non-native code is not changed in this commit. For compatibility
MT_HEADER is mapped to MT_DATA.
Sponsored by: TCP/IP Optimization Fundraise 2005
ktr_tid as part of gathering of ktr header data for new ktrace
records. The continued use of intptr_t is required for file layout
reasons, and cannot be changed to lwpid_t at this point.
MFC after: 1 month
Reviewed by: davidxu
intptr_t. The buffer length needs to be written to disk as part
of the trace log, but the kernel pointer for the buffer does not.
Add a new ktr_buffer pointer to the kernel-only ktrace request
structure to hold that pointer. This frees up an integer in the
ktrace record format that can be used to hold the threadid,
although older ktrace files will have a garbage ktr_buffer field
(or more accurately, a kernel pointer value).
MFC after: 2 weeks
Space requested by: davidxu
before dereferencing it. Certain corrupt kernel modules might not have
a valid hash table, and would cause a kernel panic when they were loaded.
Instead of panic'ing, the kernel now prints out a warning that it is
missing the symbol hash table.
Tested by: Benjamin Close Benjamin dot Close at clearchain dot com
MFC after: 1 week
- Prefer '_' to ' ', as it results in more easily parsed results in
memory monitoring tools such as vmstat.
- Remove punctuation that is incompatible with using memory type names
as file names, such as '/' characters.
- Disambiguate some collisions by adding subsystem prefixes to some
memory types.
- Generally prefer lower case to upper case.
- If the same type is defined in multiple architecture directories,
attempt to use the same name in additional cases.
Not all instances were caught in this change, so more work is required to
finish this conversion. Similar changes are required for UMA zone names.
following the protocol pru_listen() call to solisten_proto(), so
that it occurs under the socket lock acquisition that also sets
SO_ACCEPTCONN. This requires passing the new backlog parameter
to the protocol, which also allows the protocol to be aware of
changes in queue limit should it wish to do something about the
new queue limit. This continues a move towards the socket layer
acting as a library for the protocol.
Bump __FreeBSD_version due to a change in the in-kernel protocol
interface. This change has been tested with IPv4 and UNIX domain
sockets, but not other protocols.
convert to or from timeval frequently.
Introduce function itimer_accept() to ack a timer signal in signal
acceptance code, this allows us to return more fresh overrun counter
than at signal generating time. while POSIX says:
"the value returned by timer_getoverrun() shall apply to the most
recent expiration signal delivery or acceptance for the timer,.."
I prefer returning it at acceptance time.
Introduce SIGEV_THREAD_ID notification mode, it is used by thread
libary to request kernel to deliver signal to a specified thread,
and in turn, the thread library may use the mechanism to implement
SIGEV_THREAD which is required by POSIX.
Timer signal is managed by timer code, so it can not fail even if
signal queue is full filled by sigqueue syscall.
set. When watchdogd(1) is terminated intentionally it clears the bit,
which should then disable it in the kernel.
PR: kern/74386
Submitted by: Alex Hoff <ahoff at sandvine dot com>
Approved by: phk, rwatson (mentor)
can't acquire an sx lock in ttyinfo() because ttyinfo() can be called
from interrupt handlers (such as atkbd_intr()). Instead, go back to
locking the process group while we pick a thread to display information for
and hold that lock until after we drop sched_lock to make sure the
process doesn't exit out from under us. sched_lock ensures that the
specific thread from that process doesn't go away. To protect against
the process exiting after we drop the proc lock but before we dereference
it to lookup the pid and p_comm in the call to ttyprintf(), we now copy
the pid and p_comm to local variables while holding the proc lock.
This problem was found by the recently added TD_NO_SLEEPING assertions for
interrupt handlers.
Tested by: emaste
MFC after: 1 week