modified bit emulation traps on Alpha while holding locks in the
sysctl handler.
A better solution would be to pass a hint to the Alpha pmap code to
tell mark these pages as modified when they as they are being wired,
but that appears to be more difficult to implement.
Suggested by: jhb
MFC after: 3 days
placeholder similar to KTR_DEV. Explain the use of KTR_DEV and
KTR_SUBSYS in a comment as well.
- Retire KTR_WITNESS and instead have KTR_WITNESS default to off but use
KTR_SUBSYS if it is enabled.
1) unregsiter kqueue filter for EVFILT_LIO.
2) free uma_zones.
3) call setsid directly to enter another session rather than
implementing by itself.
Submitted by: jhb
The success of the cluster allocation is checked through a field in the
mbuf structure. This change is non-functional but properly silences code
inspection tools.
Found by: Coverity Prevent(tm)
Coverity ID: CID807
Sponsored by: TCP/IP Optimization Fundraise 2005
lookup() instead of EPERM when a DELETE or RENAME operation is
attempted on "..".
In kern_unlink(), remap EINVAL errors returned from namei() to EPERM
to match existing (and POSIX required) behaviour.
Discussed with: bde
MFC after: 3 days
used by utilities to reset moused(8), for example. The syntax is:
!system=kern subsystem=power type=resume
Note that it would be nice to have notification of suspend, but it's more
difficult since there would have to be a method of doing request/ack
to userland and automatically timing out if no response. apm(4) has a
similar mechanism.
MFC after: 2 weeks
An executable contains at most one PT_INTERP program header. Therefore,
the loop that searches for it can terminate after it is found rather than
iterating over the entire set of program headers.
Eliminate an unneeded initialization.
Reviewed by: tegge
the last component of the path name is "..". This keeps VOP_LOOKUP()
from locking vnodes in reverse order.
Tested by: Denis Shaposhnikov <dsh AT vlink DOT ru>
MFC after: 3 days
prototypes, as the majority of new functions added have been in this
style. Changing prototype style now results in gcc noticing that the
implementation of vn_pollrecord() has a 'short' argument instead of
'int' as prototyped in vnode.h, so correct that definition. In practice
this didn't matter as only poll flags in the lower 16 bits are used.
MFC after: 1 week
devclass's parent pointer if the two drivers share the same devclass. This
can happen if the drivers use the same new-bus name. For example, we
currently have 3 drivers that use the name "pci": the generic PCI bus
driver, the ACPI PCI bus driver, and the OpenFirmware PCI bus driver. If
the ACPI PCI bus driver was defined as a subclass of the generic PCI bus
driver, then without this check the "pci" devclass would point to itself
as its parent and device_probe_child() would spin forever when it
encountered the first PCI device that did have a matching driver.
Reviewed by: dfr, imp, new-bus@
equal to NULL several times later. p_ucred "should probably not" be NULL
if the process isn't PRS_NEW anyway. This is strongly reinforced by the fact
that we don't see frequent crashes here. Remove the checks after p_cansee and
add a KASSERT right before it.
Found by: Coverity Prevent (tm)
Also trim one nearby trailing space.
lock_obj objects:
- Add new lock_init() and lock_destroy() functions to setup and teardown
lock_object objects including KTR logging and registering with WITNESS.
- Move all the handling of LO_INITIALIZED out of witness and the various
lock init functions into lock_init() and lock_destroy().
- Remove the constants for static indices into the lock_classes[] array
and change the code outside of subr_lock.c to use LOCK_CLASS to compare
against a known lock class.
- Move the 'show lock' ddb function and lock_classes[] array out of
kern_mutex.c over to subr_lock.c.
Since we are using vfs_busy() on a freshly allocated mount structure, use
(void) to show that we do not care about the return value.
Found with: Coverity Prevent (tm)
MFC after: 2 weeks
taskqueue_start_threads(struct taskqueue **, int count, int pri,
const char *name, ...);
This allows the creation of 1 or more threads that will service a single
taskqueue. Also rework the taskqueue_create() API to remove the API change
that was introduced a while back. Creating a taskqueue doesn't rely on
the presence of a process structure, and the proc mechanics are much better
encapsulated in taskqueue_start_threads(). Also clean up the
taskqueue_terminate() and taskqueue_free() functions to safely drain
pending tasks and remove all associated threads.
The TASKQUEUE_DEFINE and TASKQUEUE_DEFINE_THREAD macros have been changed
to use the new API, but drivers compiled against the old definitions will
still work. Thus, recompiling drivers is not a strict requirement.
intended for use solely with atomic datagram socket types, and relies
on the previous break-out of sosend_copyin(). Changes to allow UDP to
optionally use this instead of sosend() will be committed as a
follow-up.
to COMPAT_43TTY.
Add COMPAT_43TTY to NOTES and */conf/GENERIC
Compile tty_compat.c only under the new option.
Spit out
#warning "Old BSD tty API used, please upgrade."
if ioctl_compat.h gets #included from userland.
fast taskqueues. The following have been added:
TASKQUEUE_FAST_DEFINE() - create a global task queue.
an arbitrary execution context.
TASKQUEUE_FAST_DEFINE_THREAD() - create a global taskqueue that uses a
dedicated kthread.
taskqueue_create_fast() - create a local/private taskqueue.
These are all complimentary of the standard taskqueue functions. They are
primarily useful for fast interrupt handlers that can only use spinlock for
synchronization.
I personally think that the taskqueue API is starting to get too narrow and
hairy, but fixing it will require a major redesign on the API. Such a
redesign would be good but would break compatibility with FreeBSD 6.x, so
it really isn't desirable at this time.
Submitted by: sam
and subsequently broke the build. This change is supposed to fix the
case where doing a mtx_destroy() off a spin mutex while you hold it fails.
If it had been tested I would just leave it in, but it hasn't been tested
yet, so it will have to wait until later.
struct sx). Instead of storing a direct pointer to a our lock_class
struct in lock_object, reserve 4 bits in the lo_flags field to serve as an
index into a global lock_classes array that contains pointers to the lock
classes. Only debugging code such as WITNESS or INVARIANTS checks and KTR
logging need to access the lock_class member, so this shouldn't add any
overhead to production kernels. It might add some slight overhead to
kernels using those debug options however.
As with the previous set of changes to lock_object, this is going to
completely obliterate the kernel ABI, so be sure to recompile all your
modules.
returns EBADF. That errno is correct and is mandated by POSIX. It also
goes back to revision 1.1 of our CVS history (i.e. 4.4BSD).
The _fget() function should probably also be upated as it currently returns
EINVAL in that case rather than EBADF. (It does return EBADF for reads
on a write-only descriptor without any XXX comments oddly enough.)
Discussed with: scottl, grog, mjacob, bde
that a file's atime and mtime are only set to correct fractional
second values (0-999999000ns with the current interface).
Prior to this change users could create files with values outside
that range. Moreover, on 32-bit machines tv_usec offsets larger than
4.3s would result in an unnormalized AND wrong timestamp value,
due to overflow.
MFC after: 1 week
- provide an interface (macros) to the page coloring part of the VM system,
this allows to try different coloring algorithms without the need to
touch every file [1]
- make the page queue tuning values readable: sysctl vm.stats.pagequeue
- autotuning of the page coloring values based upon the cache size instead
of options in the kernel config (disabling of the page coloring as a
kernel option is still possible)
MD changes:
- detection of the cache size: only IA32 and AMD64 (untested) contains
cache size detection code, every other arch just comes with a dummy
function (this results in the use of default values like it was the
case without the autotuning of the page coloring)
- print some more info on Intel CPU's (like we do on AMD and Transmeta
CPU's)
Note to AMD owners (IA32 and AMD64): please run "sysctl vm.stats.pagequeue"
and report if the cache* values are zero (= bug in the cache detection code)
or not.
Based upon work by: Chad David <davidc@acns.ab.ca> [1]
Reviewed by: alc, arch (in 2004)
Discussed with: alc, Chad David, arch (in 2004)
- Provide tunable vm.memguard.desc, so one can specify memory type without
changing the code and recompiling the kernel.
- Allow to use memguard for kernel modules by providing sysctl
vm.memguard.desc, which can be changed to short description of memory
type before module is loaded.
- Move as much memguard code as possible to memguard.c.
- Add sysctl node vm.memguard. and move memguard-specific sysctl there.
- Add malloc_desc2type() function for finding memory type based on its
short description (ks_shortdesc field).
- Memory type can be changed (via vm.memguard.desc sysctl) only if it
doesn't exist (will be loaded later) or when no memory is allocated yet.
If there is allocated memory for the given memory type, return EBUSY.
- Implement two ways of memory types comparsion and make safer/slower the
default.
The race is very real, but conditions needed for triggering it are rather
hard to meet now.
When gjournal will be committed (where it is quite easy to trigger) we need
to fix it.
For now, verify if it is really hard to trigger.
Discussed with: kan
of msleep(). msleep_spin() doesn't support changing the priority of the
thread while it is asleep nor does it support interruptible sleeps (PCATCH)
or the PDROP flag. It does support timeouts however. It differs from
msleep() in that the passed in mutex is a spin mutex. This means one can
use msleep_spin() and wakeup() with a spin mutex similar to msleep() and
wakeup() with a regular mutex. Note that the spin mutex in question needs
to come before sched_lock and the sleepq locks in lock order.
spin locks that are not in the static order list. It is not safe to call
printf while holding the witness spin mutex since the console drivers that
back printf may need to use their own spin locks which would try to talk
to witness when they were locked. Given this, it is possible for one
CPU to lock a console driver lock (such as sio) which then tries to lock
the witness lock while another CPU is doing the printf while holding the
witness lock. Fix this by moving the printf outside of the witness lock.
All other printf's in witness are already correct.
MFC after: 3 days
UMA_SLAB_MALLOC flag.
In some circumstances (I observed it when I was doing a lot of reallocs)
UMA_SLAB_MALLOC can be set even if us_keg != NULL.
If this is the case we have wonderful, silent data corruption, because less
data is copied to the newly allocated region than should be.
I'm not sure when this bug was introduced, it could be there undetected
for years now, as we don't have a lot of realloc(9) consumers and it was
hard to reproduce it...
...but what I know for sure, is that I don't want to know who introduce
the bug:) It took me two/three days to track it down (of course most of
the time I was looking for the bug in my own code).
with flags bitfield and set BI_CAN_EXEC_DYN flag for all brands that usually
allow executing elf dynamic binaries (aka shared libraries). When it is
requested to execute ET_DYN elf image check if this flag is on after we
know the elf brand allowing execution if so.
PR: kern/87615
Submitted by: Marcin Koziej <creep@desk.pl>
Specifically, it is required for the I/O that may be performed by
elfN_load_section().
Avoid an obscure deadlock in the a.out, elf, and gzip image
activators. Add a comment describing why the deadlock does not occur
in the common case and how it might occur in less usual circumstances.
Eliminate an unused variable from exec_aout_imgact().
In collaboration with: tegge
by debugger, e.g process is dumping core. Only access p_xthread if
P_STOPPED_TRACE is set, this means thread is ready to exchange signal
with debugger, print a warning if P_STOPPED_TRACE is not set due to
some bugs in other code, if there is.
The patch has been tested by Anish Mistry mistry.7 at osu dot edu, and
is slightly adjusted.
passing a pointer to an opaque clockframe structure and requiring the
MD code to supply CLKF_FOO() macros to extract needed values out of the
opaque structure, just pass the needed values directly. In practice this
means passing the pair (usermode, pc) to hardclock() and profclock() and
passing the boolean (usermode) to hardclock_cpu() and hardclock_process().
Other details:
- Axe clockframe and CLKF_FOO() macros on all architectures. Basically,
all the archs were taking a trapframe and converting it into a clockframe
one way or another. Now they can just extract the PC and usermode values
directly out of the trapframe and pass it to fooclock().
- Renamed hardclock_process() to hardclock_cpu() as the latter is more
accurate.
- On Alpha, we now run profclock() at hz (profhz == hz) rather than at
the slower stathz.
- On Alpha, for the TurboLaser machines that don't have an 8254
timecounter, call hardclock() directly. This removes an extra
conditional check from every clock interrupt on Alpha on the BSP.
There is probably room for even further pruning here by changing Alpha
to use the simplified timecounter we use on x86 with the lapic timer
since we don't get interrupts from the 8254 on Alpha anyway.
- On x86, clkintr() shouldn't ever be called now unless using_lapic_timer
is false, so add a KASSERT() to that affect and remove a condition
to slightly optimize the non-lapic case.
- Change prototypeof arm_handler_execute() so that it's first arg is a
trapframe pointer rather than a void pointer for clarity.
- Use KCOUNT macro in profclock() to lookup the kernel profiling bucket.
Tested on: alpha, amd64, arm, i386, ia64, sparc64
Reviewed by: bde (mostly)
it and reacquiring it in vrele(). Consequently, there is no reason to
increase the reference count on the vm object caching the file's pages.
Reviewed by: tegge
Eliminate unused parameters to elfN_load_file().
The purpose of this change is consistency (not performance improvement:)),
as it was hard to tell if fdrop() is MPSAFE or not when I saw it sometimes
under the Giant and sometimes without it.
Glanced at by: ssouhlal, kan
means:
o Remove Elf64_Quarter,
o Redefine Elf64_Half to be 16-bit,
o Redefine Elf64_Word to be 32-bit,
o Add Elf64_Xword and Elf64_Sxword for 64-bit entities,
o Use Elf_Size in MI code to abstract the difference between
Elf32_Word and Elf64_Word.
o Add Elf_Ssize as the signed counterpart of Elf_Size.
MFC after: 2 weeks
and KTR_IO as they were never used. Remove KTR_CLK since it was only
used for hardclock firing and use KTR_INTR there instead. Remove
KTR_CRITICAL since it was only used for crit enter/exit and use
KTR_CONTENTION instead.
really should be a fptrdiff_t if we had that) in profclock().
- Don't try to profile kernel pc's that are >= the kernel lowpc to avoid
underflows when computing a profiling index.
- Use the PC_TO_I() macro to compute the kernel profiling index rather than
doing it inline.
Discussed with: bde
ephemeral mappings that are used as the source for three copy
operations from kernel space to user space. There are two reasons for
making this change: (1) Under heavy load exec_map can fill up causing
vm_map_find() to fail. When it fails, the nascent process is aborted
(SIGABRT). Whereas, this reimplementation using sf_buf_alloc()
sleeps. (2) Although it is possible to sleep on vm_map_find()'s
failure until address space becomes available (see kmem_alloc_wait()),
using sf_buf_alloc() is faster. Furthermore, the reimplementation
uses a CPU private mapping, avoiding a TLB shootdown on
multiprocessors.
Problem uncovered by: kris@
Reviewed by: tegge@
MFC after: 3 weeks
mbuf chain that starts with a cluster containing just MHLEN bytes. This
happened because m_dup called m_get or m_getcl depending on the amount of
data to copy, but then always set the size available in the first mbuf to
MHLEN.
Submitted by: Matt Koivisto <mkoivisto at sandvine dot com>
Approved by: jmg
Silence from: rwatson (mentor)
class, then it displays various information about the lock and calls a
new function pointer in lock_class (lc_ddb_show) to dump class-specific
information about the lock as well (such as the owner of a mutex or
xlock'ed sx lock). This is easier than staring at hex dumps of locks to
figure out who owns the lock, etc. Note that extending lock_class doesn't
affect the ABI for any kernel modules as the only code that deals with
lock_class structures directly is kern_mutex.c, kern_sx.c, and witness.
MFC after: 1 week
- Implement cv_wait_unlock() method which has semantics compatible
with the sv_wait() method in IRIX. For cv_wait_unlock(), the lock
must be held before entering the function, but is not held when the
function is exited.
- Implement the existing cv_wait() function in terms of cv_wait_unlock().
Submitted by: kan
Feedback from: jhb, trhodes, Christoph Hellwig <hch at infradead dot org>
being hold by current thread or ignored by current process,
otherwise, it is very possible the thread will enter an infinite loop
and lead to an administrator's nightmare.
4k clusters in addition to 9k and 16k ones.
struct mbuf *m_getjcl(int how, short type, int flags, int size)
void *m_cljget(struct mbuf *m, int how, int size)
m_getjcl() returns an mbuf with a cluster of the specified size attached
like m_getcl() does for 2k clusters.
m_cljget() is different from m_clget() as it can allocate clusters
without attaching them to an mbuf. In that case the return value
is the pointer to the cluster of the requested size. If an mbuf was
specified, it gets the cluster attached to it and the return value
can be safely ignored.
For size both take MCLBYTES, MJUM4BYTES, MJUM9BYTES, MJUM16BYTES.
Reviewed by: glebius
Tested by: glebius
Sponsored by: TCP/IP Optimization Fundraise 2005
lock object (and thus off of each mutex and sx lock):
- Rename the all_locks list to pending_locks and only put locks initialized
before SI_SUB_WITNESS on the list so that the SI_SUB_WITNESS can add them
to witness once it starts up.
- Now that pending_locks is only used during early startup, change it from
a TAILQ to an STAILQ. This removes a pointer from the STAILQ_ENTRY in
struct lock_object.
- Since the pending_locks list is only used during the single-threaded
early boot it no longer needs to be protected by a mutex, so remove
all_mtx.
- Since the lo_list member of struct lock_object is now only used during
early boot before witness is running, collapse lo_list and lo_witness
into a union. This shaves the second pointer off of struct lock_object.
- Axe lock_cur_cnt and lock_max_cnt.
With these changes, struct mtx shrinks from 36 to 28 bytes on 32-bit
platforms and from 72 to 56 bytes on 64-bit platforms. Note that this
commit will completely and utterly destroy the kernel ABI, so no MFC.
Tested on: alpha, amd64, i386, sparc64
sosend(). Robert accidentally changed the snderr() macro to jump to the
out label which assumes the lock is already released rather than the
release label which drops the lock in his previous change to sosend().
This should fix the recent panics about returning from write(2) with the
socket lock held and the most recent LOR on current@.
process as over the limit when its time is >= to the limit rather than >
the limit. Technically, if p->p_rux.rux_runtime.sec == p->p_pcpulimit
and p->p_rux.rux_runtime.frac == 0, the process hasn't exceeded the limit
yet. However, having the fraction exactly equal to 0 is rather rare, and
it is not worth the overhead to handle that edge case. With just the >
comparison, the process would have to exceed its limit by almost a second
before it was killed.
PR: kern/83192
Submitted by: Maciej Zawadzinski mzawadzinski at gmail dot com
Reviewed by: bde
MFC after: 1 week
chains and copying in mbufs from the body of the send logic, creating
a new function sosend_copyin(). This changes makes sosend() almost
readable, and will allow the same logic to be used by tailored socket
send routines.
MFC after: 1 month
Reviewed by: andre, glebius
application wishes to request high precision time stamps be returned:
Alias Existing
CLOCK_REALTIME_PRECISE CLOCK_REALTIME
CLOCK_MONOTONIC_PRECISE CLOCK_MONOTONIC
CLOCK_UPTIME_PRECISE CLOCK_UPTIME
Add experimental low-precision clockid_t names corresponding to these
clocks, but implemented using cached timestamps in kernel rather than
a full time counter query. This offers a minimum update rate of 1/HZ,
but in practice will often be more frequent due to the frequency of
time stamping in the kernel:
New clockid_t name Approximates existing clockid_t
CLOCK_REALTIME_FAST CLOCK_REALTIME
CLOCK_MONOTONIC_FAST CLOCK_MONOTONIC
CLOCK_UPTIME_FAST CLOCK_UPTIME
Add one additional new clockid_t, CLOCK_SECOND, which returns the
current second without performing a full time counter query or cache
lookup overhead to make sure the cached timestamp is stable. This is
intended to support very low granularity consumers, such as time(3).
The names, visibility, and implementation of the above are subject
to change, and will not be MFC'd any time soon. The goal is to
expose lower quality time measurement to applications willing to
sacrifice accuracy in performance critical paths, such as when taking
time stamps for the purpose of rescheduling select() and poll()
timeouts. Future changes might include retrofitting the time counter
infrastructure to allow the "fast" time query mechanisms to use a
different time counter, rather than a cached time counter (i.e.,
TSC).
NOTE: With different underlying time mechanisms exposed, using
different time query mechanisms in the same application may result in
relative non-monoticity or the appearance of clock stalling for a
single clockid_t, as a cached time stamp queried after a precision
time stamp lookup may be "before" the time returned by the earlier
live time counter query.
directly. We need to copyin() the strings in the iovec before
we can strcmp() them. Also, when we want to send the errmsg back
to userspace, we need to copyout()/copystr() the string.
Add a small helper function vfs_getopt_pos() which takes in the
name of an option, and returns the array index of the name in the iovec,
or -1 if not found. This allows us to locate an option in
the iovec without actually manipulating the iovec members. directly via
strcmp().
Noticed by: kris on sparc64
connection queue for a new connection. It was removing connections
from the wrong list.
Submitted by: Paul Mikesell
Sponsored by: Isilon Systems
MFC after: 1 week
When all file systems have a time stamp of zero, which is the case
for example when the root file system is on a read-only medium, we
ended up not calling inittodr() at all. A potential uncleanliness
existed as well. If multiple file systems had a non-zero time stamp,
we would call inittodr() multiple times. While this should not be
harmful, it's definitely not ideal.
Fix both issues by iterating over the mounted file systems to find
the largest time stamp and call inittodr() exactly once with that
time stamp. This could of course be a zero time stamp if none of the
mounted file systems have a non-zero time stamp. In that case the
annoying errors mentioned in the commit log for revision 1.186 still
haven't been avoided. The bottom line is that inittodr() should not
complain when it gets a time base of zero. At the time of this
commit only alpha seems to have that problem.
Reported by: Dario Freni (saturnero at freesbie dot org)
MFC after: 1 week
is called. It looks like there are lots of different mount flags checked
in vfs_domount(), so we need to do the parsing for these particular
mount flags earlier on. The new flags parsed are:
async, force, multilabel, noasync, noatime, noclusterr, noclusterw,
noexec, nosuid, nosymfollow, snapshot, suiddir, sync, union.
Existing code which uses mount() to mount UFS filesystems is not
affected, but new code which uses nmount() to mount UFS filesystems
should behave better.
in, and if so, set MNT_UPDATE filesystem flag.
vfs_nmount() calls vfs_domount(), and there is special logic
inside vfs_domount() if MNT_UPDATE is set. This is very important
when we want to do an update mount of the root filesystem, using nmount().
execute a ET_DYN binary (shared object).
This does not make much sense, but some linux scripts expect to be able to
execute /lib/ld-linux.so.2 (ldd comes to mind).
The sysctl defaults to 0.
MFC after: 3 days
currently present is minor and offers no real semantic issues, it also
doesn't make sense since an earlier lockless check has already
occurred. Also hold the mutex longer, over a manipulation of
per-process ktrace state, which requires synchronization.
MFC after: 1 month
Pointed out by: jhb
reliability when tracing fast-moving processes or writing traces to
slow file systems by avoiding unbounded queueuing and dropped records.
Record loss was previously possible when the global pool of records
become depleted as a result of record generation outstripping record
commit, which occurred quickly in many common situations.
These changes partially restore the 4.x model of committing ktrace
records at the point of trace generation (synchronous), but maintain
the 5.x deferred record commit behavior (asynchronous) for situations
where entering VFS and sleeping is not possible (i.e., in the
scheduler). Records are now queued per-process as opposed to
globally, with processes responsible for committing records from their
own context as required.
- Eliminate the ktrace worker thread and global record queue, as they
are no longer used. Keep the global free record list, as records
are still used.
- Add a per-process record queue, which will hold any asynchronously
generated records, such as from context switches. This replaces the
global queue as the place to submit asynchronous records to.
- When a record is committed asynchronously, simply queue it to the
process.
- When a record is committed synchronously, first drain any pending
per-process records in order to maintain ordering as best we can.
Currently ordering between competing threads is provided via a global
ktrace_sx, but a per-process flag or lock may be desirable in the
future.
- When a process returns to user space following a system call, trap,
signal delivery, etc, flush any pending records.
- When a process exits, flush any pending records.
- Assert on process tear-down that there are no pending records.
- Slightly abstract the notion of being "in ktrace", which is used to
prevent the recursive generation of records, as well as generating
traces for ktrace events.
Future work here might look at changing the set of events marked for
synchronous and asynchronous record generation, re-balancing queue
depth, timeliness of commit to disk, and so on. I.e., performing a
drain every (n) records.
MFC after: 1 month
Discussed with: jhb
Requested by: Marc Olzheim <marcolz at stack dot nl>
happiness, as well as correct other bugs:
- Replace notion of current and saved accounting credential/vnode with a
single credential/vnode and an acct_suspended flag. This simplifies the
accounting logic substantially.
- Replace acct_mtx with acct_sx, a sleepable lock held exclusively during
reconfiguration and space polling, but shared during log entry
generation. This avoids holding a mutex over sleepable VFS operations.
- Hold the sx lock over the duration of the I/O so that the vnode I/O
cannot occur after vnode close, which could occur previously if
accounting was disabled as a process exited.
- Write the accounting log entry with Giant conditionally acquired based
on the file system where the log is stored. Previously, the accounting
code relied on the caller acquiring Giant.
- Acquire Giant conditionally in the accounting callout based on the file
system where the accounting log is stored. Run the callout MPSAFE.
- Expose acct_suspended via a read-only sysctl so it is possibly to
programmatically determine whether accounting is suspended or not without
attempting to parse logs.
- Check both acct_vp and acct_suspended lock-free before entering the
accounting sx lock in acct().
- When accounting is disabled due to a VBAD vnode (i.e., forceable unmount),
generate a log message indicating accounting has been disabled.
- Correct a long-standing bug in how free space is calculated and compared
to the required space: generate and compare signed results, not unsigned
results, or negative free space will cause accounting to not be suspended
when required, or worse, incorrectly resumed once negative free space is
reached.
MFC after: 2 weeks
socket file descriptor garbage collection code, which is intended to
detect and clear cycles of orphaned file descriptors that are "in-flight"
in a socket when that socket is closed before they are received. The
algorithm present was both run at poor times (resulting in recursion and
reentrance), and also buggy in the presence of parallelism. In order to
fix these problems, make the following changes:
- When there are in-flight sockets and a UNIX domain socket is destroyed,
asynchronously schedule the garbage collector, rather than running it
synchronously in the current context. This avoids lock order issues
when the garbage collection code reenters the UNIX domain socket code,
avoiding lock order reversals, deadlocks, etc. Run the code
asynchronously in a task queue.
- In the garbage collector, when skipping file descriptors that have
entered a closing state (i.e., have f_count == 0), re-test the FDEFER
flag, and decrement unp_defer. As file descriptors can now transition
to a closed state, while the garbage collector is running, it is no
longer the case that unp_defer will remain an accurate count of
deferred sockets in the mark portion of the GC algorithm. Otherwise,
the garbage collector will loop waiting waiting for unp_defer to reach
zero, which it will never do as it is skipping file descriptors that
were marked in an earlier pass, but now closed.
- Acquire the UNIX domain socket subsystem lock in unp_discard() when
modifying the unp_rights counter, or a read/write race is risked with
other threads also manipulating the counter.
While here:
- Remove #if 0'd code regarding acquiring the socket buffer sleep lock in
the garbage collector, this is not required as we are able to use the
socket buffer receive lock to protect scanning the receive buffer for
in-flight file descriptors on the socket buffer.
- Annotate that the description of the garbage collector implementation
is increasingly inaccurate and needs to be updated.
- Add counters of the number of deferred garbage collections and recycled
file descriptors. This will be removed and is here temporarily for
debugging purposes.
With these changes in place, the unp_passfd regression test now appears
to be passed consistently on UP and SMP systems for extended runs,
whereas before it hung quickly or panicked, depending on which bug was
triggered.
Reported by: Philip Kizer <pckizer at nostrum dot com>
MFC after: 2 weeks
state about each open file, and identify the first process in the process
table that references the file. This is helpful in debugging leaks of
file descriptors.
MFC after: 1 week
The PR and patch have the details. The ultimate fix requires architectural
changes and clarifications to the VFS API, but this will prevent the system
from panicking when someone does "ls /dev" while running in a shell under the
linuxulator.
This issue affects HEAD and RELENG_6 only.
PR: 88249
Submitted by: "Devon H. O'Dell" <dodell@ixsystems.com>
MFC after: 3 days
with the file descriptor. When a file descriptor is closed as a result
of garbage collecting a UNIX domain socket, the file descriptor will
not have any associated thread, so the logic to identify advisory locks
held by that thread is not appropriate. Check the thread for NULL to
avoid this scenario. Expand an existing comment to say a bit more about
this.
MFC after: 1 week
thread context. While it doesn't matter too much at the moment, in
the future we could be back in the same boat if/when more restrictions
are placed (or enforced) in a SWI.
Suggested by: njl, bde, jhb, scottl
in the hardware interrupt context (even if it is likely just an
ithread). We don't document that suspend/resume routines are run from
such a context and some of the things that happen in those routines
aren't interrupt safe. Since there's no real need to run from that
context, this restores assumptions that suspend routines have made.
This fixes Thierry Herbelot's 'Trying to sleep while sleeping is
prohibited' problem.
to user-space if a parameter named "errmsg" is passed into the iovec.
Used in conjunction with vfs_mount_error(), more useful error messages
than errno can be passed back to userspace when mounting a filesystem
fails.
Discussed with: phk, pjd
have started aio, instead, initialize aio management structure
if it hasn't been done, the reason to adjust this behavior is
to make it a bit friendly for threaded program, consider two
threads, one submits aio_write, and another just calls
aio_waitcomplete to wait any I/O to be completed and recycle the
aio requests, before submitter doing any I/O, the recycler wants
to wait in kernel. This also fixes inconsistency with other aio
syscalls.
- Use curthread for calls to knlist_delete() and add a big comment
explaining why as well as appropriate assertions.
- Use TAILQ_FOREACH and TAILQ_FOREACH_SAFE instead of handrolling them.
- Use fget() family of functions to lookup file objects instead of
grovelling around in file descriptor tables.
- Destroy the aio_freeproc mutex if we are unloaded.
Tested on: i386
-Change unconditional aquisition of Giant to only pickup Giant if the vnode
for the controlling tty resides on a non-mpsafe file system.
-Pickup Giant around executable vnode reference counting operations only if
the executable resides on a non-mpsafe file system.
-If this process is being traced, pickup Giant for trace file reference count
operations only if it resides on a non-mpsafe file system.
Discussed with: jhb
Tested by: kris
For each child process whose status has been changed, a SIGCHLD instance
is queued, if the signal is stilling pending, and process changed status
several times, signal information is updated to reflect latest process
status. If wait() returns because the status of a child process is
available, pending SIGCHLD signal associated with the child process is
discarded. Any other pending SIGCHLD signals remain pending.
The signal information is allocated at the same time when proc structure
is allocated, if process signal queue is fully filled or there is a memory
shortage, it can still send the signal to process.
There is a booting time tunable kern.sigqueue.queue_sigchild which
can control the behavior, setting it to zero disables the SIGCHLD queueing
feature, the tunable will be removed if the function is proved that it is
stable enough.
Tested on: i386 (SMP and UP)
from there. All others get broken up and free'd individually to the mbuf
and cluster zones.
The packet zone is a secondary zone to the mbuf zone. There is currently
a limitation in UMA which prevents decreasing the packet zone stock when
the mbuf and cluster zone are drained and all their members are part of
packets. When this is fixed this change may be reverted.
current context in the IPI_STOP handler so that we can get accurate stack
traces of threads on other CPUs on these two archs like we do now on i386
and amd64.
Tested on: alpha, sparc64
both proc pointer and thread pointer, if thread pointer is NULL,
tdsignal automatically finds a thread, otherwise it sends signal
to given thread.
Add utility function psignal_event to send a realtime sigevent
to a process according to the delivery requirement specified in
struct sigevent.
based jumbo 9k and jumbo 16k cluster support.
All mbuf's with external storage attached are mandatory reference
counted. For clusters and jumbo clusters UMA provides the refcnt
storage directly. It does not have to be separatly allocated. Any
other type of external storage gets its own refcnt allocated from
an UMA mbuf refcnt zone instead of normal kernel malloc.
The refcount API MEXT_ADD_REF() and MEXT_REM_REF() is no longer
publically accessible. The proper m_* functions have to be used.
mb_ctor_clust() and mb_dtor_clust() both handle normal 2K as well
as 9k and 16k clusters.
Clusters and jumbo clusters may be obtained without attaching it
immideatly to an mbuf. This is for high performance cluster
allocation in network drivers where mbufs are attached after the
cluster has been filled.
Tested by: rwatson
Sponsored by: TCP/IP Optimizations Fundraise 2005
Having an additional MT_HEADER mbuf type is superfluous and redundant
as nothing depends on it. It only adds a layer of confusion. The
distinction between header mbuf's and data mbuf's is solely done
through the m->m_flags M_PKTHDR flag.
Non-native code is not changed in this commit. For compatibility
MT_HEADER is mapped to MT_DATA.
Sponsored by: TCP/IP Optimization Fundraise 2005
ktr_tid as part of gathering of ktr header data for new ktrace
records. The continued use of intptr_t is required for file layout
reasons, and cannot be changed to lwpid_t at this point.
MFC after: 1 month
Reviewed by: davidxu
intptr_t. The buffer length needs to be written to disk as part
of the trace log, but the kernel pointer for the buffer does not.
Add a new ktr_buffer pointer to the kernel-only ktrace request
structure to hold that pointer. This frees up an integer in the
ktrace record format that can be used to hold the threadid,
although older ktrace files will have a garbage ktr_buffer field
(or more accurately, a kernel pointer value).
MFC after: 2 weeks
Space requested by: davidxu
before dereferencing it. Certain corrupt kernel modules might not have
a valid hash table, and would cause a kernel panic when they were loaded.
Instead of panic'ing, the kernel now prints out a warning that it is
missing the symbol hash table.
Tested by: Benjamin Close Benjamin dot Close at clearchain dot com
MFC after: 1 week
- Prefer '_' to ' ', as it results in more easily parsed results in
memory monitoring tools such as vmstat.
- Remove punctuation that is incompatible with using memory type names
as file names, such as '/' characters.
- Disambiguate some collisions by adding subsystem prefixes to some
memory types.
- Generally prefer lower case to upper case.
- If the same type is defined in multiple architecture directories,
attempt to use the same name in additional cases.
Not all instances were caught in this change, so more work is required to
finish this conversion. Similar changes are required for UMA zone names.
following the protocol pru_listen() call to solisten_proto(), so
that it occurs under the socket lock acquisition that also sets
SO_ACCEPTCONN. This requires passing the new backlog parameter
to the protocol, which also allows the protocol to be aware of
changes in queue limit should it wish to do something about the
new queue limit. This continues a move towards the socket layer
acting as a library for the protocol.
Bump __FreeBSD_version due to a change in the in-kernel protocol
interface. This change has been tested with IPv4 and UNIX domain
sockets, but not other protocols.
convert to or from timeval frequently.
Introduce function itimer_accept() to ack a timer signal in signal
acceptance code, this allows us to return more fresh overrun counter
than at signal generating time. while POSIX says:
"the value returned by timer_getoverrun() shall apply to the most
recent expiration signal delivery or acceptance for the timer,.."
I prefer returning it at acceptance time.
Introduce SIGEV_THREAD_ID notification mode, it is used by thread
libary to request kernel to deliver signal to a specified thread,
and in turn, the thread library may use the mechanism to implement
SIGEV_THREAD which is required by POSIX.
Timer signal is managed by timer code, so it can not fail even if
signal queue is full filled by sigqueue syscall.
set. When watchdogd(1) is terminated intentionally it clears the bit,
which should then disable it in the kernel.
PR: kern/74386
Submitted by: Alex Hoff <ahoff at sandvine dot com>
Approved by: phk, rwatson (mentor)
can't acquire an sx lock in ttyinfo() because ttyinfo() can be called
from interrupt handlers (such as atkbd_intr()). Instead, go back to
locking the process group while we pick a thread to display information for
and hold that lock until after we drop sched_lock to make sure the
process doesn't exit out from under us. sched_lock ensures that the
specific thread from that process doesn't go away. To protect against
the process exiting after we drop the proc lock but before we dereference
it to lookup the pid and p_comm in the call to ttyprintf(), we now copy
the pid and p_comm to local variables while holding the proc lock.
This problem was found by the recently added TD_NO_SLEEPING assertions for
interrupt handlers.
Tested by: emaste
MFC after: 1 week
debug.kdb.panic and debug.kdb.trap alongside the existing debug.kdb.enter
sysctl. 'panic' causes a panic, and 'trap' causes a page fault. We used
these to ensure that crash dumps succeed from those two common failure
modes. This avoids the need for creating a 'panic' kld module.
threads can wait for a thread to exit, and safely assume that the thread
has left userland and is no longer using its userland stack, this is
necessary for pthread_join when a thread is waiting for another thread
to exit which has user customized stack, after pthread_join returns,
the userland stack can be reused for other purposes, without this change,
the joiner thread has to spin at the address to ensure the thread is really
exited.
and increase flexibility to allow various different approaches to be tried
in the future.
- Split struct ithd up into two pieces. struct intr_event holds the list
of interrupt handlers associated with interrupt sources.
struct intr_thread contains the data relative to an interrupt thread.
Currently we still provide a 1:1 relationship of events to threads
with the exception that events only have an associated thread if there
is at least one threaded interrupt handler attached to the event. This
means that on x86 we no longer have 4 bazillion interrupt threads with
no handlers. It also means that interrupt events with only INTR_FAST
handlers no longer have an associated thread either.
- Renamed struct intrhand to struct intr_handler to follow the struct
intr_foo naming convention. This did require renaming the powerpc
MD struct intr_handler to struct ppc_intr_handler.
- INTR_FAST no longer implies INTR_EXCL on all architectures except for
powerpc. This means that multiple INTR_FAST handlers can attach to the
same interrupt and that INTR_FAST and non-INTR_FAST handlers can attach
to the same interrupt. Sharing INTR_FAST handlers may not always be
desirable, but having sio(4) and uhci(4) fight over an IRQ isn't fun
either. Drivers can always still use INTR_EXCL to ask for an interrupt
exclusively. The way this sharing works is that when an interrupt
comes in, all the INTR_FAST handlers are executed first, and if any
threaded handlers exist, the interrupt thread is scheduled afterwards.
This type of layout also makes it possible to investigate using interrupt
filters ala OS X where the filter determines whether or not its companion
threaded handler should run.
- Aside from the INTR_FAST changes above, the impact on MD interrupt code
is mostly just 's/ithread/intr_event/'.
- A new MI ddb command 'show intrs' walks the list of interrupt events
dumping their state. It also has a '/v' verbose switch which dumps
info about all of the handlers attached to each event.
- We currently don't destroy an interrupt thread when the last threaded
handler is removed because it would suck for things like ppbus(8)'s
braindead behavior. The code is present, though, it is just under
#if 0 for now.
- Move the code to actually execute the threaded handlers for an interrrupt
event into a separate function so that ithread_loop() becomes more
readable. Previously this code was all in the middle of ithread_loop()
and indented halfway across the screen.
- Made struct intr_thread private to kern_intr.c and replaced td_ithd
with a thread private flag TDP_ITHREAD.
- In statclock, check curthread against idlethread directly rather than
curthread's proc against idlethread's proc. (Not really related to intr
changes)
Tested on: alpha, amd64, i386, sparc64
Tested on: arm, ia64 (older version of patch by cognet and marcel)
IPI_STOP IPIs.
- Change the i386 and amd64 MD IPI code to send an NMI if STOP_NMI is
enabled if an attempt is made to send an IPI_STOP IPI. If the kernel
option is enabled, there is also a sysctl to change the behavior at
runtime (debug.stop_cpus_with_nmi which defaults to enabled). This
includes removing stop_cpus_nmi() and making ipi_nmi_selected() a
private function for i386 and amd64.
- Fix ipi_all(), ipi_all_but_self(), and ipi_self() on i386 and amd64 to
properly handle bitmapped IPIs as well as IPI_STOP IPIs when STOP_NMI is
enabled.
- Fix ipi_nmi_handler() to execute the restart function on the first CPU
that is restarted making use of atomic_readandclear() rather than
assuming that the BSP is always included in the set of restarted CPUs.
Also, the NMI handler didn't clear the function pointer meaning that
subsequent stop and restarts could execute the function again.
- Define a new macro HAVE_STOPPEDPCBS on i386 and amd64 to control the use
of stoppedpcbs[] and always enable it for i386 and amd64 instead of
being dependent on KDB_STOP_NMI. It works fine in both the NMI and
non-NMI cases.
from being reclaimed before it was wired. Use pmap_extract_and_hold()
instead of pmap_extract() and retain the hold on the page until it has been
wired.
clock are supported. I have plan to merge XSI timer ITIMER_REAL and other
two CPU timers into the new code, current three slots are available for
the XSI timers.
The SIGEV_THREAD notification type is not supported yet because our
sigevent struct lacks of two member fields:
sigev_notify_function
sigev_notify_attributes
I have found the sigevent is used in AIO, so I won't add the two members
unless the AIO code is adjusted.
2. Introduce flags KSI_EXT and KSI_INS. The flag KSI_EXT allows a ksiginfo
to be managed by outside code, the KSI_INS indicates sigqueue_add should
directly insert passed ksiginfo into queue other than copy it.
available kernel malloc types. Quite useful for post-mortem debugging of
memory leaks without a dump device configured on a panicked box.
MFC after: 2 weeks
to unload the usb.ko module after boot if it was originally preloaded
from "/boot/loader.conf". When processing preloaded modules, the
linker erroneously added self-dependencies the each module's reference
count. That prevented usb.ko's reference count from ever going to 0,
so it could not be unloaded.
Sponsored by Isilon Systems.
Reviewed by: pjd, peter
MFC after: 1 week
called during early init before cninit().
Tested on: i386, alpha, sparc64
Reviewed by: phk, imp
Reported by: Divacky Roman xdivac02 at stud dot fit dot vutbr dot cz
MFC after: 1 week
While here, support up to four sections because it was trivial to do
and cheap. (One pointer per section).
For amd64 with "-fpic -shared" format .ko files, using a single PT_LOAD
section is important to avoid wasting about 1MB of KVM and physical ram
for the 'gap' between the two PT_LOAD sections. amd64 normally uses
.o format kld files and isn't affected normally. But -fpic -shared modules
are actually possible to produce and load... (And with a bugfix to
binutils, we can build and use plain -shared .ko files without -fpic)
i386 only wastes 4K per .ko file, so that isn't such a big deal there.
so we are ready for mpsafevfs=1 by default on sparc64 too. I have been
running this on all my sparc64 machines for over 6 months, and have not
encountered MD problems.
MFC after: 1 week
correspond to the commit log. It changed the maxswzone and maxbcache
parameters from int to long, without changing the extern definitions
in <sys/buf.h>.
In fact it's a good thing it did not, because other parts of the system
are not yet ready for this, and on large-memory sparc machines it causes
severe filesystem damage if you try.
The worst effect of the change was that the tunables controlling the
above variables stopped working. These were necessary to allow such
large sparc64 machines (with >12GB RAM) to boot, since sparc64 did not
set a hard-coded upper limit on these parameters and they ended
up overflowing an int, causing an infinite loop at boot in bufinit().
Reviewed by: mlaier
changes in MD code are trivial, before this change, trapsignal and
sendsig use discrete parameters, now they uses member fields of
ksiginfo_t structure. For sendsig, this change allows us to pass
POSIX realtime signal value to user code.
2. Remove cpu_thread_siginfo, it is no longer needed because we now always
generate ksiginfo_t data and feed it to libpthread.
3. Add p_sigqueue to proc structure to hold shared signals which were
blocked by all threads in the proc.
4. Add td_sigqueue to thread structure to hold all signals delivered to
thread.
5. i386 and amd64 now return POSIX standard si_code, other arches will
be fixed.
6. In this sigqueue implementation, pending signal set is kept as before,
an extra siginfo list holds additional siginfo_t data for signals.
kernel code uses psignal() still behavior as before, it won't be failed
even under memory pressure, only exception is when deleting a signal,
we should call sigqueue_delete to remove signal from sigqueue but
not SIGDELSET. Current there is no kernel code will deliver a signal
with additional data, so kernel should be as stable as before,
a ksiginfo can carry more information, for example, allow signal to
be delivered but throw away siginfo data if memory is not enough.
SIGKILL and SIGSTOP have fast path in sigqueue_add, because they can
not be caught or masked.
The sigqueue() syscall allows user code to queue a signal to target
process, if resource is unavailable, EAGAIN will be returned as
specification said.
Just before thread exits, signal queue memory will be freed by
sigqueue_flush.
Current, all signals are allowed to be queued, not only realtime signals.
Earlier patch reviewed by: jhb, deischen
Tested on: i386, amd64
notifications when LIO operations completed. These were the problems
with LIO event complete notification:
- Move all LIO/AIO event notification into one general function
so we don't have bugs in different data paths. This unification
got rid of several notification bugs one of which if kqueue was
used a SIGILL could get sent to the process.
- Change the LIO event accounting to count all AIO request that
could have been split across the fast path and daemon mode.
The prior accounting only kept track of AIO op's in that
mode and not the entire list of operations. This could cause
a bogus LIO event complete notification to occur when all of
the fast path AIO op's completed and not the AIO op's that
ended up queued for the daemon.
Suggestions from: alc
opt_device_polling.h
- Include opt_device_polling.h into appropriate files.
- Embrace with HAVE_KERNEL_OPTION_HEADERS the include in the files that
can be compiled as loadable modules.
Reviewed by: bde
the MAC result, as well as avoid losing the DAC check result when MAC
is enabled.
MFC after: 3 days
Reported by: Patrick LeBlanc <Patrick dot LeBlanc at sparta dot com>
framework. This makes Giant protection around MAC operations which inter-
act with VFS conditional, based on the MPSAFE status of the file system.
Affected the following syscalls:
o __mac_get_fd
o __mac_get_file
o __mac_get_link
o __mac_set_fd
o __mac_set_file
o __mac_set_link
-Drop Giant all together in __mac_set_proc because the
mac_cred_mmapped_drop_perms_recurse routine no longer requires it.
-Move conditional Giant aquisitions to after label allocation routines.
-Move the conditional release of Giant to before label de-allocation
routines.
Discussed with: rwatson
dedicated sysctl handlers. Protect manipulations with
poll_mtx. The affected sysctls are:
- kern.polling.burst_max
- kern.polling.each_burst
- kern.polling.user_frac
- kern.polling.reg_frac
o Use CTLFLAG_RD on MIBs that supposed to be read-only.
o u_int32t -> uint32_t
o Remove unneeded locking from poll_switch().
possible for do_execve() to call exit1() rather than returning. As a
result, the sequence "allocate memory; call kern_execve; free memory"
can end up leaking memory.
This commit documents this astonishing behaviour and adds a call to
exec_free_args() before the exit1() call in do_execve(). Since all
the users of kern_execve() in the tree use exec_free_args() to free
the command-line arguments after kern_execve() returns, this should
be safe, and it fixes the memory leak which can otherwise occur.
Submitted by: Peter Holm
MFC after: 3 days
Security: Local denial of service
calling sysctl_out_proc(). -- fix from jhb
Move the code in fill_kinfo_thread() that gathers data from struct proc
into the new function fill_kinfo_proc_only().
Change all callers of fill_kinfo_thread() to call both
fill_kinfo_proc_only() and fill_kinfo() thread. When gathering
data from a multi-threaded process, fill_kinfo_proc_only() only needs
to be called once.
Grab sched_lock before accessing the process thread list or calling
fill_kinfo_thread().
PR: kern/84684
MFC after: 3 days
o Axe poll in trap.
o Axe IFF_POLLING flag from if_flags.
o Rework revision 1.21 (Giant removal), in such a way that
poll_mtx is not dropped during call to polling handler.
This fixes problem with idle polling.
o Make registration and deregistration from polling in a
functional way, insted of next tick/interrupt.
o Obsolete kern.polling.enable. Polling is turned on/off
with ifconfig.
Detailed kern_poll.c changes:
- Remove polling handler flags, introduced in 1.21. The are not
needed now.
- Forget and do not check if_flags, if_capenable and if_drv_flags.
- Call all registered polling handlers unconditionally.
- Do not drop poll_mtx, when entering polling handlers.
- In ether_poll() NET_LOCK_GIANT prior to locking poll_mtx.
- In netisr_poll() axe the block, where polling code asks drivers
to unregister.
- In netisr_poll() and ether_poll() do polling always, if any
handlers are present.
- In ether_poll_[de]register() remove a lot of error hiding code. Assert
that arguments are correct, instead.
- In ether_poll_[de]register() use standard return values in case of
error or success.
- Introduce poll_switch() that is a sysctl handler for kern.polling.enable.
poll_switch() goes through interface list and enabled/disables polling.
A message that kern.polling.enable is deprecated is printed.
Detailed driver changes:
- On attach driver announces IFCAP_POLLING in if_capabilities, but
not in if_capenable.
- On detach driver calls ether_poll_deregister() if polling is enabled.
- In polling handler driver obtains its lock and checks IFF_DRV_RUNNING
flag. If there is no, then unlocks and returns.
- In ioctl handler driver checks for IFCAP_POLLING flag requested to
be set or cleared. Driver first calls ether_poll_[de]register(), then
obtains driver lock and [dis/en]ables interrupts.
- In interrupt handler driver checks IFCAP_POLLING flag in if_capenable.
If present, then returns.This is important to protect from spurious
interrupts.
Reviewed by: ru, sam, jhb
to avoid touching pageable memory while holding a mutex.
Simplify argument list replacement logic.
PR: kern/84935
Submitted by: "Antoine Pelisse" apelisse AT gmail.com (in a different form)
MFC after: 3 days
Add a new private thread flag to indicate that the thread should
not sleep if runningbufspace is too large.
Set this flag on the bufdaemon and syncer threads so that they skip
the waitrunningbufspace() call in bufwrite() rather than than
checking the proc pointer vs. the known proc pointers for these two
threads. A way of preventing these threads from being starved for
I/O but still placing limits on their outstanding I/O would be
desirable.
Set this flag in ffs_copyonwrite() to prevent bufwrite() calls from
blocking on the runningbufspace check while holding snaplk. This
prevents snaplk from being held for an arbitrarily long period of
time if runningbufspace is high and greatly reduces the contention
for snaplk. The disadvantage is that ffs_copyonwrite() can start
a large amount of I/O if there are a large number of snapshots,
which could cause a deadlock in other parts of the code.
Call runningbufwakeup() in ffs_copyonwrite() to decrement runningbufspace
before attempting to grab snaplk so that I/O requests waiting on
snaplk are not counted in runningbufspace as being in-progress.
Increment runningbufspace again before actually launching the
original I/O request.
Prior to the above two changes, the system could deadlock if enough
I/O requests were blocked by snaplk to prevent runningbufspace from
falling below lorunningspace and one of the bawrite() calls in
ffs_copyonwrite() blocked in waitrunningbufspace() while holding
snaplk.
See <http://www.holm.cc/stress/log/cons143.html>
bio may have been freed and reassigned by the wakeup before being
tested after releasing the bdonelock.
There's a non-zero chance this is the cause of a few of the crashes
knocking around with biodone() sitting in the stack backtrace.
Reviewed By: phk@
subdrivers to hook up.
It should probably be rewritten to implement a simple bus to which
the sub drivers attach using some kind of hint.
Until then, provide a couple of crutch functions with big warning
signs so it can survive the recent changes to struct resource.
osf1_signal.c:1.41, amd64/amd64/trap.c:1.291, linux_socket.c:1.60,
svr4_fcntl.c:1.36, svr4_ioctl.c:1.23, svr4_ipc.c:1.18, svr4_misc.c:1.81,
svr4_signal.c:1.34, svr4_stat.c:1.21, svr4_stream.c:1.55,
svr4_termios.c:1.13, svr4_ttold.c:1.15, svr4_util.h:1.10,
ext2_alloc.c:1.43, i386/i386/trap.c:1.279, vm86.c:1.58,
unaligned.c:1.12, imgact_elf.c:1.164, ffs_alloc.c:1.133:
Now that Giant is acquired in uprintf() and tprintf(), the caller no
longer leads to acquire Giant unless it also holds another mutex that
would generate a lock order reversal when calling into these functions.
Specifically not backed out is the acquisition of Giant in nfs_socket.c
and rpcclnt.c, where local mutexes are held and would otherwise violate
the lock order with Giant.
This aligns this code more with the eventual locking of ttys.
Suggested by: bde
of whether or not Giant was picked up by the filesystem. Add VFS_LOCK_GIANT
macros around vrele as it's possible that this can call in the VOP_INACTIVE
filesystem specific code. Also while we are here, remove the Giant assertion.
from the sysctl handler, we do not actually require Giant here so we
shouldn't assert it. Doing so will just complicate things when Giant is removed
from the sysctl framework.
sleep lock status while kdb_active, or we risk contending with the
mutex on another CPU, resulting in a panic when using "show
lockedvnods" while in DDB.
MFC after: 3 days
Reviewed by: jhb
Reported by: kris
the per-cpu data for all CPUs. This is easier to ask users to do than
"figure out how many CPUs you have, now run show pcpu, then run it
once for each CPU you have".
MFC after: 3 days
the vast majority of cases, these functions are called without mutexes
held, meaning that in all but two cases, there will be no ordering
issues with doing this, and it will eliminate the need for changes in
the caller. In two cases, mutexes are held, so Giant must be acquired
before those mutexes such that uprintf() and tprintf() recurse Giant
rather than generating a lock order reversal.
Suggested by: bde
remove the unconditional acquisition of Giant for extended attribute related
operations. If the file system is set as being MP safe and debug.mpsafevfs is
1, do not pickup Giant.
Mark the following system calls as being MP safe so we no longer pickup Giant
in the system call handler:
o extattrctl
o extattr_set_file
o extattr_get_file
o extattr_delete_file
o extattr_set_fd
o extattr_get_fd
o extattr_delete_fd
o extattr_set_link
o extattr_get_link
o extattr_delete_link
o extattr_list_file
o extattr_list_link
o extattr_list_fd
-Pass MPSAFE flags to namei(9) lookup and introduce vfslocked variable which
will keep track of any Giant acquisitions.
-Wrap any fd operations which manipulate vnodes in VFS_{UN}LOCK_GIANT
-Drop VFS_ASSERT_GIANT into function which operate on vnodes to ensure that
we are sufficiently protected.
I've tested these changes with various TrustedBSD MAC policies which use
extended attribute a lot on SMP and UP systems (thanks to Scott Long for
making some SMP hardware available to me for testing).
Discussed with: jeff
Requested by: jhb, rwatson
The external part is still called 'struct resource' but the contents
is now visible to drivers etc. This makes it part of the device
driver ABI so it not be changed lightly. A comment to this effect
is in place.
The internal part is called 'struct resource_i' and contain its external
counterpart as one field.
Move the bus_space tag+handle into the external struct resource, this
removes the need for device drivers to even know about these fields
in order to use bus_space to access hardware. (More in following commit).
and bus_free_resources(). These functions take a list of resources
and handle them all in one go. A flag makes it possible to mark
a resource as optional.
A typical device driver can save 10-30 lines of code by using these.
Usage examples will follow RSN.
MFC: A good idea, eventually.
It confuses the lock manager since in some places thread0 is
then used for vnode locking while curthread is used for vnode unlocking.
Found by: Yahoo!
Reviewed by: ps@,jhb@
MFC after: 3 days
a thread holding critical resource, e.g mutex or other implicit
synchronous flags. Give thread which exceeds nice threshold a minimum
time slice.
PR: kern/86087
NULL. The NFS client expects that a thread will always be present for a
VOP so that it can check for signal conditions, and will dereference a
NULL pointer if one isn't present.
MFC after: 3 days
nor uprintf() is believed to perform tsleep() or msleep() as written,
as ttycheckoutq() is called with '0' as its sleep argument.
Remove recently added WITNESS warnings for sleep as the comment was
incorrect. This should silence a warning from the nfs_timer() code.
Discussed with: bde
Give DEVFS a proper inode called struct cdev_priv. It is important
to keep in mind that this "inode" is shared between all DEVFS
mountpoints, therefore it is protected by the global device mutex.
Link the cdev_priv's into a list, protected by the global device
mutex. Keep track of each cdev_priv's state with a flag bit and
of references from mountpoints with a dedicated usecount.
Reap the benefits of much improved kernel memory allocator and the
generally better defined device driver APIs to get rid of the tables
of pointers + serial numbers, their overflow tables, the atomics
to muck about in them and all the trouble that resulted in.
This makes RAM the only limit on how many devices we can have.
The cdev_priv is actually a super struct containing the normal cdev
as the "public" part, and therefore allocation and freeing has moved
to devfs_devs.c from kern_conf.c.
The overall responsibility is (to be) split such that kern/kern_conf.c
is the stuff that deals with drivers and struct cdev and fs/devfs
handles filesystems and struct cdev_priv and their private liason
exposed only in devfs_int.h.
Move the inode number from cdev to cdev_priv and allocate inode
numbers properly with unr. Local dirents in the mountpoints
(directories, symlinks) allocate inodes from the same pool to
guarantee against overlaps.
Various other fields are going to migrate from cdev to cdev_priv
in the future in order to hide them. A few fields may migrate
from devfs_dirent to cdev_priv as well.
Protect the DEVFS mountpoint with an sx lock instead of lockmgr,
this lock also protects the directory tree of the mountpoint.
Give each mountpoint a unique integer index, allocated with unr.
Use it into an array of devfs_dirent pointers in each cdev_priv.
Initially the array points to a single element also inside cdev_priv,
but as more devfs instances are mounted, the array is extended with
malloc(9) as necessary when the filesystem populates its directory
tree.
Retire the cdev alias lists, the cdev_priv now know about all the
relevant devfs_dirents (and their vnodes) and devfs_revoke() will
pick them up from there. We still spelunk into other mountpoints
and fondle their data without 100% good locking. It may make better
sense to vector the revoke event into the tty code and there do a
destroy_dev/make_dev on the tty's devices, but that's for further
study.
Lots of shuffling of stuff and churn of bits for no good reason[2].
XXX: There is still nothing preventing the dev_clone EVENTHANDLER
from being invoked at the same time in two devfs mountpoints. It
is not obvious what the best course of action is here.
XXX: comment out an if statement that lost its body, until I can
find out what should go there so it doesn't do damage in the meantime.
XXX: Leave in a few extra malloc types and KASSERTS to help track
down any remaining issues.
Much testing provided by: Kris
Much confusion caused by (races in): md(4)
[1] You are not supposed to understand anything past this point.
[2] This line should simplify life for the peanut gallery.
as they both interact with the tty code (!MPSAFE) and may sleep if the
tty buffer is full (per comment).
Modify all consumers of uprintf() and tprintf() to hold Giant around
calls into these functions. In most cases, this means adding an
acquisition of Giant immediately around the function. In some cases
(nfs_timer()), it means acquiring Giant higher up in the callout.
With these changes, UFS no longer panics on SMP when either blocks are
exhausted or inodes are exhausted under load due to races in the tty
code when running without Giant.
NB: Some reduction in calls to uprintf() in the svr4 code is probably
desirable.
NB: In the case of nfs_timer(), calling uprintf() while holding a mutex,
or even in a callout at all, is a bad idea, and will generate warnings
and potential upset. This needs to be fixed, but was a problem before
this change.
NB: uprintf()/tprintf() sleeping is generally a bad ideas, as is having
non-MPSAFE tty code.
MFC after: 1 week
provided access to the root file system before the start of the
init process. This was used briefly by SEBSD before it knew about
preloading data in the loader, and using that method to gain
access to data earlier results in fewer inconsistencies in the
approach. Policy modules still have access to the root file system
creation event through the mac_create_mount() entry point.
Removed now, and will be removed from RELENG_6, in order to gain
third party policy dependencies on the entry point for the lifetime
of the 6.x branch.
MFC after: 3 days
Submitted by: Chris Vance <Christopher dot Vance at SPARTA dot com>
Sponsored by: SPARTA
so that UUIDs can be generated from within the kernel. The uuidgen(2)
syscall now allocates kernel memory, calls the generator, and does a
copyout() for the whole UUID store. This change is in support of GPT.
and other applications to query the state of the stack regarding the
accept queue on a listen socket:
SO_LISTENQLIMIT Return the value of so_qlimit (socket backlog)
SO_LISTENQLEN Return the value of so_qlen (complete sockets)
SO_LISTENINCQLEN Return the value of so_incqlen (incomplete sockets)
Minor white space tweaks to existing socket options to make them
consistent.
Discussed with: andre
MFC after: 1 week
quite a bit of reading to figure it out, and I want to avoid figuring
it out again.
Convert an if (foo) else printf("this is almost a panic") into a
KASSERT.
MFC after: 3 days
unconditional acquisition of Giant for ACL related operations. If the file
system is set as being MP safe and debug.mpsafevfs is 1, do not pickup
giant.
For any operations which require namei(9) lookups:
__acl_get_file
__acl_get_link
__acl_set_file
__acl_set_link
__acl_delete_file
__acl_delete_link
__acl_aclcheck_file
__acl_aclcheck_link
-Set the MPSAFE flag in NDINIT
-Initialize vfslocked variable using the NDHASGIANT macro
For functions which operate on fds, make sure the operations are locked:
__acl_get_fd
__acl_set_fd
__acl_delete_fd
__acl_aclcheck_fd
-Initialize vfslocked using VFS_LOCK_GIANT before we manipulate the vnode
Discussed with: jeff
any other non-sleepable lock. In plain English: Giant comes before all
other mutexes.
- Add some extra description to the lock order reversal printf's to indicate
when a reversal is triggered by a hard-coded implicit rule.
Requested by: truckman (2)
MFC after: 1 week
state where sleeping on a sleep queue is not allowed. The facility
doesn't support recursion but uses a simple private per-thread flag
(TDP_NOSLEEPING). The sleepq_add() function will panic if the flag is
set and INVARIANTS is enabled.
- Use this new facility to replace the g_xup and g_xdown mutexes that were
(ab)used to achieve similar behavior.
- Disallow sleeping in interrupt threads when invoking interrupt handlers.
MFC after: 1 week
Reviewed by: phk
links and the execution of ELF binaries. Two problems were found:
1) The link path wasn't tagged as being MP safe and thus was not properly
protected.
2) The ELF interpreter vnode wasnt being locked in namei(9) and thus was
insufficiently protected.
This commit makes the following changes:
-Sets the MPSAFE flag in NDINIT for symbolic link paths
-Sets the MPSAFE flag in NDINIT and introduce a vfslocked variable which
will be used to instruct VFS_UNLOCK_GIANT to unlock Giant if it has been
picked up.
-Drop in an assertion into vfs_lookup which ensures that if the MPSAFE
flag is NOT set, that we have picked up giant. If not panic (if WITNESS
compiled into the kernel). This should help us find conditions where vnode
operations are in-sufficiently protected.
This is a RELENG_6 candidate.
Discussed with: jeff
MFC after: 4 days
shutdown procedures (which have a duration of more than 120 seconds).
We have two user-space affecting shutdown timeouts: a "soft" one in
/etc/rc.shutdown and a "hard" one in init(8). The first one can be
configured via /etc/rc.conf variable "rcshutdown_timeout" and defaults
to 30 seconds. The second one was originally (in 1998) intended to be
configured via sysctl(8) variable "kern.shutdown_timeout" and defaults
to 120 seconds.
Unfortunately, the "kern.shutdown_timeout" was declared "unused" in 1999
(as it obviously is actually not used within the kernel itself) and
hence was intentionally but misleadingly removed in revision 1.107 from
init_main.c. Kernel sysctl(8) variables are certainly a wrong way to
control user-space processes in general, but in this particular case the
sysctl(8) variable should have remained as it supports init(8), which
isn't passed command line flags (which in turn could have been set via
/etc/rc.conf), etc.
As there is already a similar "kern.init_path" sysctl(8) variable which
directly affects init(8), resurrect the init(8) shutdown timeout under
sysctl(8) variable "kern.init_shutdown_timeout". But this time document
it as being intentionally unused within the kernel and used by init(8).
Also document it in the manpages init(8) and rc.conf(5).
Reviewed by: phk
MFC after: 2 weeks
struct bufs that are persistently held by ext2fs. Ignore any buffers
with this flag in the code in boot() that counts "busy" and dirty
buffers and attempts to sync the dirty buffers, which is done before
attempting to unmount all the file systems during shutdown.
This fixes the problem caused by any ext2fs file systems that are
mounted at system shutdown time, which caused boot() to give up on
a non-zero number of buffers and skip the call to vfs_unmountall().
This left all the mounted file systems in a dirty state and caused
them to all require cleanup by fsck on reboot.
Move the two separate copies of the "busy" buffer test in boot()
to a separate function.
Nuke the useless spl() stuff in the ext2fs ULCK_BUF() macro.
Bring the PRINT_BUF_FLAGS definition in sys/buf.h up to date with
this and previous flag changes.
PR: kern/56675, kern/85163
Tested by: "Matthias Andree" matthias.andree at gmx.de
Reviewed by: bde
MFC after: 3 days
Also introduce an aclinit function which will be used to create the UMA zone
for use by file systems at system start up.
MFC after: 1 month
Discussed with: rwatson
instead. Detailed changelist:
o Add flags field to struct pollrec, to indicate that
are particular entry is being worked on.
o Define a macro PR_VALID() to check that a pollrec
is valid and pollable.
o Mark ISRs as mpsafe.
o ether_poll()
- Acquire poll_mtx while traversing pollrec array.
- Skip pollrecs, that are being worked on.
- Conditionally acquire Giant when entering handler.
o netisr_pollmore()
- Conditionally assert Giant.
- Acquire poll_mtx while working with statistics.
o netisr_poll()
- Conditionally assert Giant.
- Acquire poll_mtx while working with statistics
and traversing pollrec array.
o ether_poll_register(), ether_poll_deregister()
- Conditionally assert Giant.
- Acquire poll_mtx while working with pollrec array.
o poll_idle()
- Remove all strange manipulations with Giant.
In collaboration with: ru, pjd
In collaboration with: Oleg Bulyzhin <oleg rinet.ru>
In collaboration with: dima <_pppp mail.ru>
remaining % arguments because the varargs are now out of sync and
there is a risk that we might for instance dereference an integer
in a %s argument.
Sponsored by: Napatech.com
link proctree and allproc to Giant since that order is already implicitly
enforced.
- Use a goto to handle the case where we want to enforce a reversal before
calling isitmydescendant() in witness_checkorder() so that the logic is
easier to follow and so that it is easier to add more forced-reversal
cases in the future.
MFC after: 3 days
mutex.
- Don't panic if a spin lock is held too long inside _mtx_lock_spin() if
panicstr is set (meaning that we are already in a panic). Just keep
spinning forever instead.
o for() instead of while() looping over mbuf chain
o paren's around all flag checks
o more verbose function and purpose description
o some more style changes
Based on feedback from: sam
m_demote(m->m_next) if they wish to start at the second mbuf in chain.
o Test m_type with == instead of &.
o Check m_nextpkt against NULL instead of implicit 0.
Based on feedback from: sam
1. Walk the absolute list in reverse to prefer duplicated levels that have
a lower absolute setting, i.e. 800 Mhz/50% is better than 1600 Mhz/25% even
though both have the same actual frequency. This also removes the need to
check for already-modified levels since by definition, those will be added
later in the sorted list.
2. Compare the absolute settings for derived levels and don't use the new
level if it's higher. For example, a level of 800 Mhz/75% is preferable to
1600 Mhz/25% even though the latter has a lower total frequency.
This work is based on a patch from the submitter but reworked by myself.
Submitted by: Tijl Coosemans (tijl/ulyssis.org)
int prep, int how).
Copies the data portion of mbuf (chain) n starting from offset off
for length len to mbuf (chain) m. Depending on prep the copied
data will be appended or prepended. The function ensures that the
mbuf (chain) m will be fully writeable by making real (not refcnt)
copies of mbuf clusters. For the prepending the function returns
a pointer to the new start of mbuf chain m and leaves as much
leading space as possible in the new first mbuf.
Reviewed by: glebius
checking on mbuf's and mbuf chains. Set sanitize to 1 to garble
illegal things and have them blow up later when used/accessed.
m_sanity()'s main purpose is for KASSERT()'s and debugging of non-
kosher mbuf manipulation (of which we have a number of).
Reviewed by: glebius
any tags and packet headers. If "all" is set then the first mbuf
in the chain will be cleaned too.
This function is used before an mbuf, that arrived as packet with
m->flags & M_PKTHDR, is appended to an mbuf chain using m->m_next
(not m->m_nextpkt).
Reviewed by: glebius
but vm_map_wire() fails, then a vm object, vm map entries, and kernel_map
free space is leaked and (2) unwiring is handled automatically by
vm_map_remove().
Suggested by: tegge