maxsockets limit, not maxfiles limit. The question remains why those
limits are handled differently (with error code for maxfiles but with
sleep for maxsokets), but those would be addressed in a separate commit
if necessary.
Requested by: rwhatson, jeff
before doing the very expensive cursig() and related locking. NEEDSIGCHK
is updated whenever our signal mask change or when a signal is delivered and
should be sufficient to avoid the more expensive tests. This eliminates
another source of PROC_LOCK contention in multithreaded programs.
- In the last revision the code was changed to use maxfilesperproc rather than
the per-process file limit to restrict the size of the poll array. This
eliminates a significant source of process lock contention in multithreaded
programs and is cheaper. This had been committed with the wrong batch of
changes.
a simple (wmesg, count) tuple in a hash to keep track of how many times
we sleep at each wait message. We hash on message and not channel. No
line number information is given as typically wait messages are not used in
more than one place. Identical strings defined at different addresses will
show up with seperate counters.
- Use debug.sleepq.enable to enable, .reset to reset, and .stats dumps stats.
- Do an unsynchronized check in sleepq_switch() prior to switching before
calling sleepq_profile() which uses a global lock to synchronize the hash.
Only sleeps which actually cause a context switch are counted.
1.38 in 2001. Break out of the FOREACH_THREAD_IN_PROC loop when we've
discovered a new proc in the chain.
- Increment i and check for maxlockdepth once per matching process not
once per thread. This didn't properly terminate the loop before.
- Fix a bug which has existed potentially since rev 1.1. waitblock->lf_next
can be NULL when a thread has been woken-up but not yet scheduled. Check
for this condition rather than blindly dereferencing.
Found by: libMicro
requiring the per-process spinlock to only requiring the process lock.
- Reflect these changes in the proc.h documentation and consumers throughout
the kernel. This is a substantial reduction in locking cost for these
fields and was made possible by recent changes to threading support.
- Always include the ie_disable and ie_eoi methods in 'struct intr_event'
and collapse down to one intr_event_create() routine. The disable and
eoi hooks simply aren't used currently in the !INTR_FILTER case.
- Expand 'disab' to 'disable' in a few places.
- Use function casts for arm and i386:intr_eoi_src() instead of wrapper
routines since to trim one extra indirection.
Compiled on: {arm,amd64,i386,ia64,ppc,sparc64} x {FILTER, !FILTER}
Tested on: {amd64,i386} x {FILTER, !FILTER}
drivers.
In the giant_XXX wrappers for the device methods of the D_NEEDGIANT
drivers, do not dereference the cdev->si_devsw. It is racing with
the destroy_devl() clearing of the si_devsw. Instead, use the
dev_refthread() and return ENXIO for the destroyed device. [1]
The check for the D_INIT in the prep_cdevsw() was not synchronized with
the call of the fini_cdevsw() in destroy_devl(), that under rapid device
creation/destruction may result in the use of uninitialized cdevsw [2].
Change the protocol for the prep_cdevsw(), requiring it to be called
under dev_mtx, where the check for D_INIT is done.
Do not free the memory allocated for the gianttrick cdevsw while holding
the dev_mtx, put it into the free list to be freed later. Reuse the
d_gianttrick pointer to keep the size and layout of the struct cdevsw
(requested by phk). Free the memory in the dev_unlock_and_free(), and do
all the free after the dev_mtx is dropped (suggested by jhb).
Reported by: bsdimp + many [1], pho [2]
Reviewed by: phk, jhb
Tested by: pho
MFC after: 1 week
uidinfo structure. This entirely removes contention observed on the
ui_mtxp mutex (as it is now gone).
- Convert the uihashtbl_mtx mutex to a rwlock, as most of the time we just
need to read-lock it.
Reviewed by: jhb, jeff, kris & others
Tested by: kris
a jail, etc. by simply calling setpriority(PRIO_PROCESS, <PID>, 0) and
checking the return value: 0 means that the process exists and -1 that
it doesn't exist.
Reviewed by: rwatson
MFC after: 1 week
after each SYSINIT() macro invocation. This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.
MFC after: 1 month
Discussed with: imp, rink
Otherwise the parameter is no-op, since zone by default limits number
of descriptors to some 12K entries. Attempt to allocate more ends up
sleeping on zonelimit.
MFC after: 2 weeks
- Add a new intr_event method ie_assign_cpu() that is invoked when the MI
code wishes to bind an interrupt source to an individual CPU. The MD
code may reject the binding with an error. If an assign_cpu function
is not provided, then the kernel assumes the platform does not support
binding interrupts to CPUs and fails all requests to do so.
- Bind ithreads to CPUs on their next execution loop once an interrupt
event is bound to a CPU. Only shared ithreads are bound. We currently
leave private ithreads for drivers using filters + ithreads in the
INTR_FILTER case unbound.
- A new intr_event_bind() routine is used to bind an interrupt event to
a CPU.
- Implement binding on amd64 and i386 by way of the existing pic_assign_cpu
PIC method.
- For x86, provide a 'intr_bind(IRQ, cpu)' wrapper routine that looks up
an interrupt source and binds its interrupt event to the specified CPU.
MI code can currently (ab)use this by doing:
intr_bind(rman_get_start(irq_res), cpu);
however, I plan to add a truly MI interface (probably a bus_bind_intr(9))
where the implementation in the x86 nexus(4) driver would end up calling
intr_bind() internally.
Requested by: kmacy, gallatin, jeff
Tested on: {amd64, i386} x {regular, INTR_FILTER}
- Close a sleepqueue signal race by interlocking with the per-process
spinlock. This was mistakenly omitted from the thread_lock patch and
has been a race since.
MFC After: 1 week
PR: bin/117603
Reported by: Danny Braniss <danny@cs.huji.ac.il>
While the KSE project was quite successful in bringing threading to
FreeBSD, the M:N approach taken by the kse library was never developed
to its full potential. Backwards compatibility will be provided via
libmap.conf for dynamically linked binaries and static binaries will
be broken.
sched_sleep(). This removes extra thread_lock() acquisition and
allows the scheduler to decide what to do with the static boost.
- Change the priority arguments to cv_* to match sleepq/msleep/etc.
where 0 means no priority change. Catch -1 in cv_broadcastpri() and
convert it to 0 for now.
- Set a flag when sleeping in a way that is compatible with swapping
since direct priority comparisons are meaningless now.
- Add a sysctl to ule, kern.sched.static_boost, that defaults to on which
controls the boost behavior. Turning it off gives better performance
in some workloads but needs more investigation.
- While we're modifying sleepq, change signal and broadcast to both
return with the lock held as the lock was held on enter.
Reviewed by: jhb, peter
tdq_runq_add to select the runq rather than hoping we set it properly
when we adjusted the priority. This involves the same number of
branches as before so should perform identically without the extra
fragility.
Tested by: bz
Reviewed by: bz
- Only calculate timeshare priorities once per tick or when a thread is woken
from sleeping.
- Keep the ts_runq pointer valid after all priority changes.
- Call tdq_runq_add() directly from sched_switch() without passing in via
tdq_add(). We don't need to adjust loads or runqs anymore.
- Sort tdq and ts_sched according to utilization to improve cache behavior.
Sponsored by: Nokia
- Normalize the preemption/ipi setting code by introducing sched_shouldpreempt()
so the logical is identical and not repeated between tdq_notify() and
sched_setpreempt().
- In tdq_notify() don't set NEEDRESCHED as we may not actually own the thread lock
this could have caused us to lose td_flags settings.
- Garbage collect some tunables that are no longer relevant.
know if has siblings that need an actual probe. Introduce a specail
return value called BUS_PROBE_NOOWILDCARD. If the driver returns
this, the probe is only successful for devices that have had a
specific devclass set for them.
Reviewed by: current@, jhb@, grehan@
Solaris and AIX.
fcntl(fd, F_DUP2FD, arg) and dup2(fd, arg) are functionnaly equivalent.
Document it.
Add some regression tests (identical to the dup2(2) regression tests).
PR: 120233
Submitted by: Jukka Ukkonen
Approved by: rwaston (mentor)
MFC after: 1 month
process lock leading to a hang. This bug was introduced in
kern_sig.c:1.351, when the call to expand_name() was moved earlier
bit this particular error case was not updated.
passing it to cpuset_which(). Pass in 'set' instead. This argument
is not used but for convenience cpuset_which() nulls all incoming
parameters.
Submitted by: davidxu
mask none of the upper bits are set.
- Be more careful about enforcing the boundaries of masks and child sets.
- Introduce a few more CPU_* macros for implementing these tests.
- Change the cpusetsize argument to be bytes rather than bits to match
other apis.
Sponsored by: Nokia
The PQ3 is a high performance integrated communications processing system
based on the e500 core, which is an embedded RISC processor that implements
the 32-bit Book E definition of the PowerPC architecture. For details refer
to: http://www.freescale.com/webapp/sps/site/prod_summary.jsp?code=MPC8555E
This port was tested and successfully run on the following members of the PQ3
family: MPC8533, MPC8541, MPC8548, MPC8555.
The following major integrated peripherals are supported:
* On-chip peripherals bus
* OpenPIC interrupt controller
* UART
* Ethernet (TSEC)
* Host/PCI bridge
* QUICC engine (SCC functionality)
This commit brings the main functionality and will be followed by individual
drivers that are logically separate from this base.
Approved by: cognet (mentor)
Obtained from: Juniper, Semihalf
MFp4: e500
- When searching for affinity search backwards in the tree from the last
cpu we ran on while the thread still has affinity for the group. This
can take advantage of knowledge of shared L2 or L3 caches among a
group of cores.
- When searching for the least loaded cpu find the least loaded cpu via
the least loaded path through the tree. This load balances system bus
links, individual cache levels, and hyper-threaded/SMT cores.
- Make the periodic balancer recursively balance the highest and lowest
loaded cpu across each link.
Add support for cpusets:
- Convert the cpuset to a simple native cpumask_t while the kernel still
only supports cpumask.
- Pass the derived cpumask down through the cpu_search functions to
restrict the result cpus.
- Make the various steal functions resilient to failure since all threads
can not run on all cpus any longer.
General improvements:
- Precisely track the lowest priority thread on every runq with
tdq_setlowpri(). Before it was more advisory but this ended up having
pathological behaviors.
- Remove many #ifdef SMP conditions to simplify the code.
- Get rid of the old cumbersome tdq_group. This is more naturally
expressed via the cpu_group tree.
Sponsored by: Nokia
Testing by: kris
tree structure that encodes the level of cache sharing and other
properties.
- Provide several convenience functions for creating one and two level
cpu trees as well as a default flat topology. The system now always
has some topology.
- On i386 and amd64 create a seperate level in the hierarchy for HTT
and multi-core cpus. This will allow the scheduler to intelligently
load balance non-uniform cores. Presently we don't detect what level
of the cache hierarchy is shared at each level in the topology.
- Add a mechanism for testing common topologies that have more information
than the MD code is able to provide via the kern.smp.topology tunable.
This should be considered a debugging tool only and not a stable api.
Sponsored by: Nokia
and assignment.
- Add a reference to a struct cpuset in each thread that is inherited from
the thread that created it.
- Release the reference when the thread is destroyed.
- Add prototypes for syscalls and macros for manipulating cpusets in
sys/cpuset.h
- Add syscalls to create, get, and set new numbered cpusets:
cpuset(), cpuset_{get,set}id()
- Add syscalls for getting and setting affinity masks for cpusets or
individual threads: cpuid_{get,set}affinity()
- Add types for the 'level' and 'which' parameters for the cpuset. This
will permit expansion of the api to cover cpu masks for other objects
identifiable with an id_t integer. For example, IRQs and Jails may be
coming soon.
- The root set 0 contains all valid cpus. All thread initially belong to
cpuset 1. This permits migrating all threads off of certain cpus to
reserve them for special applications.
Sponsored by: Nokia
Discussed with: arch, rwatson, brooks, davidxu, deischen
Reviewed by: antoine
than rely on the lockmgr support [1]:
* bump the waiters only if the interlock is held
* let brelvp() return the waiters count
* rely on brelvp() instead than BUF_LOCKWAITERS() in order to check
for the waiters number
- Remove a namespace pollution introduced recently with lockmgr.h
including lock.h by including lock.h directly in the consumers and
making it mandatory for using lockmgr.
- Modify flags accepted by lockinit():
* introduce LK_NOPROFILE which disables lock profiling for the
specified lockmgr
* introduce LK_QUIET which disables ktr tracing for the specified
lockmgr [2]
* disallow LK_SLEEPFAIL and LK_NOWAIT to be passed there so that it
can only be used on a per-instance basis
- Remove BUF_LOCKWAITERS() and lockwaiters() as they are no longer
used
This patch breaks KPI so __FreBSD_version will be bumped and manpages
updated by further commits. Additively, 'struct buf' changes results in
a disturbed ABI also.
[2] Really, currently there is no ktr tracing in the lockmgr, but it
will be added soon.
[1] Submitted by: kib
Tested by: pho, Andrea Barberio <insomniac at slackware dot it>
the vnode interlock is not held. vn_printf() already correctly handles
locked and unlocked vnode interlocks, and all the in-tree vop_print
methods are interlock-agnostic.
Some code calls vprintf() with the vnode interlock held, that causes
unjustified panics with INVARIANTS (ffs_syncvnode() as example).
Reported by: Peter Holm
always curthread.
As KPI gets broken by this patch, manpages and __FreeBSD_version will be
updated by further commits.
Tested by: Andrea Barberio <insomniac at slackware dot it>
namespace in order to handle lockmgr fields in a controlled way instead
than spreading all around bogus stubs:
- VN_LOCK_AREC() allows lock recursion for a specified vnode
- VN_LOCK_ASHARE() allows lock sharing for a specified vnode
In FFS land:
- BUF_AREC() allows lock recursion for a specified buffer lock
- BUF_NOREC() disallows recursion for a specified buffer lock
Side note: union_subr.c::unionfs_node_update() is the only other function
directly handling lockmgr fields. As this is not simple to fix, it has
been left behind as "sole" exception.
the provided trailers. This has been broken since revision 1.240.
Submitted by: Dan Nelson
PR: kern/120948
"sounds ok to me" from: phk
MFC after: 3 days
consists of the null-terminated name and the contents of any structure
you wish to record. A new ktrstruct() function constructs and emits a
KTR_STRUCT record. It is accompanied by convenience macros for struct
stat and struct sockaddr.
In kdump(1), KTR_STRUCT records are handled by a dispatcher function
that runs stringent sanity checks on its contents before handing it
over to individual decoding funtions for each type of structure.
Currently supported structures are struct stat and struct sockaddr for
the AF_INET, AF_INET6 and AF_UNIX families; support for AF_APPLETALK
and AF_IPX is present but disabled, as I am unable to test it properly.
Since 's' was already taken, the letter 't' is used by ktrace(1) to
enable KTR_STRUCT trace points, and in kdump(1) to enable their
decoding.
Derived from patches by Andrew Li <andrew2.li@citi.com>.
PR: kern/117836
MFC after: 3 weeks
the same operation of lockmgr() but accepting a custom wmesg, prio and
timo for the particular lock instance, overriding default values
lkp->lk_wmesg, lkp->lk_prio and lkp->lk_timo.
- Use lockmgr_args() in order to implement BUF_TIMELOCK()
- Cleanup BUF_LOCK()
- Remove LK_INTERNAL as it is nomore used in the lockmgr namespace
Tested by: Andrea Barberio <insomniac at slackware dot it>
is to be requested via a "ro" option. At the same time, MNT_RDONLY
is gradually becoming an indicator of the current state of the FS
instead of a command flag. Today passing MNT_RDONLY alone to the
kernel's mount machinery will lead to various glitches. (See the
PRs for examples.)
Therefore mount the root FS with a "ro" option instead of the
MNT_RDONLY flag. (Note that MNT_RDONLY still is added to the mount
flags internally, by vfs_donmount(), if "ro" was specified.)
To be able to pass "ro" cleanly to kernel_vmount(), teach the latter
function to accept options with NULL values.
Also correct the comment explaining how mount_arg() handles length
of -1.
PR: bin/106636 kern/120319
Submitted by: Jaakko Heinonen <see PR kern/120319 for email> (originally)
modules using invalid ABI versions (e.g. a 7.x module with an 8.x kernel)
for a given kernel:
- Add a 'kernel' module version whose value is __FreeBSD_version.
- Add a version dependency on 'kernel' in every module that has an
acceptable version range of __FreeBSD_version up to the end of the
branch __FreeBSD_version is part of. E.g. a module compiled on 701000
would work on kernels with versions between 701000 and 799999 inclusive.
Discussed on: arch@
MFC after: 1 week
A couple of notes for this:
* WITNESS support, when enabled, is only used for shared locks in order
to avoid problems with the "disowned" locks
* KA_HELD and KA_UNHELD only exists in the lockmgr namespace in order
to assert for a generic thread (not curthread) owning or not the
lock. Really, this kind of check is bogus but it seems very
widespread in the consumers code. So, for the moment, we cater this
untrusted behaviour, until the consumers are not fixed and the
options could be removed (hopefully during 8.0-CURRENT lifecycle)
* Implementing KA_HELD and KA_UNHELD (not surported natively by
WITNESS) made necessary the introduction of LA_MASKASSERT which
specifies the range for default lock assertion flags
* About other aspects, lockmgr_assert() follows exactly what other
locking primitives offer about this operation.
- Build real assertions for buffer cache locks on the top of
lockmgr_assert(). They can be used with the BUF_ASSERT_*(bp)
paradigm.
- Add checks at lock destruction time and use a cookie for verifying
lock integrity at any operation.
- Redefine BUF_LOCKFREE() in order to not use a direct assert but
let it rely on the aforementioned destruction time check.
KPI results evidently broken, so __FreeBSD_version bumping and
manpage update result necessary and will be committed soon.
Side note: lockmgr_assert() will be used soon in order to implement
real assertions in the vnode namespace replacing the legacy and still
bogus "VOP_ISLOCKED()" way.
Tested by: kris (earlier version)
Reviewed by: jhb
through the FreeBSD ABI. IPC_INFO, SHM_INFO, SHM_STAT were added
specifically for Linux binary support. They are not documented
as being a part of the FreeBSD ABI, also, the structures necessary
for them have been hidden away from the users for a long time.
Also, the Linux ABI layer uses it's own structures to populate the
responses back to the user to ensure that the ABI is consistent.
I think there is a bit more separation work that needs to happen.
Reviewed by: jhb
Discussed with: jhb
Discussed on: freebsd-arch@ (very briefly)
MFC after: 1 month
directory, and jail directory within procstat. While this functionality
is available already in fstat, encapsulating it in the kern.proc.filedesc
sysctl makes it accessible without using kvm and thus without needing
elevated permissions.
The new procstat output looks like:
PID COMM FD T V FLAGS REF OFFSET PRO NAME
76792 tcsh cwd v d -------- - - - /usr/src
76792 tcsh root v d -------- - - - /
76792 tcsh 15 v c rw------ 16 9130 - -
76792 tcsh 16 v c rw------ 16 9130 - -
76792 tcsh 17 v c rw------ 16 9130 - -
76792 tcsh 18 v c rw------ 16 9130 - -
76792 tcsh 19 v c rw------ 16 9130 - -
I am also bumping __FreeBSD_version for this as this new feature will be
used in at least one port.
Reviewed by: rwatson
Approved by: rwatson
VOP_ISLOCKED(arg, curthread). Now, VOP_ISLOCKED() and lockstatus() should
only acquire curthread as argument; this will lead in axing the additional
argument from both functions, making the code cleaner.
Reviewed by: jeff, kib
the provided lock or &blocked_lock. The thread may be temporarily
assigned to the blocked_lock by the scheduler so a direct comparison
can not always be made.
- Use THREAD_LOCKPTR_ASSERT() in the primary consumers of the scheduling
interfaces. The schedulers themselves still use more explicit asserts.
Sponsored by: Nokia
- Move recursion checking into rwlock inlines to free a bit for use with
adaptive spinners.
- Clear the RW_LOCK_WRITE_SPINNERS flag whenever the lock state changes
causing write spinners to restart their loop.
- Write spinners are limited by a count while readers hold the lock as
there is no way to know for certain whether readers are running still.
- In the read path block if there are write waiters or spinners to avoid
starving writers. Use a new per-thread count, td_rw_rlocks, to skip
starvation avoidance if it might cause a deadlock.
- Remove or change invalid assertions in turnstiles.
Reviewed by: attilio (developed parts of the patch as well)
Sponsored by: Nokia
This support tries to be as parallel as possible with other locking
primitives, but there are differences; more specifically:
- The base witness support is alredy equipped for allowing lock
duplication acquisition as lockmgr rely on this.
- In the case of lockmgr_disown() the lock result unlocked by witness
even if it is still held by the "kernel context"
- In the case of upgrading we can have 3 different situations:
* Total unlocking of the shared lock and nothing else
* Real witness upgrade if the owner is the first upgrader
* Shared unlocking and exclusive locking if the owner is not the first
upgrade but it is still allowed to upgrade
- LK_DRAIN is basically handled like an exclusive acquisition
Additively new options LK_NODUP and LK_NOWITNESS can now be used with
lockinit(): LK_NOWITNESS disables WITNESS for the specified lock while
LK_NODUP enable duplicated locks tracking. This will require manpages
update and a __FreeBSD_version bumping (addressed by further commits).
This patch also fixes a problem occurring if a lockmgr is held in
exclusive mode and the same owner try to acquire it in shared mode:
currently there is a spourious shared locking acquisition while what
we really want is a lock downgrade. Probabilly, this situation can be
better served with a EDEADLK failing errno return.
Side note: first testing on this patch alredy reveleated several LORs
reported, so please expect LORs cascades until resolved. NTFS also is
reported broken by WITNESS introduction. BTW, NTFS is exposing a lock
leak which needs to be fixed, and this patch can help it out if
rightly tweaked.
Tested by: kris, yar, Scot Hetzel <swhetzel at gmail dot com>
done in consumers code: using locks properties is much more appropriate.
Fix current code doing these bogus checks.
Note: Really, callout are not usable by all !(LC_SPINLOCK | LC_SLEEPABLE)
primitives like rmlocks doesn't implement the generic lock layer
functions, but they can be equipped for this, so the check is still
valid.
Tested by: matteo, kris (earlier version)
Reviewed by: jhb
- Expose sbrelease_internal(), a variant of sbrelease() with no
expectations about the validity of locks in the socket buffer.
- Use sbrelease_internel() in sorflush(), and as a result avoid intializing
and destroying a socket buffer lock for the temporary stack copy of the
actual buffer, asb.
- Add a comment indicating why we do what we do, and remove an XXX since
things have gotten less ugly in sorflush() lately.
This makes socket close cleaner, and possibly also marginally faster.
MFC after: 3 weeks
referencing the files VM pages are returned from the network stack,
making changes to the file safe.
This flag does not guarantee that the data has been transmitted to the
other end.
free function controlable, instead of passing the KVA of the buffer
storage as the first argument.
Fix all conventional users of the API to pass the KVA of the buffer
as the first argument, to make this a no-op commit.
Likely break the only non-convetional user of the API, after informing
the relevant committer.
Update the mbuf(9) manual page, which was already out of sync on
this point.
Bump __FreeBSD_version to 800016 as there is no way to tell how
many arguments a CPP macro needs any other way.
This paves the way for giving sendfile(9) a way to wait for the
passed storage to have been accessed before returning.
This does not affect the memory layout or size of mbufs.
Parental oversight by: sam and rwatson.
No MFC is anticipated.
read socket buffers in shutdown() and close():
- Call socantrcvmore() before sblock() to dislodge any threads that
might be sleeping (potentially indefinitely) while holding sblock(),
such as a thread blocked in recv().
- Flag the sblock() call as non-interruptible so that a signal
delivered to the thread calling sorflush() doesn't cause sblock() to
fail. The sblock() is required to ensure that all other socket
consumer threads have, in fact, left, and do not enter, the socket
buffer until we're done flushin it.
To implement the latter, change the 'flags' argument to sblock() to
accept two flags, SBL_WAIT and SBL_NOINTR, rather than one M_WAITOK
flag. When SBL_NOINTR is set, it forces a non-interruptible sx
acquisition, regardless of the setting of the disposition of SB_NOINTR
on the socket buffer; without this change it would be possible for
another thread to clear SB_NOINTR between when the socket buffer mutex
is released and sblock() is invoked.
Reviewed by: bz, kmacy
Reported by: Jos Backus <jos at catnook dot com>
resulted in the argument to the make_dev() to be a unit number.
Correct this by supplying a minor number to make_dev(), and using
the unit number for the calculation of the slave tty name.
Reported and tested by: Peter Holm
Reviewed by: jhb
Yet another pointy hat to: kib
MFC after: 1 day
while the thread does not hold the thread lock would stop blocking for
subsequent interruptible sleeps and would always immediately fail the
sleep with EWOULDBLOCK instead (even sleeps that didn't have a timeout).
Some background:
- KSE has a facility for allowing one thread to interrupt another thread.
During this process, the target thread aborts any interruptible sleeps
much as if the target thread had a pending signal. Once the target
thread acknowledges the interrupt, normal sleep handling resumes. KSE
manages this via the TDF_INTERRUPTED flag. Specifically, it sets the
flag when it sends an interrupt to another thread and clears it when
the interrupt is acknowledged. (Note that this is purely a software
interrupt sort of thing and has no relation to hardware interrupts
or kernel interrupt threads.)
- The old code for handling the sleep timeout race handled the race
by setting the TDF_INTERRUPT flag and faking a KSE-style thread
interrupt to the thread in the process of going to sleep. It probably
should have just checked the TDF_TIMEOUT flag in sleepq_catch_signals()
instead.
- The bug was that the sleepq code would set TDF_INTERRUPT but it was
never cleared. The sleepq code couldn't safely clear it in case there
actually was a real KSE thread interrupt pending for the target thread
(in fact, the sleepq timeout actually stomped on said pending interrupt).
Thus, any future interruptible sleeps (*sleep(.. PCATCH ..) or
cv_*wait_sig()) would see the TDF_INTERRUPT flag set and immediately
fail with EWOULDBLOCK. The flag could be cleared if the thread belonged
to a KSE process and another thread posted an interrupt to the original
thread. However, in the more common case of a non-KSE process, the
thread would pretty much stop sleeping.
- Fix the bug by just setting TDF_TIMEOUT in the sleepq timeout code and
not messing with TDF_INTERRUPT and td_intrval. With yesterday's fix to
fix sleepq_switch() to check TDF_TIMEOUT, this is now sufficient.
MFC after: 3 days
being properly cancelled by a timeout. In general there is a race
between a the sleepq timeout handler firing while the thread is still
in the process of going to sleep. In 6.x with sched_lock, the race was
largely protected by sched_lock. The only place it was "exposed" and had
to be handled was while checking for any pending signals in
sleepq_catch_signals().
With the thread lock changes, the thread lock is dropped in between
sleepq_add() and sleepq_*wait*() opening up a new window for this race.
Thus, if the timeout fired while the sleeping thread was in between
sleepq_add() and sleepq_*wait*(), the thread would be marked as timed
out, but the thread would not be dequeued and sleepq_switch() would
still block the thread until it was awakened via some other means. In
the case of pause(9) where there is no other wakeup, the thread would
never be awakened.
Fix this by teaching sleepq_switch() to check if the thread has had its
sleep canceled before blocking by checking the TDF_TIMEOUT flag and
aborting the sleep and dequeueing the thread if it is set.
MFC after: 3 days
Reported by: dwhite, peter
`kn_sdata' member of the newly registered knote. The problem is that
this member is overwritten by a call to kevent(2) with the EV_ADD flag,
targetted at the same kevent/knote. For instance, a userland application
may set the pointer to NULL, leading to a panic.
A testcase was provided by the submitter.
PR: kern/118911
Submitted by: MOROHOSHI Akihiko <moro@remus.dti.ne.jp>
MFC after: 1 day
- Remove the "thread" argument from the lockmgr() function as it is
always curthread now
- Axe lockcount() function as it is no longer used
- Axe LOCKMGR_ASSERT() as it is bogus really and no currently used.
Hopefully this will be soonly replaced by something suitable for it.
- Remove the prototype for dumplockinfo() as the function is no longer
present
Addictionally:
- Introduce a KASSERT() in lockstatus() in order to let it accept only
curthread or NULL as they should only be passed
- Do a little bit of style(9) cleanup on lockmgr.h
KPI results heavilly broken by this change, so manpages and
FreeBSD_version will be modified accordingly by further commits.
Tested by: matteo
Introduce a new privilege allowing to set certain IP header options
(hop-by-hop, routing headers).
Leave a few comments to be addressed later.
Reviewed by: rwatson (older version, before addressing his comments)
a run-queue. If the priority is numerically raised only change lowpri
if we're certain it will be correct. Some slop is allowed however
previously we could erroneously raise lowpri for an idle cpu that a
thread had recently run on which lead to errors in load balancing
decisions.
lockmgr lkp, when held in exclusive mode, is recursed
- Introduce the function BUF_RECURSED() which does the same for bufobj
locks based on the top of lockmgr_recursed()
- Introduce the function BUF_ISLOCKED() which works like the counterpart
VOP_ISLOCKED(9), showing the state of lockmgr linked with the bufobj
BUF_RECURSED() and BUF_ISLOCKED() entirely replace the usage of bogus
BUF_REFCNT() in a more explicative and SMP-compliant way.
This allows us to axe out BUF_REFCNT() and leaving the function
lockcount() totally unused in our stock kernel. Further commits will
axe lockcount() as well as part of lockmgr() cleanup.
KPI results, obviously, broken so further commits will update manpages
and freebsd version.
Tested by: kris (on UFS and NFS)