- Sort by date in license blocks, oldest copyright first.
- All rights reserved after all copyrights, not just the first.
- Use (c) to be consistent with other entries.
MFC after: 3 days
of max() when computing the divisor in SCHED_TICK_PRI(). This prevents
cases where rounding down would allow the quotient to exceed
SCHED_PRI_RANGE.
- Garbage collect some unused flags and fields.
- Replace TDF_HOLD with sched_pin_td()/sched_unpin_td() since it simply
duplicated this functionality.
- Re-enable the rebalancer by default and fix the sysctl so it can be
modified.
marked idle, thus breaking cpu load balancing.
- Change sched_interact_update() to fix cases where the stored history
has expanded significantly rather than handling them in the callers. This
fixes a case where sched_priority() could compute a bad value.
- Add a sysctl to disable the global load balancer for experimentation.
sysctl and socket teardown by adding a reference count to the UNIX domain
pcb object and fixing the sysctl that enumerates unpcbs to grab a
reference on each unpcb while it builds the list to copy out to userland.
- Close a race between UNIX domain pcb garbage collection (unp_gc()) and
file descriptor teardown (fdrop()) by adding a new garbage collection
flag FWAIT. unp_gc() sets FWAIT while it walks the message buffers
in a UNIX domain socket looking for nested file descriptor references
and clears the flag when it is finished. fdrop() checks to see if the
flag is set on a file descriptor whose refcount just dropped to 0 and
waits for unp_gc() to clear the flag before completely destroying the
file descriptor.
MFC after: 1 week
Reviewed by: rwatson
Submitted by: ups
Hopefully makes the panics go away: mx1
setting ftick = ltick = ticks in schedinit().
- Update the priority when we are pulled off of the run queue and when we
are inserted onto the run queue so that it more accurately reflects our
present status. This is important for efficient priority propagation
functioning.
- Move the frequency test into sched_pctcpu_update() so we don't repeat it
each time we'd like to call it.
- Put some temporary work-around code in sched_priority() in case the tick
mechanism produces a bad priority. Eventually this should revert to an
assert again.
the most recently chosen index. This significantly improves nice
behavior. This allows a lower priority thread to run some multiple of
times before the higher priority thread makes it to the front of
the queue. A nice +20 cpu hog now only gets ~5% of the cpu when running
with a nice 0 cpu hog and about 1.5% with a nice -20 hog. A nice
difference of 1 makes a 4% difference in cpu usage between two hogs.
- Track a seperate insert and removal index. When the removal index is
empty it is updated to point at the current insert index.
- Don't remove and re-add a thread to the runq when it is being adjusted
down in priority.
- Pull some conditional code out of sched_tick(). It's looking a bit
large now.
- Remove the double queue mechanism for timeshare threads. It was slow
due to excess cache lines in play, caused suboptimal scheduling behavior
with niced and other non-interactive processes, complicated priority
lending, etc.
- Use a circular queue with a floating starting index for timeshare threads.
Enforces fairness by moving the insertion point closer to threads with
worse priorities over time.
- Give interactive timeshare threads real-time user-space priorities and
place them on the realtime/ithd queue.
- Select non-interactive timeshare thread priorities based on their cpu
utilization over the last 10 seconds combined with the nice value. This
gives us more sane priorities and behavior in a loaded system as
compared to the old method of using the interactivity score. The
interactive score quickly hit a ceiling if threads were non-interactive
and penalized new hog threads.
- Use one slice size for all threads. The slice is not currently
dynamically set to adjust scheduling behavior of different threads.
- Add some new sysctls for scheduling parameters.
Bug fixes/Clean up:
- Fix zeroing of td_sched after initialization in sched_fork_thread() caused
by recent ksegrp removal.
- Fix KSE interactivity issues related to frequent forking and exiting of
kse threads. We simply disable the penalty for thread creation and exit
for kse threads.
- Cleanup the cpu estimator by using tickincr here as well. Keep ticks and
ltick/ftick in the same frequency. Previously ticks were stathz and
others were hz.
- Lots of new and updated comments.
- Many many others.
Tested on: up x86/amd64, 8way amd64.
- runq_add_pri allows the caller to position the thread at any rqindex
regardless of priority.
- runq_choose_from() chooses the lowest priority thread starting from a given
index. The index is updated with the rqindex of the chosen thread. This
routine is used to pick the lowest priority relative to a given index.
- runq_remove_idx() updates the index if the run queue that held the removed
thread is now empty.
mac_framework.c Contains basic MAC Framework functions, policy
registration, sysinits, etc.
mac_syscalls.c Contains implementations of various MAC system calls,
including ENOSYS stubs when compiling without options
MAC.
Obtained from: TrustedBSD Project
consumes and implements, as well as the location of the framework and
policy modules.
Refactor MAC Framework versioning a bit so that the current ABI version can
be exported via a read-only sysctl.
Further update comments relating to locking/synchronization.
Update copyright to take into account these and other recent changes.
Obtained from: TrustedBSD Project
mbuf is dropped, to preserve the invariant in the PR_ADDR case.
Add a regression test to detect this condition, but do not hook it
up to the build for now.
PR: kern/38495
Submitted by: James Juran
Reviewed by: sam, rwatson
Obtained from: NetBSD
MFC after: 2 weeks
non-extattr functions from vfs_extattr.c, and extattr functions from
vfs_syscalls.c.
Change copyright/license on vfs_extattr.c to my copyright/license on
the extended attribute implementation (from extattr.h).
Clean up includes a bit.
Obtained from: TrustedBSD Project
Framework and security modules, to src/sys/security/mac/mac_policy.h,
completing the removal of kernel-only MAC Framework include files from
src/sys/sys. Update the MAC Framework and MAC policy modules. Delete
the old mac_policy.h.
Third party policy modules will need similar updating.
Obtained from: TrustedBSD Project
It always called MH_ALIGN for small lengths being
prepended (less than MHLEN). This meant that if you did
a prepend on a non M_PKTHDR the system would panic with
the KASSERT in MH_ALIGN. Instead we are not aware of
this and do a MH_ALIGN or M_ALIGN as appropriate.
Reviewed by: andre
Approved by: gnn
subsystems will be a property of policy modules, which may require
access control check entry points to be invoked even when not actively
enforcing (i.e., to track information flow without providing
protection).
Obtained from: TrustedBSD Project
Suggested by: Christopher dot Vance at sparta dot com
copyin()/copyout() for message type is separated from msgsnd()/msgrcv() and
it is done from its wrapper functions to support 32-bit emulations. After I
implemented this, I have briefly referenced NetBSD and Darwin. NetBSD passes
copyin()/copyout() function pointers from wrappers. Darwin passes size of
message type as an argument, which is actually similar to my first
implementation (P4 109706). We may revisit these implementations later.
vnode v_flag. For cluster buffers this would result in dereferencing NULL
b_vp. To prevent the panic, cache relevant vnode flag before calling
bstrategy.
Reported by: Peter Holm, kris
Tested by: Peter Holm
Reviewed by: tegge
Pointy hat to: kib
running thread's id on each cpu. This allow us to add in-kernel adaptive
spin for user level mutex. While spinning in user space is possible,
without correct thread running state exported from kernel, it hardly
can be implemented efficiently without wasting cpu cycles, however
exporting thread running state unlikely will be implemented soon as
it has to design and stablize interfaces. This implementation is
transparent to user space, it can be disabled dynamically. With this
change, mutex ping-pong program's performance is improved massively on
SMP machine. performance of mysql super-smack select benchmark is increased
about 7% on Intel dual dual-core2 Xeon machine, it indicates on systems
which have bunch of cpus and system-call overhead is low (athlon64, opteron,
and core-2 are known to be fast), the adaptive spin does help performance.
Added sysctls:
kern.threads.umtx_dflt_spins
if the sysctl value is non-zero, a zero umutex.m_spincount will
cause the sysctl value to be used a spin cycle count.
kern.threads.umtx_max_spins
the sysctl sets upper limit of spin cycle count.
Tested on: Athlon64 X2 3800+, Dual Xeon 5130
- in trying to avoid nested brackets and #ifdef INVARIANTS around i at the
top, I broke booting for INVARIANTS all together :-(
- the cleanest fix is to simply assign to sq twice if INVARIANTS is enabled
- tested both with and without INVARIANTS :-/
after we perform the operations to delete the export,
call vfs_deleteopt() to delete the "export" mount option from
the linked list of mount options associated with that mount point.
This fixes one scenario:
- put a filesystem in /etc/exports to export it
- remove the filesystem from /etc/exports to delete the export and restart
mountd
- try to do a "mount -u -o ro" or "mount -u -o rw" on that filesystem
now that it is no longer exported.
arguments to fail. The mode field for shmget() appears to have undefined
meaning in the context of an already-present IPC object, but applications
appear to assume any arbitrary passed value will be ignored. I had hoped
to revisit this more quickly, but am removing the change for now to
prevent toe-stubbing.
Reported by: JAroslav Suchanek <jarda at grisoft dot cz>
PR: kern/106078
- add cnt_hold cnt_lock support for spin mutexes
- make sure contested is initialized to zero to only bump contested when appropriate
- move initialization function to kern_mutex.c to avoid cyclic dependency between
mutex.h and lock_profile.h
behave as expected.
Also:
- Return an error if WD_PASSIVE is passed in to the ioctl as only
WD_ACTIVE is implemented at the moment. See sys/watchdog.h for an
explanation of the difference between WD_ACTIVE and WD_PASSIVE.
- Remove the I_HAVE_TOTALLY_LOST_MY_SENSE_OF_HUMOR define. If you've
lost your sense of humor, than don't add a define.
Specific changes:
i80321_wdog.c
Don't roll your own passive watchdog tickle as this would defeat the
purpose of an active (userland) watchdog tickle.
ichwd.c / ipmi.c:
WD_ACTIVE means active patting of the watchdog by a userland process,
not whether the watchdog is active. See sys/watchdog.h.
kern_clock.c:
(software watchdog) Remove a check for WD_ACTIVE as this does not make
sense here. This reverts r1.181.
by vnode. Allow for md thread and the thread that owns lock on vnode
backing the md device to do the write even when runningbufspace is
exhausted.
Tested by: Peter Holm
Reviewed by: tegge
MFC after: 2 weeks
Make part of John Birrell's KSE patch permanent..
Specifically, remove:
Any reference of the ksegrp structure. This feature was
never fully utilised and made things overly complicated.
All code in the scheduler that tried to make threaded programs
fair to unthreaded programs. Libpthread processes will already
do this to some extent and libthr processes already disable it.
Also:
Since this makes such a big change to the scheduler(s), take the opportunity
to rename some structures and elements that had to be moved anyhow.
This makes the code a lot more readable.
The ULE scheduler compiles again but I have no idea if it works.
The 4bsd scheduler still reqires a little cleaning and some functions that now do
ALMOST nothing will go away, but I thought I'd do that as a separate commit.
Tested by David Xu, and Dan Eischen using libthr and libpthread.
if waittime was zero (the lock was uncontested) l->lpo_waittime
in the hash table would not get initialized.
Inspection prompted by questions from: Attilio Rao
pthread_cancel()ed, it is expected that the thread will not
consume a pthread_cond_signal(), therefor, we use thr_wake()
to mark a flag, the flag tells a thread calling do_cv_wait()
in umtx code to not block on a condition variable.
Thread library is expected that once a thread detected itself
is in pthread_cond_wait, it will call the thr_wake() for itself
in its SIGCANCEL handler.
priority mutex implemented, it is the time to introduce this stuff,
now we can use umutex and ucond together to implement pthread's
condition wait/signal.
__stop_<section> symbols generated by the static linker for elf
sections. This is done only for the final link, and not for ld -r.
Augment elf_obj in-kernel linker by recognizing such special symbols,
and resolving them to the start and end of the section automatically.
As result, linker sets on amd64 could be used in the same way as on
other architectures, without explicit calls to linker_file_lookup_set().
Requested by: rdivacky
No objections from: peter, jhb
by default for sun4v where it is absolutely required.
This change moves the buffer from struct pcpu to the stack to avoid
using the critical section which created a LOR in a couple of cases
due to interaction with the tty code and kqueue. The LOR can't be
fixed with the critical section and the pcpu buffer can't be used
without the critical section.
Putting the buffer on the stack was my initial solution, but it was
pointed out that the stress on the stack might cause problems
depending on the call path. We don't have a way of creating tests
for those possible cases, so it's best to leave this as an option
for the time being. In time we may get enough data to enable this
option more generally.
listening socket after the pass that cleans those queues. This
results in these connections being orphaned (and leaked). The fix
is to clean up the so queues after detaching the socket from the
protocol. Thanks to ups and jhb for discussions and a thorough code
review.
msgsnd and rechecking resources. This problem was found while I was running
Linux Test Project test suite (test cases: msgctl08, msgctl09).
Change `msgwait' to `msgsnd' and `msgrcv' to distinguish its sleeping
conditions. Few cosmetic changes to debugging messages.
which allows to use it with different kinds of locks. For example it allows
to implement Solaris conditions variables which will be used in ZFS port on
top of sx(9) locks.
Reviewed by: jhb
channel for tsleep():
- Allow tsleep() on &lbolt without Giant with a timeout 0 since &lbolt has
an implied timeout.
- If &lbolt is used with msleep() pass NULL to sleepq_add() for the lock
object. Unlike other sleepq channels, &lbolt doesn't have an associated
owning lock.
written to the socket). The rewrite in revision 1.240 got confused by the
FreeBSD 4.x bug compatibility code.
For some reason lighttpd, that was used for testing the new sendfile code,
was not affected by the problem but apache and others using headers/trailers
in the sendfile call received incorrect sbytes values after return from non-
blocking sockets. This then lead to restarts with wrong offsets and thus
mixed up file contents when the socket was writeable again. All programs
not using headers/trailers, like ftpd, were not affected by the bug.
Reported by: Pawel Worach <pawel.worach-at-gmail.com>
Tested by: Pawel Worach <pawel.worach-at-gmail.com>
wait (time waited to acquire) and hold times for *all* kernel locks. If
the architecture has a system synchronized TSC, the profiling code will
use that - thereby minimizing profiling overhead. Large chunks of profiling
code have been moved out of line, the overhead measured on the T1 for when
it is compiled in but not enabled is < 1%.
Approved by: scottl (standing in for mentor rwatson)
Reviewed by: des and jhb
- Don't drop the lock just to reacquire it again to check rushjob, this
only wastes time.
- Use msleep() to drop the mutex while sleeping instead of explicitly
unlocking around tsleep.
Reviewed by: pjd
label after the sbunlock() part.
This correctly handles calls to sendfile(2) without valid parameters
that was broken in rev. 1.240.
Coverity error: 272162
specific privilege names to a broad range of privileges. These may
require some future tweaking.
Sponsored by: nCircle Network Security, Inc.
Obtained from: TrustedBSD Project
Discussed on: arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
Alex Lyashkov <umka at sevcity dot net>,
Skip Ford <skip dot ford at verizon dot net>,
Antoine Brodin <antoine dot brodin at laposte dot net>
privilege for threads and credentials. Unlike the existing suser(9)
interface, priv(9) exposes a named privilege identifier to the privilege
checking code, allowing more complex policies regarding the granting of
privilege to be expressed. Two interfaces are provided, replacing the
existing suser(9) interface:
suser(td) -> priv_check(td, priv)
suser_cred(cred, flags) -> priv_check_cred(cred, priv, flags)
A comprehensive list of currently available kernel privileges may be
found in priv.h. New privileges are easily added as required, but the
comments on adding privileges found in priv.h and priv(9) should be read
before doing so.
The new privilege interface exposed sufficient information to the
privilege checking routine that it will now be possible for jail to
determine whether a particular privilege is granted in the check routine,
rather than relying on hints from the calling context via the
SUSER_ALLOWJAIL flag. For now, the flag is maintained, but a new jail
check function, prison_priv_check(), is exposed from kern_jail.c and used
by the privilege check routine to determine if the privilege is permitted
in jail. As a result, a centralized list of privileges permitted in jail
is now present in kern_jail.c.
The MAC Framework is now also able to instrument privilege checks, both
to deny privileges otherwise granted (mac_priv_check()), and to grant
privileges otherwise denied (mac_priv_grant()), permitting MAC Policy
modules to implement privilege models, as well as control a much broader
range of system behavior in order to constrain processes running with
root privilege.
The suser() and suser_cred() functions remain implemented, now in terms
of priv_check() and the PRIV_ROOT privilege, for use during the transition
and possibly continuing use by third party kernel modules that have not
been updated. The PRIV_DRIVER privilege exists to allow device drivers to
check privilege without adopting a more specific privilege identifier.
This change does not modify the actual security policy, rather, it
modifies the interface for privilege checks so changes to the security
policy become more feasible.
Sponsored by: nCircle Network Security, Inc.
Obtained from: TrustedBSD Project
Discussed on: arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
Alex Lyashkov <umka at sevcity dot net>,
Skip Ford <skip dot ford at verizon dot net>,
Antoine Brodin <antoine dot brodin at laposte dot net>
the correct syscalls.master's $FreeBSD$ tag record and
a make sysent in sys/compat/freebsd32. Thanks Ruslan
for pointing out the steps I missed :-0
Approved by: gnn
work is not just mine, but it is also the works of Peter Lei
and Michael Tuexen. They both are my two key other developers
working on the project.. and they need ata-boy's too:
****
peterlei@cisco.comtuexen@fh-muenster.de
****
I did do a make sysent which updated the
syscall's and sysproto.. I hope that is correct... without
it you don't build since we have new syscalls for SCTP :-0
So go out and look at the NOTES, add
option SCTP (make sure inet and inet6 are present too)
and play with SCTP.
I will see about comitting some test tools I have after I
figure out where I should place them. I also have a
lib (libsctp.a) that adds some of the missing socketapi
functions that I need to put into lib's.. I will talk
to George about this :-)
There may still be some 64 bit issues in here, none of
us have a 64 bit processor to test with yet.. Michael
may have a MAC but thats another beast too..
If you have a mac and want to use SCTP contact Michael
he maintains a web site with a loadable module with
this code :-)
Reviewed by: gnn
Approved by: gnn
to do the userland to kernel copying in sosend_generic() and sosend_dgram().
sosend_copyin() is retained for ZERO_COPY_SOCKETS which are not yet supported
by m_uiotombuf().
Benchmaring shows significant improvements (95% confidence):
66% less cpu (or 2.9 times better) with new sosend vs. old sosend (non-TSO)
65% less cpu (or 2.8 times better) with new sosend vs. old sosend (TSO)
(Sender AMD Opteron 852 (2.6GHz) with em(4) PCI-X-133 interface and receiver
DELL Poweredge SC1425 P-IV Xeon 3.2GHz with em(4) LOM connected back to back
at 1000Base-TX full duplex.)
Sponsored by: TCP/IP Optimization Fundraise 2005
MFC after: 3 month
mbuf clusters. Add a flags parameter to accept M_PKTHDR and M_EOR mbuf
chain flags. Provide compatibility macro for m_getm() calling m_getm2()
with M_PKTHDR set.
Rewrite m_uiotombuf() to use m_getm2() for mbuf allocation and do the
uiomove() in a tight loop over the mbuf chain. Add a flags parameter to
accept mbuf flags to be passed to m_getm2(). Adjust all callers for the
extra parameter.
Sponsored by: TCP/IP Optimization Fundraise 2005
MFC after: 3 month
VM pages into mbufs as it can -- up to the free send socket buffer space.
The outer loop then drops the whole mbuf chain into the send socket buffer,
calls tcp_output() on it and then waits until 50% of the socket buffer are
free again to repeat the cycle. This way tcp_output() gets the full amount
of data to work with and can issue up to 64K sends for TSO to chop up in
the network adapter without using any CPU cycles. Thus it gets very efficient
especially with the readahead the VM and I/O system do.
The previous sendfile(2) code simply looped over the file, turned each 4K
page into an mbuf and sent it off. This had the effect that TSO could only
generate 2 packets per send instead of up to 44 at its maximum of 64K.
Add experimental SF_MNOWAIT flag to sendfile(2) to return ENOMEM instead of
sleeping on mbuf allocation failures.
Benchmarking shows significant improvements (95% confidence):
45% less cpu (or 1.81 times better) with new sendfile vs. old sendfile (non-TSO)
83% less cpu (or 5.7 times better) with new sendfile vs. old sendfile (TSO)
(Sender AMD Opteron 852 (2.6GHz) with em(4) PCI-X-133 interface and receiver
DELL Poweredge SC1425 P-IV Xeon 3.2GHz with em(4) LOM connected back to back
at 1000Base-TX full duplex.)
Sponsored by: TCP/IP Optimization Fundraise 2005
MFC after: 3 month
a lock to prevent interspersed strings written from different CPUs
at the same time.
To avoid putting a buffer on the stack or having to malloc one,
space is incorporated in the per-cpu structure. The buffer
size if 128 bytes; chosen because it's the next power of 2 size
up from 80 characters.
String writes to the console are buffered up the end of the line
or until the buffer fills. Then the buffer is flushed to all
console devices.
Existing low level console output via cnputc() is unaffected by
this change. ithread calls to log() are also unaffected to avoid
blocking those threads.
A minor change to the behaviour in a panic situation is that
console output will still be buffered, but won't be written to
a tty as before. This should prevent interspersed panic output
as a number of CPUs panic before we end up single threaded
running ddb.
Reviewed by: scottl, jhb
MFC after: 2 weeks
- Add FS_GJOURNAL flag which enables gjournal support on a file system.
- Add cg_unrefs field to the cylinder group structure which holds
number of unreferenced (orphaned) inodes in the given cylinder group.
- Add fs_unrefs field to the super block structure which holds
total number of unreferenced (orphaned) inodes.
- When file or a directory is orphaned (last reference is removed, but
object is still open), increase fs_unrefs and cg_unrefs fields,
which is a hint for fsck in which cylinder groups looks for such
(orphaned) objects.
- When file is last closed, decrease {fs,cg}_unrefs fields.
- Add VV_DELETED vnode flag which points at orphaned objects.
Sponsored by: home.pl
Call vfs_setdirty_locked_object() from vfs_busy_pages() instead of
vfs_setdirty(), thereby eliminating a second acquisition and release
of the same vm object lock.
queues lock to BIO_READ operations. Recent changes to the implementation
of the per-page flags have eliminated the need for the page queues lock
in the other cases.
to twice unlock the vnode. Check that ni_vp and ni_dvp are different before
doing second unlock.
Reviewed by: rwatson
Approved by: pjd (mentor)
MFC after: 1 week
counters of allocs/frees/use for each malloc type to calculating InUse,
MemUse, and Requests as displayed by the userspace vmstat -m. This is
more useful when debugging malloc(9)-related memory leaks, where the
count of allocs/frees may not usefully reflect that current memory
allocation (i.e., when highly variable size allocations occur with the
same malloc type, such as with contigmalloc).
MFC after: 3 days
Limitations observed by: scottl
to and from struct timespec, to replace the crummy conversion
function which have been copy&pasted into three different
filesystems already.
Apart from general crummyness as indicated by code like:
for (year = 1970;; year++) {
inc = year & 0x03 ? 365 : 366;
if (days < inc)
break;
days -= inc;
}
They also contain specialized crummyness which tries to compensate
for the general crummyness by caching recent conversion results,
with no regard for locking or consistency.
These replacement functions are smaller, O(1) and handle the Y2.1K
leap-year correctly.
Ideally, these functions should live in a module of their own,
which the three offending filesystems would depend on, but the
size is 877 bytes of code (on i386), so that would be false
economy.
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.
This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.
Obtained from: TrustedBSD Project
Sponsored by: SPARTA
processes. It was originally added back when support for Linux threads
(and thus shared sigacts objects) was added, but no one knows why. My
guess is that at some point during the Linux threads patches, the sigacts
object was torn down during exit1(), so this check was added to prevent
a panic for that race. However, the stuff that was actually committed to
the tree doesn't teardown sigacts until wait() making the above race moot.
Re-allowing signals here lets one interrupt a NFS request during process
teardown (such as closing descriptors) on an interruptible mount.
Requested by: kib (long time ago)
MFC after: 1 week
vnode' v_rdev and increment the dev threadcount , as well as clear it
(in devfs_reclaim) under the dev_lock().
Reviewed by: tegge
Approved by: pjd (mentor)
- Count (scheduling of) software interrupts (SWIs) as SWIs, not as
hardware interrupts.
- Don't count (scheduling of) delayed SWIs as interrupts at all, since
in the delayed case it is expected that there are many more scheduling
calls than handling calls. Perhaps all interrupts should be counted
only when they are handled, but it is only counts of delayed SWIs that
shouldn never be combined with the other counts.
subr_trap.c:
- Count (handling of) Asynchronous System Traps (ASTs) as traps, not as
software interrupts.
Before these changes, the counter for SWIs only counted ASTs, and SWIs
weren't counted separately, but a subcounter for ASTs alone is less
needed than for most other exception sources.
4.4BSD-Lite uses the counters for similar things (actually matching
their names) on its main arches (hp300, ..., !i386) where more of the
exceptions are in hardware.
Implement the linux_io_* syscalls (AIO). They are only enabled if the native
AIO code is available (either compiled in to the kernel or as a module) at
the time the functions are used. If the AIO stuff is not available there
will be a ENOSYS.
From the submitter:
---snip---
DESIGN NOTES:
1. Linux permits a process to own multiple AIO queues (distinguished by
"context"), but FreeBSD creates only one single AIO queue per process.
My code maintains a request queue (STAILQ of queue(3)) per "context",
and throws all AIO requests of all contexts owned by a process into
the single FreeBSD per-process AIO queue.
When the process calls io_destroy(2), io_getevents(2), io_submit(2) and
io_cancel(2), my code can pick out requests owned by the specified context
from the single FreeBSD per-process AIO queue according to the per-context
request queues maintained by my code.
2. The request queue maintained by my code stores contrast information between
Linux IO control blocks (struct linux_iocb) and FreeBSD IO control blocks
(struct aiocb). FreeBSD IO control block actually exists in userland memory
space, required by FreeBSD native aio_XXXXXX(2).
3. It is quite troubling that the function io_getevents() of libaio-0.3.105
needs to use Linux-specific "struct aio_ring", which is a partial mirror
of context in user space. I would rather take the address of context in
kernel as the context ID, but the io_getevents() of libaio forces me to
take the address of the "ring" in user space as the context ID.
To my surprise, one comment line in the file "io_getevents.c" of
libaio-0.3.105 reads:
Ben will hate me for this
REFERENCE:
1. Linux kernel source code: http://www.kernel.org/pub/linux/kernel/v2.6/
(include/linux/aio_abi.h, fs/aio.c)
2. Linux manual pages: http://www.kernel.org/pub/linux/docs/manpages/
(io_setup(2), io_destroy(2), io_getevents(2), io_submit(2), io_cancel(2))
3. Linux Scalability Effort: http://lse.sourceforge.net/io/aio.html
The design notes: http://lse.sourceforge.net/io/aionotes.txt
4. The package libaio, both source and binary:
http://rpmfind.net/linux/rpm2html/search.php?query=libaio
Simple transparent interface to Linux AIO system calls.
5. Libaio-oracle: http://oss.oracle.com/projects/libaio-oracle/
POSIX AIO implementation based on Linux AIO system calls (depending on
libaio).
---snip---
Submitted by: Li, Xiao <intron@intron.ac>
if backward copatibility options are present) from attempting
to free memory that wasn't allocated. This is an old bug, and
previously it would attempt to free a null pointer. I noticed
this bug when working on the previous revision, but forgot to
fix it.
Security: local DoS
Reported by: Peter Holm
MFC after: 3 days
method is defined, to avoid memory being modified after free.
Temporarily increase refcount in destroy_devl() to avoid a double free
if dev_rel() is called while waiting for thread count to reach zero.
removals, including failures, into the callwheel.
XXX: Most of the CTR() macros are called with callout_lock spin mutex
held, thus won't be logged into file, if KTR_ALQ is used. Moving the
CTR() macros out from the spinlocked code would require copying of all
arguments. I'm too lazy to do this.
calls are not used by libthr in RELENG_6 and HEAD, it is only used by
the libthr in RELENG-5, the _umtx_op system call can do more incremental
dirty works than these two system calls without having to introduce new
system calls or throw away old system calls when things are going on.
unmount when mp structure is reused while waiting for coveredvp lock.
Introduce struct mount generation count, increment it on each reuse and
compare the generations before and after obtaining the coveredvp lock.
Reviewed by: tegge, pjd
Approved by: pjd (mentor)
MFC after: 2 weeks
Split subr_clock.c in two parts (by repo-copy):
subr_clock.c contains generic RTC and calendaric stuff. etc.
subr_rtc.c contains the newbus'ified RTC interface.
Centralize the machdep.{adjkerntz,disable_rtc_set,wall_cmos_clock}
sysctls and associated variables into subr_clock.c. They are
not machine dependent and we have generic code that relies on being
present so they are not even optional.
Remove workarounds for tty_refcount beeing 0, this will be fixed differently
later.
Back out rev 1.145 since we initialize the tty struct from scratch and bad
things can't happen anymore.
ioctls passing integer arguments should use the _IOWINT() macro.
This fixes a lot of ioctl's not working on sparc64, most notable
being keyboard/syscons ioctls.
Full ABI compatibility is provided, with the bonus of fixing the
handling of old ioctls on sparc64.
Reviewed by: bde (with contributions)
Tested by: emax, marius
MFC after: 1 week
sync() and sync_fsync() without losing MNT_ASYNC. Add MNTK_ASYNC flag
which is set only when MNT_ASYNC is set and mnt_noasync is zero, and
check that flag instead of MNT_ASYNC before initiating async io.
programs to find out exactly which events were registered and which were
returned... This should be lower in kern_kevent, but that would require
special munging due to locks and the functions used to copyin/copyout
kevents...
If someone wants to teach ktrace how to output pretty kevents, I have a
kevent prety printer that can be used...
you can't call tty_clone afterwords. OpenBSD and NetBSD both fail the
open call in that case, so we should do so as well. This can
be done in ctty_clone by returning with *dev==NULL. Admittedly this
causes open to return ENOENT, instead of ENXIO as on the other BSDs,
but this way requires the least touching of code.
Submitted by: Nate Eldredge <nge@cs.hmc.edu>
PR: 83375
MFC: 1 week
the entire record when a non-data mbuf is removed in the soreceive() path.
This only triggers a panic directly when compiled with INVARIANTS.
PR: 38495
Submitted by: James Juran
MFC after: 1 week
returns the previous value that the "add" effected (In
this case we are adding -1), afterwhich we compare it
to '0'... to see if we free the mbuf... we should
be comparing it to '1'... Note that this only effects
when there is contention since there is a first part
to the comparison that checks to see if its '1'. So
this bug would only crop up if two CPU's are trying
to free the same mbuf refcount at the same time. This
will happen in SCTP but I doubt can happen in TCP or
UDP.
PR: N/A
Submitted by: rrs
Reviewed by: gnn,sam
Approved by: gnn,sam
appears to be serving a useful purpose, as it was used during initial
development of MAC support for System V IPC.
MFC after: 1 month
Obtained from: TrustedBSD Project
Suggested by: Christopher dot Vance at SPARTA dot com
other problems while labels were first being added to various kernel
objects. They have outlived their usefulness.
MFC after: 1 month
Suggested by: Christopher dot Vance at SPARTA dot com
Obtained from: TrustedBSD Project
code is still under giant lock, but the session/pgrp release code just used
proctree_locks. This explains why moving the proctree_lock in sys/kern/tty.c
rev. 1.258 did fix the panics in our SMP systems.
This should also fix some race panics with revoked ttys.
Reviewed by: jhb
MFC after: 1 week
be recycled during the sleep, wrap the vn_lock with vhold/vdrop.
Check that coveredvp still points to the same mp after sleep (needed
because sleep dropped Giant).
Move check for user rights for unmount after coveredvp lock is obtained.
Tested by: Peter Holm
Reviewed by: tegge
Approved by: kan (mentor)
MFC after: 2 weeks
with other commonly used sysctl name spaces, rather than declaring them
all over the place.
MFC after: 1 month
Sponsored by: nCircle Network Security, Inc.
unconfigured state of the kernel accounting system. This is used by
the accounting privilege regression test to determine whether
accounting is in use and will be disrupted by the regression test.
Sponsored by: nCircle Network Security, Inc.
Obtained from: TrustedBSD Project
MFC after: 1 month
adds a FS type specific flag indicating that the FS supports shared
vnode lock lookups, adds some logic in vfs_lookup.c to test this flag
and set lock flags appropriately.
- amd on 6.x is a non-starter (without this change). Using amd under
heavy load results in a deadlock (with cascading vnode locks all the
way to the root) very quickly.
- This change should also fix the more general problem of cascading
vnode deadlocks when an NFS server goes down.
Ideally, we wouldn't need these changes, as enabling shared vnode lock
lookups globally would work. Unfortunately, UFS, for example isn't
ready for shared vnode lock lookups, crashing pretty quickly.
This change is the result of discussions with Stephan Uphoff (ups@).
Reviewed by: ups@
in syscons. This replaces a simple access semaphore that was assumed to be
protected by Giant but often was not. If two threads that were otherwise
SMP-safe called printf at the same time, there was a high likelyhood that
the semaphore would get corrupted and result in a permanently frozen video
console. This is similar to what is already done in the serial console
drivers.
protect the vnode, it was present to synchronize access to TTY session
information between exit(2) and the TTY code. While we are here, note that
Giant is required for TTY protection.
Clue from: bde
Discussed with: jhb
MFC after: 1 week
Instead, we want busses to explicitly specify an add_child routine if they
want to support identify routines, but by default disallow having outside
drivers add devices.
- Give smbus(4) an explicit bus_add_child() method.
Requested by: imp
device_add_child_ordered(). Previously, a device driver that wanted to
add a new child device in its identify routine had to know if the parent
driver had a custom bus_add_child method and use BUS_ADD_CHILD() in that
case, otherwise use device_add_child(). Getting it wrong in either
direction would result in panics or failure to add the child device. Now,
BUS_ADD_CHILD() always works isolating child drivers from having to know
intimate details about the parent driver.
Discussed with: imp
MFC after: 1 week
for overlaps, but more importantly, it collapses adjacent free regions.
This is needed to cope with BIOSen that split up ports for system devices
(like IPMI controllers) across multiple system resource entries.
- Now that rman_manage_region() is not so dumb, remove extra logic in the
x86 nexus drivers to populate the IRQ rman that manually coalesced the
regions.
MFC after: 1 week