process. This fixes a problem when attaching to a process in gdb
and the process staying in the STOP'd state after quiting gdb.
This whole process seems a bit suspect, but this seems to work.
Reviewed by: peter
in 4.2-REL which I ripped out in -stable and -current when implementing the
low-memory handling solution. However, maxlaunder turns out to be the saving
grace in certain very heavily loaded systems (e.g. newsreader box). The new
algorithm limits the number of pages laundered in the first pageout daemon
pass. If that is not sufficient then suceessive will be run without any
limit.
Write I/O is now pipelined using two sysctls, vfs.lorunningspace and
vfs.hirunningspace. This prevents excessive buffered writes in the
disk queues which cause long (multi-second) delays for reads. It leads
to more stable (less jerky) and generally faster I/O streaming to disk
by allowing required read ops (e.g. for indirect blocks and such) to occur
without interrupting the write stream, amoung other things.
NOTE: eventually, filesystem write I/O pipelining needs to be done on a
per-device basis. At the moment it is globalized.
This is because calls with M_WAIT (now M_TRYWAIT) may not wait
forever when nothing is available for allocation, and may end up
returning NULL. Hopefully we now communicate more of the right thing
to developers and make it very clear that it's necessary to check whether
calls with M_(TRY)WAIT also resulted in a failed allocation.
M_TRYWAIT basically means "try harder, block if necessary, but don't
necessarily wait forever." The time spent blocking is tunable with
the kern.ipc.mbuf_wait sysctl.
M_WAIT is now deprecated but still defined for the next little while.
* Fix a typo in a comment in mbuf.h
* Fix some code that was actually passing the mbuf subsystem's M_WAIT to
malloc(). Made it pass M_WAITOK instead. If we were ever to redefine the
value of the M_WAIT flag, this could have became a big problem.
point in calling a function just to set a flag.
Keep better track of the syslog FAC/PRI code and try to DTRT if
they mingle.
Log all writes to /dev/console to syslog with <console.info>
priority. The formatting is not preserved, there is no robust,
way of doing it. (Ideas with patches welcome).
the kernel console. Instead, change logwakeup() to set a flag in the
softc. A callout then wakes up every so often and wakes up any processes
selecting on /dev/log (such as syslogd) if the flag is set. By default
this callout fires 5 times a second, but that can be adjusted by the
sysctl kern.log_wakeups_per_second.
Reviewed by: phk
commands have also been slightly updated as follows:
- Use ktr_idx to find the newest entry rather than walking the buffer
comparing timespecs. Timespecs are not always unique after the change
to use getnanotime(9).
- Add a new verbose setting. When the verbose setting is on, then the
timestamp is printed with each message. If KTR_EXTEND is on, then the
filename and line number are output as well. By default this option is
off. It can be turned on with the 'v' modifier passed to the 'tbuf'
and 'tall' commands. For the 'tnext' command, the 'v' modifier toggles
the verbose mode.
- Only display the cpu number for each message on SMP systems.
- Don't display anything for an empty entry that hasn't been used yet.
functions. If this flag is set, then no KTR log messages are issued.
This is useful for blocking excessive logging, such as with the internal
mutex used by the witness code.
- Use MTX_QUIET on all of the mtx_enter/exit operations on the internal
mutex used by the witness code.
- If we are in a panic, don't do witness checks in witness_enter(),
witness_exit(), and witness_try_enter(), just return.
vm86_trap() to return to the calling program directly. vm86_trap()
doesn't return, thus it was never returning to trap() to release
Giant. Thus, release Giant before calling vm86_trap().
of explicit calls to lockmgr. Also provides macros for the flags
pased to specify shared, exclusive or release which map to the
lockmgr flags. This is so that the use of lockmgr can be easily
replaced with optimized reader-writer locks.
- Add some locking that I missed the first time.
no longer contains kernel specific data structures, but rather
only scalar values and structures that are already part of the
kernel/user interface, specifically rusage and rtprio. It no
longer contains proc, session, pcred, ucred, procsig, vmspace,
pstats, mtx, sigiolst, klist, callout, pasleep, or mdproc. If
any of these changed in size, ps, w, fstat, gcore, systat, and
top would all stop working. The new structure has over 200 bytes
of unassigned space for future values to be added, yet is nearly
100 bytes smaller per entry than the structure that it replaced.
be safely held across an eventhandler function call.
- Fix an instance of the head of an eventhandler list being read without
the lock being held.
- Break down and use a SYSINIT at the new SI_SUB_EVENTHANDLER to initialize
the eventhandler global mutex and the eventhandler list of lists rather
than using a non-MP safe initialization during the first call to
eventhandler_register().
- Add in a KASSERT() to eventhandler_register() to ensure that we don't try
to register an eventhandler before things have been initialized.
held and panic if so (conditional on witness).
- Change witness_list to return the number of locks held so this is easier.
- Add kern/syscalls.c to the kernel build if witness is defined so that the
panic message can contain the name of the offending system call.
- Add assertions that Giant and sched_lock are not held when returning from
a system call, which were missing for alpha and ia64.
can lead to further panics.
- Call getnanotime() instead of nanotime() for the timestamp. nanotime()
is more precise, but it also calls into the timer code, which results
in mutex operations on the i386 arch. If KTR_LOCK is turned on, then
ktr_tracepoint() recurses on itself until it exhausts the kernel stack.
Eventually this should change to use get_cyclecount() instead, but that
can't happen if get_cyclecount() is calling nanotime() instead of
getnanotime().
Deal with excessive dirty buffers when msync() syncs non-contiguous
dirty buffers by checking for the case in UFS *before* checking for
clusterability.
__P() prototypes when an ansi-style static inline is a prototype already.
Since vnode_if.[ch] are generated on the fly, there are no CVS diffs to
mess up.
SMP problem. Compaq, in their infinite wisdom, forgot to put the IO apic
intpin #0 connection to the 8259 PIC into the mptable. This hack is to
look and see if intpin #0 has *no* table entry and adds a fake ExtInt
entry for the remap routines to use. isa/clock.c will still test the
interrupts. This entry is only ever used on an already broken system.
where fork1() could put the process on the run queue where it could be
snatched up by another CPU before kthread_create() had set the proper
fork handler. Instead, we put the new kthread on the runqueue after its
fork handler has been sent.
Noticed by: jake
Looked over by: peter
spending, which was unused now that all software interrupts have
their own thread. Make the legacy schednetisr use an atomic op
for setting bits in the netisr mask.
Reviewed by: jhb
Also, while here, run up to 32 interrupt sources on APIC systems.
Normalize INTREN/INTRDIS so they are the same on both UP and SMP systems
rather than sometimes a macro, and sometimes a function.
Reviewed by: jhb, jakeb
and numvnodes are longs in the kernel. They should remain longs in systat,
what really needs to change is that they should be using SYSCTL_LONG rather
than SYSCTL_INT. I also changed wantfreevnodes to SYSCTL_LONG because I
happened to notice it.
I wish there was a way to find all of these automatically..
Pointed out by: bde
from struct proc, which are now unused (p_nthread already was).
Remove process flag P_KTHREADP which was untested and only set
in vfs_aio.c (it should use kthread_create). Move the yield
system call to kern_synch.c as kern_threads.c has been removed
completely.
moral support from: alfred, jhb
lock. Otherwise, if we block on the backing mutex while releasing the
allproc lock, then when we resume, we will be at SRUN, and we will stay
that way all the way through cpu_exit. As a result, our parent will never
harvest us.
depend on MUTEX_DEBUG. The MUTEX_DEBUG option turns on extra assertions
and checks to verify that mutexes themselves are implemented properly.
The WITNESS option uses extra checks and diagnostics to verify that other
code is using mutexes properly.
passed vnode must be locked; this is the case because of calls
to VOP_GETATTR(), VOP_ACCESS(), and VOP_OPEN(). This becomes
more of an issue when VOP_ACCESS() gets a bit more complicated,
which it does when you introduce ACL, Capability, and MAC
support.
Obtained from: TrustedBSD Project
recently discussed at -hackers. The problem is a null-pointer
dereference that happens in kern/vfs_lookup.c when accessing ".."
with a v_mount entry for the current directory vnode of NULL. This
happens when a volume is forcibly unmounted, and the vnode for a
working directory in the mounted volume is cleared.
PR: 23191
Submitted by: Thomas Moestl <tmoestl@gmx.net>
locking the global hash on each uifree()
make struct uidinfo only visible to the kernel
make uihold() a function rather than a macro to reduce bloat
swap the order of a spl/mutex to maintain consistancy
process is on the alternate stack or not. For compatibility
with sigstack(2) state is being updated if such is needed.
We now determine whether the process is on the alternate
stack by looking at its stack pointer. This allows a process
to siglongjmp from a signal handler on the alternate stack
to the place of the sigsetjmp on the normal stack. When
maintaining state, this would have invalidated the state
information and causing a subsequent signal to be delivered
on the normal stack instead of the alternate stack.
PR: 22286
- Use a better test for determining when a process is running.
- Convert some checks to assertions.
- Remove unnecessary tests.
- Save the priority before acquiring a mutex rather than in msleep(9).
timeout. If DIAGNOSTIC is turned on, then display a message to the console
with a map of which CPUs failed to stop or restart. This gives an SMP box
at least a fighting chance of getting into DDB if one of the other CPUs has
interrupts disabled.
Alter consumers of this method to conform to the new convention.
Minor cosmetic adjustments to bus.h.
This isn't of concern as this interface isn't in use yet.
1) mpsafe (protect the refcount with a mutex).
2) reduce duplicated code by removing the inlined crdup() from crcopy()
and make crcopy() call crdup().
3) use M_ZERO flag when allocating initial structs instead of calling bzero
after allocation.
4) expand the size of the refcount from a u_short to an u_int, by using
shorts we might have an overflow.
Glanced at by: jake
use a mutex lock when looking up/deleting entries on the hashlist
use a mutex lock on each uidinfo when updating fields
make uifree() a void function rather than 'int' since no one cares
allocate uidinfo structs with the M_ZERO flag and don't explicitly initialize
them
Assisted by: eivind, jhb, jakeb
a kevent upon completion of the I/O. Specifically, introduce a new type
of sigevent notification, SIGEV_EVENT. If sigev_notify is SIGEV_EVENT,
then sigev_notify_kqueue names the kqueue that should receive the event
and sigev_value contains the "void *" is copied into the kevent's udata
field.
In contrast to the existing interface, this one: 1) works on
the Alpha 2) avoids the extra copyin() call for the kevent because all
of the information needed is in the sigevent and 3) could be
applied to request a single kevent upon completion of an entire lio_listio().
Reviewed by: jlemon
could not compress into clusters. This could result in lots of
wasted clusters while recieving small packets from an interface
that uses clusters for all it's packets.
Patch is partially from BSDi (limiting the size of the copy) and
based on a patch for 4.1 by Ian Dowse <iedowse@maths.tcd.ie> and
myself.
Reviewed by: bmilekic
Obtained From: BSDi
Submitted by: iedowse
- Use the mutex in hardclock to ensure no races between it and
softclock.
- Make softclock be INTR_MPSAFE and provide a flag,
CALLOUT_MPSAFE, which specifies that a callout handler does not
need giant. There is still no way to set this flag when
regstering a callout.
Reviewed by: -smp@, jlemon
Removed most of the hacks that were trying to deal with low-memory
situations prior to now.
The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.
Code has been added to stall in a low-memory situation prior to a vnode
being locked.
Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.
Implement a number of VFS/BIO fixes
(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.
In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.
Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.
In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.
There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.
Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
Pre-rfork code assumed inherent locking of a process's file descriptor
array. However, with the advent of rfork() the file descriptor table
could be shared between processes. This patch closes over a dozen
serious race conditions related to one thread manipulating the table
(e.g. closing or dup()ing a descriptor) while another is blocked in
an open(), close(), fcntl(), read(), write(), etc...
PR: kern/11629
Discussed with: Alexander Viro <viro@math.psu.edu>
are in softclock() for a long time. The old code already did an
splx()/slphigh() pair here, I just missed adding in the equivalent mutex
operations on sched_lock earlier.
may block on a mutex while on the sleep queue without corrupting
it.
- Move dropping of Giant to after the acquire of sched_lock.
Tested by: John Hay <jhay@icomtek.csir.co.za>
jhb
untimeout() not being called with Giant in those functions. For now,
use the sched_lock to protect the callout wheel in softclock() and in
the various timeout and callout functions.
Noticed by: tegge
acquire Giant as needed in functions that call mi_switch(). The releases
need to be done outside of the sched_lock to avoid potential deadlocks
from trying to acquire Giant while interrupts are disabled.
Submitted by: witness
idea to be holding the sched_lock while we are calling it. As such,
release sched_lock before calling CURSIG() in msleep() and mawait() and
reacquire it after CURSIG() returns.
Submitted by: witness
tsleep(). Namely, mawait() takes an extra argument which is a mutex
to drop when going to sleep. Just as with msleep(), if the priority
argument includes the PDROP flag, then the mutex will be dropped and will
not be reacquired when the process wakes up.
- Add in a backwards compatible macro await() that passes in NULL as the
mutex argument to mawait().
macros which provide the same functionality and are a bit more
efficient, convert use of CIRCLEQ's in resource manager to TAILQ's.
Approved by: Garrett Wollman <wollman@khavrinen.lcs.mit.edu>
whether the given ext_buf is shared.
* Have the sf_bufs be setup with the mbuf subsystem using MEXTADD() with the
two new arguments.
Note: m_pulldown() is somewhat crotchy; the added comment explains the
situation.
Reviewed by: jlemon
could debug processes after the commit that introduced the typo.
Security is good, but security is not always the same as turning things
off :-).
PR: kern/22711
Obtained from: brooks@one-eyed-alien.net
device tree and resource manager contents. This is the kernel side of
the upcoming libdevinfo, which will expose this information to userspace
applications in a trivial fashion.
Remove the now-obsolete DEVICE_SYSCTLS code.
syscall compare against a variable sv_minsigstksz in struct
sysentvec as to properly take the size of the machine- and
ABI dependent struct sigframe into account.
The SVR4 and iBCS2 modules continue to have a minsigstksz of
8192 to preserve behavior. The real values (if different) are
not known at this time. Other ABI modules use the real
values.
The native MINSIGSTKSZ is now defined as follows:
Arch MINSIGSTKSZ
---- -----------
alpha 4096
i386 2048
ia64 12288
Reviewed by: mjacob
Suggested by: bde
function declared in kern_ktr.c. The only inline checks left are the
checks that compare KTR_COMPILE with the supplied mask and thus should
be optimized away into either nothing or a direct call to ktr_tracepoint().
- Move several KTR-related options to opt_ktr.h now that they are only
needed by kern_ktr.c and not by ktr.h.
- Add in the ktr_verbose functionality if KTR_EXTEND is turned on. If the
global variable 'ktr_verbose' is non-zero, then KTR messages will be
dumped to the console. This variable can be set by either kernel code
or via the 'debug.ktr_verbose' sysctl. It defaults to off unless the
KTR_VERBOSE kernel option is specified in which case it defaults to on.
This can be useful when the machine locks up spinning in a loop with
interrupts disabled as you might be able to see what it is doing when it
locks up.
Requested by: phk
wakeup all of the sleeping threads when we free only one buffer. This
avoids us having to needlessly try again (and fail, and go back to
sleep) for all the threads sleeping. We will now only wakeup the
thread we know will succeed.
Reviewed by: green
freelist. Should now be thread-friendly, in part.
Note: More work is needed in uipc_syscalls.c, but it will have to wait until
the socket locking issues are at least 80% implemented and committed.
getnewvnode(). Otherwise routines called from VOP_INACTIVE() might
attempt to remove the vnode from a free list the vnode isn't on,
causing corruption.
PR: 18012
This removes support for booting current kernels with very old bootblocks.
Device driver writers: Please remove initializations for the d_bmaj
field in your cdevsw{}.
current implementation, jail neither virtualizes the Sys V IPC namespace,
nor provides inter-jail protections on IPC objects.
o Support for System V IPC can be enabled by setting jail.sysvipc_allowed=1
using sysctl.
o This is not the "real fix" which involves virtualizing the System V
IPC namespace, but prevents processes within jail from influencing those
outside of jail when not approved by the administrator.
Reported by: Paulo Fragoso <paulo@nlink.com.br>
in the p_candebug() function. Synchronize with sef's CHECKIO()
macro from the old procfs, which seems to be a good source of security
checks.
Obtained from: TrustedBSD Project
This allows writing to DVD-RAM, PD and similar drives that probe as CD
devices. Note that these are randomly writeable devices, not
sequential-only devices like CD-R drives, which are supported by cdrecord.
Add a new flag value for dsopen(), DSO_COMPATLABEL. The cd(4) driver now
uses this flag instead of the DSO_NOLABELS flag. The DSO_NOLABELS always
used a "fake" disklabel for the entire disk, provided by the caller.
With the DSO_COMPATLABEL flag, dsopen() will first search the media for a
label, and if it finds a label, it will use that label. Otherwise it will
use the fake disklabel provided by the caller. This provides backwards
compatibility, since we will still have labels for ISO9660 media.
It also provides new functionality, since you can now have a regular BSD
disklabel on read-only media, or on writeable media (e.g. DVD-RAM).
Bruce and I both think that we should eventually (in a few years) get
away from using disklabels for ISO9660 media, and just use the whole disk
device (/dev/cd0). At that point disklabel handling in the cd(4) driver
could follow the "normal" model, as used in the da(4) driver.
Also, clean up the path in a couple of places in cdregister(). (Thanks to
Nick Hibma for catching that bug.)
Reviewed by: bde
Otherwise, aio_read() and aio_write() on sockets are broken if a kevent is
registered. (The code after kevent registration for handling sockets assumes
that the struct file pointer "fp" still refers to the socket, not the kqueue.)