the system when brelse() was called with B_RELBUF set on the buffer. This
could be a problem when the system was low on memory, had many buffers on
QUEUE_EMPTYKVA and started to traverse directories. For each getnewbuf(),
pages were allocated from the system, driving the free reserve downwards.
For each brelse(), the system put the buffer on QUEUE_CLEAN, with B_INVAL
set.
This commit changes the semantics of B_RELBUF to also free pages from
non-VMIO buffers.
Reviewed by: alc
- td_ar to struct thread, which holds the in-progress audit record during
a system call.
- p_au to struct proc, which holds per-process audit state, such as the
audit identifier, audit terminal, and process audit masks.
In the earlier implementation, td_ar was added to the zero'd section of
struct thread. In order to facilitate merging to RELENG_6, it has been
moved to the end of the data structure, requiring explicit
initalization in the thread constructor.
Much help from: wsalamon
Obtained from: TrustedBSD Project
without Giant held. Do this by tracking the vfslocked state for
the directory seperate from the child. This is only important
in the case where we cross a mountpoint.
Sponsored by: Isilon Systems, Inc.
MFC After: 3 days
on a lock held the last usecount ref on a vnode and the lock failed we
would not call INACTIVE. Solve this by only holding a holdcnt to prevent
the vnode from disappearing while we wait on vn_lock. Other callers
may now VOP_INACTIVE while we are waiting on the lock, however this race
is acceptable, while losing INACTIVE is not.
Discussed with: kan, pjd
Tested by: kkenn
Sponsored by: Isilon Systems, Inc.
MFC After: 1 week
directory. vrele() may lock the passed vnode, which in these cases would
give an invalid lock order of child -> parent. These situations are
deadlock prone although do not typically deadlock because the vrele
is typically not releasing the last reference to the vnode. Users of
vrele must consider it as a call to vn_lock() and order it appropriately.
MFC After: 1 week
Sponsored by: Isilon Systems, Inc.
Tested by: kkenn
This logic change was introduced in revision 1.74:
Correct an oversight in jail() that allowed processes in jail to access
ptys in ways that might be unethical, especially towards processes not in
jail, or in other jails.
It should be fine to allow root in the host environment to do this. This
allows for more effective monitoring of prisons from the host environment.
Discussed with: rwatson
MFC after: 1 week
It detects both: buffer underflows and buffer overflows bugs at runtime
(on free(9) and realloc(9)) and prints backtraces from where memory was
allocated and from where it was freed.
Tested by: kris
work by yar, thompsa and myself. The checksum offloading part also involves
work done by Mihail Balikov.
The most important changes:
o Instead of global linked list of all vlan softc use a per-trunk
hash. The size of hash is dynamically adjusted, depending on
number of entries. This changes struct ifnet, replacing counter
of vlans with a pointer to trunk structure. This change is an
improvement for setups with big number of VLANs, several interfaces
and several CPUs. It is a small regression for a setup with a single
VLAN interface.
An alternative to dynamic hash is a per-trunk static array with
4096 entries, which is a compile time option - VLAN_ARRAY. In my
experiments the array is not an improvement, probably because such
a big trunk structure doesn't fit into CPU cache.
o Introduce an UMA zone for VLAN tags. Since drivers depend on it,
the zone is declared in kern_mbuf.c, not in optional vlan(4) driver.
This change is a big improvement for any setup utilizing vlan(4).
o Use rwlock(9) instead of mutex(9) for locking. We are the first
ones to do this! :)
o Some drivers can do hardware VLAN tagging + hardware checksum
offloading. Add an infrastructure for this. Whenever vlan(4) is
attached to a parent or parent configuration is changed, the flags
on vlan(4) interface are updated.
In collaboration with: yar, thompsa
In collaboration with: Mihail Balikov <mihail.balikov interbgc.com>
this is more consistent with the placement of slaves in /dev/pts. The
actual name doesn't matter as it's not part of the exposed API or used by
libc. In some sense, it would be nice if these device nodes didn't have to
have names in devfs at all.
Suggested by: Stephen McKay <smckay at internode dot on dot net>
specially crafted module. There are several handrolled sollutions to this
problem in the tree already which will be replaced with this. They include
iwi(4), ipw(4), ispfw(4) and digi(4).
No objection from: arch
MFC after: 2 weeks
X-MFC after: some drivers have been converted
implementation is by no means perfect as far as some of the algorithms
that it uses and the fact that it is missing some functionality (try
locks and upgrades/downgrades are not there yet), however it does seem
to work in my local testing. There is more detail in the comments in the
code, but the short version follows.
A reader/writer lock is very much like a regular mutex: it cannot be held
across a voluntary sleep; it can be acquired in an interrupt thread; if
the lock is held by a writer then the priority of any threads that block
on the lock will be lent to the owner; the simple case lock operations all
are done in a single atomic op. It also shares some similiarities
with sx locks: it supports reader/writer semantics (multiple readers,
but single writers); readers are allowed to recurse, but writers are not.
We can extend this implementation further by either improving algorithms
or adding new functionality, but this should at least give us a base to
work with now.
Reviewed by: arch (in theory)
Tested on: i386 (4 cpu box with a kernel module that used 4 threads
that randomly chose between read locks and write locks
that ran w/o panicing for over a day solid. It usually
panic'd within a few seconds when there were bugs during
testing. :) The kernel module source is available on
request.)
each turnstile. Also, allow for the owner thread pointer of a turnstile
to be NULL. This is needed for the upcoming reader/writer lock
implementation.
- Add a new ddb command 'show turnstile' that will look up the turnstile
associated with the given lock argument and display useful information
like the list of threads blocked on each queue, etc. If there isn't an
active turnstile for a lock at the specified address, then the function
will see if there is an active turnstile at the specified address and
display info about it if so.
- Adjust the mutex code to handle the turnstile API changes.
Tested on: i386 (all), alpha, amd64, sparc64 (1 and 3)
argument and looks for a sleep queue associated with that wait channel.
If it finds one it will display information such as the list of threads
sleeping on that queue. If it can't find a sleep queue for that wait
channel, then it will see if that address matches any of the active
sleep queues. If so, it will display information about the sleepq at the
specified address.
sysctl then it will clear the KTR buffer. Note that if you have active
KTR traces at the same time as a clear operation the behavior is undefined,
though it shouldn't panic.
It should play nicely with the existing BSD ptys.
By default, the system will use the BSD ptys, one can set the sysctl
kern.pts.enable to 1 to make it use the new pts system.
The max number of pty that can be allocated on a system can be changed with the
sysctl kern.pts.max. It defaults to 1000, and can be increased, but it is not
recommanded, as any pty with a number > 999 won't be handled by whatever uses
utmp(5).
modified bit emulation traps on Alpha while holding locks in the
sysctl handler.
A better solution would be to pass a hint to the Alpha pmap code to
tell mark these pages as modified when they as they are being wired,
but that appears to be more difficult to implement.
Suggested by: jhb
MFC after: 3 days
placeholder similar to KTR_DEV. Explain the use of KTR_DEV and
KTR_SUBSYS in a comment as well.
- Retire KTR_WITNESS and instead have KTR_WITNESS default to off but use
KTR_SUBSYS if it is enabled.
1) unregsiter kqueue filter for EVFILT_LIO.
2) free uma_zones.
3) call setsid directly to enter another session rather than
implementing by itself.
Submitted by: jhb
The success of the cluster allocation is checked through a field in the
mbuf structure. This change is non-functional but properly silences code
inspection tools.
Found by: Coverity Prevent(tm)
Coverity ID: CID807
Sponsored by: TCP/IP Optimization Fundraise 2005
lookup() instead of EPERM when a DELETE or RENAME operation is
attempted on "..".
In kern_unlink(), remap EINVAL errors returned from namei() to EPERM
to match existing (and POSIX required) behaviour.
Discussed with: bde
MFC after: 3 days
used by utilities to reset moused(8), for example. The syntax is:
!system=kern subsystem=power type=resume
Note that it would be nice to have notification of suspend, but it's more
difficult since there would have to be a method of doing request/ack
to userland and automatically timing out if no response. apm(4) has a
similar mechanism.
MFC after: 2 weeks
An executable contains at most one PT_INTERP program header. Therefore,
the loop that searches for it can terminate after it is found rather than
iterating over the entire set of program headers.
Eliminate an unneeded initialization.
Reviewed by: tegge
the last component of the path name is "..". This keeps VOP_LOOKUP()
from locking vnodes in reverse order.
Tested by: Denis Shaposhnikov <dsh AT vlink DOT ru>
MFC after: 3 days
prototypes, as the majority of new functions added have been in this
style. Changing prototype style now results in gcc noticing that the
implementation of vn_pollrecord() has a 'short' argument instead of
'int' as prototyped in vnode.h, so correct that definition. In practice
this didn't matter as only poll flags in the lower 16 bits are used.
MFC after: 1 week
devclass's parent pointer if the two drivers share the same devclass. This
can happen if the drivers use the same new-bus name. For example, we
currently have 3 drivers that use the name "pci": the generic PCI bus
driver, the ACPI PCI bus driver, and the OpenFirmware PCI bus driver. If
the ACPI PCI bus driver was defined as a subclass of the generic PCI bus
driver, then without this check the "pci" devclass would point to itself
as its parent and device_probe_child() would spin forever when it
encountered the first PCI device that did have a matching driver.
Reviewed by: dfr, imp, new-bus@
equal to NULL several times later. p_ucred "should probably not" be NULL
if the process isn't PRS_NEW anyway. This is strongly reinforced by the fact
that we don't see frequent crashes here. Remove the checks after p_cansee and
add a KASSERT right before it.
Found by: Coverity Prevent (tm)
Also trim one nearby trailing space.
lock_obj objects:
- Add new lock_init() and lock_destroy() functions to setup and teardown
lock_object objects including KTR logging and registering with WITNESS.
- Move all the handling of LO_INITIALIZED out of witness and the various
lock init functions into lock_init() and lock_destroy().
- Remove the constants for static indices into the lock_classes[] array
and change the code outside of subr_lock.c to use LOCK_CLASS to compare
against a known lock class.
- Move the 'show lock' ddb function and lock_classes[] array out of
kern_mutex.c over to subr_lock.c.
Since we are using vfs_busy() on a freshly allocated mount structure, use
(void) to show that we do not care about the return value.
Found with: Coverity Prevent (tm)
MFC after: 2 weeks
taskqueue_start_threads(struct taskqueue **, int count, int pri,
const char *name, ...);
This allows the creation of 1 or more threads that will service a single
taskqueue. Also rework the taskqueue_create() API to remove the API change
that was introduced a while back. Creating a taskqueue doesn't rely on
the presence of a process structure, and the proc mechanics are much better
encapsulated in taskqueue_start_threads(). Also clean up the
taskqueue_terminate() and taskqueue_free() functions to safely drain
pending tasks and remove all associated threads.
The TASKQUEUE_DEFINE and TASKQUEUE_DEFINE_THREAD macros have been changed
to use the new API, but drivers compiled against the old definitions will
still work. Thus, recompiling drivers is not a strict requirement.
intended for use solely with atomic datagram socket types, and relies
on the previous break-out of sosend_copyin(). Changes to allow UDP to
optionally use this instead of sosend() will be committed as a
follow-up.
to COMPAT_43TTY.
Add COMPAT_43TTY to NOTES and */conf/GENERIC
Compile tty_compat.c only under the new option.
Spit out
#warning "Old BSD tty API used, please upgrade."
if ioctl_compat.h gets #included from userland.
fast taskqueues. The following have been added:
TASKQUEUE_FAST_DEFINE() - create a global task queue.
an arbitrary execution context.
TASKQUEUE_FAST_DEFINE_THREAD() - create a global taskqueue that uses a
dedicated kthread.
taskqueue_create_fast() - create a local/private taskqueue.
These are all complimentary of the standard taskqueue functions. They are
primarily useful for fast interrupt handlers that can only use spinlock for
synchronization.
I personally think that the taskqueue API is starting to get too narrow and
hairy, but fixing it will require a major redesign on the API. Such a
redesign would be good but would break compatibility with FreeBSD 6.x, so
it really isn't desirable at this time.
Submitted by: sam
and subsequently broke the build. This change is supposed to fix the
case where doing a mtx_destroy() off a spin mutex while you hold it fails.
If it had been tested I would just leave it in, but it hasn't been tested
yet, so it will have to wait until later.
struct sx). Instead of storing a direct pointer to a our lock_class
struct in lock_object, reserve 4 bits in the lo_flags field to serve as an
index into a global lock_classes array that contains pointers to the lock
classes. Only debugging code such as WITNESS or INVARIANTS checks and KTR
logging need to access the lock_class member, so this shouldn't add any
overhead to production kernels. It might add some slight overhead to
kernels using those debug options however.
As with the previous set of changes to lock_object, this is going to
completely obliterate the kernel ABI, so be sure to recompile all your
modules.
returns EBADF. That errno is correct and is mandated by POSIX. It also
goes back to revision 1.1 of our CVS history (i.e. 4.4BSD).
The _fget() function should probably also be upated as it currently returns
EINVAL in that case rather than EBADF. (It does return EBADF for reads
on a write-only descriptor without any XXX comments oddly enough.)
Discussed with: scottl, grog, mjacob, bde
that a file's atime and mtime are only set to correct fractional
second values (0-999999000ns with the current interface).
Prior to this change users could create files with values outside
that range. Moreover, on 32-bit machines tv_usec offsets larger than
4.3s would result in an unnormalized AND wrong timestamp value,
due to overflow.
MFC after: 1 week
- provide an interface (macros) to the page coloring part of the VM system,
this allows to try different coloring algorithms without the need to
touch every file [1]
- make the page queue tuning values readable: sysctl vm.stats.pagequeue
- autotuning of the page coloring values based upon the cache size instead
of options in the kernel config (disabling of the page coloring as a
kernel option is still possible)
MD changes:
- detection of the cache size: only IA32 and AMD64 (untested) contains
cache size detection code, every other arch just comes with a dummy
function (this results in the use of default values like it was the
case without the autotuning of the page coloring)
- print some more info on Intel CPU's (like we do on AMD and Transmeta
CPU's)
Note to AMD owners (IA32 and AMD64): please run "sysctl vm.stats.pagequeue"
and report if the cache* values are zero (= bug in the cache detection code)
or not.
Based upon work by: Chad David <davidc@acns.ab.ca> [1]
Reviewed by: alc, arch (in 2004)
Discussed with: alc, Chad David, arch (in 2004)
- Provide tunable vm.memguard.desc, so one can specify memory type without
changing the code and recompiling the kernel.
- Allow to use memguard for kernel modules by providing sysctl
vm.memguard.desc, which can be changed to short description of memory
type before module is loaded.
- Move as much memguard code as possible to memguard.c.
- Add sysctl node vm.memguard. and move memguard-specific sysctl there.
- Add malloc_desc2type() function for finding memory type based on its
short description (ks_shortdesc field).
- Memory type can be changed (via vm.memguard.desc sysctl) only if it
doesn't exist (will be loaded later) or when no memory is allocated yet.
If there is allocated memory for the given memory type, return EBUSY.
- Implement two ways of memory types comparsion and make safer/slower the
default.
The race is very real, but conditions needed for triggering it are rather
hard to meet now.
When gjournal will be committed (where it is quite easy to trigger) we need
to fix it.
For now, verify if it is really hard to trigger.
Discussed with: kan
of msleep(). msleep_spin() doesn't support changing the priority of the
thread while it is asleep nor does it support interruptible sleeps (PCATCH)
or the PDROP flag. It does support timeouts however. It differs from
msleep() in that the passed in mutex is a spin mutex. This means one can
use msleep_spin() and wakeup() with a spin mutex similar to msleep() and
wakeup() with a regular mutex. Note that the spin mutex in question needs
to come before sched_lock and the sleepq locks in lock order.
spin locks that are not in the static order list. It is not safe to call
printf while holding the witness spin mutex since the console drivers that
back printf may need to use their own spin locks which would try to talk
to witness when they were locked. Given this, it is possible for one
CPU to lock a console driver lock (such as sio) which then tries to lock
the witness lock while another CPU is doing the printf while holding the
witness lock. Fix this by moving the printf outside of the witness lock.
All other printf's in witness are already correct.
MFC after: 3 days
UMA_SLAB_MALLOC flag.
In some circumstances (I observed it when I was doing a lot of reallocs)
UMA_SLAB_MALLOC can be set even if us_keg != NULL.
If this is the case we have wonderful, silent data corruption, because less
data is copied to the newly allocated region than should be.
I'm not sure when this bug was introduced, it could be there undetected
for years now, as we don't have a lot of realloc(9) consumers and it was
hard to reproduce it...
...but what I know for sure, is that I don't want to know who introduce
the bug:) It took me two/three days to track it down (of course most of
the time I was looking for the bug in my own code).
with flags bitfield and set BI_CAN_EXEC_DYN flag for all brands that usually
allow executing elf dynamic binaries (aka shared libraries). When it is
requested to execute ET_DYN elf image check if this flag is on after we
know the elf brand allowing execution if so.
PR: kern/87615
Submitted by: Marcin Koziej <creep@desk.pl>
Specifically, it is required for the I/O that may be performed by
elfN_load_section().
Avoid an obscure deadlock in the a.out, elf, and gzip image
activators. Add a comment describing why the deadlock does not occur
in the common case and how it might occur in less usual circumstances.
Eliminate an unused variable from exec_aout_imgact().
In collaboration with: tegge
by debugger, e.g process is dumping core. Only access p_xthread if
P_STOPPED_TRACE is set, this means thread is ready to exchange signal
with debugger, print a warning if P_STOPPED_TRACE is not set due to
some bugs in other code, if there is.
The patch has been tested by Anish Mistry mistry.7 at osu dot edu, and
is slightly adjusted.
passing a pointer to an opaque clockframe structure and requiring the
MD code to supply CLKF_FOO() macros to extract needed values out of the
opaque structure, just pass the needed values directly. In practice this
means passing the pair (usermode, pc) to hardclock() and profclock() and
passing the boolean (usermode) to hardclock_cpu() and hardclock_process().
Other details:
- Axe clockframe and CLKF_FOO() macros on all architectures. Basically,
all the archs were taking a trapframe and converting it into a clockframe
one way or another. Now they can just extract the PC and usermode values
directly out of the trapframe and pass it to fooclock().
- Renamed hardclock_process() to hardclock_cpu() as the latter is more
accurate.
- On Alpha, we now run profclock() at hz (profhz == hz) rather than at
the slower stathz.
- On Alpha, for the TurboLaser machines that don't have an 8254
timecounter, call hardclock() directly. This removes an extra
conditional check from every clock interrupt on Alpha on the BSP.
There is probably room for even further pruning here by changing Alpha
to use the simplified timecounter we use on x86 with the lapic timer
since we don't get interrupts from the 8254 on Alpha anyway.
- On x86, clkintr() shouldn't ever be called now unless using_lapic_timer
is false, so add a KASSERT() to that affect and remove a condition
to slightly optimize the non-lapic case.
- Change prototypeof arm_handler_execute() so that it's first arg is a
trapframe pointer rather than a void pointer for clarity.
- Use KCOUNT macro in profclock() to lookup the kernel profiling bucket.
Tested on: alpha, amd64, arm, i386, ia64, sparc64
Reviewed by: bde (mostly)
it and reacquiring it in vrele(). Consequently, there is no reason to
increase the reference count on the vm object caching the file's pages.
Reviewed by: tegge
Eliminate unused parameters to elfN_load_file().
The purpose of this change is consistency (not performance improvement:)),
as it was hard to tell if fdrop() is MPSAFE or not when I saw it sometimes
under the Giant and sometimes without it.
Glanced at by: ssouhlal, kan
means:
o Remove Elf64_Quarter,
o Redefine Elf64_Half to be 16-bit,
o Redefine Elf64_Word to be 32-bit,
o Add Elf64_Xword and Elf64_Sxword for 64-bit entities,
o Use Elf_Size in MI code to abstract the difference between
Elf32_Word and Elf64_Word.
o Add Elf_Ssize as the signed counterpart of Elf_Size.
MFC after: 2 weeks
and KTR_IO as they were never used. Remove KTR_CLK since it was only
used for hardclock firing and use KTR_INTR there instead. Remove
KTR_CRITICAL since it was only used for crit enter/exit and use
KTR_CONTENTION instead.
really should be a fptrdiff_t if we had that) in profclock().
- Don't try to profile kernel pc's that are >= the kernel lowpc to avoid
underflows when computing a profiling index.
- Use the PC_TO_I() macro to compute the kernel profiling index rather than
doing it inline.
Discussed with: bde
ephemeral mappings that are used as the source for three copy
operations from kernel space to user space. There are two reasons for
making this change: (1) Under heavy load exec_map can fill up causing
vm_map_find() to fail. When it fails, the nascent process is aborted
(SIGABRT). Whereas, this reimplementation using sf_buf_alloc()
sleeps. (2) Although it is possible to sleep on vm_map_find()'s
failure until address space becomes available (see kmem_alloc_wait()),
using sf_buf_alloc() is faster. Furthermore, the reimplementation
uses a CPU private mapping, avoiding a TLB shootdown on
multiprocessors.
Problem uncovered by: kris@
Reviewed by: tegge@
MFC after: 3 weeks
mbuf chain that starts with a cluster containing just MHLEN bytes. This
happened because m_dup called m_get or m_getcl depending on the amount of
data to copy, but then always set the size available in the first mbuf to
MHLEN.
Submitted by: Matt Koivisto <mkoivisto at sandvine dot com>
Approved by: jmg
Silence from: rwatson (mentor)