o Allow callers of m_extadd() to allocate their own reference
m_ext.ref_cnt pointer, rather than having the mbuf system allocate it
with a malloc() in the critical path. This speeds m_extadd() up, and
also simplifies locking (malloc() may need Giant).
A driver or subsystem wishing to take use its own ref counter must
initialize m_ext.ref_cnt to point to its ref counter prior to
calling m_extadd(), and it must use EXT_EXTREF as its external type.
Eg:
m->m_ext.ref_cnt = my_ref_cnt_ptr;
m_extadd(.....,EXT_EXTREF);
Reviewed by: bosko
this was causing filedesc work to be very painful.
In order to make this work split out sigio definitions to thier own header
(sigio.h) which is included from proc.h for the time being.
take pointers to filedesc structures instead of threads. This makes
it more clear that they do not do any voodoo with the thread/proc
or anything other than the filedesc passed in or returned.
Remove some XXX KSE's as this resolves the issue.
calling getmicrouptime (but maintain the struct timeval-based calling
convention for compatibility)
o eliminate the use of timersub in ratecheck
Note that flood ping tests indicate ppsratecheck is inaccurate (but on the
conservative side) with this revised implementation. If more accuracy is
needed we'll have to introduce an alternate interface or increase the
overhead.
Reviewed by: silby, dillon, bde
were sometimes propagated using M_COPY_PKTHDR which actually did
something between a "move" and a "copy" operation. This is replaced
by M_MOVE_PKTHDR (which copies the pkthdr contents and "removes" it
from the source mbuf) and m_dup_pkthdr which copies the packet
header contents including any m_tag chain. This corrects numerous
problems whereby mbuf tags could be lost during packet manipulations.
These changes also introduce arguments to m_tag_copy and m_tag_copy_chain
to specify if the tag copy work should potentially block. This
introduces an incompatibility with openbsd which we may want to revisit.
Note that move/dup of packet headers does not handle target mbufs
that have a cluster bound to them. We may want to support this;
for now we watch for it with an assert.
Finally, M_COPYFLAGS was updated to include M_FIRSTFRAG|M_LASTFRAG.
Supported by: Vernier Networks
Reviewed by: Robert Watson <rwatson@FreeBSD.org>
__acl_get_link(), __acl_set_link(), acl_delete_link(), and
__acl_aclcheck_link(), with almost identical implementations to
the existing __acl_*_file() variants on these calls. Update
copyright.
Obtained from: TrustedBSD Project
__acl_get_link() Retrieve an ACL by name without following
symbolic links.
__acl_set_link() Set an ACL by name without following
symbolic links.
__acl_delete_link() Delete an ACL by name without following
symbolic links.
__acl_aclcheck_link() Check an ACL against a file by name without
following symbolic links.
These calls are similar in spirit to lstat(), lchown(), lchmod(), etc,
and will be used under similar circumstances.
Obtained from: TrustedBSD Project
call is in progress on the vnode. When vput() or vrele() sees a
1->0 reference count transition, it now return without any further
action if this flag is set. This flag is necessary to avoid recursion
into VOP_INACTIVE if the filesystem inactive routine causes the
reference count to increase and then drop back to zero. It is also
used to guarantee that an unlocked vnode will not be recycled while
blocked in VOP_INACTIVE().
There are at least two cases where the recursion can occur: one is
that the softupdates code called by ufs_inactive() via ffs_truncate()
can call vput() on the vnode. This has been reported by many people
as "lockmgr: draining against myself" panics. The other case is
that nfs_inactive() can call vget() and then vrele() on the vnode
to clean up a sillyrename file.
Reviewed by: mckusick (an older version of the patch)
to treat desiredvnodes much more like a limit than as a vague concept.
On a 2GB RAM machine where desired vnodes is 130k, we run out of
kmem_map space when we hit about 190k vnodes.
If we wake up the vnode washer in getnewvnode(), sleep until it is done,
so that it has a chance to offer us a washed vnode. If we don't sleep
here we'll just race ahead and allocate yet a vnode which will never
get freed.
In the vnodewasher, instead of doing 10 vnodes per mountpoint per
rotation, do 10% of the vnodes distributed evenly across the
mountpoints.
(show thread {address})
Remove the IDLE kse state and replace it with a change in
the way threads sahre KSEs. Every KSE now has a thread, which is
considered its "owner" however a KSE may also be lent to other
threads in the same group to allow completion of in-kernel work.
n this case the owner remains the same and the KSE will revert to the
owner when the other work has been completed.
All creations of upcalls etc. is now done from
kse_reassign() which in turn is called from mi_switch or
thread_exit(). This means that special code can be removed from
msleep() and cv_wait().
kse_release() does not leave a KSE with no thread any more but
converts the existing thread into teh KSE's owner, and sets it up
for doing an upcall. It is just inhibitted from being scheduled until
there is some reason to do an upcall.
Remove all trace of the kse_idle queue since it is no-longer needed.
"Idle" KSEs are now on the loanable queue.
The duplication is caused by the fact that imgact_elf.c is included
by both imgact_elf32.c and imgact_elf64.c and both are compiled by
default on ia64. Consequently, we have two seperate copies of the
elf_legacy_coredump variable due to them being declared static, and
two entries for the same sysctl in the linker set, both referencing
the unique copy of the elf_legacy_coredump variable. Since the second
sysctl cannot be registered, one of the elf_legacy_coredump variables
can not be tuned (if ordering still holds, it's the ELF64 related one).
The only solution is to create two different sysctl variables, just
like the elf<32|64>_trace sysctl variables. This unfortunately is an
(user) interface change, but unavoidable. Thus, on ELF32 platforms
the sysctl variable is called elf32_legacy_coredump and on ELF64
platforms it is called elf64_legacy_coredump. Platforms that have
both ELF formats have both sysctl variables.
These variables should probably be retired sooner rather than later.
skipping read-only pages, which can result in valuable non-text-related
data not getting dumped, the ELF loader and the dynamic loader now mark
read-only text pages NOCORE and the coredump code only checks (primarily) for
complete inaccessibility of the page or NOCORE being set.
Certain applications which map large amounts of read-only data will
produce much larger cores. A new sysctl has been added,
debug.elf_legacy_coredump, which will revert to the old behavior.
This commit represents collaborative work by all parties involved.
The PR contains a program demonstrating the problem.
PR: kern/45994
Submitted by: "Peter Edwards" <pmedwards@eircom.net>, Archie Cobbs <archie@dellroad.org>
Reviewed by: jdp, dillon
MFC after: 7 days
_KERNEL scope from "src/sys/sys/mchain.h".
Replace each occurrence of the above in _KERNEL scope with the
appropriate macro from the set of hto(be|le)(16|32|64) and
(be|le)toh(16|32|64) from "src/sys/sys/endian.h".
Tested by: tjr
Requested by: comment marked with XXX
resource starvation we clean-up as much of the vmspace structure as we
can when the last process using it exits. The rest of the structure
is cleaned up when it is reaped. But since exit1() decrements the ref
count it is possible for a double-free to occur if someone else, such as
the process swapout code, references and then dereferences the structure.
Additionally, the final cleanup of the structure should not occur until
the last process referencing it is reaped.
This commit solves the problem by introducing a secondary reference count,
calling 'vm_exitingcnt'. The normal reference count is decremented on exit
and vm_exitingcnt is incremented. vm_exitingcnt is decremented when the
process is reaped. When both vm_exitingcnt and vm_refcnt are 0, the
structure is freed for real.
MFC after: 3 weeks
they may be the only viable ones to flush. Thus it will now wait for
an inode lock if the other alternatives will result in rollbacks (and
immediate redirtying of the buffer). If only buffers with rollbacks
are available, one will be flushed, but then the buffer daemon will
wait briefly before proceeding. Failing to wait briefly effectively
deadlocks a uniprocessor since every other process writing to that
filesystem will wait for the buffer daemon to clean up which takes
close enough to forever to feel like a deadlock.
Reported by: Archie Cobbs <archie@dellroad.org>
Sponsored by: DARPA & NAI Labs.
Approved by: re
These call uma_large_malloc() and uma_large_free() which require Giant.
Fixes panic when descriptor table is larger than KMEM_ZMAX bytes
noticed by kkenn.
Reviewed by: jhb
unused. Replace it with a dm_mount back-pointer to the struct mount
that the devfs_mount is associated with. Export that pointer to MAC
Framework entry points, where all current policies don't use the
pointer. This permits the SEBSD port of SELinux's FLASK/TE to compile
out-of-the-box on 5.0-CURRENT with full file system labeling support.
Approved by: re (murray)
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
converting from individual vnode locks to the snapshot
lock, be sure to pass any waiting processes along to the
new lock as well. This transfer is done by a new function
in the lock manager, transferlockers(from_lock, to_lock);
Thanks to Lamont Granquist <lamont@scriptkiddie.org> for
his help in pounding on snapshots beyond all reason and
finding this deadlock.
Sponsored by: DARPA & NAI Labs.
1) Record all device events when devctl is enabled, rather than just when
devd has devctl open. This is necessary to prevent races between when
a device arrives, and when devd starts.
2) Add hw.bus.devctl_disable to disable devctl, this can also be set as a
tunable.
3) Fix async support. Reset nonblocking and async_td in open. remove
async flags.
4) Free all memory when devctl is disabled.
Approved by: re (blanket)
on this.
o Update the `cur' pointer in the cluster loop in m_getm() to avoid
incorrect truncation and leaked mbufs.
Reviewed by: bmilekic
Approved by: re
create an ABI that encodes offsets and sizes of structures into client
drivers. The functions isolate the ABI from changes to the resource
structure. Since these are used very rarely (once at startup), the
speed penalty will be down in the noise.
Also, add r_rid to the structure so that clients can save the 'rid' of
the resource in the struct resource, plus accessor functions. Future
additions to newbus will make use of this to present a simplified
interface for resource specification.
Approved by: re (jhb)
Reviewed by: jhb, jake
problem was a locked directory vnode), do not give the process a chance
to sleep in state "stopevent" (depends on the S_EXEC bit being set in
p_stops) until most resources have been released again.
Approved by: re
instead of panicing. Also, perform some of the simpler sanity checks on
the fds before acquiring the filedesc lock.
Approved by: re
Reported by: Dan Nelson <dan@emsphone.com> and others
by policy modules making use of downgrades in the MAC AST event. This
is required by the mac_lomac port of LOMAC to the MAC Framework.
Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
i386 cpu_thread_exit(). This resulted in a panic with WITNESS
since we need to hold Giant to call kmem_free(), and we weren't
helding it anymore in cpu_thread_exit(). We now do this from a
new MD function, cpu_thread_dtor(), called by thread_dtor().
Approved by: re@
Suggested by: jhb
- Provide a routine in sched_4bsd to add this functionality.
- Use sched_pctcpu() in kern_proc, which is the one place outside of
sched_4bsd where the old pctcpu value was accessed directly.
Approved by: re
data in the scheduler independant structures (proc, ksegrp, kse, thread).
- Implement unused stubs for this mechanism in sched_4bsd.
Approved by: re
Reviewed by: luigi, trb
Tested on: x86, alpha
in struct proc. While the process label is actually stored in the
struct ucred pointed to by p_ucred, there is a need for transient
storage that may be used when asynchronous (deferred) updates need to
be performed on the "real" label for locking reasons. Unlike other
label storage, this label has no locking semantics, relying on policies
to provide their own protection for the label contents, meaning that
a policy leaf mutex may be used, avoiding lock order issues. This
permits policies that act based on historical process behavior (such
as audit policies, the MAC Framework port of LOMAC, etc) can update
process properties even when many existing locks are held without
violating the lock order. No currently committed policies implement use
of this label storage.
Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
checks permit policy modules to augment the system policy for permitting
kld operations. This permits policies to limit access to kld operations
based on credential (and other) properties, as well as to perform checks
on the kld being loaded (integrity, etc).
Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
leader wasn't exiting during a fork; instead, do remember to release
the lock avoiding lock order reversals and recursion panic.
Reported by: "Joel M. Baldwin" <qumqats@outel.org>
also add rusage time in thread mailbox.
2. Minor change for thread limit code in thread_user_enter(),
fix typo in kse_release() last I committed.
Reviewed by: deischen, mini
kern.threads.max_threads_per_proc
kern.threads.max_groups_per_proc
2.Temporary disable borrower thread stash itself as
owner thread's spare thread in thread_exit(). there
is a race between owner thread and borrow thread:
an owner thread may allocate a spare thread as this:
if (td->td_standin == NULL)
td->standin = thread_alloc();
but thread_alloc() can block the thread, then a borrower
thread would possible stash it self as owner's spare
thread in thread_exit(), after owner is resumed, result
is a thread leak in kernel, double check in owner can
avoid the race, but it may be ugly and not worth to do.
sysconf.c:
Use 'break' rather than 'goto yesno' in sysconf.c so that we report a '0'
return value from the kernel sysctl.
vfs_aio.c:
Make aio reset its configuration parameters to -1 after unloading
instead of 0.
posix4_mib.c:
Initialize the aio configuration parameters to -1
to indicate that it is not loaded.
Add a facility (p31b_iscfg()) to determine if a posix4 facility has been
initialized to avoid having to re-order the SYSINITs.
Use p31b_iscfg() to determine if aio has had a chance to run yet which
is likely if it is compiled into the kernel and avoid spamming its
values.
Introduce a macro P31B_VALID() instead of doing the same comparison over
and over.
posix4.h:
Prototype p31b_iscfg().