(show thread {address})
Remove the IDLE kse state and replace it with a change in
the way threads sahre KSEs. Every KSE now has a thread, which is
considered its "owner" however a KSE may also be lent to other
threads in the same group to allow completion of in-kernel work.
n this case the owner remains the same and the KSE will revert to the
owner when the other work has been completed.
All creations of upcalls etc. is now done from
kse_reassign() which in turn is called from mi_switch or
thread_exit(). This means that special code can be removed from
msleep() and cv_wait().
kse_release() does not leave a KSE with no thread any more but
converts the existing thread into teh KSE's owner, and sets it up
for doing an upcall. It is just inhibitted from being scheduled until
there is some reason to do an upcall.
Remove all trace of the kse_idle queue since it is no-longer needed.
"Idle" KSEs are now on the loanable queue.
The duplication is caused by the fact that imgact_elf.c is included
by both imgact_elf32.c and imgact_elf64.c and both are compiled by
default on ia64. Consequently, we have two seperate copies of the
elf_legacy_coredump variable due to them being declared static, and
two entries for the same sysctl in the linker set, both referencing
the unique copy of the elf_legacy_coredump variable. Since the second
sysctl cannot be registered, one of the elf_legacy_coredump variables
can not be tuned (if ordering still holds, it's the ELF64 related one).
The only solution is to create two different sysctl variables, just
like the elf<32|64>_trace sysctl variables. This unfortunately is an
(user) interface change, but unavoidable. Thus, on ELF32 platforms
the sysctl variable is called elf32_legacy_coredump and on ELF64
platforms it is called elf64_legacy_coredump. Platforms that have
both ELF formats have both sysctl variables.
These variables should probably be retired sooner rather than later.
skipping read-only pages, which can result in valuable non-text-related
data not getting dumped, the ELF loader and the dynamic loader now mark
read-only text pages NOCORE and the coredump code only checks (primarily) for
complete inaccessibility of the page or NOCORE being set.
Certain applications which map large amounts of read-only data will
produce much larger cores. A new sysctl has been added,
debug.elf_legacy_coredump, which will revert to the old behavior.
This commit represents collaborative work by all parties involved.
The PR contains a program demonstrating the problem.
PR: kern/45994
Submitted by: "Peter Edwards" <pmedwards@eircom.net>, Archie Cobbs <archie@dellroad.org>
Reviewed by: jdp, dillon
MFC after: 7 days
_KERNEL scope from "src/sys/sys/mchain.h".
Replace each occurrence of the above in _KERNEL scope with the
appropriate macro from the set of hto(be|le)(16|32|64) and
(be|le)toh(16|32|64) from "src/sys/sys/endian.h".
Tested by: tjr
Requested by: comment marked with XXX
resource starvation we clean-up as much of the vmspace structure as we
can when the last process using it exits. The rest of the structure
is cleaned up when it is reaped. But since exit1() decrements the ref
count it is possible for a double-free to occur if someone else, such as
the process swapout code, references and then dereferences the structure.
Additionally, the final cleanup of the structure should not occur until
the last process referencing it is reaped.
This commit solves the problem by introducing a secondary reference count,
calling 'vm_exitingcnt'. The normal reference count is decremented on exit
and vm_exitingcnt is incremented. vm_exitingcnt is decremented when the
process is reaped. When both vm_exitingcnt and vm_refcnt are 0, the
structure is freed for real.
MFC after: 3 weeks
they may be the only viable ones to flush. Thus it will now wait for
an inode lock if the other alternatives will result in rollbacks (and
immediate redirtying of the buffer). If only buffers with rollbacks
are available, one will be flushed, but then the buffer daemon will
wait briefly before proceeding. Failing to wait briefly effectively
deadlocks a uniprocessor since every other process writing to that
filesystem will wait for the buffer daemon to clean up which takes
close enough to forever to feel like a deadlock.
Reported by: Archie Cobbs <archie@dellroad.org>
Sponsored by: DARPA & NAI Labs.
Approved by: re
These call uma_large_malloc() and uma_large_free() which require Giant.
Fixes panic when descriptor table is larger than KMEM_ZMAX bytes
noticed by kkenn.
Reviewed by: jhb
unused. Replace it with a dm_mount back-pointer to the struct mount
that the devfs_mount is associated with. Export that pointer to MAC
Framework entry points, where all current policies don't use the
pointer. This permits the SEBSD port of SELinux's FLASK/TE to compile
out-of-the-box on 5.0-CURRENT with full file system labeling support.
Approved by: re (murray)
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
converting from individual vnode locks to the snapshot
lock, be sure to pass any waiting processes along to the
new lock as well. This transfer is done by a new function
in the lock manager, transferlockers(from_lock, to_lock);
Thanks to Lamont Granquist <lamont@scriptkiddie.org> for
his help in pounding on snapshots beyond all reason and
finding this deadlock.
Sponsored by: DARPA & NAI Labs.
1) Record all device events when devctl is enabled, rather than just when
devd has devctl open. This is necessary to prevent races between when
a device arrives, and when devd starts.
2) Add hw.bus.devctl_disable to disable devctl, this can also be set as a
tunable.
3) Fix async support. Reset nonblocking and async_td in open. remove
async flags.
4) Free all memory when devctl is disabled.
Approved by: re (blanket)
on this.
o Update the `cur' pointer in the cluster loop in m_getm() to avoid
incorrect truncation and leaked mbufs.
Reviewed by: bmilekic
Approved by: re
create an ABI that encodes offsets and sizes of structures into client
drivers. The functions isolate the ABI from changes to the resource
structure. Since these are used very rarely (once at startup), the
speed penalty will be down in the noise.
Also, add r_rid to the structure so that clients can save the 'rid' of
the resource in the struct resource, plus accessor functions. Future
additions to newbus will make use of this to present a simplified
interface for resource specification.
Approved by: re (jhb)
Reviewed by: jhb, jake
problem was a locked directory vnode), do not give the process a chance
to sleep in state "stopevent" (depends on the S_EXEC bit being set in
p_stops) until most resources have been released again.
Approved by: re
instead of panicing. Also, perform some of the simpler sanity checks on
the fds before acquiring the filedesc lock.
Approved by: re
Reported by: Dan Nelson <dan@emsphone.com> and others
by policy modules making use of downgrades in the MAC AST event. This
is required by the mac_lomac port of LOMAC to the MAC Framework.
Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
i386 cpu_thread_exit(). This resulted in a panic with WITNESS
since we need to hold Giant to call kmem_free(), and we weren't
helding it anymore in cpu_thread_exit(). We now do this from a
new MD function, cpu_thread_dtor(), called by thread_dtor().
Approved by: re@
Suggested by: jhb
- Provide a routine in sched_4bsd to add this functionality.
- Use sched_pctcpu() in kern_proc, which is the one place outside of
sched_4bsd where the old pctcpu value was accessed directly.
Approved by: re
data in the scheduler independant structures (proc, ksegrp, kse, thread).
- Implement unused stubs for this mechanism in sched_4bsd.
Approved by: re
Reviewed by: luigi, trb
Tested on: x86, alpha
in struct proc. While the process label is actually stored in the
struct ucred pointed to by p_ucred, there is a need for transient
storage that may be used when asynchronous (deferred) updates need to
be performed on the "real" label for locking reasons. Unlike other
label storage, this label has no locking semantics, relying on policies
to provide their own protection for the label contents, meaning that
a policy leaf mutex may be used, avoiding lock order issues. This
permits policies that act based on historical process behavior (such
as audit policies, the MAC Framework port of LOMAC, etc) can update
process properties even when many existing locks are held without
violating the lock order. No currently committed policies implement use
of this label storage.
Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
checks permit policy modules to augment the system policy for permitting
kld operations. This permits policies to limit access to kld operations
based on credential (and other) properties, as well as to perform checks
on the kld being loaded (integrity, etc).
Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
leader wasn't exiting during a fork; instead, do remember to release
the lock avoiding lock order reversals and recursion panic.
Reported by: "Joel M. Baldwin" <qumqats@outel.org>
also add rusage time in thread mailbox.
2. Minor change for thread limit code in thread_user_enter(),
fix typo in kse_release() last I committed.
Reviewed by: deischen, mini
kern.threads.max_threads_per_proc
kern.threads.max_groups_per_proc
2.Temporary disable borrower thread stash itself as
owner thread's spare thread in thread_exit(). there
is a race between owner thread and borrow thread:
an owner thread may allocate a spare thread as this:
if (td->td_standin == NULL)
td->standin = thread_alloc();
but thread_alloc() can block the thread, then a borrower
thread would possible stash it self as owner's spare
thread in thread_exit(), after owner is resumed, result
is a thread leak in kernel, double check in owner can
avoid the race, but it may be ugly and not worth to do.
sysconf.c:
Use 'break' rather than 'goto yesno' in sysconf.c so that we report a '0'
return value from the kernel sysctl.
vfs_aio.c:
Make aio reset its configuration parameters to -1 after unloading
instead of 0.
posix4_mib.c:
Initialize the aio configuration parameters to -1
to indicate that it is not loaded.
Add a facility (p31b_iscfg()) to determine if a posix4 facility has been
initialized to avoid having to re-order the SYSINITs.
Use p31b_iscfg() to determine if aio has had a chance to run yet which
is likely if it is compiled into the kernel and avoid spamming its
values.
Introduce a macro P31B_VALID() instead of doing the same comparison over
and over.
posix4.h:
Prototype p31b_iscfg().
Previously these were libc functions but were requested to
be made into system calls for atomicity and to coalesce what
might be two entrances into the kernel (signal mask setting
and floating point trap) into one.
A few style nits and comments from bde are also included.
Tested on alpha by: gallatin
signed, since they describe a ring buffer and signed arithmetic is
performed on them. This avoids some evilish casts.
Since this changes all but two members of this structure, style(9)
those remaining ones, too.
Requested by: bde
Reviewed by: bde (earlier version)
the MAC policy list is busy during a load or unload attempt.
We assert no locks held during the cv wait, meaning we should
be fairly deadlock-safe. Because of the cv model and busy
count, it's possible for a cv waiter waiting for exclusive
access to the policy list to be starved by active and
long-lived access control/labeling events. For now, we
accept that as a necessary tradeoff.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
we brought in the new cache and locking model for vnode labels. We
now rely on mac_associate_devfs_vnode().
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
earlier acquired lock with the same witness as the lock currently being
acquired. If we had released several earlier acquired locks after
acquiring enough locks to require another lock_list_entry bucket in the
lock list, then subsequent lock_list_entry buckets could contain only one
lock instance in which case i would be zero.
Reported by: Joel M. Baldwin <qumqats@outel.org>
dynamic mapping of an operation vector into an operation structure,
rather, we rely on C99 sparse structure initialization.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
indirectly through vm_page_protect(). The one remaining page flag that
is updated by vm_page_protect() is already being updated by our various
pmap implementations.
Note: A later commit will similarly change the VM_PROT_READ case and
eliminate vm_page_protect().
in the ELF code. Missed in earlier merge from the MAC tree.
Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
mac_thread_userret() only if PS_MACPEND is set in the process AST mask.
This avoids the cost of the entry point in the common case, but
requires policies interested in the userret event to set the flag
(protected by the scheduler lock) if they do want the event. Since
all the policies that we're working with which use mac_thread_userret()
use the entry point only selectively to perform operations deferred
for locking reasons, this maintains the desired semantics.
Approved by: re
Requested by: bde
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
points, rather than relying on policies to grub around in the
image activator instance structure.
Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
sysctls to MI code; this reduces code duplication and makes all of them
available on sparc64, and the latter two on powerpc.
The semantics by the i386 and pc98 hw.availpages is slightly changed:
previously, holes between ranges of available pages would be included,
while they are excluded now. The new behaviour should be more correct
and brings i386 in line with the other architectures.
Move physmem to vm/vm_init.c, where this variable is used in MI code.
- Remove the comments which were justifying this by the fact
that we don't have %q in the kernel, this was probably right
back in time, but we now have %q, and we even have better to
print those types (%j).
of the original AIO request: save and restore the active thread credential
as well as using the file credential, since MAC (and some other bits of
the system) rely on the thread credential instead of/as well as the
file credential. In brief: cache td->td_ucred when the AIO operation
is queued, temporarily set and restore the kernel thread credential,
and release the credential when done. Similar to ktrace credential
management.
Reviewed by: alc
Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
manipulated directly (rather than using sballoc()/sbfree()); update them
to tweak the new sb_ctl field too.
Sponsored by: NTT Multimedia Communications Labs
(1) Permit userland applications to request a change of label atomic
with an execve() via mac_execve(). This is required for the
SEBSD port of SELinux/FLASK. Attempts to invoke this without
MAC compiled in result in ENOSYS, as with all other MAC system
calls. Complexity, if desired, is present in policy modules,
rather than the framework.
(2) Permit policies to have access to both the label of the vnode
being executed as well as the interpreter if it's a shell
script or related UNIX nonsense. Because we can't hold both
vnode locks at the same time, cache the interpreter label.
SEBSD relies on this because it supports secure transitioning
via shell script executables. Other policies might want to
take both labels into account during an integrity or
confidentiality decision at execve()-time.
Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
Allow transitioning to be twiddled off using the process and fs enforcement
flags, although at some point this should probably be its own flag.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
entrypoints, #ifdef MAC. The supporting logic already existed in
kern_mac.c, so no change there. This permits MAC policies to cause
a process label change as the result of executing a binary --
typically, as a result of executing a specially labeled binary.
For example, the SEBSD port of SELinux/FLASK uses this functionality
to implement TE type transitions on processes using transitioning
binaries, in a manner similar to setuid. Policies not implementing
a notion of transition (all the ones in the tree right now) require
no changes, since the old label data is copied to the new label
via mac_create_cred() even if a transition does occur.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
describes an image activation instance. Instead, make use of the
existing fname structure entry, and introduce two new entries,
userspace_argv, and userspace_envv. With the addition of
mac_execve(), this divorces the image structure from the specifics
of the execve() system call, removes a redundant pointer, etc.
No semantic change from current behavior, but it means that the
structure doesn't depend on syscalls.master-generated includes.
There seems to be some redundant initialization of imgact entries,
which I have maintained, but which could probably use some cleaning
up at some point.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
system accounting configuration and for nfsd server thread attach.
Policies might use this to protect the integrity or confidentiality
of accounting data, limit the ability to turn on or off accounting,
as well as to prevent inappropriately labeled threads from becoming nfs
server threads.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
the data value returned by kevent()'s EVFILT_READ filter on non-TCP
sockets accurately reflects the amount of data that can be read from the
sockets by applications.
PR: 30634
Reviewed by: -net, -arch
Sponsored by: NTT Multimedia Communications Labs
MFC after: 2 weeks
permitting MAC policies to limit access to the kernel environment.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
malloc(9) failed last time. This is intended to help code adjust
memory usage to the current circumstances.
A typical use could be:
if (malloc_last_fail() < 60)
reduce_cache_by_one();
structure definition, rather than using an operation vector
we translate into the structure. Originally, we used a vector
for two reasons:
(1) We wanted to define the structure sparsely, which wasn't
supported by the C compiler for structures. For a policy
with five entry points, you don't want to have to stick in
a few hundred NULL function pointers.
(2) We thought it would improve ABI compatibility allowing modules
to work with kernels that had a superset of the entry points
defined in the module, even if the kernel had changed its
entry point set.
Both of these no longer apply:
(1) C99 gives us a way to sparsely define a static structure.
(2) The ABI problems existed anyway, due to enumeration numbers,
argument changes, and semantic mismatches. Since the going
rule for FreeBSD is that you really need your modules to
pretty closely match your kernel, it's not worth the
complexity.
This submit eliminates the operation vector, dynamic allocation
of the operation structure, copying of the vector to the
structure, and redoes the vectors in each policy to direct
structure definitions. One enourmous benefit of this change
is that we now get decent type checking on policy entry point
implementation arguments.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
MAC access() and open() checks, the argument actually has an int type
where it becomes available. Switch to using 'int' for the mode argument
throughout the MAC Framework and policy modules.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
missed. This bug has been present since the vn_start_write() and
vn_finished_write() calls were first added in revision 1.159. When
the case is triggered, any attempts to create snapshots on the
filesystem will deadlock and also prevent further write activity
on that filesystem.
to conform to 1003.1-2001. Make it possible for applications to actually
tell whether or not asynchronous I/O is supported.
Since FreeBSD's aio implementation works on all descriptor types, don't
call down into file or vnode ops when [f]pathconf() is asked about
_PC_ASYNC_IO; this avoids the need for every file and vnode op to know about
it.
mac_enforce_system toggle, rather than several separate toggles.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
permit MAC policies to augment the security protections on sysctl()
operations. This is not really a wonderful entry point, as we
only have access to the MIB of the target sysctl entry, rather than
the more useful entry name, but this is sufficient for policies
like Biba that wish to use their notions of privilege or integrity
to prevent inappropriate sysctl modification. Affects MAC kernels
only. Since SYSCTL_LOCK isn't in sysctl.h, just kern_sysctl.c,
we can't assert the SYSCTL subsystem lockin the MAC Framework.
Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
permits MAC modules to augment system security decisions regarding
the reboot() system call, if MAC is compiled into the kernel.
Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
mac_check_system_swapon(), to reflect the fact that the primary
object of this change is the running kernel as a whole, rather
than just the vnode. We'll drop additional checks of this
class into the same check namespace, including reboot(),
sysctl(), et al.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
"refreshing" the label on the vnode before use, just get the label
right from inception. For single-label file systems, set the label
in the generic VFS getnewvnode() code; for multi-label file systems,
leave the labeling up to the file system. With UFS1/2, this means
reading the extended attribute during vfs_vget() as the inode is
pulled off disk, rather than hitting the extended attributes
frequently during operations later, improving performance. This
also corrects sematics for shared vnode locks, which were not
previously present in the system. This chances the cache
coherrency properties WRT out-of-band access to label data, but in
an acceptable form. With UFS1, there is a small race condition
during automatic extended attribute start -- this is not present
with UFS2, and occurs because EAs aren't available at vnode
inception. We'll introduce a work around for this shortly.
Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
- Make DDB use %y instead of %z.
- Teach GCC about %y.
- Implement support for the C99 %z format modifier.
Approved by: re@
Reviewed by: peter
Tested on: i386, sparc64
handling clean and functional as 5.x evolves. This allows some of the
nasty bandaids in the 5.x codepaths to be unwound.
Encapsulate 4.x signal handling under COMPAT_FREEBSD4 (there is an
anti-foot-shooting measure in place, 5.x folks need this for a while) and
finish encapsulating the older stuff under COMPAT_43. Since the ancient
stuff is required on alpha (longjmp(3) passes a 'struct osigcontext *'
to the current sigreturn(2), instead of the 'ucontext_t *' that sigreturn
is supposed to take), add a compile time check to prevent foot shooting
there too. Add uniform COMPAT_43 stubs for ia64/sparc64/powerpc.
Tested on: i386, alpha, ia64. Compiled on sparc64 (a few days ago).
Approved by: re
seem to have all the prerequisites already.
Call g_waitidle() as the first thing in vfs_mountroot() so that we have
it out of the way before we even decide if we should call .._ask() or
.._try().
Call the g_dev_print() function to provide better guidance for the
root-mount prompt.
does not require Giant.
This means that we may miss panics on a class of mutex programming bugs,
but only if running with a Chernobyl setting of debug-flags.
Spotted by: Pete Carah <pete@ns.altadena.net>