kernels exposed by the recent fixes to resource limits for 32-bit processes
on 64-bit kernels:
- Let ABIs expose their maximum stack size via a new pointer in sysentvec
and use that in preference to maxssiz during exec() rather than always
using maxssiz for all processses.
- Apply the ABI's limit fixup to the previous stack size when adjusting
RLIMIT_STACK to determine if the existing mapping for the stack needs to
be grown or shrunk (as well as how much it should be grown or shrunk).
Approved by: re (kensmith)
during execve() when turning off tracing due to executing a setuid binary
as non-root. Previously this could fail to acquire Giant and fail an
assertion if the ktrace file was on a non-MPSAFE filesystem and the
executable was on an MPSAFE filesystem.
MFC after: 3 days
Reported by: kris
some cases, move to priv_check() if it was an operation on a thread and
no other flags were present.
Eliminate caller-side jail exception checking (also now-unused); jail
privilege exception code now goes solely in kern_jail.c.
We can't yet eliminate suser() due to some cases in the KAME code where
a privilege check is performed and then used in many different deferred
paths. Do, however, move those prototypes to priv.h.
Reviewed by: csjp
Obtained from: TrustedBSD Project
Change the VOP_OPEN(), vn_open() vnode operation and d_fdopen() cdev operation
argument from being file descriptor index into the pointer to struct file.
Proposed and reviewed by: jhb
Reviewed by: daichi (unionfs)
Approved by: re (kensmith)
processes under 64-bit kernels). Previously, each 32-bit process overwrote
its resource limits at exec() time. The problem with this approach is that
the new limits affect all child processes of the 32-bit process, including
if the child process forks and execs a 64-bit process. To fix this, don't
ovewrite the resource limits during exec(). Instead, sv_fixlimits() is
now replaced with a different function sv_fixlimit() which asks the ABI to
sanitize a single resource limit. We then use this when querying and
setting resource limits. Thus, if a 32-bit process sets a limit, then
that new limit will be inherited by future children. However, if the
32-bit process doesn't change a limit, then a future 64-bit child will
see the "full" 64-bit limit rather than the 32-bit limit.
MFC is tentative since it will break the ABI of old linux.ko modules (no
other modules are affected).
MFC after: 1 week
- Remove also "MP SAFE" after prior "MPSAFE" pass. (suggested by bde)
- Remove extra blank lines in some cases.
- Add extra blank lines in some cases.
- Remove no-op comments consisting solely of the function name, the word
"syscall", or the system call name.
- Add punctuation.
- Re-wrap some comments.
system calls now enter without Giant held, and then in some cases, acquire
Giant explicitly.
Remove a number of other MPSAFE annotations in the credential code and
tweak one or two other adjacent comments.
specific privilege names to a broad range of privileges. These may
require some future tweaking.
Sponsored by: nCircle Network Security, Inc.
Obtained from: TrustedBSD Project
Discussed on: arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
Alex Lyashkov <umka at sevcity dot net>,
Skip Ford <skip dot ford at verizon dot net>,
Antoine Brodin <antoine dot brodin at laposte dot net>
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.
This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.
Obtained from: TrustedBSD Project
Sponsored by: SPARTA
Add the argument auditing functions for argv and env.
Add kernel-specific versions of the tokenizer functions for the
arg and env represented as a char array.
Implement the AUDIT_ARGV and AUDIT_ARGE audit policy commands to
enable/disable argv/env auditing.
Call the argument auditing from the exec system calls.
Obtained from: TrustedBSD Project
Approved by: rwatson (mentor)
image_params arg.
- Change struct image_params to include struct sysentvec pointer and
initialize it.
- Change all consumers of process_exit/process_exec eventhandlers to
new prototypes (includes splitting up into distinct exec/exit functions).
- Add eventhandler to userret.
Sponsored by: Google SoC 2006
Submitted by: rdivacky
Parts suggested by: jhb (on hackers@)
the callers if the exec either succeeds or fails early.
- Move the code to call exit1() if the exec fails after the vmspace is
gone to the bottom of kern_execve() to cut down on some code duplication.
both proc pointer and thread pointer, if thread pointer is NULL,
tdsignal automatically finds a thread, otherwise it sends signal
to given thread.
Add utility function psignal_event to send a realtime sigevent
to a process according to the delivery requirement specified in
struct sigevent.
clock are supported. I have plan to merge XSI timer ITIMER_REAL and other
two CPU timers into the new code, current three slots are available for
the XSI timers.
The SIGEV_THREAD notification type is not supported yet because our
sigevent struct lacks of two member fields:
sigev_notify_function
sigev_notify_attributes
I have found the sigevent is used in AIO, so I won't add the two members
unless the AIO code is adjusted.
changes in MD code are trivial, before this change, trapsignal and
sendsig use discrete parameters, now they uses member fields of
ksiginfo_t structure. For sendsig, this change allows us to pass
POSIX realtime signal value to user code.
2. Remove cpu_thread_siginfo, it is no longer needed because we now always
generate ksiginfo_t data and feed it to libpthread.
3. Add p_sigqueue to proc structure to hold shared signals which were
blocked by all threads in the proc.
4. Add td_sigqueue to thread structure to hold all signals delivered to
thread.
5. i386 and amd64 now return POSIX standard si_code, other arches will
be fixed.
6. In this sigqueue implementation, pending signal set is kept as before,
an extra siginfo list holds additional siginfo_t data for signals.
kernel code uses psignal() still behavior as before, it won't be failed
even under memory pressure, only exception is when deleting a signal,
we should call sigqueue_delete to remove signal from sigqueue but
not SIGDELSET. Current there is no kernel code will deliver a signal
with additional data, so kernel should be as stable as before,
a ksiginfo can carry more information, for example, allow signal to
be delivered but throw away siginfo data if memory is not enough.
SIGKILL and SIGSTOP have fast path in sigqueue_add, because they can
not be caught or masked.
The sigqueue() syscall allows user code to queue a signal to target
process, if resource is unavailable, EAGAIN will be returned as
specification said.
Just before thread exits, signal queue memory will be freed by
sigqueue_flush.
Current, all signals are allowed to be queued, not only realtime signals.
Earlier patch reviewed by: jhb, deischen
Tested on: i386, amd64
possible for do_execve() to call exit1() rather than returning. As a
result, the sequence "allocate memory; call kern_execve; free memory"
can end up leaking memory.
This commit documents this astonishing behaviour and adds a call to
exec_free_args() before the exit1() call in do_execve(). Since all
the users of kern_execve() in the tree use exec_free_args() to free
the command-line arguments after kern_execve() returns, this should
be safe, and it fixes the memory leak which can otherwise occur.
Submitted by: Peter Holm
MFC after: 3 days
Security: Local denial of service
to avoid touching pageable memory while holding a mutex.
Simplify argument list replacement logic.
PR: kern/84935
Submitted by: "Antoine Pelisse" apelisse AT gmail.com (in a different form)
MFC after: 3 days
- pmcstat(8) gprof output mode fixes:
lib/libpmc/pmclog.{c,h}, sys/sys/pmclog.h:
+ Add a 'is_usermode' field to the PMCLOG_PCSAMPLE event
+ Add an 'entryaddr' field to the PMCLOG_PROCEXEC event,
so that pmcstat(8) can determine where the runtime loader
/libexec/ld-elf.so.1 is getting loaded.
sys/kern/kern_exec.c:
+ Use a local struct to group the entry address of the image being
exec()'ed and the process credential changed flag to the exec
handling hook inside hwpmc(4).
usr.sbin/pmcstat/*:
+ Support "-k kernelpath", "-D sampledir".
+ Implement the ELF bits of 'gmon.out' profile generation in a new
file "pmcstat_log.c". Move all log related functions to this
file.
+ Move local definitions and prototypes to "pmcstat.h"
- Other bug fixes:
+ lib/libpmc/pmclog.c: correctly handle EOF in pmclog_read().
+ sys/dev/hwpmc_mod.c: unconditionally log a PROCEXIT event to all
attached PMCs when a process exits.
+ sys/sys/pmc.h: correct a function prototype.
+ Improve usage checks in pmcstat(8).
Approved by: re (blanket hwpmc)
- Implement sampling modes and logging support in hwpmc(4).
- Separate MI and MD parts of hwpmc(4) and allow sharing of
PMC implementations across different architectures.
Add support for P4 (EMT64) style PMCs to the amd64 code.
- New pmcstat(8) options: -E (exit time counts) -W (counts
every context switch), -R (print log file).
- pmc(3) API changes, improve our ability to keep ABI compatibility
in the future. Add more 'alias' names for commonly used events.
- bug fixes & documentation.
file's access time should be updated when it gets executed. A while
ago the mechanism used to exec was changed to use a more mmap based
mechanism and this behavior was broken as a side-effect of that.
A new vnode flag is added that gets set when the file gets executed,
and the VOP_SETATTR() vnode operation gets called. The underlying
filesystem is expected to handle it based on its own semantics, some
filesystems don't support access time at all. Those that do should
handle it in a way that does not block, does not generate I/O if possible,
etc. In particular vn_start_write() has not been called. The UFS code
handles it the same way as it would normally handle the access time if
a file was read - the IN_ACCESS flag gets set in the inode but no other
action happens at this point. The actual time update will happen later
during a sync (which handles all the necessary locking).
Got me into this: cperciva
Discussed with: a lot with bde, a little with kan
Showed patches to: phk, jeffr, standards@, arch@
Minor discussion on: arch@
PAGE_SIZE.
Unlike originator of the PR suggests retain MAXSHELLCMDLEN definition
(he has been proposing to replace it with PAGE_SIZE everywhere), not only
this reduced the diff significantly, but prevents code obfuscation and also
allows to increase/decrease this parameter easily if needed.
PR: kern/64196
Submitted by: Magnus Bäckström <b@etek.chalmers.se>
copies arguments into the kernel space and one that operates
completely in the kernel space;
o use kernel-only version of execve(2) to kill another stackgap in
linuxlator/i386.
Obtained from: DragonFlyBSD (partially)
MFC after: 2 weeks
for ensuring that a process' filedesc is not shared with anybody.
Use it in the two places which previously had private implmentations.
This collects all fd_refcnt handling in kern_descrip.c
Use this in all the places where sleeping with the lock held is not
an issue.
The distinction will become significant once we finalize the exact
lock-type to use for this kind of case.
sysctl routines and state. Add some code to use it for signalling the need
to downconvert a data structure to 32 bits on a 64 bit OS when requested by
a 32 bit app.
I tried to do this in a generic abi wrapper that intercepted the sysctl
oid's, or looked up the format string etc, but it was a real can of worms
that turned into a fragile mess before I even got it partially working.
With this, we can now run 'sysctl -a' on a 32 bit sysctl binary and have
it not abort. Things like netstat, ps, etc have a long way to go.
This also fixes a bug in the kern.ps_strings and kern.usrstack hacks.
These do matter very much because they are used by libc_r and other things.
all other threads to suicide, problem is execve() could be failed, and
a failed execve() would change threaded process to unthreaded, this side
effect is unexpected.
The new code introduces a new single threading mode SINGLE_BOUNDARY, in
the mode, all threads should suspend themself at user boundary except
the singler. we can not use SINGLE_NO_EXIT because we want to start from
a clean state if execve() is successful, suspending other threads at unknown
point and later resuming them from there and forcing them to exit at user
boundary may cause the process to start from a dirty state. If execve() is
successful, current thread upgrades to SINGLE_EXIT mode and forces other
threads to suicide at user boundary, otherwise, other threads will be resumed
and their interrupted syscall will be restarted.
Reviewed by: julian
Better to kill all other threads than to panic the system if 2 threads call
execve() at the same time. A better fix will be committed later.
Note that this only affects the case where the execve fails.
but with slightly cleaned up interfaces.
The KSE structure has become the same as the "per thread scheduler
private data" structure. In order to not make the diffs too great
one is #defined as the other at this time.
The KSE (or td_sched) structure is now allocated per thread and has no
allocation code of its own.
Concurrency for a KSEGRP is now kept track of via a simple pair of counters
rather than using KSE structures as tokens.
Since the KSE structure is different in each scheduler, kern_switch.c
is now included at the end of each scheduler. Nothing outside the
scheduler knows the contents of the KSE (aka td_sched) structure.
The fields in the ksegrp structure that are to do with the scheduler's
queueing mechanisms are now moved to the kg_sched structure.
(per ksegrp scheduler private data structure). In other words how the
scheduler queues and keeps track of threads is no-one's business except
the scheduler's. This should allow people to write experimental
schedulers with completely different internal structuring.
A scheduler call sched_set_concurrency(kg, N) has been added that
notifies teh scheduler that no more than N threads from that ksegrp
should be allowed to be on concurrently scheduled. This is also
used to enforce 'fainess' at this time so that a ksegrp with
10000 threads can not swamp a the run queue and force out a process
with 1 thread, since the current code will not set the concurrency above
NCPU, and both schedulers will not allow more than that many
onto the system run queue at a time. Each scheduler should eventualy develop
their own methods to do this now that they are effectively separated.
Rejig libthr's kernel interface to follow the same code paths as
linkse for scope system threads. This has slightly hurt libthr's performance
but I will work to recover as much of it as I can.
Thread exit code has been cleaned up greatly.
exit and exec code now transitions a process back to
'standard non-threaded mode' before taking the next step.
Reviewed by: scottl, peter
MFC after: 1 week
a more complete subsystem, and removes the knowlege of how things are
implemented from the drivers. Include locking around filter ops, so a
module like aio will know when not to be unloaded if there are outstanding
knotes using it's filter ops.
Currently, it uses the MTX_DUPOK even though it is not always safe to
aquire duplicate locks. Witness currently doesn't support the ability
to discover if a dup lock is ok (in some cases).
Reviewed by: green, rwatson (both earlier versions)
somewhat clearer, but more importantly allows for a consistent naming
scheme for suser_cred flags.
The old name is still defined, but will be removed in a few days (unless I
hear any complaints...)
Discussed with: rwatson, scottl
Requested by: jhb
pmap_remove_pages(). (The implementation of pmap_remove_pages() is
optional. If pmap_remove_pages() is unimplemented, the acquisition and
release of the page queues lock is unnecessary.)
Remove spl calls from the alpha, arm, and ia64 pmap_remove_pages().
of not clearing the flags for execv() syscall will result that a new
program runs in KSE thread mode without enabling it.
Submitted by: tjr
Modified by: davidxu
in the two consumers that need it.. processes using AIO and netncp.
Update docs. Say that process_exec is called with Giant, but not to
depend on it. All our consumers can handle it without Giant.
- no longer serialize on Giant for thread_single*() and family in fork,
exit and exec
- thread_wait() is mpsafe, assert no Giant
- reduce scope of Giant in exit to not cover thread_wait and just do
vm_waitproc().
- assert that thread_single() family are not called with Giant
- remove the DROP/PICKUP_GIANT macros from thread_single() family
- assert that thread_suspend_check() s not called with Giant
- remove manual drop_giant hack in thread_suspend_check since we know it
isn't held.
- remove the DROP/PICKUP_GIANT macros from thread_suspend_check() family
- mark kse_create() mpsafe
Conforming POSIX application should do by disallowing the argv
argument to be NULL.
PR: kern/33738
Submitted by: Marc Olzheim, Serge van den Boom
OK'ed by: nectar
kernel. I'm not happy with it yet - refinements are to come.
This hack allows the kern.ps_strings and kern.usrstack sysctls to respond
to a 32 bit request, such as those coming from emulated i386 binaries.
function back to near the beginning of the file. Rev.1.194 moved it into
the middle of auxiliary functions following kern_execve(). Moved the
__mac_execve() syscall function up together with execve(). It was new in
rev1.1.196 and perfectly misplaced after execve().
in various kernel objects to represent security data, we embed a
(struct label *) pointer, which now references labels allocated using
a UMA zone (mac_label.c). This allows the size and shape of struct
label to be varied without changing the size and shape of these kernel
objects, which become part of the frozen ABI with 5-STABLE. This opens
the door for boot-time selection of the number of label slots, and hence
changes to the bound on the number of simultaneous labeled policies
at boot-time instead of compile-time. This also makes it easier to
embed label references in new objects as required for locking/caching
with fine-grained network stack locking, such as inpcb structures.
This change also moves us further in the direction of hiding the
structure of kernel objects from MAC policy modules, not to mention
dramatically reducing the number of '&' symbols appearing in both the
MAC Framework and MAC policy modules, and improving readability.
While this results in minimal performance change with MAC enabled, it
will observably shrink the size of a number of critical kernel data
structures for the !MAC case, and should have a small (but measurable)
performance benefit (i.e., struct vnode, struct socket) do to memory
conservation and reduced cost of zeroing memory.
NOTE: Users of MAC must recompile their kernel and all MAC modules as a
result of this change. Because this is an API change, third party
MAC modules will also need to be updated to make less use of the '&'
symbol.
Suggestions from: bmilekic
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
by libguile that needs to know the base of the RSE backing store. We
currently do not export the fixed address to userland by means of a
sysctl so user code needs to hardcode it for now. This will be revisited
later.
The RSE backing store is now at the bottom of region 4. The memory stack
is at the top of region 4. This means that the whole region is usable
for the stacks, giving a 61-bit stack space.
Port: lang/guile (depended of x11/gnome2)
use the ability on ia64 to map the register stack. The orientation of
the stack (i.e. its grow direction) is passed to vm_map_stack() in the
overloaded cow argument. Since the grow direction is represented by
bits, it is possible and allowed to create bi-directional stacks.
This is not an advertised feature, more of a side-effect.
Fix a bug in vm_map_growstack() that's specific to rstacks and which
we could only find by having the ability to create rstacks: when
the mapped stack ends at the faulting address, we have not actually
mapped the faulting address. we need to include or cover the faulting
address.
Note that at this time mmap(2) has not been extended to allow the
creation of rstacks by processes. If such a need arises, this can
be done.
Tested on: alpha, i386, ia64, sparc64
systems where the data/stack/etc limits are too big for a 32 bit process.
Move the 5 or so identical instances of ELF_RTLD_ADDR() into imgact_elf.c.
Supply an ia32_fixlimits function. Export the clip/default values to
sysctl under the compat.ia32 heirarchy.
Have mmap(0, ...) respect the current p->p_limits[RLIMIT_DATA].rlim_max
value rather than the sysctl tweakable variable. This allows mmap to
place mappings at sensible locations when limits have been reduced.
Have the imgact_elf.c ld-elf.so.1 placement algorithm use the same
method as mmap(0, ...) now does.
Note that we cannot remove all references to the sysctl tweakable
maxdsiz etc variables because /etc/login.conf specifies a datasize
of 'unlimited'. And that causes exec etc to fail since it can no
longer find space to mmap things.
contain the filedescriptor number on opens from userland.
The index is used rather than a "struct file *" since it conveys a bit
more information, which may be useful to in particular fdescfs and /dev/fd/*
For now pass -1 all over the place.
- Move struct sigacts out of the u-area and malloc() it using the
M_SUBPROC malloc bucket.
- Add a small sigacts_*() API for managing sigacts structures: sigacts_alloc(),
sigacts_free(), sigacts_copy(), sigacts_share(), and sigacts_shared().
- Remove the p_sigignore, p_sigacts, and p_sigcatch macros.
- Add a mutex to struct sigacts that protects all the members of the struct.
- Add sigacts locking.
- Remove Giant from nosys(), kill(), killpg(), and kern_sigaction() now
that sigacts is locked.
- Several in-kernel functions such as psignal(), tdsignal(), trapsignal(),
and thread_stopped() are now MP safe.
Reviewed by: arch@
Approved by: re (rwatson)
if (p->p_numthreads > 1) and not a flag because action is only necessary
if there are other threads. The rest of the system has no need to
identify thr threaded processes.
- In kern_thread.c use thr_exit1() instead of thread_exit() if P_THREADED
is not set.
flexible process_fork, process_exec, and process_exit eventhandlers. This
reduces code duplication and also means that I don't have to go duplicate
the eventhandler locking three more times for each of at_fork, at_exec, and
at_exit.
Reviewed by: phk, jake, almost complete silence on arch@
struct proc as p_tracecred alongside the current cache of the vnode in
p_tracep. This credential is then used for all later ktrace operations on
this file rather than using the credential of the current thread at the
time of each ktrace event.
- Now that we have multiple ktrace-related items in struct proc that are
pointers, rename p_tracep to p_tracevp to make it less ambiguous.
Requested by: rwatson (1)
I'm not convinced there is anything major wrong with the patch but
them's the rules..
I am using my "David's mentor" hat to revert this as he's
offline for a while.
data structure called kse_upcall to manage UPCALL. All KSE binding
and loaning code are gone.
A thread owns an upcall can collect all completed syscall contexts in
its ksegrp, turn itself into UPCALL mode, and takes those contexts back
to userland. Any thread without upcall structure has to export their
contexts and exit at user boundary.
Any thread running in user mode owns an upcall structure, when it enters
kernel, if the kse mailbox's current thread pointer is not NULL, then
when the thread is blocked in kernel, a new UPCALL thread is created and
the upcall structure is transfered to the new UPCALL thread. if the kse
mailbox's current thread pointer is NULL, then when a thread is blocked
in kernel, no UPCALL thread will be created.
Each upcall always has an owner thread. Userland can remove an upcall by
calling kse_exit, when all upcalls in ksegrp are removed, the group is
atomatically shutdown. An upcall owner thread also exits when process is
in exiting state. when an owner thread exits, the upcall it owns is also
removed.
KSE is a pure scheduler entity. it represents a virtual cpu. when a thread
is running, it always has a KSE associated with it. scheduler is free to
assign a KSE to thread according thread priority, if thread priority is changed,
KSE can be moved from one thread to another.
When a ksegrp is created, there is always N KSEs created in the group. the
N is the number of physical cpu in the current system. This makes it is
possible that even an userland UTS is single CPU safe, threads in kernel still
can execute on different cpu in parallel. Userland calls kse_create to add more
upcall structures into ksegrp to increase concurrent in userland itself, kernel
is not restricted by number of upcalls userland provides.
The code hasn't been tested under SMP by author due to lack of hardware.
Reviewed by: julian
dereferenced when a process exits due to the vmspace ref-count being
bumped. Change shmexit() and shmexit_myhook() to take a vmspace instead
of a process and call it in vmspace_dofree(). This way if it is missed
in exit1()'s early-resource-free it will still be caught when the zombie is
reaped.
Also fix a potential race in shmexit_myhook() by NULLing out
vmspace->vm_shm prior to calling shm_delete_mapping() and free().
MFC after: 7 days
On architectures with a non-executable stack, eg sparc64, this is used by
libgcc to determine at runtime if its necessary to enable execute permissions
on a region of the stack which will be used to execute code, allowing the
call to mprotect to be avoided if the kernel is configured to map the stack
executable.
problem was a locked directory vnode), do not give the process a chance
to sleep in state "stopevent" (depends on the S_EXEC bit being set in
p_stops) until most resources have been released again.
Approved by: re
indirectly through vm_page_protect(). The one remaining page flag that
is updated by vm_page_protect() is already being updated by our various
pmap implementations.
Note: A later commit will similarly change the VM_PROT_READ case and
eliminate vm_page_protect().
(1) Permit userland applications to request a change of label atomic
with an execve() via mac_execve(). This is required for the
SEBSD port of SELinux/FLASK. Attempts to invoke this without
MAC compiled in result in ENOSYS, as with all other MAC system
calls. Complexity, if desired, is present in policy modules,
rather than the framework.
(2) Permit policies to have access to both the label of the vnode
being executed as well as the interpreter if it's a shell
script or related UNIX nonsense. Because we can't hold both
vnode locks at the same time, cache the interpreter label.
SEBSD relies on this because it supports secure transitioning
via shell script executables. Other policies might want to
take both labels into account during an integrity or
confidentiality decision at execve()-time.
Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
entrypoints, #ifdef MAC. The supporting logic already existed in
kern_mac.c, so no change there. This permits MAC policies to cause
a process label change as the result of executing a binary --
typically, as a result of executing a specially labeled binary.
For example, the SEBSD port of SELinux/FLASK uses this functionality
to implement TE type transitions on processes using transitioning
binaries, in a manner similar to setuid. Policies not implementing
a notion of transition (all the ones in the tree right now) require
no changes, since the old label data is copied to the new label
via mac_create_cred() even if a transition does occur.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
describes an image activation instance. Instead, make use of the
existing fname structure entry, and introduce two new entries,
userspace_argv, and userspace_envv. With the addition of
mac_execve(), this divorces the image structure from the specifics
of the execve() system call, removes a redundant pointer, etc.
No semantic change from current behavior, but it means that the
structure doesn't depend on syscalls.master-generated includes.
There seems to be some redundant initialization of imgact entries,
which I have maintained, but which could probably use some cleaning
up at some point.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
the locking of the proc lock after the goto to done1 to avoid locking
the lock in an error case just so we can turn around and unlock it.
- Move the exec_setregs() stuff out from under the proc lock and after
the p_args stuff. This allows exec_setregs() to be able to sleep or
write things out to userland, etc. which ia64 does.
Tested by: peter
constants VM_MIN_ADDRESS, VM_MAXUSER_ADDRESS, USRSTACK and PS_STRINGS.
This is mainly so that they can be variable even for the native abi, based
on different machine types. Get stack protections from the sysentvec too.
This makes it trivial to map the stack non-executable for certain abis, on
machines that support it.
recursion when closef() calls pfind() which also wants the proc lock.
This case only occurred when setugidsafety() needed to close unsafe files.
Reviewed by: truckman
s/SNGL/SINGLE/
s/SNGLE/SINGLE/
Fix abbreviation for P_STOPPED_* etc flags, in original code they were
inconsistent and difficult to distinguish between them.
Approved by: julian (mentor)
sysentvec. Initialized all fields of all sysentvecs, which will allow
them to be used instead of constants in more places. Provided stack
fixup routines for emulations that previously used the default.
kernel access control.
Invoke an appropriate MAC entry point to authorize execution of
a file by a process. The check is placed slightly differently
than it appears in the trustedbsd_mac tree so that it prevents
a little more information leakage about the target of the execve()
operation.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
special actions for safety. One of these is to make sure that file descriptors
0..2 are in use, by opening /dev/null for those that are not already open.
Another is to close any file descriptors 0..2 that reference procfs. However,
these checks were made out of order, so that it was still possible for a
set-user-ID or set-group-ID process to be started with some of the file
descriptors 0..2 unused.
Submitted by: Georgi Guninski <guninski@guninski.com>
during execve() to use a 'credential_changing' variable. This makes it
easier to have outstanding patchsets against this code, as well as to
add conditionally defined clauses.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
handler in the kernel at the same time. Also, allow for the
exec_new_vmspace() code to build a different sized vmspace depending on
the executable environment. This is a big help for execing i386 binaries
on ia64. The ELF exec code grows the ability to map partial pages when
there is a page size difference, eg: emulating 4K pages on 8K or 16K
hardware pages.
Flesh out the i386 emulation support for ia64. At this point, the only
binary that I know of that fails is cvsup, because the cvsup runtime
tries to execute code in pages not marked executable.
Obtained from: dfr (mostly, many tweaks from me).
- Grab the vnode object early in exec when we still have the vnode lock.
- Cache the object in the image_params.
- Make use of the cached object in imgact_*.c
The ability to schedule multiple threads per process
(one one cpu) by making ALL system calls optionally asynchronous.
to come: ia64 and power-pc patches, patches for gdb, test program (in tools)
Reviewed by: Almost everyone who counts
(at various times, peter, jhb, matt, alfred, mini, bernd,
and a cast of thousands)
NOTE: this is still Beta code, and contains lots of debugging stuff.
expect slight instability in signals..
uifind() with a proc lock held.
change_ruid() and change_euid() have been modified to take a uidinfo
structure which will be pre-allocated by callers, they will then
call uihold() on the uidinfo structure so that the caller's logic
is simplified.
This allows one to call uifind() before locking the proc struct and
thereby avoid a potential blocking allocation with the proc lock
held.
This may need revisiting, perhaps keeping a spare uidinfo allocated
per process to handle this situation or re-examining if the proc
lock needs to be held over the entire operation of changing real
or effective user id.
Submitted by: Don Lewis <dl-freebsd@catspoiler.org>
locks the process.
- Defer other blocking operations such as vrele()'s until after we
release locks.
- execsigs() now requires the proc lock to be held when it is called
rather than locking the process internally.
that we can compile gcc. This is a hack because it adds a fixed 2MB to
each process's VSIZE regardless of how much is really being used since
there is no grow-up stack support. At least it isn't physical memory.
Sigh.
Add a sysctl to enable tweaking it for new processes.
general cleanup of the API. The entire API now consists of two functions
similar to the pre-KSE API. The suser() function takes a thread pointer
as its only argument. The td_ucred member of this thread must be valid
so the only valid thread pointers are curthread and a few kernel threads
such as thread0. The suser_cred() function takes a pointer to a struct
ucred as its first argument and an integer flag as its second argument.
The flag is currently only used for the PRISON_ROOT flag.
Discussed on: smp@
There is still some locations where the PROC lock should be held
in order to prevent inconsistent views from outside (like the
proc->p_fd fix for kern/vfs_syscalls.c:checkdirs()) that can be
fixed later.
Submitted by: Jonathan Mini <mini@haikugeek.com>
pmap_qremove. pmap_kenter is not safe to use in MI code because it is not
guaranteed to flush the mapping from the tlb on all cpus. If the process
in question is preempted and migrates cpus between the call to pmap_kenter
and pmap_kremove, the original cpu will be left with stale mappings in its
tlb. This is currently not a problem for i386 because we do not use PG_G on
SMP, and thus all mappings are flushed from the tlb on context switches, not
just user mappings. This is not the case on all architectures, and if PG_G
is to be used with SMP on i386 it will be a problem. This was committed by
peter earlier as part of his fine grained tlb shootdown work for i386, which
was backed out for other reasons.
Reviewed by: peter
There is some unresolved badness that has been eluding me, particularly
affecting uniprocessor kernels. Turning off PG_G helped (which is a bad
sign) but didn't solve it entirely. Userland programs still crashed.
shootdowns in a couple of key places. Do the same for i386. This also
hides some physical addresses from higher levels and has it use the
generic vm_page_t's instead. This will help for PAE down the road.
Obtained from: jake (MI code, suggestions for MD part)
this is a low-functionality change that changes the kernel to access the main
thread of a process via the linked list of threads rather than
assuming that it is embedded in the process. It IS still embeded there
but remove all teh code that assumes that in preparation for the next commit
which will actually move it out.
Reviewed by: peter@freebsd.org, gallatin@cs.duke.edu, benno rice,
Seigo Tanimura (tanimura) posted the initial delta.
I've polished it quite a bit reducing the need for locking and
adapting it for KSE.
Locks:
1 mutex in each filedesc
protects all the fields.
protects "struct file" initialization, while a struct file
is being changed from &badfileops -> &pipeops or something
the filedesc should be locked.
1 mutex in each struct file
protects the refcount fields.
doesn't protect anything else.
the flags used for garbage collection have been moved to
f_gcflag which was the FILLER short, this doesn't need
locking because the garbage collection is a single threaded
container.
could likely be made to use a pool mutex.
1 sx lock for the global filelist.
struct file * fhold(struct file *fp);
/* increments reference count on a file */
struct file * fhold_locked(struct file *fp);
/* like fhold but expects file to locked */
struct file * ffind_hold(struct thread *, int fd);
/* finds the struct file in thread, adds one reference and
returns it unlocked */
struct file * ffind_lock(struct thread *, int fd);
/* ffind_hold, but returns file locked */
I still have to smp-safe the fget cruft, I'll get to that asap.
Remove the explicit call to aio_proc_rundown() from exit1(), instead AIO
will use at_exit(9).
Add functions at_exec(9), rm_at_exec(9) which function nearly the
same as at_exec(9) and rm_at_exec(9), these functions are called
on behalf of modules at the time of execve(2) after the image
activator has run.
Use a modified version of tegge's suggestion via at_exec(9) to close
an exploitable race in AIO.
Fix SYSCALL_MODULE_HELPER such that it's archetecuterally neutral,
the problem was that one had to pass it a paramater indicating the
number of arguments which were actually the number of "int". Fix
it by using an inline version of the AS macro against the syscall
arguments. (AS should be available globally but we'll get to that
later.)
Add a primative system for dynamically adding kqueue ops, it's really
not as sophisticated as it should be, but I'll discuss with jlemon when
he's around.
In this case, C99's __func__ is properly defined as:
static const char __func__[] = "function-name";
and GCC 3.1 will not allow it to be used in bogus string concatenation.
it has not yet returned. Use this flag to deny debugging requests while
the process is execve()ing, and close once and for all any race conditions
that might occur between execve() and various debugging interfaces.
Reviewed by: jhb, rwatson
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.
Sorry john! (your next MFC will be a doosie!)
Reviewed by: peter@freebsd.org, dillon@freebsd.org
X-MFC after: ha ha ha ha
(this commit is just the first stage). Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.
lock. We now use temporary variables to save the process argument pointer
and just update the pointer while holding the lock. We then perform the
free on the cached pointer after releasing the lock.
the saved uid and gid during execve(). Unfortunately, the optimizations
were incorrect in the case where the credential was updated, skipping
the setting of the saved uid and gid when new credentials were generated.
This change corrects that problem by handling the newcred!=NULL case
correctly.
Reported/tested by: David Malone <dwmalone@maths.tcd.ie>
Obtained from: TrustedBSD Project
real uid, saved uid, real gid, and saved gid to ucred, as well as the
pcred->pc_uidinfo, which was associated with the real uid, only rename
it to cr_ruidinfo so as not to conflict with cr_uidinfo, which
corresponds to the effective uid.
o Remove p_cred from struct proc; add p_ucred to struct proc, replacing
original macro that pointed.
p->p_ucred to p->p_cred->pc_ucred.
o Universally update code so that it makes use of ucred instead of pcred,
p->p_ucred instead of p->p_pcred, cr_ruidinfo instead of p_uidinfo,
cr_{r,sv}{u,g}id instead of p_*, etc.
o Remove pcred0 and its initialization from init_main.c; initialize
cr_ruidinfo there.
o Restruction many credential modification chunks to always crdup while
we figure out locking and optimizations; generally speaking, this
means moving to a structure like this:
newcred = crdup(oldcred);
...
p->p_ucred = newcred;
crfree(oldcred);
It's not race-free, but better than nothing. There are also races
in sys_process.c, all inter-process authorization, fork, exec, and
exit.
o Remove sigio->sio_ruid since sigio->sio_ucred now contains the ruid;
remove comments indicating that the old arrangement was a problem.
o Restructure exec1() a little to use newcred/oldcred arrangement, and
use improved uid management primitives.
o Clean up exit1() so as to do less work in credential cleanup due to
pcred removal.
o Clean up fork1() so as to do less work in credential cleanup and
allocation.
o Clean up ktrcanset() to take into account changes, and move to using
suser_xxx() instead of performing a direct uid==0 comparision.
o Improve commenting in various kern_prot.c credential modification
calls to better document current behavior. In a couple of places,
current behavior is a little questionable and we need to check
POSIX.1 to make sure it's "right". More commenting work still
remains to be done.
o Update credential management calls, such as crfree(), to take into
account new ruidinfo reference.
o Modify or add the following uid and gid helper routines:
change_euid()
change_egid()
change_ruid()
change_rgid()
change_svuid()
change_svgid()
In each case, the call now acts on a credential not a process, and as
such no longer requires more complicated process locking/etc. They
now assume the caller will do any necessary allocation of an
exclusive credential reference. Each is commented to document its
reference requirements.
o CANSIGIO() is simplified to require only credentials, not processes
and pcreds.
o Remove lots of (p_pcred==NULL) checks.
o Add an XXX to authorization code in nfs_lock.c, since it's
questionable, and needs to be considered carefully.
o Simplify posix4 authorization code to require only credentials, not
processes and pcreds. Note that this authorization, as well as
CANSIGIO(), needs to be updated to use the p_cansignal() and
p_cansched() centralized authorization routines, as they currently
do not take into account some desirable restrictions that are handled
by the centralized routines, as well as being inconsistent with other
similar authorization instances.
o Update libkvm to take these changes into account.
Obtained from: TrustedBSD Project
Reviewed by: green, bde, jhb, freebsd-arch, freebsd-audit
vm_mtx does not recurse and is required for most low level
vm operations.
faults can not be taken without holding Giant.
Memory subsystems can now call the base page allocators safely.
Almost all atomic ops were removed as they are covered under the
vm mutex.
Alpha and ia64 now need to catch up to i386's trap handlers.
FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).
Reviewed (partially) by: jake, jhb
other "system" header files.
Also help the deprecation of lockmgr.h by making it a sub-include of
sys/lock.h and removing sys/lockmgr.h form kernel .c files.
Sort sys/*.h includes where possible in affected files.
OK'ed by: bde (with reservations)
passed vnode must be locked; this is the case because of calls
to VOP_GETATTR(), VOP_ACCESS(), and VOP_OPEN(). This becomes
more of an issue when VOP_ACCESS() gets a bit more complicated,
which it does when you introduce ACL, Capability, and MAC
support.
Obtained from: TrustedBSD Project
chgsbsize(), which are called rather frequently and may be called from an
interrupt context in the case of chgsbsize(). Instead, do the hash table
lookup and maintenance when credentials are changed, which is a lot less
frequent. Add pointers to the uidinfo structures to the ucred and pcred
structures for fast access. Pass a pointer to the credential to chgproccnt()
and chgsbsize() instead of passing the uid. Add a reference count to the
uidinfo structure and use it to decide when to free the structure rather
than freeing the structure when the resource consumption drops to zero.
Move the resource tracking code from kern_proc.c to kern_resource.c. Move
some duplicate code sequences in kern_prot.c to separate helper functions.
Change KASSERTs in this code to unconditional tests and calls to panic().
SYSCTL_LONG macro to be consistent with other integer sysctl variables
and require an initial value instead of assuming 0. Update several
sysctl variables to use the unsigned types.
PR: 15251
Submitted by: Kelly Yancey <kbyanc@posi.net>
program running under linux emulation, the script binary is checked for
in /compat/linux first. Without this patch the wrong script binary
(i.e. the FreeBSD binary) will be run instead of the linux binary.
For example, #!/bin/sh, thus breaking out of linux compatibility mode.
This solves a number of problems people have had installing linux
software on FreeBSD boxes.