readability.
o Conditionalize only the SYSCTL definitions for the regression
tree, not the variables itself, decreasing the number of #ifdef
REGRESSIONs scattered in kern_mib.c, and making the code more
readable.
Sponsored by: DARPA, NAI Labs
For an object type, we maintain a variable mb_mapfull. It is 0 by default
and is only raised to 1 in one place: when an mb_pop_cont() fails for
the first time, on the assumption that the reason for the failure is
due to the underlying map for the object (e.g. clust_map, mbuf_map) being
exhausted.
Problem and Changes:
Change how we define "mb_mapfull." It now means: "set to 1 when the first
mb_pop_cont() fails only in the kmem_malloc()-ing of the object, and
only if the call was with the M_TRYWAIT flag." This is a more conservative
definition and should avoid odd [but theoretically possible] situations
from occuring. i.e. we had set mb_mapfull to 1 thinking the map for the
object was actually exhausted when we _actually_ failed in malloc()ing
the space for the bucket structure managing the objects in the page
we're allocating.
The kernel certainly doesn't use _PATH_DEV or even /dev/ to find the device.
It cannot, since "/" has not been mounted. Maybe the only affect of using
/dev/ is that it gets put in the mounted-from name for "/", so that mount(8),
etc., display an absolute path before "/" has been remounted. Many have
never bothered typing the full path, and code that constructs a path in
rootdevnames[] never bothered to construct a full path, so the example
shouldn't have it.
Submitted by: bde
vnodes. This will hopefully serve as a base from which we can
expand the MP code. We currently do not attempt to obtain any
mutex or SX locks, but the door is open to add them when we nail
down exactly how that part of it is going to work.
Replace uses of holdfp() with fget*() or fgetvp*() calls as appropriate
introduce fget(), fget_read(), fget_write() - these functions will take
a thread and file descriptor and return a file pointer with its ref
count bumped.
introduce fgetvp(), fgetvp_read(), fgetvp_write() - these functions will
take a thread and file descriptor and return a vref()'d vnode.
*_read() requires that the file pointer be FREAD, *_write that it be
FWRITE.
This continues the cleanup of struct filedesc and struct file access
routines which, when are all through with it, will allow us to then
make the API calls MP safe and be able to move Giant down into the fo_*
functions.
- Restore inferior() to being iterative rather than recursive.
- Assert that the proctree_lock is held in inferior() and change the one
caller to get a shared lock of it. This also ensures that we hold the
lock after performing the check so the check can't be made invalid out
from under us after the check but before we act on it.
Requested by: bde
one to perform a vn_open using temporary/other/fake credentials.
Modify the nfs client side locking code to use vn_open_cred() passing
proc0's ucred instead of the old way which was to temporary raise
privs while running vn_open(). This should close the race hopefully.
as being valid. Previously only the magic number and the virtual
address were checked, but it makes little sense to require that
the virtual address is the same (the message buffer is located at
the end of physical memory), and checks on the msg_bufx and msg_bufr
indices were missing.
Submitted by: Bodo Rueskamp <br@clabsms.de>
Tripped over during a kernel debugging tutorial given by: grog
Reviewed by: grog, dwmalone
MFC after: 1 week
sysctl_req', which describes in-progress sysctl requests. This permits
sysctl handlers to have access to the current thread, permitting work
on implementing td->td_ucred, migration of suser() to using struct
thread to derive the appropriate ucred, and allowing struct thread to be
passed down to other code, such as network code where td is not currently
available (and curproc is used).
o Note: netncp and netsmb are not updated to reflect this change, as they
are not currently KSE-adapted.
Reviewed by: julian
Obtained from: TrustedBSD Project
consistent with the one other function in the file, and prevents long
lines in up-coming changes. This nominally pulls kern_mib.c a little
further down the long path to style(9) compliance.
in wdrain during a write. This flag needs to be used in devices whos
strategy routines turn-around and issue another high level I/O, such as
when MD turns around and issues a VOP_WRITE to vnode backing store, in order
to avoid deadlocking the dirty buffer draining code.
Remove a vprintf() warning from MD when the backing vnode is found to be
in-use. The syncer of buf_daemon could be flushing the backing vnode at
the time of an MD operation so the warning is not correct.
MFC after: 1 week
structure changes now rather then piecemeal later on. mnt_nvnodelist
currently holds all the vnodes under the mount point. This will eventually
be split into a 'dirty' and 'clean' list. This way we only break kld's once
rather then twice. nvnodelist will eventually turn into the dirty list
and should remain compatible with the klds.
by one - see _SIG_IDX(). Revert part of my mis-correction in kern_sig.c
(but signal 0 still has to be allowed) and fix _SIG_VALID() (it was
rejecting ignal 128).
argument of 0. You cannot return EINVAL for signal 0. This broke
(in 5 minutes of testing) at least ssh-agent and screen.
However, there was a bug in the original code. Signal 128 is not
valid.
Pointy-hat to: des, jhb
as suser_td(td) works as well as suser_xxx(NULL, p->p_ucred, 0);
This simplifies upcoming changes to suser(), and causes this code
to use the right credential (well, largely) once the td->td_ucred
changes are complete. There remains some redundancy and oddness
in this code, which should be rethought after the next batch of
suser and credential changes.
in vfs_syscalls.c. Although it did save some indirection, many of
those savings will be obscured with the impending commit of suser()
changes, and the result is increased code complexity. Also, once
p->p_ucred and td->td_ucred are distinguished, this will make
vfs_mount() use the correct thread credential, rather than the
process credential.
debug another process based on their respective {effective,additional,
saved,real} gid's. p1 is only permitted to debug p2 if its effective
gids (egid + additional groups) are a strict superset of the gids of
p2. This implements properly the security test previously incorrectly
implemented in kern_ktrace.c, and is consistent with the kernel
security policy (although might be slightly confusing for those more
familiar with the userland policy).
o Restructure p_candebug() logic so that various results are generated
comparing uids, gids, credential changes, and then composed in a
single check before testing for privilege. These tests encapsulate
the "BSD" inter-process debugging policy. Other non-BSD checks remain
seperate. Additional comments are added.
Submitted by: tmm, rwatson
Obtained from: TrustedBSD Project
Reviewed by: petef, tmm, rwatson
really be moved elsewhere: p_candebug() encapsulates the security
policy decision, whereas the P_INEXEC check has to do with "correctness"
regarding race conditions, rather than security policy.
Example: even if no security protections were enforced (the "uids are
advisory" model), removing P_INEXEC could result in incorrect operation
due to races on credential evaluation and modification during execve().
Obtained from: TrustedBSD Project
o Reorder and synchronize #include's, including moving "opt_cap.h" to
above system includes.
o Introduce #ifdef'd kern.security.capabilities sysctl tree, including
kern.security.capabilities.enabled, which defaults to 0.
The rest of the file remains stubs for the time being.
Obtained from: TrustedBSD Project
o POSIX.1e capabilities authorize overriding of VEXEC for VDIR based
on CAP_DAC_READ_SEARCH, but of !VDIR based on CAP_DAC_EXECUTE. Add
appropriate conditionals to vaccess() to take that into account.
o Synchronization cap_check_xxx() -> cap_check() change.
Obtained from: TrustedBSD Project
o Add reference to TrustedBSD Project in license header.
o Update dated comments, including comment in extattr.h claiming that
no file systems support extended attributes.
o Improve comment consistency.
credential selection, rather than reference via a thread or process
pointer. This is part of a gradual migration to suser() accepting
a struct ucred instead of a struct proc, simplifying the reference
and locking semantics of suser().
Obtained from: TrustedBSD Project
- Now that apm loadable module can inform its existence to other kernel
components (e.g. i386/isa/clock.c:startrtclock()'s TCS hack).
- Exchange priority of SI_SUB_CPU and SI_SUB_KLD for above purpose.
- Add simple arbitration mechanism for APM vs. ACPI. This prevents
the kernel enables both of them.
- Remove obsolete `#ifdef DEV_APM' related code.
- Add abstracted interface for Powermanagement operations. Public apm(4)
functions, such as apm_suspend(), should be replaced new interfaces.
Currently only power_pm_suspend (successor of apm_suspend) is implemented.
Reviewed by: peter, arch@ and audit@
function symbols in the kernel in a list of C strings, with an extra
nul-termination at the end.
This sysctl requires addition of a new linker operation. Now,
linker_file_t's need to respond to "each_function_name" to export
their function symbols.
Note that the sysctl doesn't currently allow distinguishing multiple
symbols with the same name from different modules, but could quite
easily without a change to the linker operation. This will be a nicety
to have when it can be used.
Obtained from: NAI Labs CBOSS project
Funded by: DARPA
by the profiler on a running system. This is not done sparsely, as
memory is cheaper than processor speed and each gprof mcount() and
mexitcount() operation is already very expensive.
Obtained from: NAI Labs CBOSS project
Funded by: DARPA
number) instead of allocating next free unit for them. If someone needs
fixed place, he must specify it correctly. "Allocating next" is especially bad
because leads to double device detection and to "repeat make_dev panic" as
result. This can happens if the same devices present somewhere on PCI bus,
hints and ACPI. Making them present in one place only not always
possible, "sc" f.e. can't be removed from hints, it results to no console at
all.
2) In make_device(), detect when devclass_add_device() fails, free dev and
return. I.e. add missing error checking. This part needed to finish fix in 1),
but must be done this way in anycase, with old variant too.
non-existent disk in a legacy /dev on a DEVFS system would panic the system
if stat(2)'ed.
Do not whine about anonymous device nodes not having a si_devsw, they're
not supposed to.
Make it a panic to repeat make_dev() or destroy_dev(), this check
should maybe be neutered when -current goes -stable.
Whine if devsw() is called on anon dev_t's in a devfs system.
Make a hack to avoid our lazy-eval disk code triggering the above whine.
Fix the multiple make_dev() in disk code by making ${disk}${unit}s${slice}
an alias/symlink to ${disk}${unit}s${slice}c
it has not yet returned. Use this flag to deny debugging requests while
the process is execve()ing, and close once and for all any race conditions
that might occur between execve() and various debugging interfaces.
Reviewed by: jhb, rwatson
strip the space from '( struct thread *...', wrap long lines.
o Remove an unneeded comment on the topic of no lock being required as
part of the NDINIT() in __acl_get_file(), as it's really not required
there.
Obtained from: TrustedBSD Project
of Giant during the Giant unwinding phase, and start work on instrumenting
Giant for the file and proc mutexes.
These wrappers allow developers to turn on and off Giant around various
subsystems. DEVELOPERS SHOULD NEVER TURN OFF GIANT AROUND A SUBSYSTEM JUST
BECAUSE THE SYSCTL EXISTS! General developers should only considering
turning on Giant for a subsystem whos default is off (to help track down
bugs). Only developers working on particular subsystems who know what
they are doing should consider turning off Giant.
These wrappers will greatly improve our ability to unwind Giant and test
the kernel on a (mostly) subsystem by subsystem basis. They allow Giant
unwinding developers (GUDs) to emplace appropriate subsystem and structural
mutexes in the main tree and then request that the larger community test
the work by turning off Giant around the subsystem(s), without the larger
community having to mess around with patches. These wrappers also allow
GUDs to boot into a (more likely to be) working system in the midst of
their unwinding work and to test that work under more controlled
circumstances.
There is a master sysctl, kern.giant.all, which defaults to 0 (off). If
turned on it overrides *ALL* other kern.giant sysctls and forces Giant to
be turned on for all wrapped subsystems. If turned off then Giant around
individual subsystems are controlled by various other kern.giant.XXX sysctls.
Code which overlaps multiple subsystems must have all related subsystem Giant
sysctls turned off in order to run without Giant.
while it is on a queue with the queue lock and remove the per-task locks.
- Remove TASK_DESTROY now that it is no longer needed.
- Go back to inlining TASK_INIT now that it is short again.
Inspired by: dfr
userland. The per thread ucred reference is immutable and thus needs no
locks to be read. However, until all the proc locking associated with
writes to p_ucred are completed, it is still not safe to use the per-thread
reference.
Tested on: x86 (SMP), alpha, sparc64
queue, and a mutex to protect the global list of taskqueues. The only
visible change is that a TASK_DESTROY() macro has been added to mirror
the TASK_INIT() macro to destroy a task before it is free'd.
Submitted by: Andrew Reiter <awr@watson.org>
splhigh() before the mtx_unlock and tsleep(). The splhigh() was probably
correct in the original code using simplelocks but is not correct in
5.0-current.
Noticed by: Andrew Reiter <awr@FreeBSD.org>
real effect.
Optimize vfs_msync(). Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list. Improves looping case by 500%.
Optimize ffs_sync(). Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list. This makes a couple of assumptions,
which I believe are ok, in regards to vnode stability when the mount list
mutex is held. Improves looping case by 500%.
(more optimization work is needed on top of these fixes)
MFC after: 1 week
or the cluster will not be properly merged. Dup the code from
cluster_wbuild() and add some printf()s to see if bad cases are present.
MFC after: 2 weeks
returns an success/failure code rather than the actual value.
- Add getenv_string() which copies a string from the environment to another
string and returns true on success.
all successful calls to VOP_OPEN() might be reflected in a call to
VOP_CLOSE(). For now, simply add a comment reflecting this problem;
this should be fixed at some point.
terminated and flushes pending dirty pages it is possible for the
object to be ref'd (0->1) and then deref'd (1->0) during termination.
We do not terminate the object a second time.
Document vop_stdgetvobject() to explicitly allow it to be called without
the vnode interlock held (for upcoming sync_msync() and ffs_sync()
performance optimizations)
MFC after: 3 days
the system load average. Previously, the load average measurement
was susceptible to synchronisation with processes that run at
regular intervals such as the system bufdaemon process.
Each interval is now chosen at random within the range of 4 to 6
seconds. This large variation is chosen so that over the shorter
5-minute load average timescale there is a good dispersion of
samples across the 5-second sample period (the time to perform 60
5-second samples now has a standard deviation of approx 4.5 seconds).
to kern_synch.c in preparation for adding some jitter to the
inter-sample time.
Note that the "vm.loadavg" sysctl still lives in vm_meter.c which
isn't the right place, but it is appropriate for the current (bad)
name of that sysctl.
Suggested by: jhb (some time ago)
Reviewed by: bde
off to witness_init() making the check for double intializating a lock by
testing the LO_INITIALIZED flag moot. Workaround this by checking the
LO_INITIALIZED flag ourself before we bzero the lock structure.
be so dangerous it isn't funny. eg: if you panic inside NFS or softdep,
and then try and sync you run into held locks and cause either deadlocks,
recursive panics or other interesting chaos. Default is unchanged.
number, portable OpenAFS applications don't have to attempt to determine
what system call number was dynamically allocated. No system call
prototype or implementation is defined.
Requested by: Tom Maher <tardis@watson.org>
This stops panics on unloading modules which define their own sysctl sets.
However, this also removes the protection against somebody actually
defining a static sysctl with an oid in the range of the dynamic ones,
which would break badly if there is already a dynamic sysctl with
the requested oid.
Apparently, the algorithm for removing sysctl sets needs a bit more work.
For the present, the panic I introduced only leads to Bad Things (tm).
Submitted by: many users of -current :(
Pointy hat to: roam (myself) for not testing rev. 1.112 enough.
- Add proc locking to the jail() syscall. This mostly involved shuffling
a few things around so that blockable things like malloc and copyin
were performed before acquiring the lock and checking the existing
ucred and then updating the ucred as one "atomic" change under the proc
lock.
- crhold() returns a reference to the ucred whose refcount it bumps.
- crcopy() now simply copies the credentials from one credential to
another and has no return value.
- a new crshared() primitive is added which returns true if a ucred's
refcount is > 1 and false (0) otherwise.
Updated by peter following KSE and Giant pushdown.
I've running with this patch for two week with no ill side effects.
PR: kern/12014: Fix SysV Semaphore handling
Submitted by: Peter Jeremy <peter.jeremy@alcatel.com.au>
a single kern.security.seeotheruids_permitted, describes as:
"Unprivileged processes may see subjects/objects with different real uid"
NOTE: kern.ps_showallprocs exists in -STABLE, and therefore there is
an API change. kern.ipc.showallsockets does not.
- Check kern.security.seeotheruids_permitted in cr_cansee().
- Replace visibility calls to socheckuid() with cr_cansee() (retain
the change to socheckuid() in ipfw, where it is used for rule-matching).
- Remove prison_unpcb() and make use of cr_cansee() against the UNIX
domain socket credential instead of comparing root vnodes for the
UDS and the process. This allows multiple jails to share the same
chroot() and not see each others UNIX domain sockets.
- Remove unused socheckproc().
Now that cr_cansee() is used universally for socket visibility, a variety
of policies are more consistently enforced, including uid-based
restrictions and jail-based restrictions. This also better-supports
the introduction of additional MAC models.
Reviewed by: ps, billf
Obtained from: TrustedBSD Project
processes to attach debugging to themselves even though the
global kern_unprivileged_procdebug_permitted policy might disallow
this.
o Move the kern_unprivileged_procdebug_permitted check above the
(p1==p2) check.
Reviewed by: des