abstract the securelevel implementation details from the checking
code. The call in -CURRENT accepts a struct ucred--in -STABLE, it
will accept struct proc. This facilitates the upcoming commit of
per-jail securelevel support. The calls will also generate a
kernel printf if the calls are made with NULL ucred/proc pointers:
generally speaking, there are few instances of this, and they should
be fixed.
o Update p_candebug() to use securelevel_gt(); future updates to the
remainder of the kernel tree will be committed soon.
Obtained from: TrustedBSD Project
transcription during the (pcred,ucred) merge; this was not used for
the kill() system call, so does not affect direct explicit process
signalling.
Pointed out by: fenner
This works if /dev exists, or if / is read/write (nfsroot). If it is
too hard, leave it up to init -d (which will probably fail if /dev does
not exist, but there isn't much else we can do short of making a union
mount on /).
This means we get a proper /dev if you boot a 5.x kernel on a 4.x world,
which I happen to do often (the ramdisks on our install netboot servers
have 4.x userland worlds on them).
Reviewed by: audit
Add tunables for the sem* and shm* syscontrols for tuning on boottime
until they become dynamic.
SAP R/3 doesn't like the compiled in defaults.
automatically change the code to add
struct proc *p = td->td_proc;
because now 'td' is probably capable of being NULL too.
I expect to see more of this kind of error during the 'weeding'
process. It's too easy to make. (junior hacker project.. look for these :-)
Submitted by: mark Peek <mp@freebsd.org>
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.
Sorry john! (your next MFC will be a doosie!)
Reviewed by: peter@freebsd.org, dillon@freebsd.org
X-MFC after: ha ha ha ha
dillon in an earlier e-mail.
- We don't need to test the console right before we vfprintf() the panicstr
message. The printing of the panic message is a fine console test by
itself and doesn't make useful messages scroll off the screen or tick
developers off in quite the same.
Requested by: jlemon, imp, bmilekic, chris, gsutter, jake (2)
'struct tty' was out of sync in user and kernel due to dev_t/udev_t
mixups. This takes advantage of the fact that dev_t changes type in
userland, so it isn't too pretty.
overflow if uap->nsops (which is already unsigned) is over INT_MAX;
consequently, the bounds check below becomes valid. Previously, if a
value over INT_MAX was passed in uap->nsops, the bounds check wouldn't
catch it, and the value would be used to compute copyin()'s third
argument.
Obtained from: NetBSD
it to the MI area. KSE touched cpu_wait() which had the same change
replicated five ways for each platform. Now it can just do it once.
The only MD parts seemed to be dealing with fpu state cleanup and things
like vm86 cleanup on x86. The rest was identical.
XXX: ia64 and powerpc did not have cpu_throw(), so I've put a functional
stub in place.
Reviewed by: jake, tmm, dillon
me (addition of vn_rdwr_inchunks). The problem Yahoo is solving is that
if you have large process images core dumping, or you have a large number of
forked processes all core dumping at the same time, the original coredump code
would leave the vnode locked throughout. This can cause the directory vnode
to get locked up, which can cause the parent directory vnode to get locked
up, and so on all the way to the root node, locking the entire machine up
for extremely long periods of time.
This patch solves the problem in two ways. First it uses an advisory
non-blocking lock to abort multiple processes trying to core to the same
file. Second (my contribution) it chunks up the writes and uses bwillwrite()
to avoid holding the vnode locked while blocking in the buffer cache.
Submitted by: ps
Reviewed by: dillon
MFC after: 2 weeks
has been forcibly unmounted. If the filesystem root vnode is reached
and it has no associated mountpoint (vp->v_mount == NULL), __getcwd
would return without freeing 'buf'. Add the missing free() call.
PR: kern/30306
Submitted by: Mike Potanin <potanin@mccme.ru>
MFC after: 1 week
and I still dont know why, this was not failing on the non-kse kernel.
It certainly should have since things were using linker_kernel_file
unconditionally. This has highlighted a different problem though that
means that trying to do a kldload on a non-dynamic kernel will implode.
Synchronize syscalls.master with all MPSAFE changes to date. Synchronize
new syscall generation follows because yield() will panic if it is out
of sync with syscalls.master.
by renaming it to kern.security.suser_enabled. This makes the name
consistent with other use: "permitted" now refers to a specific right
or privilege, whereas "enabled" refers to a feature. As this hasn't
been MFC'd, and using this destroys a running system currently, I believe
the user base of the sysctl will not be too unhappy.
o While I'm at it, un-staticize and export the supporting variable, as it
will be used by kern_cap.c shortly.
Obtained from: TrustedBSD Project
Instead introduce the [M] prefix to existing keywords. e.g.
MSTD is the MP SAFE version of STD. This is prepatory for a
massive Giant lock pushdown. The old MPSAFE keyword made
syscalls.master too messy.
Begin comments MP-Safe procedures with the comment:
/*
* MPSAFE
*/
This comments means that the procedure may be called without
Giant held (The procedure itself may still need to obtain
Giant temporarily to do its thing).
sv_prepsyscall() is now MP SAFE and assumed to be MP SAFE
sv_transtrap() is now MP SAFE and assumed to be MP SAFE
ktrsyscall() and ktrsysret() are now MP SAFE (Giant Pushdown)
trapsignal() is now MP SAFE (Giant Pushdown)
Places which used to do the if (mtx_owned(&Giant)) mtx_unlock(&Giant)
test in syscall[2]() in */*/trap.c now do not. Instead they
explicitly unlock Giant if they previously obtained it, and then
assert that it is no longer held to catch broken system calls.
Rebuild syscall tables.
KINFO_BSDI_SYSINFO. This supposedly fixes Netscape 3.0.4 (bsdi binary)
on -current. (and is also applicable to RELENG_4)
PR: 25476
Submitted by: Philipp Mergenthaler <un1i@rz.uni-karlsruhe.de>
file. ffs will discard any pending dirty pages when it is closed,
so we may as well not waste time trying to clean them. This doesn't
stop other things from writing it out, eg: pageout, fsync(2) etc.
level implementation stuff out of machine/globaldata.h to avoid exposing
UPAGES to lots more places. The end result is that we can double
the kernel stack size with 'options UPAGES=4' etc.
This is mainly being done for the benefit of a MFC to RELENG_4 at some
point. -current doesn't really need this so much since each interrupt
runs on its own kstack.
I'm at it also add a comment in mtx_validate() explaining the purpose
of the last change.
Basically, this fixes booting kernels compiled with MUTEX_DEBUG. What used
to happen is before we setidt from init386() [still using BTX idt], we
called mtx_init() on several mutex locks, notably Giant and some others.
This is a problem for MUTEX_DEBUG because it enables mtx_validate() which
calls kernacc(), some of which in turn requires Giant.
Fix by calling kernacc() from mtx_validate() only if (!cold).
then one can restart from a panic by resetting the panicstr variable to
NULL. This commit conditionalizes the previously committed functionality
on this variable. It also removes the __dead2 attribute from the panic()
function so that when one continues from a panic() the behavior will
be predictable.
sysproto.h in addition to the existing padding afterwards.
This is needed to support big-endian architectures like sparc64.
Reviewed by: bde
Tested on alpha by: jhb
timeout callwheel and buffer cache, out of the platform specific areas
and into the machine independant area. i386 and alpha adjusted here.
Other cpus can be fixed piecemeal.
Reviewed by: freebsd-smp, jake
callout_stop() would fail in two cases:
1) The timeout was currently executing, and
2) The timeout had already executed.
We only needed to work around the race for 1). We caught some instances
of 2) via the PS_TIMEOUT flag, however, if endtsleep() fired after the
process had been woken up but before it had resumed execution,
PS_TIMEOUT would not be set, but callout_stop() would fail, so we
would block the process until endtsleep() resumed it. Except that
endtsleep() had already run and couldn't resume it. This adds a new flag
PS_TIMOFAIL to indicate the case of 2) when PS_TIMEOUT isn't set.
- Implement this race fix for condition variables as well.
Tested by: sos
has existed for a long time, but I made it worse a few months ago
by by adding calls to VFS_ROOT() and checkdirs() in revision 1.179.
Also, remove the LK_REENABLE flag in the lockmgr() call; this flag
has been ignored by the lockmgr code for 4 years. This was the only
remaining mention of it apart from its definition.
Reviewed by: jhb
information. The default limits only effect machines with > 1GB of ram
and can be overriden with two new kernel conf variables VM_SWZONE_SIZE_MAX
and VM_BCACHE_SIZE_MAX, or with loader variables kern.maxswzone and
kern.maxbcache. This has the effect of leaving more KVM available for
sizing NMBCLUSTERS and 'maxusers' and should avoid tripups where a sysad
adds memory to a machine and then sees the kernel panic on boot due to
running out of KVM.
Also change the default swap-meta auto-sizing calculation to allocate half
of what it was previously allocating. The prior defaults were way too high.
Note that we cannot afford to run out of swap-meta structures so we still
stay somewhat conservative here.
`struct xucred` with the credentials of the connected peer.
Obviously this only works (and makes sense) on SOCK_STREAM
sockets. This works for both the connect(2) and listen(2)
callers.
There is precise documentation of the semantics in unix(4).
Reviewed by: dwmalone (eyeballed)
unnecessary breakage.
While here, use explicit sizes for the string fields so that we dont
have unintentional changes again in the future when key tunables change.
This still is not quite right, but a june userland is happy with
a -current kernel with these tweaks.
label if the dump device overflaps the label (which is a slight
misconfiguration). Dump routines don't use dscheck(), so the normal
write protection of the label doesn't help.
Reduced some nearby overflow bugs. In disk_dumpcheck(), there was
(fatal but fail-safe) overflow on i386's with 4GB of memory, at least
if Maxmem was the top page (can this happen?). The fix assumes that
the sector size divides PAGE_SIZE (dump routines already assume this).
In setdumpdev(), the corresponding overflow occurred with only about
2GB of memory on all machines with 32-bit ints. This allowed setdumpdev()
to succeed when it shouldn't have, but then disk_dumpcheck() failed
safe later. Except in old versions of FreeBSD like RELENG_3 where
there is no disk_dumpcheck().
PR: 28164 (label clobbering part)
MFC after: 1 week
structure is always free()ed yet only sometimes malloc()ed. In particular,
it was simply set to point to l_filename from the a linker_file_t in
link_elf_link_preload_finish(). The l_filename had been malloc()ed inside
the kern_linker.c module and was being free()ed twice: once by
link_elf_unload_file() and again by linker_file_unload(), leading to
a panic.
How to duplicate the problem:
- Pre-load a kernel module from the loader, i.e. if_sis.ko
- Boot system
- Attempt to unload module with kldunload if_sis
- Bewm
The problem here is that the case where the module was loaded with kldload
after system boot would work correctly, so this bug went unnoticed until
I stubbed my toe on it just now. (Also, you can only trip this bug if
you compile a kernel with options DDB, but that's the default now.)
Fix: remember to malloc() a separate copy of the module name for the
l_name member of the gdb linkage structure in three places where the
linkage structure can be initialized.
the process of exiting the kernel. The ast() function now loops as long
as the PS_ASTPENDING or PS_NEEDRESCHED flags are set. It returns with
preemption disabled so that any further AST's that arrive via an
interrupt will be delayed until the low-level MD code returns to user
mode.
- Use u_int's to store the tick counts for profiling purposes so that we
do not need sched_lock just to read p_sticks. This also closes a
problem where the call to addupc_task() could screw up the arithmetic
due to non-atomic reads of p_sticks.
- Axe need_proftick(), aston(), astoff(), astpending(), need_resched(),
clear_resched(), and resched_wanted() in favor of direct bit operations
on p_sflag.
- Fix up locking with sched_lock some. In addupc_intr(), use sched_lock
to ensure pr_addr and pr_ticks are updated atomically with setting
PS_OWEUPC. In ast() we clear pr_ticks atomically with clearing
PS_OWEUPC. We also do not grab the lock just to test a flag.
- Simplify the handling of Giant in ast() slightly.
Reviewed by: bde (mostly)
a time using the ogetdirentries() compatibility syscall. This is a
hack to ensure that rediculous values don't get passed to MALLOC().
Reviewed by: kris
for endtsleep() to be executing when msleep() resumed, for endtsleep()
to spin on sched_lock long enough for the other process to loop on
msleep() and sleep again resulting in endtsleep() waking up the "wrong"
msleep.
Obtained from: BSD/OS
removing the callout entry, return 1. If callout_stop() fails to remove
the callout entry because it is currently executing or has already been
executed, then the function returns 0. The idea was obtained from BSD/OS,
however, BSD/OS changed untimeout(), and I've just changed callout_stop()
to be more conservative.
Obtained from: BSD/OS
- Callers of asleep() and await() have been converted to calling tsleep().
The only caller outside of M_ASLEEP was the ata driver, which called both
asleep() and await() with spl-raised, so there was no need for the
asleep() and await() pair. M_ASLEEP was unused.
Reviewed by: jasone, peter
- Callers of asleep() and await() have been converted to calling tsleep().
The only caller outside of M_ASLEEP was the ata driver, which called both
asleep() and await() with spl-raised, so there was no need for the
asleep() and await() pair. M_ASLEEP was unused.
Reviewed by: jasone, peter
are a really nasty interface that should have been killed long ago
when 'ptrace(PT_[SG]ETREGS' etc came along. The entity that they
operate on (struct user) will not be around much longer since it
is part-per-process and part-per-thread in a post-KSE world.
gdb does not actually use this except for the obscure 'info udot'
command which does a hexdump of as much of the child's 'struct user'
as it can get. It carries its own #defines so it doesn't break
compiles.
filename passed in via the module loader functions in the GDB
"sharedlibrary" support structures. This isn't good, since the pointer
would become stale in almost every case (not the pre-loaded case, of
course).
Change this to malloc()ed copy of the string and finally fix the reason
that gdb -k's "sharedlibrary" command stopped working.
Obtained from: LOMAC/FreeBSD (cf. NAI Labs)
It didn't implement the proper /dev/fd functionality (which would be to
include in the directory listing /dev/fd/n if the process has fd n open)
anyway.
Anything needing access to /dev/fd/n where n > 2 can use the optional
fdescfs module, which implements this properly and does not cause any
trouble with devfs.
Discussed with: phk
bind() call on IPv4 sockets:
Currently, if one tries to bind a socket using INADDR_LOOPBACK inside a
jail, it will fail because prison_ip() does not take this possibility
into account. On the other hand, when one tries to connect(), for
example, to localhost, prison_remote_ip() will silently convert
INADDR_LOOPBACK to the jail's IP address. Therefore, it is desirable to
make bind() to do this implicit conversion as well.
Apart from this, the patch also replaces 0x7f000001 in
prison_remote_ip() to a more correct INADDR_LOOPBACK.
This is a 4.4-RELEASE "during the freeze, thanks" MFC candidate.
Submitted by: Anton Berezin <tobez@FreeBSD.org>
Discussed with at some point: phk
MFC after: 3 days
order to avoid namespace collision with subr_mchain.c's mb_init(). This
wasn't "fatal" as the mbuf initialization routine mb_init() was local to
subr_mbuf.c which in turn didn't pull in subr_mchain.c's mb_init()
declaration, but it should deffinately be changed now before it creates
headache.
This paniced my one of my machines one time too many :-( and there is
no sign of a solution in the pipeline. The deltas are still easily
available in cvs. The problem is that if the parent has been swapped
out, the child process cannot grope around in the parent's UPAGES to
see the sigact[] array or it will fault. This probably is a showstopper
for this implementation anyway.
defined to 0 in the non-SMP case, which very much makes sense as it
permits its usage in per-CPU initialization loops (for an example, check
out subr_mbuf.c).
Further, on a UP system, make mb_alloc always use the first per-CPU
container, regardless of cpuid (i.e. remove reliability on cpuid in the
UP case).
Requested by: alfred
asleep() and await() functions split the functionality of msleep() up into
two halves. Only the asleep() half (which is what puts the process on the
sleep queue) actually needs the lock usually passed to msleep() held to
prevent lost wakeups. await() does not need the lock held, so the lock
can be released prior to calling await() and does not need to be passed in
to the await() function. Typical usage of these functions would be as
follows:
mtx_lock(&foo_mtx);
... do stuff ...
asleep(&foo_cond, PRIxx, "foowt", hz);
...
mtx_unlock&foo_mtx);
...
await(-1, -1);
Inspired by: dillon on the couch at Usenix
of debugging the current process when that is in conflict with other
restrictions (such as jail, unprivileged_procdebug_permitted, etc).
o This corrects anomolies in the behavior of
kern.security.unprivileged_procdebug_permitted when using truss and
ktrace. The theory goes that this is now safe to use.
Obtained from: TrustedBSD Project
MIB entries.
o Relocate kern.suser_permitted to kern.security.suser_permitted.
o Introduce new kern.security.unprivileged_procdebug_permitted, which
(when set to 0) prevents processes without privilege from performing
a variety of inter-process debugging activities. The default is 1,
to provide current behavior.
This feature allows "hardened" systems to disable access to debugging
facilities, which have been associated with a number of past security
vulnerabilities. Previously, while procfs could be unmounted, other
in-kernel facilities (such as ptrace()) were still available. This
setting should not be modified on normal development systems, as it
will result in frustration. Some utilities respond poorly to
failing to get the debugging access they require, and error response
by these utilities may be improved in the future in the name of
beautification.
Note that there are currently some odd interactions with some
facilities, which will need to be resolved before this should be used
in production, including odd interactions with truss and ktrace.
Note also that currently, tracing is permitted on the current process
regardless of this flag, for compatibility with previous
authorization code in various facilities, but that will probably
change (and resolve the odd interactions).
Obtained from: TrustedBSD Project
dynamic symbol table buckets and chains. The sparc64 toolchain uses 32
bit .hash entries, unlike other 64 bits architectures (alpha), which use
64 bit entries.
Discussed with: dfr, jdp
FreeBSD _does_ define ENOMSG, so no need for checking if we support it.
Inspired by PR: 22470
Which was submitted by: Bjorn Tornqvist <bjorn@west.se>
MFC after: 1 week
VM caching of disks through mmap() and stopping syncing of open files
that had their last reference in the fs removed (ie: their unsync'ed
pages get discarded on close already, so I made it stop syncing too).
initialize in the right order to make derivative settings work right.
eg: at compile time, nmbufs was double nmbclusters. For POLA this should
work the same at runtime.
Tunables are now derived at boot time from maxusers. ie: change maxusers
via a tunable and all the derivative settings change. You can change
the other tunables individually as well. Even hz etc is tunable.
were indices in a dense array. The cpuids are a sparse set and treat
them as such, setting up containers only for CPUs activated during
mb_init().
- Fix netstat(1) and systat(1) to treat the per-CPU stats area as a sparse
map, in accordance with the above.
This allows us to properly boot with certain CPUs disactivated. However, if
we later decide to re-activate said CPUs, we will barf until we decide to
implement CPU spinon/spinoff callback hooks to allow for said CPUs' per-CPU
containers to get configured on their activation.
Reported by: mjacob
Partially (sys/ diffs) Submitted by: mjacob
static entries with oid's over 100, and defining enough dynamic entries
causes an overlap.
Move the "magic" value 0x100 into <sys/sysctl.h> where it belongs.
PR: 29131
Submitted by: "Alexander N. Kabaev" <kabaev@mail.ru>
Reviewed by: -arch, -audit
MFC after: 2 weeks
an unexpected user-visible side effect with the sigaction flags. Also cleanup
a minor union issue.
Submitted by: Rudolf Cejka <cejkar@dcse.fee.vutbr.cz>
MFC addendum: MFC will be combined w/ original commit
MFC after: 3 days
support. Trying to fix the merged set where dynamic overrode
static was getting more and more complicated by the day.
This should fix the duplicate atkbd, psm, fd* etc in GENERIC. (which
paniced the alpha, but not the i386)
The p_can(...) construct was a premature (and, it turns out,
awkward) abstraction. The individual calls to p_canxxx() better
reflect differences between the inter-process authorization checks,
such as differing checks based on the type of signal. This has
a side effect of improving code readability.
o Replace direct credential authorization checks in ktrace() with
invocation of p_candebug(), while maintaining the special case
check of KTR_ROOT. This allows ktrace() to "play more nicely"
with new mandatory access control schemes, as well as making its
authorization checks consistent with other "debugging class"
checks.
o Eliminate "privused" construct for p_can*() calls which allowed the
caller to determine if privilege was required for successful
evaluation of the access control check. This primitive is currently
unused, and as such, serves only to complicate the API.
Approved by: ({procfs,linprocfs} changes) des
Obtained from: TrustedBSD Project
(this commit is just the first stage). Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.
These take an additional mutex argument, which is dropped before any
processes are made runnable. This can avoid contention on the mutex
if the processes would immediately acquire it, and is done in such a
way that wakeups will not be lost.
Reviewed by: jhb
We already did this in the SMP case, and it is now maintained in the UP
case as well, and makes the code slightly more readable. Note that
curproc is always executing, thus the p != curproc test does not need to
be performed if the p_oncpu check is made.
We don't actually need to force a context switch of the current process.
The act of firing the event triggers a context switch to softclock() and
then switching back out again which is equivalent to a preemption, thus
no further work is needed on the local CPU.
allow call-specific authorization.
o Modify the authorization model so that p_can() is used to check
scheduling get/set events, using P_CAN_SEE for gets, and P_CAN_SCHED
for sets. This brings the checks in line with get/setpriority().
Obtained from: TrustedBSD Project