support routines in kern_acl.c:
- Define ACL_OVERRIDE_MASK and ACL_PRESERVE_MASK centrally in acl.h: the
mode bits that are (and aren't) stored in the ACL.
- Add acl_posix1e_acl_to_mode(): given a POSIX.1e extended ACL, generate
a compatibility mode (only the bits supported by the POSIX.1e ACL).
- acl_posix1e_newfilemode(): Given a requested creation mode and default
ACL, calculate the mode for the new file system object (only the bits
supported by the POSIX.1e ACL).
PR: 50148
Reported by: Ritz, Bruno <bruno_ritz@gmx.ch>
Obtained from: TrustedBSD Project
another thread. We use the td_oncpu member of the other field to locate
it's associated CPU and then search the that CPU's list of spin locks
contained in its per-CPU data. This is not always safe and may in fact
panic or just not work, but it is useful in at least one case.
on a non-blocking pipe in cases where select(2) returns the file
descriptor as ready for write. This in turns causes libc_r, for
one, to busy wait in such cases.
Note: it is a quick performance fix, a more complex fix might be
required in case this turns out to have unexpected side effects.
Reviewed by: silby
MFC after: 3 days
thread's pid to make debugging easier for people who don't want to have to
use the intended tool for these panics (witness).
Indirectly prodded by: kris
a long-standing mistake in the way a portion of a pipe's KVA is
allocated. Specifically, kmem_alloc_pageable() is inappropriate
for use in the "direct" case because it allows a preceding vm map entry
and vm object to be extended to support the new KVA allocation.
However, the direct case KVA allocation should not have a backing
vm object. This is corrected by using kmem_alloc_nofault().
Submitted by: tegge (with the above explanation by me)
kernel ACL interfaces and system call names.
Break out UFS2 and FFS extattr delete and list vnode operations from
setextattr and getextattr to deleteextattr and listextattr, which
cleans up the implementations, and makes the results more readable,
and makes the APIs more clear.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
kern.file sysctl, don't return information about processes that
fail p_cansee(td, p). This prevents sockstat and related
programs from seeing file descriptors owned by processes not
in the same jail as the thread, as well as having implications
for MAC, etc.
This is a partial solution: it permits an information leak about
the number of descriptors in the sizing calculation (but this is
not new information, you can also get it from kern.openfiles),
and doesn't attempt to mask file descriptors based on the
properties of the descriptor, only the process referencing it.
However, it provides most of what you want under most
circumstances, without complicating the locking.
PR: 54211
Based on a patch submitted by: Pawel Jakub Dawidek <nick@garage.freebsd.pl>
contain the filedescriptor number on opens from userland.
The index is used rather than a "struct file *" since it conveys a bit
more information, which may be useful to in particular fdescfs and /dev/fd/*
For now pass -1 all over the place.
comes across it, it will turn into a core dump in userland instead of
a kernel panic. I had also inverted the sense of the test, so
Double pointy hat to: mtm
- Make m_prepend use m_gethdr instead of m_get where
appropriate
- Make m_copym use m_gethdr instead of m_get where
appropriate
- Add a call to m_fixhdr in m_defrag; m_defrag can't
deal with corrupted pkthdr.len counts.
MFC after: 3 days
do not clear the `sb_sel' member of the sockbuf structure
while invalidating the receive sockbuf in sorflush(), called
from soshutdown().
The panic was reproduceable from user land by attaching a knote
with EVFILT_READ filters to a socket, disabling further reads
from it using shutdown(2), and then closing it. knote_remove()
was called to remove all knotes from the socket file descriptor
by detaching each using its associated filterops' detach call-
back function, sordetach() in this case, which tried to remove
itself from the invalidated sockbuf's klist (sb_sel.si_note).
PR: kern/54331
When a signal is being delivered to process, first find a sigwait
thread to deliver, POSIX's argument is speed of delivering signal
to sigwait thread is faster than other ways. A signal in its wait
set will cause sigwait to return the signal number, a signal not
in its wait set but in not blocked by the thread also causes sigwait
to return, but sigwait returns EINTR, sigwait is oneshot operation,
only one signal can be delivered to its wait set, when a signal is
delivered to the sigwait thread, the thread's sigwait state is canceled.
an appropriate error number after a failure condition.
In particular, three of the changed statements return ESRCH for a
failed pfind(), and in also three places a non-zero return
from p_cansee() will be passed back,
Also noticed by: rwatson
1. There was a race condition between a thread unlocking
a umtx and the thread contesting it. If the unlocking
thread won the race it may try to wakeup a thread that
was not yet in msleep(). The contesting thread would then
go to sleep to await a wakeup that would never come. It's
not possible to close the race by using a lock because
calls to casuptr() may have to fault a page in from swap.
Instead, the race was closed by introducing a flag that
the unlocking thread will set when waking up a thread.
The contesting thread will check for this flag before
going to sleep. For now the flag is kept in td_flags,
but it may be better to use some other member or create
a new one because of the possible performance/contention
issues of having to own sched_lock. Thanks to jhb for
pointing me in the right direction on this one.
2. Once a umtx was contested all future locks and unlocks
were happening in the kernel, regardless of whether it
was contested or not. To prevent this from happening,
when a thread locks a umtx it checks the queue for that
umtx and unsets the contested bit if there are no other
threads waiting on it. Again, this is slightly more
complicated than it needs to be because we can't hold
a lock across casuptr(). So, the thread has to check
the queue again after unseting the bit, and reset the
contested bit if it finds that another thread has put
itself on the queue in the mean time.
3. Remove the if... block for unlocking an uncontested
umtx, and replace it with a KASSERT. The _only_ time
a thread should be unlocking a umtx in the kernel is
if it is contested.
large to huge amounts of small or medium sized receive buffers. The problem
with these situations is that they eat up the available DMA address space
very quickly when using mbufs or even mbuf clusters. Additionally this
facility provides a direct mapping between 32-bit integers and these buffers.
This is needed for devices originally designed for 32-bit systems. Ususally
the virtual address of the buffer is used as a handle to find the buffer as
soon as it is returned by the card. This does not work for 64-bit machines
and hence this mapping is needed.
multiple mutex pools with different options and sizes. Mutex pools can
be created with either the default sleep mutexes or with spin mutexes.
A dynamically created mutex pool can now be destroyed if it is no longer
needed.
Create two pools by default, one that matches the existing pool that
uses the MTX_NOWITNESS option that should be used for building higher
level locks, and a new pool with witness checking enabled.
Modify the users of the existing mutex pool to use the appropriate pool
in the new implementation.
Reviewed by: jhb
immediately after the kernel map has been sized, and is
the optimal place for the autosizing of memory allocations
which occur within the kernel map to occur.
Suggested by: bde
- Use atomic ops to update the bigpipe count
- Make the bigpipe count sysctl readable
- Remove a duplicate comparison in an if statement
- Comment two SYSCTLs.
than the shortcircuited version I had been using, which only worked
properly on i386 & amd64.
Also, change an autoscale constant to account for the more correct
kmem_map size.
Problem noticed by: mux
- Limit the total number of pipes so that we do not
exhaust all vm objects in the kernel map. When
this limit is reached, a ratelimited message will
be printed to the console.
- Put a soft limit on the amount of memory consumable
by pipes. Once the limit has been reached, all new
pipes will be limited to 4K in size, rather than the
default of 16K.
- Put a limit on the number of pages that may be used
for high speed page flipping in order to reduce the
amount of wired memory. Pipe writes that occur
while this limit is exceeded will fall back to
non-page flipping mode.
The above values are auto-tuned in subr_param.c and
are scaled to take into account both the size of
physical memory and the size of the kernel map.
These limits help to reduce the "kernel resources exhausted"
panics that could be caused by opening a large
number of pipes. (Pipes alone are no longer able
to exhaust all resources, but other kernel memory hogs
in league with pipes may still be able to do so.)
PR: 53627
Ideas / comments from: hsu, tjr, dillon@apollo.backplane.com
MFC after: 1 week
notice another typo in the same line. This typo makes libthr unuseable,
but it's effects where counter-balanced by the extra semicolon, which
made libthr remarkably useable for the past several months.
- Associate logical CPUs on the same physical core with the same kseq.
- Adjust code that assumed there would only be one running thread in any
kseq.
- Wrap the HTT code with a ULE_HTT_EXPERIMENTAL ifdef. This is a start
towards HyperThreading support but it isn't quite there yet.
as the target process' pid, it may exist if the process forked before leaving
the pgrp.
Thix fixes a panic that happens when calling setpgid to make a process
re-enter the pgrp with the same pgid as its pid if the pgrp still exists.
be delivered to that thread, regardless of whether it
has it masked or not.
Previously, if the targeted thread had the signal masked,
it would be put on the processes' siglist. If
another thread has the signal umasked or unmasks it before
the target, then the thread it was intended for would never
receive it.
This patch attempts to solve the problem by requiring callers
of tdsignal() to say whether the signal is for the thread or
for the process. If it is for the process, then normal processing
occurs and any thread that has it unmasked can receive it.
But if it is destined for a specific thread, it is put on
that thread's pending list regardless of whether it is currently
masked or not.
The new behaviour still needs more work, though. If the signal
is reposted for some reason it is always posted back to the
thread that handled it because the information regarding the
target of the signal has been lost by then.
Reviewed by: jdp, jeff, bde (style)
locks held by each thread.
- Fix a bug in the original BSD/OS code where a contested lock was not
properly handed off from the old thread to the new thread when a
contested lock with more than one blocked thread was transferred from
one thread to another.
- Don't use an atomic operation to write the MTX_CONTESTED value to
mtx_lock in the aforementioned special case. The memory barriers and
exclusion provided by sched_lock are sufficient.
Spotted by: alc (2)
system by specifying the file system ID instead of a path. Use this
by default in umount(8). This avoids the need to perform any vnode
operations to look up the mount point, so it makes it possible to
unmount a file system whose root vnode cannot be looked up (e.g.
due to a dead NFS server, or a file system that has become detached
from the hierarchy because an underlying file system was unmounted).
It also provides an unambiguous way to specify which file system is
to be unmunted.
Since the ability to unmount using a path name is retained only for
compatibility, that case now just uses a simple string comparison
of the supplied path against f_mntonname of each mounted file system.
Discussed on: freebsd-arch
mdoc help from: ru
happens to work on 32-bit platforms as sizeof(long)=sizeof(int), but
wrecks all kinds of havoc (garbage reads, corrupting writes and
misaligned loads/stores) on 64-bit architectures.
The fix for now is to use fuword32() and suword32() and change the
type of the applicable int fields to int32. This is to make it
explicit that we depend on these fields being 32-bit. We may want
to revisit this later.
Reviewed by: deischen
or unblock a thread in kernel, and allow UTS to specify whether syscall
should be restarted.
o Add ability for UTS to monitor signal comes in and removed from process,
the flag PS_SIGEVENT is used to indicate the events.
o Add a KMF_WAITSIGEVENT for KSE mailbox flag, UTS call kse_release with
this flag set to wait for above signal event.
o For SA based thread, kernel masks all signal in its signal mask, let
UTS to use kse_thr_interrupt interrupt a thread, and install a signal
frame in userland for the thread.
o Add a tm_syncsig in thread mailbox, when a hardware trap occurs,
it is used to deliver synchronous signal to userland, and upcall
is schedule, so UTS can process the synchronous signal for the thread.
Reviewed by: julian (mentor)
POSIX says siginfo pointer parameter can be NULL and if the
function success, it should return signal number but not zero.
The waitset it past should be negatived before it can be
used as thread signal mask.
nice distribution without significantly impacting interactive response.
As a side effect it should also allow batch processes to run for a
slightly longer period which will positively impact their performance.
This commit has two pieces. One half is the watchdog kernel code which lives
primarily in hardclock() in sys/kern/kern_clock.c. The other half is a userland
daemon which, when run, will keep the watchdog from firing while the userland
is intact and functioning.
Approved by: jeff (mentor)
Before, we would add/subtract the leap second when the system had been
up for an even multiple of days, rather than at the end of the day, as
a leap second is defined (at least wrt ntp). We do this by
calculating the notion of UTC earlier in the loop, and passing that to
get it adjusted. Any adjustments that ntp_update_second makes to this
time are then transferred to boot time. We can't pass it either the
boot time or the uptime because their sum is what determines when a
leap second is needed. This code adds an extra assignment and two
extra compare in the typical case, which is as cheap as I could made
it.
I have confirmed with this code the kernel time does the correct thing
for both positive and negative leap seconds. Since the ntp interface
doesn't allow for +2 or -2, those cases can't be tested (and the folks
in the know here say there will never be a +2s or -2s leap event, but
rather two +1s or -1s leap events).
There will very likely be no leap seconds for a while, given how the
earth is speeding up and slowing down, so there will be plenty of time
for this fix to propigate. UT1-UTC is currently at "about -0.4s" and
decrementing by .1s every 8 months or so. 6 * 8 is 48 months, or 4
years.
-stable has different code, but a similar bug that was introduced
about the time of the last leap second, which is why nobody has
noticed until now.
MFC After: 3 weeks
Reviewed by: phk
"Furthermore, leap seconds must die." -- Cato the Elder
incremented at the start of the leap second, not after the leap second
has been inserted. This is because at the start of the leap second,
we set the time back one second. This setting back one second is the
moment that the offset changes. The old code set it back after the
leap second, but that's one second too late. The negative leap second
case is handled correctly.
Reviewed by: phk
the MAC policy modules to improve robustness against C string
bugs and vulnerabilities. Following these revisions, all
string construction of labels for export to userspace (or
elsewhere) is performed using the sbuf API, which prevents
the consumer from having to perform laborious and intricate
pointer and buffer checks. This substantially simplifies
the externalization logic, both at the MAC Framework level,
and in individual policies; this becomes especially useful
when policies export more complex label data, such as with
compartments in Biba and MLS.
Bundled in here are some other minor fixes associated with
externalization: including avoiding malloc while holding the
process mutex in mac_lomac, and hence avoid a failure mode
when printing labels during a downgrade operation due to
the removal of the M_NOWAIT case.
This has been running in the MAC development tree for about
three weeks without problems.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
attributes from objects over vop_setextattr() with a NULL uio; if
the file system doesn't support the vop_rmextattr() method, fall
back to the vop_setextattr() method.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
interface, rather than relying on a NULL uio for the deletion
operation.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
specify what credential to use when authorizing vn_open() and later
write operations, rather than curthread->td_ucred.
When writing KTR traces to an ALQ, specify the credential of the thread
generating the sysctl request.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
Several of the subtypes have an associated vnode which is used for
stuff like the f*() functions.
By giving the vnode a speparate field, a number of checks for the specific
subtype can be replaced simply with a check for f_vnode != NULL, and
we can later free f_data up to subtype specific use.
At this point in time, f_data still points to the vnode, so any code I
might have overlooked will still work.
console, even if there is a TIOCCONS console tty. We were already
doing this after a panic, but it's also useful when entering DDB
for some other reason too.
TIOCCONS console (e.g. xconsole) via a timeout routine instead of
calling into the tty code directly from printf(). This fixes a
number of cases where calling printf() at the wrong time (such as
with locks held) would cause a panic if xconsole is running.
The TIOCCONS message buffer is 8k in size by default, but this can
be changed with the kern.consmsgbuf_size sysctl. By default, messages
are checked for 5 times per second. The timer runs and the buffer
memory remains allocated only at times when a TIOCCONS console is
active.
Discussed on: freebsd-arch
with a new implementation that has a mostly reentrant "addchar"
routine, supports multiple message buffers in the kernel, and hides
the implementation details from callers.
The new code uses a kind of sequence number to represend the current
read and write positions in the buffer. This approach (suggested
mainly by bde) permits the read and write pointers to be maintained
separately, which reduces the number of atomic operations that are
required. The "mostly reentrant" above refers to the way that while
it is now always safe to have any number of concurrent writers,
readers could see the message buffer after a writer has advanced
the pointers but before it has witten the new character.
Discussed on: freebsd-arch
causing poor interactive performance while unnice processes were running.
The new scheme still allows nice to have an effect on priority but it is
not as dramatic as the effect of the interactivity score.
before calling it for bound thread. To avoid this problem, change
thread_schedule_upcall to not put new thread on run queue, let caller
do it, so we can tweak the new thread before setting it to run.
Reported by: pho
threads in the process have already masked the signal, so job control
is delayed. But later a thread unmasking the STOP signal should enable
job control, so in issignal(), scanning all threads in process to see
if we can direct suspend some of them, not just suspend current thread.
we can deadlock because of lock order reversals. This was not
caught because Witness ignores pool mutexes right now.
Diagnosis and help: truckman
Noticed by: pho
"maxproc limit exceeded by uid %i, please see tuning(7) and login.conf(5)."
Which will be triggered whenever a user hits his/her maxproc limit or
the systemwide maxproc limit is reached.
MFC after: 1 week
mutexes are supposed to only be used as leaf mutexes, and what appear
to be separate pool mutexes could be aliased together, it is bad idea
for a thread to attempt to hold two pool mutexes at the same time.
Slightly rearrange the code in kern_open() so that FILE_UNLOCK() is
called before calling VOP_GETVOBJECT(), which will grab the v_vnlock
mutex.
systems to fail more gracefully when a file descriptor exhaustion situation
occurs.
Original patch by: David G. Andersen <dga@lcs.mit.edu>
PR: 45353
MFC after: 1 week
because the run time exceeds the largest value a signed int can hold.
The real solution involves calculating how far we are over the limit.
To quickly solve this problem we loop removing 1/5th of the current value
until it falls below the limit. The common case requires no passes.
and run time.
- Scale the sleep and run time back via sched_interact_update() in more
places. This is to keep the statistic more accurate.
- Charge a parent one tick for forking a child.
- Add only the run time and not the sleep time to the parents kg when a
thread exits. This allows us to give a penalty for having an expensive
thread exit but does not give a bonus for having an interactive thread
exit.
- Change the SLP_RUN_THROTTLE to limit us to 4/5th and not 1/2.
- Change the SLP_RUN_MAX to two seconds. This keeps bursty interactive
applications like mozilla and openoffice in the interactive range even
through expensive tasks.
- Recalculate the slice after every sleep. This ensures that once a task
has been marked interactive it only has a slice of 1 at the risk of
giving tasks that sleep for a very brief period a longer time slice.
schedules an upcall. Signal delivering to a bound thread is same as
non-threaded process. This is intended to be used by libpthread to
implement PTHREAD_SCOPE_SYSTEM thread.
2. Simplify kse_release() a bit, remove sleep loop.
panics. Before revision 1.38, we used to just point panicstr at the
format string if panicstr was NULL, but since we now use a static
buffer for the formatted panic message, we have to be careful to
only write to it during the first panic.
Pointed out by: bde
which meant no process would run for longer than 20ms.
- Slightly redo the interactivity scorer. It follows the same algorithm but
in a slightly more correct way. Previously values above half were
incorrect.
- Lower the interactivity threshold to 20. It seems that in testing non-
interactive tasks are hardly ever near there and expensive interactive
tasks can sometimes surpass it. This area needs more testing.
- Remove an unnecessary KTR.
- Fix a case where an idle thread that had an elevated priority due to
priority prop. would be placed back on the idle queue.
- Delay setting NEEDRESCHED until userret() for threads that haad their
priority elevated while in kernel. This gives us the same context switch
optimization as SCHED_4BSD.
- Limit the child's slice to 1 in sched_fork_kse() so we detect its behavior
more quickly.
- Inhert some of the run/slp time from the child in sched_exit_ksegrp().
- Redo some of the priority comparisons so they are more clear.
- Throttle the frequency of sched_pctcpu_update() so that rounding errors
do not make it invalid.
to the machine-independent parts of the VM. At the same time, this
introduces vm object locking for the non-i386 platforms.
Two details:
1. KSTACK_GUARD has been removed in favor of KSTACK_GUARD_PAGES. The
different machine-dependent implementations used various combinations
of KSTACK_GUARD and KSTACK_GUARD_PAGES. To disable guard page, set
KSTACK_GUARD_PAGES to 0.
2. Remove the (unnecessary) clearing of PG_ZERO in vm_thread_new. In
5.x, (but not 4.x,) PG_ZERO can only be set if VM_ALLOC_ZERO is passed
to vm_page_alloc() or vm_page_grab().
small but noticeable increase in performance for name lookup operations.
The code uses two zones, one for short names (less than 32 characters)
and one for long names (up to NAME_MAX). Since most file names are
fairly short, this saves a considerable amount of space that would
otherwise be wasted if we always allocated NAME_MAX bytes. The cutoff
value of 32 characters was picked arbitrarily and may benefit from some
tweaking; it could also be made into a tunable.
Submitted by: hmp
too small panics on PAE machines which have odd > 4GB sizes (4.5 gig
would render a 20MB of KVA for kmem_map instead of 200MB).
Submitted by: John Cagle <john.cagle@hp.com>, jeff
Reviewed by: jeff, peter, scottl, lots of USENIX folks
is currently executing when we try to remove it in exit1(). Without this,
it was possible for the callout to bogusly rearm itself and eventually
refire after the process had been free'd resulting in a panic.
PR: kern/51964
Reported by: Jilles Tjoelker <jilles@stack.nl>
Reviewed by: tegge, bde
curthread. Unlike td_flags, this field does not need any locking.
- Replace the td_inktr and td_inktrace variables with equivalent private
thread flags.
- Move TDF_OLDMASK over to the private flags field so it no longer requires
sched_lock.
second and equalizing the load between the two most imbalanced CPU. This
is intended to clear up long term load imbalances that would not be handled
by the 'pull' method in sched_choose().
- Pull out some bits of sched_choose() into a kseq_move() function that moves
an arbitrary thread from one kseq to another.
adding it to the nice tables. Therefore, in kseq_add_nice, we should
keep in mind that the load will be 1 if we are the only thread, and not
0.
- Assert that the sched lock is held in all the appropriate places.
- Increase the scope of the sched lock in sched_pctcpu_update().
- Hold the sched lock in sched_runnable(). It is not held by the caller.
"", temporarily map it to a call to extattr_list_vp() to provide
compatibility for older applications using the "" API to retrieve
EA lists.
Use VOP_LISTEXTATTR() to support extattr_list_vp() rather than
VOP_GETEXTATTR(..., "", ...).
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Asssociates Laboratories
specific attribute name. It will have the same semantics as the
older vop_getextattr() "retrieve the names" hack, returning
a buffer with ASCII nul-seperated names.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
we were passing in a void* representing the PCB of the parent thread.
Now we pass a pointer to the parent thread itself.
The prime reason for this change is to allow cpu_set_upcall() to copy
(parts of) the trapframe instead of having it done in MI code in each
caller of cpu_set_upcall(). Copying the trapframe cannot always be
done with a simply bcopy() or may not always be optimal that way. On
ia64 specifically the trapframe contains information that is specific
to an entry into the kernel and can only be used by the corresponding
exit from the kernel. A trapframe copied verbatim from another frame
is in most cases useless without some additional normalization.
Note that this change removes the assignment to td->td_frame in some
implementations of cpu_set_upcall(). The assignment is redundant.
A previous call to cpu_thread_setup() already did the exact same
assignment. An added benefit of removing the redundant assignment is
that we can now change td_pcb without nasty side-effects.
This change officially marks the ability on ia64 for 1:1 threading.
Not tested on: amd64, powerpc
Compile & boot tested on: alpha, sparc64
Functionally tested on: i386, ia64
extattr_list_link() system calls, which return a least of extended
attributes defined for a vnode referenced by a file descriptor
or path name. Currently, we just invoke VOP_GETEXTATTR() since
it will convert a request for an empty name into a query for a
name list, which was the old (more hackish) API. At some point
in the near future, we'll push the distinction between get and
list down to the vnode operation layer, but this provides access
to the new API for applications in the short term.
Pointed out by: Dominic Giampaolo <dbg@apple.com>
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
file/directory/link, rather than using a less explicit hack on
the extattr retrieval API:
extattr_list_fd()
extattr_list_file()
extattr_list_link()
The existing API was counter-intuitive, and poorly documented.
The prototypes for these system calls are identical to
extattr_get_*(), but without a specific attribute name to
leave NULL.
Pointed out by: Dominic Giampaolo <dbg@apple.com>
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
Don't copyin() data we are about to overwrite.
Add a flag to tell userland that KSE is officially "DONE" with the
mailbox and has gone away.
Obtained from: davidxu@
we failed to put the bucket back into the general cache/container.
Also, fix a bad assumption. There was a KASSERT() that aimed to
guarantee that whenever the pcpu container's mc_starved was > 0,
that whatever the bucket we were freeing to was an empty bucket,
assuming it belonged to the pcpu container cache. However, there
is at least one case where this is not true anymore; consider:
1) All containers empty, next thread to try to alloc will touch
a pcpu container, notice it's empty, and increment the pcpu
container's mc_starved.
2) Some other thread frees an mbuf belonging to a bucket in
the general cache/container. Then it frees another mbuf
belonging to the same bucket (still in gen container).
3) Some third thread tries to allocate an mbuf from the pcpu
container and, since empty, grabs one mbuf now available
in the general cache and moves the non-empty bucket from
which it took 1 mbuf and to which the thread in (2) freed
to, and moves it to the pcpu container.
4) A final thread tries to free an mbuf belonging to the
NON-EMPTY bucket mentionned in (2) and (3) and, since
the pcpu container's mc_starved is > 0, but the bucket
is obviously non-empty, it trips on the KASSERT.
This meant that one could potentially get a panic in some
cases when out of mbufs and clusters. The problem could
be mitigated by commenting out some cv_signal() calls,
but I'm assuming that was pure coincidence and this is
the correct fix.
- Use a hash of umtx queues to queue blocked threads. We hash on pid and the
virtual address of the umtx structure. This eliminates cases where we
previously held a lock across a casuptr call.
Reviwed by: jhb (quickly)
the lameness of the kstack code. The EPC overhaul de-lame-ified the
kstack code by removing the need for contigmalloc(). We can now
allocate stacks using malloc(). We probably want to make the stacks
swappable as well so that we can make it MI. But that's another story.
why certain exceptions are made, note an inconsistency between
FreeBSD and some other implementations regarding IPC_M, and let
suser() generate our EPERM rather than forcing it ourselves.
Remove a carriage return that crept in in the last commit.
Reviewed by: gordon
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
the stack to be changed in a way incompatible with elf32_map_insert()
where we used data_buf without initializing it for when the partial
mapping resulting in a misaligned image (typical when the page size
implied by the image is not the same as the page size in use by the
kernel). Since data_buf is passed by reference to vm_map_find(), the
compiler cannot warn about it.
While here, move all local variables to the top of the function.
in the kernel, the sysctl_register() call would fail, as expected.
However, when unloading this module again, the kernel would then panic
in sysctl_unregister(). Print a message error instead.
Submitted by: Nicolai Petri <nicolai@catpipe.net>
Reviewed by: imp
Approved by: re@ (jhb)
buf_start() to avoid triggering a panic in softdep_disk_io_initiation()
if b_iocmd happened to be BIO_READ. The later initialisation of
b_iocmd in cluster_wbuild() could probably be moved to before the
buf_start() call, but this patch keeps the change as simple as
possible.
This is reported to fix occasional "softdep_disk_io_initiation: read"
panics, especially on NFS servers.
Reported by: Nick Hilliard <nick@netability.ie>
Tested by: Nick Hilliard <nick@netability.ie>
Approved by: re (rwatson)
because we could fail due to a small buffer and loop and rerun. If this
happens, then the vsnprintf() will have already taken the arguments off
the va_list. For i386 and others, this doesn't matter because the
va_list type is a passed as a copy. But on powerpc and amd64, this is
fatal because the va_list is a reference to an external structure that
keeps the vararg state due to the more complicated argument passing system.
On amd64, arguments can be passed as follows:
First 6 int/pointer type arguments go in registers, the rest go on
the memory stack.
Float and double are similar, except using SSE registers.
long double (80 bit precision) are similar except using the x87 stack.
Where the 'next argument' comes from depends on how many have been
processed so far and what type it is. For amd64, gcc keeps this state
somewhere that is referenced by the va_list.
I found a description that showed the va_copy was required here:
http://mirrors.ccs.neu.edu/cgi-bin/unixhelp/man-cgi?va_end+9
The single unix spec doesn't mention va_copy() at all.
Anyway, the problem was that the sysctl kern.geom.conf* nodes would panic
due to walking off the end of the va_arg lists in vsnprintf. A better fix
would be to have sbuf_vprintf() use a single pass and call kvprintf()
with a callback function that stored the results and grew the buffer
as needed.
Approved by: re (scottl)
size and the kernel's heap size, specifically, vm_kmem_size. This
function allows a maximum of 40% of the vm_kmem_size to be used for
vnodes and vm objects. This is a conservative bound based upon recent
problem reports. (In other words, a slight increase in this percentage
may be safe.)
Finally, machines with less than ~3GB of RAM should be unaffected
by this change, i.e., the maximum number of vnodes should remain
the same. If necessary, machines with 3GB or more of RAM can increase
the maximum number of vnodes by increasing vm_kmem_size.
Desired by: scottl
Tested by: jake
Approved by: re (rwatson,scottl)
keep the thread state variable consistent with its real state.
i.e. Don't say it's on the run queue when it isn't.
Also clarify the associated comment.
Turns a double panic back to a single panic :-/
Approved by: re@ (jhb)
prime objectives are:
o Implement a syscall path based on the epc inststruction (see
sys/ia64/ia64/syscall.s).
o Revisit the places were we need to save and restore registers
and define those contexts in terms of the register sets (see
sys/ia64/include/_regset.h).
Secundairy objectives:
o Remove the requirement to use contigmalloc for kernel stacks.
o Better handling of the high FP registers for SMP systems.
o Switch to the new cpu_switch() and cpu_throw() semantics.
o Add a good unwinder to reconstruct contexts for the rare
cases we need to (see sys/contrib/ia64/libuwx)
Many files are affected by this change. Functionally it boils
down to:
o The EPC syscall doesn't preserve registers it does not need
to preserve and places the arguments differently on the stack.
This affects libc and truss.
o The address of the kernel page directory (kptdir) had to
be unstaticized for use by the nested TLB fault handler.
The name has been changed to ia64_kptdir to avoid conflicts.
The renaming affects libkvm.
o The trapframe only contains the special registers and the
scratch registers. For syscalls using the EPC syscall path
no scratch registers are saved. This affects all places where
the trapframe is accessed. Most notably the unaligned access
handler, the signal delivery code and the debugger.
o Context switching only partly saves the special registers
and the preserved registers. This affects cpu_switch() and
triggered the move to the new semantics, which additionally
affects cpu_throw().
o The high FP registers are either in the PCB or on some
CPU. context switching for them is done lazily. This affects
trap().
o The mcontext has room for all registers, but not all of them
have to be defined in all cases. This mostly affects signal
delivery code now. The *context syscalls are as of yet still
unimplemented.
Many details went into the removal of the requirement to use
contigmalloc for kernel stacks. The details are mostly CPU
specific and limited to exception_save() and exception_restore().
The few places where we create, destroy or switch stacks were
mostly simplified by not having to construct physical addresses
and additionally saving the virtual addresses for later use.
Besides more efficient context saving and restoring, which of
course yields a noticable speedup, this also fixes the dreaded
SMP bootup problem as a side-effect. The details of which are
still not fully understood.
This change includes all the necessary backward compatibility
code to have it handle older userland binaries that use the
break instruction for syscalls. Support for break-based syscalls
has been pessimized in favor of a clean implementation. Due to
the overall better performance of the kernel, this will still
be notived as an improvement if it's noticed at all.
Approved by: re@ (jhb)
the vnode and restart the loop. Vflush() is vulnerable since it does not
hold a reference to the vnode and it holds no other locks while waiting
for the vnode lock. The vnode will no longer be on the list when the
loop is restarted.
Approved by: re (rwatson)
PT_DETACH ptrace(2) requests from functioning as advertised in the
manual page. As described in kern/35175, the PT_DETACH request will,
under certain circumstances, pass an unwanted signal on to the traced
process upan detaching from it. The PT_CONTINUE request will
sometimes fail if you make it pass a signal that has "properties" that
differ from the properties of the signal that origionally caused the
traced process to be stopped. Since PT_KILL is nothing than
PT_CONTINUE with SIGKILL, it is broken too. In the PT_KILL case, this
leads to an unkillable process.
PR: 44011
Submitted by: Mark Kettenis <kettenis@chello.nl>
Approved by: re(jhb)
netstat(1) not display it for now because its effects are not yet
completely implemented and we're about to cut 5.2-RELEASE.
This is temporary.
Approved by: re (scottl, rwatson)
- Move struct sigacts out of the u-area and malloc() it using the
M_SUBPROC malloc bucket.
- Add a small sigacts_*() API for managing sigacts structures: sigacts_alloc(),
sigacts_free(), sigacts_copy(), sigacts_share(), and sigacts_shared().
- Remove the p_sigignore, p_sigacts, and p_sigcatch macros.
- Add a mutex to struct sigacts that protects all the members of the struct.
- Add sigacts locking.
- Remove Giant from nosys(), kill(), killpg(), and kern_sigaction() now
that sigacts is locked.
- Several in-kernel functions such as psignal(), tdsignal(), trapsignal(),
and thread_stopped() are now MP safe.
Reviewed by: arch@
Approved by: re (rwatson)
don't add the current time to it, but leave it as clear so that when the
timer is disabled, the it_value is always clear.
Reviewed by: bde
Approved by: re (rwatson)
desired buffer is found at one of the roots more than 60% of the time.
Thus, checking both roots before performing either splay eliminates
unnecessary splays on the first tree splayed.
Approved by: re (jhb)
synchronization primitives from inside DDB is generally a bad idea,
and in this case it frequently results in panics due to DDB commands
being executed from the sio fast interrupt context on a serial
console. Replace the locking with a note that a lack of locking
means that DDB may get see inconsistent views of the mount and vnode
lists, which could also result in a panic. More frequently,
though, this avoids a panic than causes it.
Discussed with ages ago: bde
Approved by: re (scottl)
reason for the duplication was that m_freem() was meant to eventually
be optimized to hold the lock of the cache being freed to as long as
possible across frees but the difficulty of implementing said
optimization right now is too high, given that in some cases (see MAC
and non-cluster external buffers), we need to call into other subsytems,
something not permissible when the cache lock is held.
This change minimizes code duplication while keeping at least the
atomic mbuf+cluster free optimization.
Suggested by: luigi
constants in question refer to the number of label slots, not the
maximum number of policies that may be loaded. This should reduce
confusion regarding an element in the MAC sysctl MIB, as well as
make it more clear what the affect of changing the compile-time
constants is.
Approved by: re (jhb)
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
(1) Accept that we're now going to use mutexes, so don't attempt
to avoid treating them as mutexes. This cleans up locking
accessor function names some.
(2) Rename variables to _mtx, _cv, _count, simplifying the naming.
(3) Add a new form of the _busy() primitive that conditionally
makes the list busy: if there are entries on the list, bump
the busy count. If there are no entries, don't bump the busy
count. Return a boolean indicating whether or not the busy
count was bumped.
(4) Break mac_policy_list into two lists: one with the same name
holding dynamic policies, and a new list, mac_static_policy_list,
which holds policies loaded before mac_late and without the
unload flag set. The static list may be accessed without
holding the busy count, since it can't change at run-time.
(5) In general, prefer making the list busy conditionally, meaning
we pay only one mutex lock per entry point if all modules are
on the static list, rather than two (since we don't have to
lower the busy count when we're done with the framework). For
systems running just Biba or MLS, this will halve the mutex
accesses in the network stack, and may offer a substantial
performance benefits.
(6) Lay the groundwork for a dynamic-free kernel option which
eliminates all locking associated with dynamically loaded or
unloaded policies, for pre-configured systems requiring
maximum performance but less run-time flexibility.
These changes have been running for a few weeks on MAC development
branch systems.
Approved by: re (jhb)
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
does the copyin stuff and then calls the second part kern_sendit to do
the hard work. Don't bother holding Giant during the copyin phase.
The intent of this is to allow the Linux emulator to impliment send*
syscalls without using the stackgap.
argument to the functions shm{at,ctl}1 and shm_find_segment_by_shmid{x}.
The BSD semantics didn't allow the usage of shared segment after
being marked for removal through IPC_RMID.
The patch involves the following functions:
- shmat
- shmctl
- shm_find_segment_by_shmid
- shm_find_segment_by_shmidx
- linux_shmat
- linux_shmctl
Submitted by: Orlando Bassotto <orlando.bassotto@ieo-research.it>
Reviewed by: marcel
trustworthy for vnode-backed objects.
- Restore the old behavior of vm_object_page_remove() when the end
of the given range is zero. Add a comment to vm_object_page_remove()
regarding this behavior.
Reported by: iedowse
double free of a mbuf occurs and cause an immediate panic, rather
than allowing free list corruption to occur.
This code is trapped under INVARIANTS, so it should not cause any
change in default performance.
Reviewed by: a bunch of people on -net
MFC after: 1 week
fini routines instead of in fork() and wait(). This has the nice side
benefit that the proc lock of any process on the allproc list is always
valid and sched_lock doesn't have to be used to test against PRS_NEW
anymore.
uptime. Where necessary, convert it back to Unix time by adding boottime
to it. This fixes a potential problem in the accounting code, which would
compute the elapsed time incorrectly if the Unix time was stepped during
the lifetime of the process.