used for up to "vfs.aio.max_buf_aio" of the requests. If a request
size is MAXPHYS, but the request base isn't page aligned, vmapbuf()
will map the end of the user space buffer into the start of the kva
allocated for the next physical buffer. Don't use a physical buffer
in this case. (This change addresses problem report 25617.)
When an aio_read/write() on a raw device has completed, timeout() is
used to schedule a signal to the process. Thus, the reporting is
delayed up to 10 ms (assuming hz is 100). The process might have
terminated in the meantime, causing a trap 12 when attempting to
deliver the signal. Thus, the timeout must be cancelled when removing
the job.
aio jobs in state JOBST_JOBQGLOBAL should be removed from the
kaio_jobqueue list during process rundown.
During process rundown, some aio jobs might move from one list to a
different list that has already been "emptied", causing the rundown to
be incomplete. Retry the rundown.
A call to BUF_KERNPROC() is needed after obtaining a physical buffer
to disassociate the lock from the running process since it can return
to userland without releasing that lock.
PR: 25617
Submitted by: tegge
if we hold a spin mutex, since we can trivially get into deadlocks if we
start switching out of processes that hold spinlocks. Checking to see if
interrupts were disabled was a sort of cheap way of doing this since most
of the time interrupts were only disabled when holding a spin lock. At
least on the i386. To fix this properly, use a per-process counter
p_spinlocks that counts the number of spin locks currently held, and
instead of checking to see if interrupts are disabled in the witness code,
check to see if we hold any spin locks. Since child processes always
start up with the sched lock magically held in fork_exit(), we initialize
p_spinlocks to 1 for child processes. Note that proc0 doesn't go through
fork_exit(), so it starts with no spin locks held.
Consulting from: cp
into an interruptable sleep and we increment a sleep count, we make sure
that we are the thread that will decrement the count when we wakeup.
Otherwise, what happens is that if we get interrupted (signal) and we
have to wake up, but before we get our mutex, some thread that wants
to wake us up detects that the count is non-zero and so enters wakeup_one(),
but there's nothing on the sleep queue and so we don't get woken up. The
thread will still decrement the sleep count, which is bad because we will
also decrement it again later (as we got interrupted) and are already off
the sleep queue.
more robust. They would correctly return ENOMEM for the first time when
the buffer was exhausted, but subsequent calls in this case could cause
writes ouside of the buffer bounds.
Approved by: rwatson
structure rather than assuming that the device vnode would reside
in the FFS filesystem (which is obviously a broken assumption with
the device filesystem).
in the hopes that they will actually *read* the comment above
it and *follow* the instructions so as to cause all the rest
of us less a lot less grief.
- Don't try to grab Giant before postsig() in userret() as it is no longer
needed.
- Don't grab Giant before psignal() in ast() but get the proc lock instead.
Giant. The only exception is the CANSIGNAL() macro. Unlocking the proc
lock around sendsig() in trapsignal() is also questionable. Note that
the functions sigexit(), psignal(), and issignal() must be called with
the proc lock of the process in question held. postsig() and
trapsignal() should not be called with the proc lock held, but they
also do not require Giant anymore either.
- Remove spl's that are now no longer needed as they are fully replaced.
don't end up back at ourselves which would indicate deadlock.
- Add the proc lock to the witness dup_list as we may hold more than one
process lock at a time.
- Don't assert a mutex is owned in _mtx_unlock_sleep() as that is too late.
We do the checks in the macros instead.
mutex operations in kthread_create().
- Lock a kthread's proc before changing its parent via proc_reparent().
- Test P_KTHREAD not P_SYSTEM in kthread_suspend() and kthread_resume().
P_SYSTEM just means that the process shouldn't be swapped and is used
for vinum's daemon for example.
- Lock all the signal state used for suspending and resuming kthreads with
the proc lock.
- Add proc locking to fork1(). Always lock the child procoess (new
process) first when both processes need to be locked at the same
time.
- Remove unneeded spl()'s as the data they protected is now locked.
- Ensure that the proctree is exclusively locked and the new process is
locked when setting up the parent process pointer.
- Lock the check for P_KTHREAD in p_flag in fork_exit().
possible for us to see a process in the early stages of fork before p_fd
has been initialized. Ideally, we wouldn't stick a process on the allproc
list until it was fully created however.
than dinking around in the process lists explicitly.
- Hold both the proctree lock and proc lock of the child process when
reparenting a process via proc_reparent.
- Lock processes while sending them signals.
- Miscellaenous proc locking.
- proc_reparent() now asserts that the child is locked in addition to an
exclusive proctree lock.
INVARIANTS case, define the actual KASSERT() in _SX_ASSERT_[SX]LOCKED
macros that are used in the sx code itself and convert the
SX_ASSERT_[SX]LOCKED macros to simple wrappers that grab the mutex for the
duration of the check.
support implementations of ACLs in file systems. Introduce the
following new functions:
vaccess_acl_posix1e() vaccess() that accepts an ACL
acl_posix1e_mode_to_perm() Convert mode bits to ACL rights
acl_posix1e_mode_to_entry() Build ACL entry from mode/uid/gid
acl_posix1e_perms_to_mode() Generate file mode from ACL
acl_posix1e_check() Syntax verification for ACL
These functions allow a file system to rely on central ACL evaluation
and syntax checking, as well as providing useful utilities to
allow ACL-based file systems to generate mode/owner/etc information
to return via VOP_GETATTR(), and to support file systems that split
their ACL information over their existing inode storage (mode, uid,
gid) and extended ACL into extended attributes (additional users,
groups, ACL mask).
o Add prototypes for exported functions to sys/acl.h, sys/vnode.h
Reviewed by: trustedbsd-discuss, freebsd-arch
Obtained from: TrustedBSD Project
but potentially significant in -4.x.)
Eliminate a pointless parameter to aio_fphysio().
Remove unnecessary casts from aio_fphysio() and aio_physwakeup().
- Add sx_xholder member to sx struct which is used for INVARIANTS-enabled
assertions. It indicates the thread that presently owns the xlock.
- Add some assertions to the sx lock code that will detect the fatal
API abuse:
xlock --> xlock
xlock --> slock
which now works thanks to sx_xholder.
Notice that the remaining two problematic cases:
slock --> xlock
slock --> slock (a little less problematic, but still recursion)
will need to be handled by witness eventually, as they are more
involved.
Reviewed by: jhb, jake, jasone
aiocb's allocated by zalloc(). In other words, zfree() was never
called. Now, we call zfree(). Why eliminate this micro-
optimization? At some later point, when we multithread the AIO
system, we would need a mutex to synchronize access to aio_freejobs,
making its use nearly indistinguishable in cost from zalloc() and
zfree().
Remove unnecessary fhold() and fdrop() calls from aio_qphysio(),
undo'ing a part of revision 1.86. The reference count on the file
structure is already incremented by _aio_aqueue() before it calls
aio_qphysio(). (Update the comments to document this fact.)
Remove unnecessary casts from _aio_aqueue(), aio_read(), aio_write()
and aio_waitcomplete().
Remove an unnecessary "return;" from aio_process().
Add "static" in various places.
related code from aio_read() and aio_write(). This field was
intended, but never used, to allow a mythical user-level library to
make an aio_read() or aio_write() behave like an ordinary read() or
write(), i.e., a blocking I/O operation.
- Add a KASSERT() to ensure an ithread has a backing kernel thread when we
schedule it.
- Don't attempt to preemptively switch to an ithread if p_stat of curproc
is not SRUN.
An initial tidyup of the mount() syscall and VFS mount code.
This code replaces the earlier work done by jlemon in an attempt to
make linux_mount() work.
* the guts of the mount work has been moved into vfs_mount().
* move `type', `path' and `flags' from being userland variables into being
kernel variables in vfs_mount(). `data' remains a pointer into
userspace.
* Attempt to verify the `type' and `path' strings passed to vfs_mount()
aren't too long.
* rework mount() and linux_mount() to take the userland parameters
(besides data, as mentioned) and pass kernel variables to vfs_mount().
(linux_mount() already did this, I've just tidied it up a little more.)
* remove the copyin*() stuff for `path'. `data' still requires copyin*()
since its a pointer into userland.
* set `mount->mnt_statf_mntonname' in vfs_mount() rather than in each
filesystem. This variable is generally initialised with `path', and
each filesystem can override it if they want to.
* NOTE: f_mntonname is intiailised with "/" in the case of a root mount.
`rootvnode' pointer, but vfs_syscalls.c's checkdirs() assumed that
it did. This bug reliably caused a panic at reboot time if any
filesystem had been mounted directly over /.
The checkdirs() function is called at mount time to find any process
fd_cdir or fd_rdir pointers referencing the covered mountpoint
vnode. It transfers these to point at the root of the new filesystem.
However, this process was not reversed at unmount time, so processes
with a cwd/root at a mount point would unexpectedly lose their
cwd/root following a mount-unmount cycle at that mountpoint.
This change should fix both of the above issues. Start_init() now
holds an extra vnode reference corresponding to `rootvnode', and
dounmount() releases this reference when the root filesystem is
unmounted just before reboot. Dounmount() now undoes the actions
taken by checkdirs() at mount time; any process cdir/rdir pointers
that reference the root vnode of the unmounted filesystem are
transferred to the now-uncovered vnode.
Reviewed by: bde, phk
hit on the client side and prevent the server side from retiring writes.
Pipeline operations turned off for all READs (no big loss since reads are
usually synchronous) and for NFS writes, and left on for the default bwrite().
(MFC expected prior to 4.3 freeze)
Testing by: mjacob, dillon
update native priority, it is diffcult to get right and likely
to end up horribly wrong. Use an honestly wrong fixed value
that seems to work; PUSER for user threads, and the interrupt
priority for ithreads. Set it once when the process is created
and forget about it.
Suggested by: bde
Pointy hat: me
process's priority go through the roof when it released a (contested)
mutex. Only set the native priority in mtx_lock if hasn't already
been set.
Reviewed by: jhb
to be more like Xint0x80_syscall and less like c function syscall().
- Reduce code duplication between the int0x80 and lcall handlers by
shuffling the elfags into the right place, saving the sizeof the
instruction in tf_err and jumping into the common int0x80 code.
Reviewed by: peter
passed in filename and line number in the KTR tracepoint message.
- Even though it is #if 0'd code, change the code to detect that a process
is an interrupt thread to check p->p_ithd against NULL rather than
checking non-existant process flags from BSD/OS.
- Use '%p' to print pointers in KTR log messages instead of assuming
sizeof(int) == sizeof(void *).
- Don't set p_mtxname to NULL when releasing a mutex. It doesn't hurt
to leave it set (we don't clear w_mesg for example) and at least at
one time in the past, there used to be race conditions in the kernel
that would result in setting this to NULL causing the kernel to
dereference NULL.
- Make the _mtx_assert() function be compiled in if INVARIANTS_SUPPORT is
defined rather than if INVARIANTS is defined so that a KLD compiled
with INVARIANTS that uses mtx_assert() can be used with a kernel that
just has INVARIANT_SUPPORT compiled in.
allow the watermark to be passed in via the data field during the EV_ADD
operation.
Hook this up to the socket read/write filters; if specified, it overrides
the so_{rcv|snd}.sb_lowat values in the filter.
Inspired by: "Ronald F. Guilmette" <rfg@monkeys.com>
the current socket error in fflags. This may be useful for determining
why a connect() request fails.
Inspired by: "Jonathan Graehl" <jonathan@graehl.org>
depend on this. The linux ABI emulator tries to use it for some linux
binaries too. VM86 had a bigger cost than this and it was made default
a while ago.
Reviewed by: jhb, imp
interrupts.
Protect usage of the per processor switchtime variable against
interrupts in calcru().
This seem to eliminate the "microuptime() went backwards" warnings.
the the original trapframe of the syscall, trap, or interrupt that entered
the kernel. Before SMPng, ast's were handled via a psuedo trap at the
end of doerti. With the SMPng commit, ast's were broken out into a
separate ast() function that was called from doreti to match the behavior
of other architectures. Unfortunately, when this was done, the
p_md.md_regs member of curproc was not updateda in ast(), thus when
signals are handled by userret() after an interrupt that returns to
userland, we end up using a stale trapframe that will result in the
registers from the old trapframe overwriting the real trapframe and
smashing all the registers right before we return to usermode. The saved
%cs:%eip from where we were in usermode are saved in the trapframe for
example.
- Don't use an atomic operation to update cnt.v_soft in ast(). This is
the only place the variable is written to, and sched_lock is always
held when it is written, so it is already protected and the mutex release
of sched_lock asserts a memory barrier that ensures the value will be
updated in a timely fashion.
- Don't hold sched_lock around addupc_task() as this apparently breaks
profiling badly due to sched_lock being held across copyin().
Reported by: bde (2)
an interrupt thread while the interrupt thread is blocked on Giant waiting
to execute the interrupt handler being removed. The result was that the
intrhand structure would be free'd, and we would call 0xdeadc0de. The work
around is to check to see if the interrupt thread is idle when removing a
handler. If not, then we mark the interrupt handler as being dead using
the new IH_DEAD flag and don't remove it from the interrupt threads' list
of handlers. When the interrupt thread resumes, it will see a dead handler
while traversing the list of handlers and will remove the handler then.
work because opt_preemption.h wasn't #include'd. Instead, make use of the
do_switch parameter to ithread_schedule() and do the check in the alpha
interrupt code.
credential structure, ucred (cr->cr_prison).
o Allow jail inheritence to be a function of credential inheritence.
o Abstract prison structure reference counting behind pr_hold() and
pr_free(), invoked by the similarly named credential reference
management functions, removing this code from per-ABI fork/exit code.
o Modify various jail() functions to use struct ucred arguments instead
of struct proc arguments.
o Introduce jailed() function to determine if a credential is jailed,
rather than directly checking pointers all over the place.
o Convert PRISON_CHECK() macro to prison_check() function.
o Move jail() function prototypes to jail.h.
o Emulate the P_JAILED flag in fill_kinfo_proc() and no longer set the
flag in the process flags field itself.
o Eliminate that "const" qualifier from suser/p_can/etc to reflect
mutex use.
Notes:
o Some further cleanup of the linux/jail code is still required.
o It's now possible to consider resolving some of the process vs
credential based permission checking confusion in the socket code.
o Mutex protection of struct prison is still not present, and is
required to protect the reference count plus some fields in the
structure.
Reviewed by: freebsd-arch
Obtained from: TrustedBSD Project
filename insteada of copying the first 32 characters of it.
- Add in const modifiers for the passed in format strings and filenames
and their respective members in the ktr_entry struct.
scheduling an interrupt thread to run when needed. This has the side
effect of enabling support for entropy gathering from interrupts on
all architectures.
- Change the software interrupt and x86 and alpha hardware interrupt code
to use ithread_schedule() for most of their processing when scheduling
an interrupt to run.
- Remove the pesky Warning message about interrupt threads having entropy
enabled. I'm not sure why I put that in there in the first place.
- Add more error checking for parameters and change some cases that
returned EINVAL to panic on failure instead via KASSERT().
- Instead of doing a documented evil hack of setting the P_NOLOAD flag
on every interrupt thread whose pri was SWI_CLOCK, set the flag
explicity for clk_ithd's proc during start_softintr().
- Add pager capability to the 'show ktr' command. It functions much like
'ps': Enter at the prompt displays one more entry, Space displays
another page, and any other key quits.
This is useful when doing copies of packet where some leading
space has been preallocated to insert protocol headers.
Note that there are in fact almost no users of m_copypacket.
MFC candidate.
in mi_switch() just before calling cpu_switch() so that the first switch
after a resched request will satisfy the request.
- While I'm at it, move a few things into mi_switch() and out of
cpu_switch(), specifically set the p_oncpu and p_lastcpu members of
proc in mi_switch(), and handle the sched_lock state change across a
context switch in mi_switch().
- Since cpu_switch() no longer handles the sched_lock state change, we
have to setup an initial state for sched_lock in fork_exit() before we
release it.
is sent to a process, psignal() needs to schedule an AST for the
process if the process is runnable, not just if it is current, so that
pending signals get checked for on the next return of the process to
user mode. This wasn't practical until recently because the AST flag
was per-cpu so setting it for a non-current process would usually just
cause a bogus AST for the current process.
For non-current processes looping in user mode, it took accidental
(?) magic to deliver signals at all. Signals were usually delivered
late as a side effect of rescheduling (need_resched() sets astpending,
etc.). In pre-SMPng, delivery was delayed by at most 1 quantum (the
need_resched() call in roundrobin() is certain to occur within 1
quantum for looping processes). In -current, things are complicated
by normal interrupt handlers being threads. Missing handling of the
complications makes roundrobin() a bogus no-op, but preemptive
scheduling sort of works anyway due to even larger bogons elsewhere.
always on curproc. This is needed to implement signal delivery properly
(see a future log message for kern_sig.c).
Debogotified the definition of aston(). aston() was defined in terms
of signotify() (perhaps because only the latter already operated on
a specified process), but aston() is the primitive.
Similar changes are needed in the ia64 versions of cpu.h and trap.c.
I didn't make them because the ia64 is missing the prerequisite changes
to make astpending and need_resched per-process and those changes are
too large to make without testing.
actually in the kernel. This structure is a different size than
what is currently in -CURRENT, but should hopefully be the last time
any application breakage is caused there. As soon as any major
inconveniences are removed, the definition of the in-kernel struct
ucred should be conditionalized upon defined(_KERNEL).
This also changes struct export_args to remove dependency on the
constantly-changing struct ucred, as well as limiting the bounds
of the size fields to the correct size. This means: a) mountd and
friends won't break all the time, b) mountd and friends won't crash
the kernel all the time if they don't know what they're doing wrt
actual struct export_args layout.
Reviewed by: bde