that works in the new threaded kernel. It was commented out of
the disksort routine earlier this year for the reasons given in
kern/subr_disklabel.c (which is where this code used to reside
before it moved to kern/subr_disk.c):
----------------------------
revision 1.65
date: 2002/04/22 06:53:20; author: phk; state: Exp; lines: +5 -0
Comment out Kirks io-request priority hack until we can do this in a
civilized way which doesn't cause grief.
The problem is that it is not generally safe to cast a "struct bio
*" to a "struct buf *". Things like ccd, vinum, ata-raid and GEOM
constructs bio's which are not entrails of a struct buf.
Also, curthread may or may not have anything to do with the I/O request
at hand.
The correct solution can either be to tag struct bio's with a
priority derived from the requesting threads nice and have disksort
act on this field, this wouldn't address the "silly-seek syndrome"
where two equal processes bang the diskheads from one edge to the
other of the disk repeatedly.
Alternatively, and probably better: a sleep should be introduced
either at the time the I/O is requested or at the time it is completed
where we can be sure to sleep in the right thread.
The sleep also needs to be in constant timeunits, 1/hz can be practicaly
any sub-second size, at high HZ the current code practically doesn't
do anything.
----------------------------
As suggested in this comment, it is no longer located in the disk sort
routine, but rather now resides in spec_strategy where the disk operations
are being queued by the thread that is associated with the process that
is really requesting the I/O. At that point, the disk queues are not
visible, so the I/O for positively niced processes is always slowed
down whether or not there is other activity on the disk.
On the issue of scaling HZ, I believe that the current scheme is
better than using a fixed quantum of time. As machines and I/O
subsystems get faster, the resolution on the clock also rises.
So, ten years from now we will be slowing things down for shorter
periods of time, but the proportional effect on the system will
be about the same as it is today. So, I view this as a feature
rather than a drawback. Hence this patch sticks with using HZ.
Sponsored by: DARPA & NAI Labs.
Reviewed by: Poul-Henning Kamp <phk@critter.freebsd.dk>
that use it. Specifically, vop_stdlock uses the lock pointed to by
vp->v_vnlock. By default, getnewvnode sets up vp->v_vnlock to
reference vp->v_lock. Filesystems that wish to use the default
do not need to allocate a lock at the front of their node structure
(as some still did) or do a lockinit. They can simply start using
vn_lock/VOP_UNLOCK. Filesystems that wish to manage their own locks,
but still use the vop_stdlock functions (such as nullfs) can simply
replace vp->v_vnlock with a pointer to the lock that they wish to
have used for the vnode. Such filesystems are responsible for
setting the vp->v_vnlock back to the default in their vop_reclaim
routine (e.g., vp->v_vnlock = &vp->v_lock).
In theory, this set of changes cleans up the existing filesystem
lock interface and should have no function change to the existing
locking scheme.
Sponsored by: DARPA & NAI Labs.
devfs VOP symlink creation by introducing a new entry point to determine
the label of the devfs_dirent prior to allocation of a vnode for the
symlink.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
on a process's pending signals, use the signal queue flattener,
ksiginfo_to_sigset_t, on the process, and on a local sigset_t, and then work
with that as needed.
gets signals operating based on a TailQ, and is good enough to run X11,
GNOME, and do job control. There are some intricate parts which could be
more refined to match the sigset_t versions, but those require further
evaluation of directions in which our signal system can expand and contract
to fit our needs.
After this has been in the tree for a while, I will make in kernel API
changes, most notably to trapsignal(9) and sendsig(9), to use ksiginfo
more robustly, such that we can actually pass information with our
(queued) signals to the userland. That will also result in using a
struct ksiginfo pointer, rather than a signal number, in a lot of
kern_sig.c, to refer to an individual pending signal queue member, but
right now there is no defined behaviour for such.
CODAFS is unfinished in this regard because the logic is unclear in
some places.
Sponsored by: New Gold Technology
Reviewed by: bde, tjr, jake [an older version, logic similar]
that a particular device driver is not Giant-challenged.
SPECFS will DROP_GIANT() ... PICKUP_GIANT() around calls to the
driver in question.
Notice that the interrupt path is not affected by this!
This does _NOT_ work for drivers accessed through cdevsw->d_strategy()
ie drivers for disk(-like), some tapes, maybe others.
/h/des/src/sys/coda/coda_venus.c: In function `venus_ioctl':
/h/des/src/sys/coda/coda_venus.c:277: warning: cast from pointer to integer of
different size
/h/des/src/sys/coda/coda_venus.c:292: warning: cast from pointer to integer of
different size
/h/des/src/sys/coda/coda_venus.c: In function `venus_readlink':
/h/des/src/sys/coda/coda_venus.c:380: warning: cast from pointer to integer of
different size
/h/des/src/sys/coda/coda_venus.c: In function `venus_readdir':
/h/des/src/sys/coda/coda_venus.c:637: warning: cast from pointer to integer of
different size
Submitted by: des-alpha-tinderbox
implement worthful VOP_BMAP() handler, so it expect the blkno not to be
changed by VOP_BMAP(). Otherwise, it'll have to find some tricky way to
determine if bp was VOP_BMAP()ed or not in VOP_STRATEGY().
PR: kern/42139
unlocked accesses to v_usecount.
- Lock access to the buf lists in the various sync routines. interlock
locking could be avoided almost entirely in leaf filesystems if the
fsync function had a generic helper.
constants VM_MIN_ADDRESS, VM_MAXUSER_ADDRESS, USRSTACK and PS_STRINGS.
This is mainly so that they can be variable even for the native abi, based
on different machine types. Get stack protections from the sysentvec too.
This makes it trivial to map the stack non-executable for certain abis, on
machines that support it.
wasn't doing. Rather than just lock and unlock the vnode around the call
to VOP_FSYNC(), implement rwatson's suggestion to lock the file vnode
in kern_link() before calling VOP_LINK(), since the other filesystems
also locked the file vnode right away in their link methods. Remove the
locking and and unlocking from the leaf filesystem link methods.
Reviewed by: rwatson, bde (except for the unionfs_link() changes)
file servers fail to do it in the right way.
New NFLUSHWIRE flag marks pending flush request(s).
NB: not all cases covered by this commit.
Obtained from: Darwin
v_tag is now const char * and should only be used for debugging.
Additionally:
1. All users of VT_NTS now check vfsconf->vf_type VFCF_NETWORK
2. The user of VT_PROCFS now checks for the new flag VV_PROCDEP, which
is propagated by pseudofs to all child vnodes if the fs sets PFS_PROCDEP.
Suggested by: phk
Reviewed by: bde, rwatson (earlier version)
s/SNGL/SINGLE/
s/SNGLE/SINGLE/
Fix abbreviation for P_STOPPED_* etc flags, in original code they were
inconsistent and difficult to distinguish between them.
Approved by: julian (mentor)
accept an 'active_cred' argument reflecting the credential of the thread
initiating the ioctl operation.
- Change fo_ioctl() to accept active_cred; change consumers of the
fo_ioctl() interface to generally pass active_cred from td->td_ucred.
- In fifofs, initialize filetmp.f_cred to ap->a_cred so that the
invocations of soo_ioctl() are provided access to the calling f_cred.
Pass ap->a_td->td_ucred as the active_cred, but note that this is
required because we don't yet distinguish file_cred and active_cred
in invoking VOP's.
- Update kqueue_ioctl() for its new argument.
- Update pipe_ioctl() for its new argument, pass active_cred rather
than td_ucred to MAC for authorization.
- Update soo_ioctl() for its new argument.
- Update vn_ioctl() for its new argument, use active_cred rather than
td->td_ucred to authorize VOP_IOCTL() and the associated VOP_GETATTR().
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
fo_read() and fo_write(): explicitly use the cred argument to fo_poll()
as "active_cred" using the passed file descriptor's f_cred reference
to provide access to the file credential. Add an active_cred
argument to fo_stat() so that implementers have access to the active
credential as well as the file credential. Generally modify callers
of fo_stat() to pass in td->td_ucred rather than fp->f_cred, which
was redundantly provided via the fp argument. This set of modifications
also permits threads to perform these operations on behalf of another
thread without modifying their credential.
Trickle this change down into fo_stat/poll() implementations:
- badfo_poll(), badfo_stat(): modify/add arguments.
- kqueue_poll(), kqueue_stat(): modify arguments.
- pipe_poll(), pipe_stat(): modify/add arguments, pass active_cred to
MAC checks rather than td->td_ucred.
- soo_poll(), soo_stat(): modify/add arguments, pass fp->f_cred rather
than cred to pru_sopoll() to maintain current semantics.
- sopoll(): moidfy arguments.
- vn_poll(), vn_statfile(): modify/add arguments, pass new arguments
to vn_stat(). Pass active_cred to MAC and fp->f_cred to VOP_POLL()
to maintian current semantics.
- vn_close(): rename cred to file_cred to reflect reality while I'm here.
- vn_stat(): Add active_cred and file_cred arguments to vn_stat()
and consumers so that this distinction is maintained at the VFS
as well as 'struct file' layer. Pass active_cred instead of
td->td_ucred to MAC and to VOP_GETATTR() to maintain current semantics.
- fifofs: modify the creation of a "filetemp" so that the file
credential is properly initialized and can be used in the socket
code if desired. Pass ap->a_td->td_ucred as the active
credential to soo_poll(). If we teach the vnop interface about
the distinction between file and active credentials, we would use
the active credential here.
Note that current inconsistent passing of active_cred vs. file_cred to
VOP's is maintained. It's not clear why GETATTR would be authorized
using active_cred while POLL would be authorized using file_cred at
the file system level.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
these in the main filesystems. This does not change the resulting code
but makes the source a little bit more grepable.
Sponsored by: DARPA and NAI Labs.
- v_vflag is protected by the vnode lock and is used when synchronization
with VOP calls is needed.
- v_iflag is protected by interlock and is used for dealing with vnode
management issues. These flags include X/O LOCK, FREE, DOOMED, etc.
- All accesses to v_iflag and v_vflag have either been locked or marked with
mp_fixme's.
- Many ASSERT_VOP_LOCKED calls have been added where the locking was not
clear.
- Many functions in vfs_subr.c were restructured to provide for stronger
locking.
Idea stolen from: BSD/OS
embedded into their file_entry descriptor. This is more for
correctness, since these files cannot be bmap'ed/mmap'ed anyways.
Enforce this restriction.
Submitted by: tes@sgi.com
kernel access control.
Teach devfs how to respond to pathconf() _POSIX_MAC_PRESENT queries,
allowing it to indicate to user processes that individual vnode labels
are available.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
kernel access control.
Modify procfs so that (when mounted multilabel) it exports process MAC
labels as the vnode labels of procfs vnodes associated with processes.
Approved by: des
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
kernel access control.
Modify pseudofs so that it can support synthetic file systems with
the multilabel flag set. In particular, implement vop_refreshlabel()
as pn_refreshlabel(). Implement pfs_refreshlabel() to invoke this,
and have it fall back to the mount label if the file system does
not implement pn_refreshlabel() for the node. Otherwise, permit
the file system to determine how the service is provided.
Approved by: des
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
kernel access control.
Instrument devfs to support per-dirent MAC labels. In particular,
invoke MAC framework when devfs directory entries are instantiated
due to make_dev() and related calls, and invoke the MAC framework
when vnodes are instantiated from these directory entries. Implement
vop_setlabel() for devfs, which pushes the label update into the
devfs directory entry for semi-persistant store. This permits the MAC
framework to assign labels to devices and directories as they are
instantiated, and export access control information via devfs vnodes.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
except for the fact tha they are presently swapped out. Also add a process
flag to indicate that the process has started the struggle to swap
back in. This will be needed for the case where multiple threads
start the swapin action top a collision. Also add code to stop
a process fropm being swapped out if one of the threads in this
process is actually off running on another CPU.. that might hurt...
Submitted by: Seigo Tanimura <tanimura@r.dl.itc.u-tokyo.ac.jp>
ruleset. If we do, that means there's a ruleset loop (10 includes 20
include 30 includes 10), which will quickly cause a double fault due
to stack overflow (since "include" is implemented by recursion).
(Previously, we only checked that X didn't include X.)
administrator to define certain properties of new devfs nodes before
they become visible to the userland. Both static (e.g., /dev/speaker)
and dynamic (e.g., /dev/bpf*, some removable devices) nodes are
supported. Each DEVFS mount may have a different ruleset assigned to
it, permitting different policies to be implemented for things like
jails.
Approved by: phk
- Initialize lock structure in vncache_alloc
- Return locked vnodes from vncache_alloc
- Setup vnode op vectors to use default lock, unlock, and islocked
- Implement simple locking scheme required for lookup
The ability to schedule multiple threads per process
(one one cpu) by making ALL system calls optionally asynchronous.
to come: ia64 and power-pc patches, patches for gdb, test program (in tools)
Reviewed by: Almost everyone who counts
(at various times, peter, jhb, matt, alfred, mini, bernd,
and a cast of thousands)
NOTE: this is still Beta code, and contains lots of debugging stuff.
expect slight instability in signals..
adding vnode to hash. The fix is to use atomic hash-lookup-and-add-if-
not-found operation. The odd thing is that this race can't happen
actually because the lowervp vnode is locked exclusively now during the
whole process of null node creation. This must be thought as a step
toward shared lookups.
Also remove vp->v_mount checks when looking for a match in the hash,
as this is the vestige.
Also add comments and cosmetic changes.
byte offset of the directory entry for the inode number for all types
of files except directories, although this breaks hard links for
non-directories even if it doesn't cause overflow. Just ignore this
broken inode number for stat() and readdir() and return a less broken
one (the block offset of the file), so that applications normally can't
see the brokenness.
This leaves at least the following brokenness:
- extra inodes, vnodes and caching for hard links.
- various overflow bugs. cd9660 supports 64-bit block numbers, but we
silently ignore the top 32 bits in isonum_733() and then drop another
10 bits for our broken inode numbers. We may also have sign extension
bugs from storing 32-bit extents in ints and longs even if ints are
32-bits. These bugs affect DVDs. mkisofs apparently limits them
by writing directory entries first.
Inode numbers were broken mainly in 4.4BSD-Lite2. FreeBSD-1.1.5 seems
to have a correct implementation modulo the overflow bugs. We need
to look up directory entries from inodes for symlinks only. FreeBSD-1.1.5
use separate fields (iso_parent_extent, iso_parent) to point to the
directory entry. 4.4BSD-Lite doesn't have these, and abuses i_ino to
point to the directory entry. Correct pointers are impossible for
hard links, but symlinks can't be hard links.
o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a
socket buffer. The mutex in the receive buffer also protects the data
in struct socket.
o Determine the lock strategy for each members in struct socket.
o Lock down the following members:
- so_count
- so_options
- so_linger
- so_state
o Remove *_locked() socket APIs. Make the following socket APIs
touching the members above now require a locked socket:
- sodisconnect()
- soisconnected()
- soisconnecting()
- soisdisconnected()
- soisdisconnecting()
- sofree()
- soref()
- sorele()
- sorwakeup()
- sotryfree()
- sowakeup()
- sowwakeup()
Reviewed by: alfred
pointer instead of a proc pointer and require the process pointed to
by the second argument to be locked. We now use the thread ucred reference
for the credential checks in p_can*() as a result. p_canfoo() should now
no longer need Giant.
make sure it's a correct operation for devfs, do it only in the
ISLASTCN case. If we don't, we are assuming that the final file will
be in devfs, which is not true if another partition is mounted on top
of devfs or with special filenames (like /dev/net/../../foo).
Reviewed by: phk
the block to read and copy out. This removes the hack in
udf_readatoffset() for only reading one block at a time. WooHoo!
Remove a redundant test for fragmented fids in both udf_readdir()
and udf_lookup(). Add comment to both as to why the test is
written the way it is. Add a few more safety checks for brelse().
Thanks to Timothy Shimmin <tes@boing.melbourne.sgi.com> for pointing
out these problems.
Requested by: bde
Since locking sigio_lock is usually followed by calling pgsigio(),
move the declaration of sigio_lock and the definitions of SIGIO_*() to
sys/signalvar.h.
While I am here, sort include files alphabetically, where possible.
of a process pointer.
- Move the p_candebug() at the start of procfs_control() a bit to make
locking feasible. We still perform the access check before doing
anything, we just now perform it after acquiring locks.
- Don't lock the sched_lock for TRACE_WAIT_P() and when checking to see if
p_stat is SSTOP. We lock the process while setting p_stat to SSTOP
so locking the process is sufficient to do a read to see if p_stat is
SSTOP or not.
Setting of timestamps on devices had no effect visible to userland
because timestamps for devices were set in places that are never used.
This broke:
- update of file change time after a change of an attribute
- setting of file access and modification times.
The VA_UTIMES_NULL case did not work. Revs 1.31-1.32 were supposed to
fix this by copying correct bits from ufs, but had little or no effect
because the old checks were not removed.
files. We didn't clear the update marks when we set the times, so
some of the settings were sometimes clobbered with the current time a
little later. This caused cp -p even by root to almost always fail
to preserve any times despite not reporting any errors in attempting
to preserve them.
Don't forget to set the archive attribute when we set the read-only
attribute. We should only set the archive attribute if we actually
change something, but we mostly don't bother avoiding setting it
elsewhere, so don't bother here yet.
MFC after: 1 week
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.
Tested on: i386, alpha, sparc64
they aren't in the usual path of execution for syscalls and traps.
The main complication for this is that we have to set flags to control
ast() everywhere that changes the signal mask.
Avoid locking in userret() in most of the remaining cases.
Submitted by: luoqi (first part only, long ago, reorganized by me)
Reminded by: dillon
general cleanup of the API. The entire API now consists of two functions
similar to the pre-KSE API. The suser() function takes a thread pointer
as its only argument. The td_ucred member of this thread must be valid
so the only valid thread pointers are curthread and a few kernel threads
such as thread0. The suser_cred() function takes a pointer to a struct
ucred as its first argument and an integer flag as its second argument.
The flag is currently only used for the PRISON_ROOT flag.
Discussed on: smp@
provided the latter is nonzero. At this point, the former is a fairly
arbitrary default value (DFTPHYS), so changing it to any reasonable
value specified by the device driver is safe. Using the maximum of
these limits broke ffs clustered i/o for devices whose si_iosize_max
is < DFLTPHYS. Using the minimum would break device drivers' ability
to increase the active limit from DFTLPHYS up to MAXPHYS.
Copied the code for this and the associated (unnecessary?) fixup of
mp_iosize_max to all other filesystems that use clustering (ext2fs and
msdosfs). It was completely missing.
PR: 36309
MFC-after: 1 week
as it leaves the nullfs vnode allocated, but with no identity. The
effect is that a null mount can slowly accumulate all the vnodes
in the system, reclaiming them only when it is unmounted. Thus
the null_inactive state instead accelerates the release of the
null vnode by calling vrecycle which will in turn call the
null_reclaim operator. The null_reclaim routine then does the
freeing actions previosuly (incorrectly) done in null_inactive.
locking flags when acquiring a vnode. The immediate purpose is
to allow polling lock requests (LK_NOWAIT) needed by soft updates
to avoid deadlock when enlisting other processes to help with
the background cleanup. For the future it will allow the use of
shared locks for read access to vnodes. This change touches a
lot of files as it affects most filesystems within the system.
It has been well tested on FFS, loopback, and CD-ROM filesystems.
only lightly on the others, so if you find a problem there, please
let me (mckusick@mckusick.com) know.
the bio and buffer structures to have daddr64_t bio_pblkno,
b_blkno, and b_lblkno fields which allows access to disks
larger than a Terabyte in size. This change also requires
that the VOP_BMAP vnode operation accept and return daddr64_t
blocks. This delta should not affect system operation in
any way. It merely sets up the necessary interfaces to allow
the development of disk drivers that work with these larger
disk block addresses. It also allows for the development of
UFS2 which will use 64-bit block addresses.
New locks are:
- pgrpsess_lock which locks the whole pgrps and sessions,
- pg_mtx which protects the pgrp members, and
- s_mtx which protects the session members.
Please refer to sys/proc.h for the coverage of these locks.
Changes on the pgrp/session interface:
- pgfind() needs the pgrpsess_lock held.
- The caller of enterpgrp() is responsible to allocate a new pgrp and
session.
- Call enterthispgrp() in order to enter an existing pgrp.
- pgsignal() requires a pgrp lock held.
Reviewed by: jhb, alfred
Tested on: cvsup.jp.FreeBSD.org
(which is a quad-CPU machine running -current)
- clobbering of jsp's $Id$ by FreeBSD's old $Id$.
- long lines in recent KSE changes (procfs_ctl.c).
- other style bugs in KSE changes (most related to an shadowed variable
in procfs_status.c -- the td in the outer scope is obfuscated by
PFS_FILL_ARGS).
Approved by: des
- clobbering of jsp's $Id$ by FreeBSD's old $Id$.
- lost Berkeley id in procfs_dbregs.c
- long lines in recent KSE changes.
- various gratuitous differences between procfs_*regs.c.
o Modify the system call syntax for extattr_{get,set}_{fd,file}() so
as not to use the scatter gather API (which appeared not to be used
by any consumers, and be less portable), rather, accepts 'data'
and 'nbytes' in the style of other simple read/write interfaces.
This changes the API and ABI.
o Modify system call semantics so that extattr_get_{fd,file}() return
a size_t. When performing a read, the number of bytes read will
be returned, unless the data pointer is NULL, in which case the
number of bytes of data are returned. This changes the API only.
o Modify the VOP_GETEXTATTR() vnode operation to accept a *size_t
argument so as to return the size, if desirable. If set to NULL,
the size will not be returned.
o Update various filesystems (pseodofs, ufs) to DTRT.
These changes should make extended attributes more useful and more
portable. More commits to rebuild the system call files, as well
as update userland utilities to follow.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
this is a low-functionality change that changes the kernel to access the main
thread of a process via the linked list of threads rather than
assuming that it is embedded in the process. It IS still embeded there
but remove all teh code that assumes that in preparation for the next commit
which will actually move it out.
Reviewed by: peter@freebsd.org, gallatin@cs.duke.edu, benno rice,
Backout revision 1.56 and 1.57 of fifo_vnops.c.
Introduce a new poll op "POLLINIGNEOF" that can be used to ignore
EOF on a fifo, POLLIN/POLLRDNORM is converted to POLLINIGNEOF within
the FIFO implementation to effect the correct behavior.
This should allow one to view a fifo pretty much as a data source
rather than worry about connections coming and going.
Reviewed by: bde
functions just grab f_data and don't muck with anything else so this
should be ok.
this fixes a panic with invariants where it thinks we've doubly initialized
the filetmp mutex even though all we've done is neglect to bzero it.
Seigo Tanimura (tanimura) posted the initial delta.
I've polished it quite a bit reducing the need for locking and
adapting it for KSE.
Locks:
1 mutex in each filedesc
protects all the fields.
protects "struct file" initialization, while a struct file
is being changed from &badfileops -> &pipeops or something
the filedesc should be locked.
1 mutex in each struct file
protects the refcount fields.
doesn't protect anything else.
the flags used for garbage collection have been moved to
f_gcflag which was the FILLER short, this doesn't need
locking because the garbage collection is a single threaded
container.
could likely be made to use a pool mutex.
1 sx lock for the global filelist.
struct file * fhold(struct file *fp);
/* increments reference count on a file */
struct file * fhold_locked(struct file *fp);
/* like fhold but expects file to locked */
struct file * ffind_hold(struct thread *, int fd);
/* finds the struct file in thread, adds one reference and
returns it unlocked */
struct file * ffind_lock(struct thread *, int fd);
/* ffind_hold, but returns file locked */
I still have to smp-safe the fget cruft, I'll get to that asap.
socreate(), rather than getting it implicitly from the thread
argument.
o Make NFS cache the credential provided at mount-time, and use
the cached credential (nfsmount->nm_cred) when making calls to
socreate() on initially connecting, or reconnecting the socket.
This fixes bugs involving NFS over TCP and ipfw uid/gid rules, as well
as bugs involving NFS and mandatory access control implementations.
Reviewed by: freebsd-arch
against VM_WAIT in the pageout code. Both fixes involve adjusting
the lockmgr's timeout capability so locks obtained with timeouts do not
interfere with locks obtained without a timeout.
Hopefully MFC: before the 4.5 release
the wrong VOP descriptor. This misuse caused VFS-cached vnodes to be
re-cached, resulting in the leak. This commit is an interim fix until DES
has a chance to rework the code involved.
With this change, mounting an smb share (using mount_smb, which is not
yet included in the tree) without any of smbfs, libiconv or libmchain
compiled into the kernel or loaded works.
not the calling process. While we're here, also unstaticize procfs_doprocfile() and
procfs_docurproc() so linprocfs can call them directly instead of duplicating them.
Submitted by: Dominic Mitchell <dom@semantico.com>
mutable contents of struct prison (hostname, securelevel, refcount,
pr_linux, ...)
o Generally introduce mtx_lock()/mtx_unlock() calls throughout kern/
so as to enforce these protections, in particular, in kern_mib.c
protection sysctl access to the hostname and securelevel, as well as
kern_prot.c access to the securelevel for access control purposes.
o Rewrite linux emulator abstractions for accessing per-jail linux
mib entries (osname, osrelease, osversion) so that they don't return
a pointer to the text in the struct linux_prison, rather, a copy
to an array passed into the calls. Likewise, update linprocfs to
use these primitives.
o Update in_pcb.c to always use prison_getip() rather than directly
accessing struct prison.
Reviewed by: jhb
The problem was that the ISO9660 code wasn't opening the device prior to
issuing ioctl calls. In particular, the device must be open before
iso_get_ssector() is called in iso_mountroot().
If the device isn't opened first, the disk layer blows up due to an
uninitialized variable.
The solution was to open the device, call iso_get_ssector() and then close
it again.
The ATAPI CDROM driver doesn't have this problem because it doesn't use the
disk layer, and evidently doesn't mind if someone issues an ioctl without
first issuing an open call.
Thanks to phk for pointing me at the source of this problem.
Tested by: dirk
MFC after: 1 week
pathconf() variables for directories, and set st_size and st_blocks
(of struct stat) for directories as appropriate. Note that st_size is
always set to DEV_BSIZE, since the size of the directories is not
currently kept.
Reviewed by: phk, bde
Basically FIFOs become a real pain to abuse as a rendevous point without
this change because you can't really select(2) on them because they always
return ready even though there is no writer (to signal EOF).
Obtained from: BSD/os
quad_t cannot be printed with %lld on 64 bit systems.
Dont waste cpu to round user and system times up to long long, it is
highly improbable that a process will have accumulated 68 years of
user or system cpu time (not wall clock time) before a reboot or
process restart.
structure changes now rather then piecemeal later on. mnt_nvnodelist
currently holds all the vnodes under the mount point. This will eventually
be split into a 'dirty' and 'clean' list. This way we only break kld's once
rather then twice. nvnodelist will eventually turn into the dirty list
and should remain compatible with the klds.
that a buffer's b_blkno would be valid. This is true when vmiodirenable
is turned off because the B_MALLOC'd buffer's data is invalidated when
the buffer is destroyed. But when vmiodirenable is turned on a buffer
can be reconstituted from its VMIO backing store. The reconstituted buffer
will have no knowledge of the physical block translation and the result is
serious directory corruption of the CDROM.
The solution is to fix cd9660_blkatoff() to always BMAP the buffer if
b_lblkno == b_blkno.
MFC after: 0 days
- crhold() returns a reference to the ucred whose refcount it bumps.
- crcopy() now simply copies the credentials from one credential to
another and has no return value.
- a new crshared() primitive is added which returns true if a ucred's
refcount is > 1 and false (0) otherwise.
Until now, the ptrace syscall was implemented as a wrapper that called
various functions in procfs depending on which ptrace operation was
requested. Most of these functions were themselves wrappers around
procfs_{read,write}_{,db,fp}regs(), with only some extra error checks,
which weren't necessary in the ptrace case anyway.
This commit moves procfs_rwmem() from procfs_mem.c into sys_process.c
(renaming it to proc_rwmem() in the process), and implements ptrace()
directly in terms of procfs_{read,write}_{,db,fp}regs() instead of
having it fake up a struct uio and then call procfs_do{,db,fp}regs().
It also moves the prototypes for procfs_{read,write}_{,db,fp}regs()
and proc_rwmem() from proc.h to ptrace.h, and marks all procfs files
except procfs_machdep.c as "optional procfs" instead of "standard".
the target process was being held locked during the uiomove() call. If the
process calling readdir() was the same as the target process (for instance
'ls /proc/curproc/'), and uiomove() caused a page fault, the result would
be a proc lock recursion. I have no idea how long this has been broken -
possibly ever since pfind() was changed to lock the process it returns.
Also replace the one and only call to procfs_findtextvp() with a direct
test of td->td_proc->p_textvp.
YA pseudofs megacommit, part 2:
- Merge the pfs_vnode and pfs_vdata structures, and make the vnode cache
a doubly-linked list. This eliminates the need to walk the list in
pfs_vncache_free().
- Add an exit callout which revokes vnodes associated with the process
that just exited. Since it needs to lock the cache when it does this,
pfs_vncache_mutex needs MTX_RECURSE.
- Add a third callback to the pfs_node structure. This one simply returns
non-zero if the specified requesting process is allowed to access the
specified node for the specified target process. This is used in
addition to the usual permission checks, e.g. when certain files don't
make sense for certain (system) processes.
- Make sure that pfs_lookup() and pfs_readdir() don't yap about files
which aren't pfs_visible(). Also check pfs_visible() before performing
reads and writes, to prevent the kind of races reported in SA-00:77 and
SA-01:55 (fork a child, open /proc/child/ctl, have that child fork a
setuid binary, and assume control of it).
- Add some more trace points.
- Rearrange the flag constants a little to simplify specifying and testing
for readability and writeability.
pseudofs_vnops.c:
- Track the aforementioned change.
- Add checks to pfs_open() to prevent opening read-only files for writing
or vice versa (pfs_{read,write} would block the actual reads and writes,
but it's still a bug to allow the open() to succeed). Also, return
EOPNOTSUPP if the caller attempts to lock the file.
- Add more trace points.
- Remove hardcoded uid, gid, mode from struct pfs_node; make pfs_getattr()
smart enough to get it right most of the time, and allow for callbacks
to handle the remaining cases. Rework the definition macros to match.
- Add lots of (conditional) debugging output.
- Fix a long-standing bug inherited from procfs: don't pretend to be a
read-only file system. Instead, return EOPNOTSUPP for operations we
truly can't support and allow others to fail silently. In particular,
pfs_lookup() now treats CREATE as LOOKUP. This may need more work.
- In pfs_lookup(), if the parent node is process-dependent, check that
the process in question still exists.
- Implement pfs_open() - its only current function is to check that the
process opening the file can see the process it belongs to.
- Finish adding support for writeable nodes.
- Bump module version number.
- Introduce lots of new bugs.
- Use some simple #define's at the top of the files for proc -> thread
changes instead of having lots of needless #ifdef's in the code.
- Don't try to use struct thread in !FreeBSD code.
- Don't use a few struct lwp's in some of the NetBSD code since it isn't
in their HEAD.
The new diff relative to before KSE is now signficantly smaller and easier
to maintain.
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.
Sorry john! (your next MFC will be a doosie!)
Reviewed by: peter@freebsd.org, dillon@freebsd.org
X-MFC after: ha ha ha ha
linux_getdents uses VOP_READDIR( ..., &ncookies, &cookies ) instead of
VOP_READDIR( ..., NULL, NULL ) because it seems to need the offsets for
linux_dirent and sizeof(dirent) != sizeof(linux_dirent)...
PR: 29467
Submitted by: Michael Reifenberger <root@nihil.plaut.de>
Reviewed by: phk
as there are now "unusual" protection properties to Pmem that differ
from the other files. While I'm at it, introduce proc locking for
the other files, which was previously present only in the Pmem case.
Obtained from: TrustedBSD Project
and so special-casing was introduced to provide extra procfs privilege
to the kmem group. With the advent of non-setgid kmem ps, this code
is no longer required, and in fact, can is potentially harmful as it
allocates privilege to a gid that is increasingly less meaningful.
Knowledge of specific gid's in kernel is also generally bad precedent,
as the kernel security policy doesn't distinguish gid's specifically,
only uid 0.
This commit removes reference to kmem in procfs, both in terms of
access control decisions, and the applying of gid kmem to the
/proc/*/mem file, simplifying the associated code considerably.
Processes are still permitted to access the mem file based on
the debugging policy, so ps -e still works fine for normal
processes and use.
Reviewed by: tmm
Obtained from: TrustedBSD Project
The p_can(...) construct was a premature (and, it turns out,
awkward) abstraction. The individual calls to p_canxxx() better
reflect differences between the inter-process authorization checks,
such as differing checks based on the type of signal. This has
a side effect of improving code readability.
o Replace direct credential authorization checks in ktrace() with
invocation of p_candebug(), while maintaining the special case
check of KTR_ROOT. This allows ktrace() to "play more nicely"
with new mandatory access control schemes, as well as making its
authorization checks consistent with other "debugging class"
checks.
o Eliminate "privused" construct for p_can*() calls which allowed the
caller to determine if privilege was required for successful
evaluation of the access control check. This primitive is currently
unused, and as such, serves only to complicate the API.
Approved by: ({procfs,linprocfs} changes) des
Obtained from: TrustedBSD Project
(this commit is just the first stage). Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.