some cases, move to priv_check() if it was an operation on a thread and
no other flags were present.
Eliminate caller-side jail exception checking (also now-unused); jail
privilege exception code now goes solely in kern_jail.c.
We can't yet eliminate suser() due to some cases in the KAME code where
a privilege check is performed and then used in many different deferred
paths. Do, however, move those prototypes to priv.h.
Reviewed by: csjp
Obtained from: TrustedBSD Project
This patch fixes places where they should be called atomically changing
their locking requirements (both assume per-proc spinlock held) and
introducing rufetchcalc which wrappers both calls to be performed in
atomic way.
Reviewed by: jeff
Approved by: jeff (mentor)
- Use thread_lock() rather than sched_lock for per-thread scheduling
sychronization.
- Use the per-process spinlock rather than the sched_lock for per-process
scheduling synchronization.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
Now, we assume no more sched_lock protection for some of them and use the
distribuited loads method for vmmeter (distribuited through CPUs).
Reviewed by: alc, bde
Approved by: jeff (mentor)
argument from being file descriptor index into the pointer to struct file:
part 2. Convert calls missed in the first big commit.
Noted by: rwatson
Pointy hat to: kib
Probabilly, a general approach is not the better solution here, so we should
solve the sched_lock protection problems separately.
Requested by: alc
Approved by: jeff (mentor)
Change the VOP_OPEN(), vn_open() vnode operation and d_fdopen() cdev operation
argument from being file descriptor index into the pointer to struct file.
Proposed and reviewed by: jhb
Reviewed by: daichi (unionfs)
Approved by: re (kensmith)
function calls are no more generated for vop_lock.
Rename _vop_lock to vop_lock1 to satisfy tools/vnode_if.awk assumption
about vop naming conventions. This restores pre/post-condition calls.
vmcnts. This can be used to abstract away pcpu details but also changes
to use atomics for all counters now. This means sched lock is no longer
responsible for protecting counts in the switch routines.
Contributed by: Attilio Rao <attilio@FreeBSD.org>
where similar data structures exist to support devfs and the MAC
Framework, but are named differently.
Obtained from: TrustedBSD Project
Sponsored by: SPARTA, Inc.
be applied to dev entries. This leaves us with file times like "Jan 1 1970."
Work around this problem by replacing the tv_sec == 0 check with a
<= 3600 check. It's doubtful anyone will be booting within an hour of the
Epoch, let alone care about a few seconds worth of nonzero timestamps. It's
a hackish work around, but it does work and I have not experienced any
negatives in my testing.
Discussed with: bde
"Ok with me: phk
requests where uio_offset is not 0 to begin with. This fixes a long-
standing bug where e.g. 'cat /proc/$$/regs' would loop forever.
MFC after: 3 weeks
The pfs_info mutex is only needed to lock pi_unrhdr. Everything else
in struct pfs_info is modified only while Giant is held (during
vfs_init() / vfs_uninit()); add assertions to that effect.
Simplify pfs_destroy somewhat.
Remove superfluous arguments from pfs_fileno_{alloc,free}(), and the
assertions which were added in the previous commit to ensure they were
consistent.
Assert that Giant is held while the vnode cache is initialized and
destroyed. Also assert that the cache is empty when it is destroyed.
Rename the vnode cache mutex for consistency.
Fix a long-standing bug in pfs_getattr(): it would uncritically return
the node's pn_fileno as st_ino. This would result in st_ino being 0
if the node had not previously been visited by readdir(), and also in
an incorrect st_ino for process directories and any files contained
therein. Correct this by abstracting the fileno manipulations
previously done in pfs_readdir() into a new function, pfs_fileno(),
which is used by both pfs_getattr() and pfs_readdir().
specific nodes when the process exits)
Move the vnode-cache-walking loop which was duplicated in pfs_exit() and
pfs_disable() into its own function, pfs_purge(), which looks for vnodes
marked as dead and / or belonging to the specified pfs_node and reclaims
them. Note that this loop is still extremely inefficient.
Add a comment in pfs_vncache_alloc() explaining why we have to purge the
vnode from the vnode cache before returning, in case anyone should be
tempted to remove the call to cache_purge().
Move the special handling for pfstype_root nodes into pfs_fileno_alloc()
and pfs_fileno_free() (the root node's fileno must always be 2). This
also fixes a bug where pfs_fileno_free() would reclaim the root node's
fileno, triggering a panic in the unr code, as that fileno was never
allocated from unr to begin with.
When destroying a pfs_node, release its fileno and purge it from the
vnode cache. I wish we could put off the call to pfs_purge() until
after the entire tree had been destroyed, but then we'd have vnodes
referencing freed pfs nodes. This probably doesn't matter while we're
still under Giant, but might become an issue later.
When destroying a pseudofs instance, destroy the tree before tearing
down the fileno allocator.
In pfs_mount(), acquire the mountpoint interlock when required.
MFC after: 3 weeks
and flags with an sxlock. This leads to a significant and measurable
performance improvement as a result of access to shared locking for
frequent lookup operations, reduced general overhead, and reduced overhead
in the event of contention. All of these are imported for threaded
applications where simultaneous access to a shared file descriptor array
occurs frequently. Kris has reported 2x-4x transaction rate improvements
on 8-core MySQL benchmarks; smaller improvements can be expected for many
workloads as a result of reduced overhead.
- Generally eliminate the distinction between "fast" and regular
acquisisition of the filedesc lock; the plan is that they will now all
be fast. Change all locking instances to either shared or exclusive
locks.
- Correct a bug (pointed out by kib) in fdfree() where previously msleep()
was called without the mutex held; sx_sleep() is now always called with
the sxlock held exclusively.
- Universally hold the struct file lock over changes to struct file,
rather than the filedesc lock or no lock. Always update the f_ops
field last. A further memory barrier is required here in the future
(discussed with jhb).
- Improve locking and reference management in linux_at(), which fails to
properly acquire vnode references before using vnode pointers. Annotate
improper use of vn_fullpath(), which will be replaced at a future date.
In fcntl(), we conservatively acquire an exclusive lock, even though in
some cases a shared lock may be sufficient, which should be revisited.
The dropping of the filedesc lock in fdgrowtable() is no longer required
as the sxlock can be held over the sleep operation; we should consider
removing that (pointed out by attilio).
Tested by: kris
Discussed with: jhb, kris, attilio, jeff
late stages of unmount). On failure, the vnode is recycled.
Add insmntque1(), to allow for file system specific cleanup when
recycling vnode on failure.
Change getnewvnode() to no longer call insmntque(). Previously,
embryonic vnodes were put onto the list of vnode belonging to a file
system, which is unsafe for a file system marked MPSAFE.
Change vfs_hash_insert() to no longer lock the vnode. The caller now
has that responsibility.
Change most file systems to lock the vnode and call insmntque() or
insmntque1() after a new vnode has been sufficiently setup. Handle
failed insmntque*() calls by propagating errors to callers, possibly
after some file system specific cleanup.
Approved by: re (kensmith)
Reviewed by: kib
In collaboration with: kib
function which is called from pfs_destroy() before the node is reclaimed.
Modify pfs_create_{dir,file,link}() to accept a pointer to a destructor
function in addition to the usual attr / fill / vis pointers.
This breaks both the programming and binary interfaces between pseudofs
and its consumers. It is believed that there are no pseudofs consumers
outside the source tree, so that the impact of this change is minimal.
Submitted by: Aniruddha Bohra <bohra@cs.rutgers.edu>
This way we may support multiple structures in v_data vnode field within
one file system without using black magic.
Vnode-to-file-handle should be VOP in the first place, but was made VFS
operation to keep interface as compatible as possible with SUN's VFS.
BTW. Now Solaris also implements vnode-to-file-handle as VOP operation.
VFS_VPTOFH() was left for API backward compatibility, but is marked for
removal before 8.0-RELEASE.
Approved by: mckusick
Discussed with: many (on IRC)
Tested with: ufs, msdosfs, cd9660, nullfs and zfs
the value of p_textvp. This way, we always unlock the locked vnode.
While there, vhold() the vnode around the vn_lock().
Reported and tested by: Guy Helmer (ghelmer palisadesys com)
Approved by: des (procfs maintainer)
MFC after: 1 week
#ifdef MSDOSFS_LARGE to run-time checks to see if "-o large" was specified.
Test case provided by Oliver Fromme:
truncate -s 200G test.img
mdconfig -a -t vnode -f test.img -u 9
newfs_msdos -s 419430400 -n 1 /dev/md9 zip250
mount -t msdosfs /dev/md9 /mnt # should fail
mount -t msdosfs -o large /dev/md9 /mnt # should succeed
PR: 105964
Requested by: Oliver Fromme <olli lurza secnetix de>
Tested by: trhodes
MFC after: 2 weeks
--------------------------
[Deadlock] is caused by a lock order reversal in vfs_lookup(), where
[some] process is trying to lock a directory vnode, that is the parent
directory of covered vnode) while holding an exclusive vnode lock on
covering vnode.
A simplified scenario:
root fs var fs
/ A / (/var) D
/var B /log (/var/log) E
vfs lock C vfs lock F
Within each file system, the lock order is clear: C->A->B and F->D->E
When traversing across mounts, the system can choose between two lock orders,
but everything must then follow that lock order:
L1: C->A->B
|
+->F->D->E
L2: F->D->E
|
+->C->A->B
The lookup() process for namei("/var") mixes those two lock orders:
VOP_LOOKUP() obtains B while A is held
vfs_busy() obtains a shared lock on F while A and B are held (follows L1,
violates L2)
vput() releases lock on B
VOP_UNLOCK() releases lock on A
VFS_ROOT() obtains lock on D while shared lock on F is held
vfs_unbusy() releases shared lock on F
vn_lock() obtains lock on A while D is held (violates L1, follows L2)
dounmount() follows L1 (B is locked while F is drained).
Without unmount activity, vfs_busy() will always succeed without blocking
and the deadlock isn't triggered (the system behaves as if L2 is followed).
With unmount, you can get 4 processes in a deadlock:
p1: holds D, want A (in lookup())
p2: holds shared lock on F, want D (in VFS_ROOT())
p3: holds B, want drain lock on F (in dounmount())
p4: holds A, want B (in VOP_LOOKUP())
You can have more than one instance of p2.
The reversal was introduced in revision 1.81 of src/sys/kern/vfs_lookup.c and
MFCed to revision 1.80.2.1, probably to avoid a cascade of vnode locks when nfs
servers are dead (VFS_ROOT() just hangs) spreading to the root fs root vnode.
- Tor Egge
To fix the LOR, ups@ noted that when crossing the mount point, ni_dvp
is actually not used by the callers of namei. Thus, placeholder deadfs
vnode vp_crossmp is introduced that is filled into ni_dvp.
Idea by: ups
Reviewed by: tegge, ups, jeff, rwatson (mac interaction)
Tested by: Peter Holm
MFC after: 2 weeks
from just before extending a file. This has the desired effect
of keeping the write speed constant. And yes, that helps a lot
copying large files always at full speed now, and I have seen
improvements using benchmarks/bonnie.
Stolen from: NetBSD
Reviewed by: bde
The code is modelled after cd9660, including support for simple read-ahead
courtesy of clustered read.
Fix udf_strategy to DTRT.
This change fixes sendfile(2) not to send out garbage.
Reviewed by: scottl
MFC after: 1 month
do not call markvoldirty() until the mount has been flagged as read-write.
Due to the nature of the msdosfs code, this bug only seemed to appear for
FAT-16 and FAT-32.
This fixes the testcase:
#!/bin/sh
dd if=/dev/zero bs=1m count=1 oseek=119 of=image.msdos
mdconfig -a -t vnode -f image.msdos
newfs_msdos -F 16 /dev/md0 fd120m
mount_msdosfs -o ro /dev/md0 /mnt
mount | grep md0
mount -u -o rw /dev/md0; echo $?
mount | grep md0
umount /mnt
mdconfig -d -u 0
PR: 105412
Tested by: Eugene Grosbein <eugen grosbein pp ru>
functions now more closely resemble similar functions in nullfs.
This also eliminates some errors.
Submitted by: daichi, Masanori OZAWA <ozawa ongs co jp>
This macro was written expecting a 32-bit unsigned long, and
doesn't work properly on 64-bit systems. This bug caused vn_stat()
to return incorrect values for files larger than 2gb on msdosfs filesystems
on 64-bit systems.
PR: 106703
Submitted by: Axel Gonzalez <loox e-shell net>
MFC after: 3 days
This bug caused vn_stat() to fail on files larger than 2gb on msdosfs
filesystems on AMD64.
PR: 106703
Tested by: Axel Gonzalez <loox e-shell net>
MFC after: 3 days
an "export" flag indicating that we are trying to NFS export the
filesystem, and the MSDOSFS_LARGEFS flag is set on the filesystem,
then deny the mount update and export request. Otherwise,
let the full mount update proceed normally.
MSDOSFS_LARGES and NFS don't mix because of the way inodes are calculated
for MSDOSFS_LARGEFS.
MFC after: 3 days
field to "unsigned long" so that it actually works.
Thanks to Robert Sciuk for sending me a DVD that
demonstrated ISO9660-formatted media with a file >2G.
I've now fixed this both in libarchive and in the cd9660
filesystem.
MFC after: 14 days
Make part of John Birrell's KSE patch permanent..
Specifically, remove:
Any reference of the ksegrp structure. This feature was
never fully utilised and made things overly complicated.
All code in the scheduler that tried to make threaded programs
fair to unthreaded programs. Libpthread processes will already
do this to some extent and libthr processes already disable it.
Also:
Since this makes such a big change to the scheduler(s), take the opportunity
to rename some structures and elements that had to be moved anyhow.
This makes the code a lot more readable.
The ULE scheduler compiles again but I have no idea if it works.
The 4bsd scheduler still reqires a little cleaning and some functions that now do
ALMOST nothing will go away, but I thought I'd do that as a separate commit.
Tested by David Xu, and Dan Eischen using libthr and libpthread.
set birthtime to FAT CTime (creation time) and in the other cases
set birthtime to -1.
o Set ctime to mtime instead of FAT CTime which has completely
different meaning.
PR: kern/106018
Submitted by: Oliver Fromme
MFC after: 1 month
and Daichi GOTO <daichi@FreeBSD.org> for submitting this
major rewrite of unionfs. This rewrite was done to
try to solve many of the longstanding crashing and locking
issues in the existing unionfs implementation. This
implementation also adds a 'MASQUERADE mode', which allows
the user to set different user, group, and file permission
modes in the upper layer.
Submitted by: daichi, Masanori OZAWA
Reviewed by: rodrigc (modified for minor style issues)
LCASE_BASE or LCASE_EXT or both are set. But dos2unixfn uses
dos2unixchr separately for the basename and the extension. So if
either LCASE_BASE or LCASE_EXT is set, dos2unixfn will convert both
the basename and extension to lowercase because it is blindly
passing in the state of both flags to dos2unixchr. The bit masks I
used ensure that only the state of LCASE_BASE gets passed to
dos2unixchr when the basename is converted, and only the state of
LCASE_EXT is passed in when the extension is converted.
PR: kern/86655
Submitted by: Micah Lieske
MFC after: 3 weeks
events. &p->p_stype is explicitely woken up on process exit for us.
Now, truss /nonexistent exits with error instead of waiting until killed
by signal.
Reported by: Nikos Vassiliadis nvass at teledomenet gr
Reviewed by: jhb
MFC after: 1 week
specific privilege names to a broad range of privileges. These may
require some future tweaking.
Sponsored by: nCircle Network Security, Inc.
Obtained from: TrustedBSD Project
Discussed on: arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
Alex Lyashkov <umka at sevcity dot net>,
Skip Ford <skip dot ford at verizon dot net>,
Antoine Brodin <antoine dot brodin at laposte dot net>
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.
This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.
Obtained from: TrustedBSD Project
Sponsored by: SPARTA
vnode' v_rdev and increment the dev threadcount , as well as clear it
(in devfs_reclaim) under the dev_lock().
Reviewed by: tegge
Approved by: pjd (mentor)
Unlock the vnode in devfs_close() while calling into the driver d_close()
routine.
devfs_revoke() changes by: ups
Reviewed and bugfixes by: tegge
Tested by: mbr, Peter Holm
Approved by: pjd (mentor)
MFC after: 1 week
ioctls passing integer arguments should use the _IOWINT() macro.
This fixes a lot of ioctl's not working on sparc64, most notable
being keyboard/syscons ioctls.
Full ABI compatibility is provided, with the bonus of fixing the
handling of old ioctls on sparc64.
Reviewed by: bde (with contributions)
Tested by: emax, marius
MFC after: 1 week
and drop_dm_lock is true, no unlocking shall be attempted. The lock is
already dropped and memory is freed.
Found with: Coverity Prevent(tm)
CID: 1536
Approved by: pjd (mentor)
vnode lock in devfs_allocv. Do this by temporary dropping dm_lock around
vnode locking.
For safe operation, add hold counters for both devfs_mount and devfs_dirent,
and DE_DOOMED flag for devfs_dirent. The facilities allow to continue after
dropping of the dm_lock, by making sure that referenced memory does not
disappear.
Reviewed by: tegge
Tested by: kris
Approved by: kan (mentor)
PR: kern/102335
synchronized by the lock on the object containing the page.
Transition PG_WANTED and PG_SWAPINPROG to use the new field,
eliminating the need for holding the page queues lock when setting
or clearing these flags. Rename PG_WANTED and PG_SWAPINPROG to
VPO_WANTED and VPO_SWAPINPROG, respectively.
Eliminate the assertion that the page queues lock is held in
vm_page_io_finish().
Eliminate the acquisition and release of the page queues lock
around calls to vm_page_io_finish() in kern_sendfile() and
vfs_unbusy_pages().
skip the actual type 1 length (6 bytes). With this change, it is now possible
to correctly spot the VAT partition map in certain discs.
Submitted by: Pedro Martelletto <pedro@ambientworks.net>
one byte less than needed.
This is a RELENG_x_y candidate, since it fixes a problem with Oracle 10.
Noticed by: Dmitry Ganenko <dima@apk-inform.com>
Testcase by: Dmitry Ganenko <dima@apk-inform.com>
Reviewed by: des
Submitted by: rdivacky
Sponsored by: Google SoC 2006
MFC after: 1 week
- remove call to getmntopts(), and just pass -o options to
nmount(). This removes some confusion as to what options
msdosfs can parse, by pushing the responsibility of option parsing
to the VFS and FS specific code in the kernel.
msdosfs_vfsops.c:
- add "force" and "sync" to msdosfs_opts. They used to be specified
in mount_msdosfs.c, so move them here. It's not clear whethere these
options should be placed into global_opts in vfs_mount.c or not.
Motivated by: marcus
Correct a bug in the handling of backslash characters in smbfs which can
allow an attacker to escape from a chroot(2). [2]
Security: FreeBSD-SA-06:15.ypserv [1]
Security: FreeBSD-SA-06:16.smbfs [2]
will allow the NFS server to call vfs_stdcheckexp() on the exported nullfs
filesystem, not the underlying filesystem being nullfs mounted.
If the lower filesystem was not NFS exported, then the NFS exported
null filesystem would not work.
Pointed out by: scottl
PR: kern/87906
MFC after: 1 week
If B_NOCACHE is set the pages of vm backed buffers will be invalidated.
However clean buffers can be backed by dirty VM pages so invalidating them
can lead to data loss.
Add support for flush dirty page in the data invalidation function
of some network file systems.
This fixes data losses during vnode recycling (and other code paths
using invalbuf(*,V_SAVE,*,*)) for data written using an mmaped file.
Collaborative effort by: jhb@,mohans@,peter@,ps@,ups@
Reviewed by: tegge@
MFC after: 7 days
if a process's uid or gid has changed, but the /proc/<PID> directory
itself was also set to mode 0. Assuming this doesn't open any
security holes, open access to the /proc/<PID> directory for users
other than root to read or search the directory.
Reviewed by: des (back in February)
MFC after: 3 weeks
mount(2) system call:
* Add cmount hook to fdescfs and pseudofs (and, by extension, procfs and
linprocfs). This (mostly) restores the ability to mount these
filesystems using the old mount(2) system call (see below for the
rest of the fix).
* Remove not-NULL check for the data argument from the mount(2) entry
point. Per the mount(2) man page, it is up to the individual
filesystem being mounted to verify data. Or, in the case of procfs,
etc. the filesystem is free to ignore the data parameter if it does
not use it. Enforcing data to be not-NULL in the mount(2) system call
entry point prevented passing NULL to filesystems which ignored the
data pointer value. Apparently, passing NULL was common practice
in such cases, as even our own mount_std(8) used to do it in the
pre-nmount(2) world.
All userland programs in the tree were converted to nmount(2) long ago,
but I've found at least one external program which broke due to this
(presumably unintentional) mount(2) API change. One could argue that
external programs should also be converted to nmount(2), but then there
isn't much point in keeping the mount(2) interface for backward
compatibility if it isn't backward compatible.
rather than panicking later. This can occur if the kernel calls
vn_open() on a fifo, as there will be no associated file descriptor,
and therefore the file descriptor operations cannot be modified to
point to the fifo operation set.
MFC after: 3 days
Reported by: Martin <nakal at nurfuerspam dot de>
PR: 94278
processing, this actually means there's a double slash recorded in the
symbolic link's path name. We used to start over from / then, which
caused link targets like ../../bsdi.1.0/include//pathnames.h to be
interpreted as /pathnahes.h. This is both contradictionary to our
conventional slash interpretation, as well as potentially dangerous.
The right thing to do is (obviously) to just ignore that element.
bde once pointed out that mistake when he noticed it on the
4.4BSD-Lite2 CD-ROM, and asked me for help.
Reviewed by: bde (about half a year ago)
MFC after: 3 days
Otherwise a kernel build would break in the coda5 module if the main
kernel conf file enabled CODA_COMPAT_5, too. Redefined symbols are
strictly disallowed by -Werror.
To overcome this issue, introduce a different symbol indicating coda5
build, CODA5_MODULE, and translate it to CODA_COMPAT_5 appropriately
in /sys/coda/coda.h.
MFC after: 3 days
- Reorder the events in exit(2) slightly so that we trigger the S_EXIT
stop event earlier. After we have signalled that, we set P_WEXIT and
then wait for any processes with a hold on the vmspace via PHOLD to
release it. PHOLD now KASSERT()'s that P_WEXIT is clear when it is
invoked, and PRELE now does a wakeup if P_WEXIT is set and p_lock drops
to zero.
- Change proc_rwmem() to require that the processing read from has its
vmspace held via PHOLD by the caller and get rid of all the junk to
screw around with the vmspace reference count as we no longer need it.
- In ptrace() and pseudofs(), treat a process with P_WEXIT set as if it
doesn't exist.
- Only do one PHOLD in kern_ptrace() now, and do it earlier so it covers
FIX_SSTEP() (since on alpha at least this can end up calling proc_rwmem()
to clear an earlier single-step simualted via a breakpoint). We only
do one to avoid races. Also, by making the EINVAL error for unknown
requests be part of the default: case in the switch, the various
switch cases can now just break out to return which removes a _lot_ of
duplicated PRELE and proc unlocks, etc. Also, it fixes at least one bug
where a LWP ptrace command could return EINVAL with the proc lock still
held.
- Changed the locking for ptrace_single_step(), ptrace_set_pc(), and
ptrace_clear_single_step() to always be called with the proc lock
held (it was a mixed bag previously). Alpha and arm have to drop
the lock while the mess around with breakpoints, but other archs
avoid extra lock release/acquires in ptrace(). I did have to fix a
couple of other consumers in kern_kse and a few other places to
hold the proc lock and PHOLD.
Tested by: ps (1 mostly, but some bits of 2-4 as well)
MFC after: 1 week
associated with the passed in pfs_node. If it does return a pointer, it
keeps the process locked. This allows a lot of places that were calling
pfind() again right after pfs_visible() to not have to do that and avoids
races since we don't drop the proc lock just to turn around and lock it
again. This will become more important with future changes to fix races
between procfs/ptrace and exit(2). Also, removed a duplicate pfs_visible()
call in pfs_getextattr().
Reviewed by: des
MFC after: 1 week
vop_lock_post do not trigger.
- Rearrange null_inactive to null_hashrem earlier so there is no chance
of finding the null node on the hash list after the locks have been
switched.
- We should never have a NULL lowervp in null_reclaim() so there is
no need to handle this situation. panic instead.
MFC After: 1 week
- Simplify the logic dealing with recycled vnodes in null_hashget() and
null_hashins(). Since we hold the lower node locked in both cases
the null node can not be undergoing recycling unless reclaim somehow
called null_nodeget(). The logic that was in place was not safe and
was essentially dead code.
MFC After: 1 week
directory. vrele() may lock the passed vnode, which in these cases would
give an invalid lock order of child -> parent. These situations are
deadlock prone although do not typically deadlock because the vrele
is typically not releasing the last reference to the vnode. Users of
vrele must consider it as a call to vn_lock() and order it appropriately.
MFC After: 1 week
Sponsored by: Isilon Systems, Inc.
Tested by: kkenn
last few days. I tracked it down to the fact that nfs_reclaim()
is setting vp->v_data to NULL _before_ calling vnode_destroy_object().
After silence from the mailing list I checked further and discovered
that ufs_reclaim() is unique among FreeBSD filesystems for calling
vnode_destroy_object() early, long before tossing v_data or much
of anything else, for that matter. The rest, including NFS, appear
to be identical, as if they were just clones of one original routine.
The enclosed patch fixes all file systems in essentially the same
way, by moving the call to vnode_destroy_object() to early in the
routine (before the call to vfs_hash_remove(), if any). I have
only tested NFS, but I've now run for over eighteen hours with the
patch where I wouldn't get past four or five without it.
Submitted by: Frank Mayhar
Requested by: Mohan Srinivasan
MFC After: 1 week
POSIX. This also makes the struct correct we ever implement an i386-time64
architecture. Not that we need too.
Reviewed by: imp, brooks
Approved by: njl (acpica), des (no objects, touches procfs)
Tested with: make universe
since mount_smbfs(8) assumed long name mounting by default unless "-n long"
was explicitly specified.
Rather than supplying a "long" option in mount_smbfs(8), this commit brings
back the original behaviour by associating SMBFS_MOUNT_NO_LONG with the
"nolong" option. This should fix the broken long file names on smbfs people
observed recently.
Reported by: Vladimir Grebenschikov <vova at fbsd dot ru>
Reviewed by: phk
Tested by: Slawa Olhovchenkov <slw at zxy dot spb dot ru>
If the complete reply on the TRANS2_FIND_FIRST2 request fits exactly
into one responce packet, then next call to TRANS2_FIND_NEXT2 will return
zero entries and server will close current transaction. To avoid
subsequent errors we should not perform FIND_CLOSE2 request.
PR: kern/78953
Submitted by: Jim Carroll
synonyms for "shortname" and "longname" mount options. The old
(before nmount()) mount_msdosfs program accepted "shortnames" and "longnames",
but the kernel nmount() checked for "shortname" and "longname".
So, make the kernel accept "shortnames", "longnames", "shortname", "longname"
for forwards and backwarsd compatibility.
Discovered by: Rainer Hurling <rhurlin at gwdg dot de>
The PR and patch have the details. The ultimate fix requires architectural
changes and clarifications to the VFS API, but this will prevent the system
from panicking when someone does "ls /dev" while running in a shell under the
linuxulator.
This issue affects HEAD and RELENG_6 only.
PR: 88249
Submitted by: "Devon H. O'Dell" <dodell@ixsystems.com>
MFC after: 3 days
- Prefer '_' to ' ', as it results in more easily parsed results in
memory monitoring tools such as vmstat.
- Remove punctuation that is incompatible with using memory type names
as file names, such as '/' characters.
- Disambiguate some collisions by adding subsystem prefixes to some
memory types.
- Generally prefer lower case to upper case.
- If the same type is defined in multiple architecture directories,
attempt to use the same name in additional cases.
Not all instances were caught in this change, so more work is required to
finish this conversion. Similar changes are required for UMA zone names.
cache_lookup() has returned a ref'ed and locked vnode since
vfs_cache.c:1.96, dated Tue Mar 29 12:59:06 2005 UTC. This change
is similar to the change made to smbfs_lookup() in smbfs_vnops.c:1.58.
Tested by: "Antony Mawer" ant AT mawer.org
MFC after: 2 weeks
I benchmarked this by simultaneously extracting 4 large tarballs (basically
world images) on a 4-processor AMD64 system, in a malloc-backed md.
With this patch, system time was reduced by 43%, and wall clock time by 33%.
Submitted by: jeff
MFC after: 1 week
cd9660_lookup() that was used to fix an actual race in ufs_lookup.c:1.78.
This is not currently a hazard, but the bug would be activated by
marking cd9660 as MPSAFE.
Requested by: bde
provided in the kernel build directory, fix modules that were
failing to build this way due to not quite correct kernel option
usage. In particular:
ng_mppc.c uses two complementary options, both of which are listed
in sys/conf/files. Ideally, there should be a separate option for
including ng_mppc.c in kernel build, but now only
NETGRAPH_MPPC_ENCRYPTION is usable anyway, the other one requires
proprietary files.
nwfs and smbfs were trying to ensure they were built with proper
network components, but the check was rather questionable.
Discussed with: ru
changes in MD code are trivial, before this change, trapsignal and
sendsig use discrete parameters, now they uses member fields of
ksiginfo_t structure. For sendsig, this change allows us to pass
POSIX realtime signal value to user code.
2. Remove cpu_thread_siginfo, it is no longer needed because we now always
generate ksiginfo_t data and feed it to libpthread.
3. Add p_sigqueue to proc structure to hold shared signals which were
blocked by all threads in the proc.
4. Add td_sigqueue to thread structure to hold all signals delivered to
thread.
5. i386 and amd64 now return POSIX standard si_code, other arches will
be fixed.
6. In this sigqueue implementation, pending signal set is kept as before,
an extra siginfo list holds additional siginfo_t data for signals.
kernel code uses psignal() still behavior as before, it won't be failed
even under memory pressure, only exception is when deleting a signal,
we should call sigqueue_delete to remove signal from sigqueue but
not SIGDELSET. Current there is no kernel code will deliver a signal
with additional data, so kernel should be as stable as before,
a ksiginfo can carry more information, for example, allow signal to
be delivered but throw away siginfo data if memory is not enough.
SIGKILL and SIGSTOP have fast path in sigqueue_add, because they can
not be caught or masked.
The sigqueue() syscall allows user code to queue a signal to target
process, if resource is unavailable, EAGAIN will be returned as
specification said.
Just before thread exits, signal queue memory will be freed by
sigqueue_flush.
Current, all signals are allowed to be queued, not only realtime signals.
Earlier patch reviewed by: jhb, deischen
Tested on: i386, amd64
the UDF specification specifies a logical sectorsize of 2048.
Instead, get it from GEOM.
- When reading the UDF Anchor Volume Descriptor, use the logical
sectorsize of 2048 when calculating the offset to read from, but
use the actual sectorsize to determine how much to read.
- works with reading a DVD disk and a DVD disk image file via mdconfig
- correctly returns EINVAL if we try to mount_udf an audio CD, instead
of panicking inside GEOM when INVARIANTS is set
sys/fs/nwfs/nwfs_vfsop= s.c, introduced with the conversion to
nmount with revision 1.38. This causes mount_nwfs to fail with
the error message:
mount_nwfs: mount error: /mnt/netware: syserr = No such file or directo=
ry
This is caused by a typo on line 178, which specifies "nwfw_args"
rather than "nwfs_args".
Submitted by: Antony Mawer <gnats@mawer.org>
Fat fingers: phk
PR: 86757
MFC: 3 days
There seems to be very little documentary evidence outside this
implementation to suggest a these checks are neccessary, and more
than one camera-formatted flash disk fails the check, but mounts
successfully on most other systems.
Reviewed By: bde@
a fifo. While this did indeed close the race, confirming suspicions
about the nature of the problem, it causes difficulties with blocking
I/O on fifos.
Discussed with: ups
Also spotted by: Peter Holm <peter at holm dot cc>
that socket during open, not the write socket receive buffer. This
might explain clearing of the sb_state SB_LOCK flag seen occasionally
in soreceive() on fifos.
MFC after: 3 days
Spotted by: ups
when we mount and get zero cost if no rules are used in a mountpoint.
Add code to deref rules on unmount.
Switch from SLIST to TAILQ.
Drop SYSINIT, use SX_SYSINIT and static initializer of TAILQ instead.
Drop goto, a break will do.
Reduce double pointers to single pointers.
Combine reaping and destroying rulesets.
Avoid memory leaks in a some error cases.
underlying the POSIX fifo implementation. In 6.x/7.x, fifo access is
moved from the VFS layer, where it was serialized using the vnode
lock, to the file descriptor layer, where access is protected by a
reference count but not serialized. This exposed socket buffer
locking to high levels of parallelism in specific fifo workloads, such
as make -j 32, which expose as yet unresolved socket buffer bugs.
fi_sx re-adds serialization about the read and write routines,
although not paths that simply test socket buffer mbuf queue state,
such as the poll and kqueue methods. This restores the extra locking
cost previously present in some cases, but is an effective workaround
for the instability that has been experienced. This workaround should
be removed once the bug in socket buffer handling has been fixed.
Reported by: kris, jhb, Julien Gabel <jpeg at thilelli dot net>,
Peter Holm <peter at holm dot cc>, others
MFC after: 3 days
Give DEVFS a proper inode called struct cdev_priv. It is important
to keep in mind that this "inode" is shared between all DEVFS
mountpoints, therefore it is protected by the global device mutex.
Link the cdev_priv's into a list, protected by the global device
mutex. Keep track of each cdev_priv's state with a flag bit and
of references from mountpoints with a dedicated usecount.
Reap the benefits of much improved kernel memory allocator and the
generally better defined device driver APIs to get rid of the tables
of pointers + serial numbers, their overflow tables, the atomics
to muck about in them and all the trouble that resulted in.
This makes RAM the only limit on how many devices we can have.
The cdev_priv is actually a super struct containing the normal cdev
as the "public" part, and therefore allocation and freeing has moved
to devfs_devs.c from kern_conf.c.
The overall responsibility is (to be) split such that kern/kern_conf.c
is the stuff that deals with drivers and struct cdev and fs/devfs
handles filesystems and struct cdev_priv and their private liason
exposed only in devfs_int.h.
Move the inode number from cdev to cdev_priv and allocate inode
numbers properly with unr. Local dirents in the mountpoints
(directories, symlinks) allocate inodes from the same pool to
guarantee against overlaps.
Various other fields are going to migrate from cdev to cdev_priv
in the future in order to hide them. A few fields may migrate
from devfs_dirent to cdev_priv as well.
Protect the DEVFS mountpoint with an sx lock instead of lockmgr,
this lock also protects the directory tree of the mountpoint.
Give each mountpoint a unique integer index, allocated with unr.
Use it into an array of devfs_dirent pointers in each cdev_priv.
Initially the array points to a single element also inside cdev_priv,
but as more devfs instances are mounted, the array is extended with
malloc(9) as necessary when the filesystem populates its directory
tree.
Retire the cdev alias lists, the cdev_priv now know about all the
relevant devfs_dirents (and their vnodes) and devfs_revoke() will
pick them up from there. We still spelunk into other mountpoints
and fondle their data without 100% good locking. It may make better
sense to vector the revoke event into the tty code and there do a
destroy_dev/make_dev on the tty's devices, but that's for further
study.
Lots of shuffling of stuff and churn of bits for no good reason[2].
XXX: There is still nothing preventing the dev_clone EVENTHANDLER
from being invoked at the same time in two devfs mountpoints. It
is not obvious what the best course of action is here.
XXX: comment out an if statement that lost its body, until I can
find out what should go there so it doesn't do damage in the meantime.
XXX: Leave in a few extra malloc types and KASSERTS to help track
down any remaining issues.
Much testing provided by: Kris
Much confusion caused by (races in): md(4)
[1] You are not supposed to understand anything past this point.
[2] This line should simplify life for the peanut gallery.
running" panics.
Previously, recursion through the "include" feature was prevented by
marking each ruleset as "running" when applied. This doesn't work for
the case where two DEVFS instances try to apply the same ruleset at
the same time.
Instead introduce the sysctl vfs.devfs.rule_depth (default == 1) which
limits how many levels of "include" we will traverse.
Be aware that traversal of "include" is recursive and kernel stack
size is limited.
MFC: after 3 days
fifo_kqfilter() VOP implementations, since they in theory are used
only on open file descriptors, in which case the ioctls are via
fifo_ioctl_f() and kqueue requests are via fifo_kqfilter_f().
Generate warnings if they are entered for now. These printf()
calls should become panic() calls.
Annotate and re-implement fifo_ioctl_f(): don't arbitrarily
forward ioctls to the socket layer, only forward the ones we
explicitly support for fifos. In the case of FIONREAD, don't
forward the request to the write socket on a read-write fifo, or
the read result is overwritten. Annotate a nasty case for the
undefined POSIX O_RDWR on fifos, in which failure of the second
ioctl will result in the socket pair being in an inconsistent
state.
Assert copyright as I find myself rewriting non-trivial parts of
fifofs.
MFC after: 3 days
be held when entering a kqueue filter for fifos via a socket buffer
event: as such, assert the lock unconditionally rather than acquiring
it conditionall.
MFC after: 3 days
1) fifo_kqfilter() is not actually ever used, it likely should be GC'd.
2) fifo_kqfilter_f() doesn't implement EVFILT_VNODE, so detecting events
on the underlying vnode for a fifo no longer works (it did in 4.x).
Likely, fifo_kqfilter_f() should forward the request to the VFS using
fp->f_vnode, which would work once fifo_kqfilter() was detached from
the vnode operation vector (removing the fifo override).
Discussed with: phk
used when a read filter is requested on a write-only fifo descriptor, or
a write filter is requested on a read-only fifo descriptor. This
permits the filters to be registered, but never raises the event, which
causes kqueue behavior for fifos to more closely match similar semantics
for poll and select, which permit testing for the condition even though
the condition will never be raised, and is consistent with POSIX's notion
that a fifo has identical semantics to a one-way IPC channel created
using pipe() on most operating systems.
The fifo regression test suite can now run to completion on HEAD without
errors.
MFC after: 3 days
according to POSIX, not to mention the fact that it doesn't make sense
(and hence isn't really implemented). This causes the fifo_misc
regression test to succeed.
file descriptor. Otherwise, the read end of a fifo might return that it
is writable (which it isn't).
Only poll the fifo for write events if the fifo attached to a writable
file descriptor. Otherwise, the write end of a fifo might return that
it is readable (which it isn't).
In the event that a file is FREAD|FWRITE (which is allowed by POSIX, but
has undefined behavior), we poll for both.
MFC after: 3 days
to poll the write socket for, the fifo polling code proceeded to poll
for the complete set of events. Use 'levents' instead of 'events' as
the argument to poll, and only poll the write socket if there is
interest in write events.
MFC after: 3 days
while sleeping to allocate fifo state: due to using the vnode lock to
serialize access to a fifo during open, it shouldn't happen (tm).
MFC after: 3 days
some of the options test, specifically the joliet and rockridge tests.
Since the root mount callchain doesn't go through cd9660_cmount, the
default mount options aren't set. Rather than having the main codepath
assume the options are there, test for the absence of the inverted
optioin
e.g. instead of vfs_flagopt(.. "joliet" ..), test for
!vfs_flagopt(.. "nojoliet" ..)
This works for root mount, non-root mount and future nmount cases.
- in cd9660_cmount, remove inadvertent setting of "gens" when "extatt"
was set.
Reported by: grehan, Dario Freni <saturnero at freesbie org>
Tested by: Dario Freni
Not objected to by: phk
MFC after: 3 days
event handler, dev_clone, which accepts a credential argument.
Implementors of the event can ignore it if they're not interested,
and most do. This avoids having multiple event handler types and
fall-back/precedence logic in devfs.
This changes the kernel API for /dev cloning, and may affect third
party packages containg cloning kernel modules.
Requested by: phk
MFC after: 3 days
If a character cannot be converted to DOS code page,
unix2doschr() returned `0'. As a result, unix2dosfn()
was forced to return `0', so we saw a file which was
composed of these characters as `Invalid argument'.
To correct this, if a character can be converted to
Unicode, unix2doschr() now returns `1' which is a magic
number to make unix2dosfn() know that the character
must be converted to `_'.
[2] unix2dosfn()
The above-mentioned solution only works if a file
has both of Unicode name and DOS code page name.
Unicode name would not be recorded if file name
can be settled within 11 bytes (DOS short name)
and if no conversion from Unix charset to DOS code
page has occurred. Thus, FreeBSD can create a file
which has only short name, but there is no guarantee
that the short name contains allways valid characters
because we leave it to people by using mount_msdosfs(8)
to select which conversion is used between DOS code
page and unix charset.
To avoid this, Unicode file name should be recorded
unless a character is an ascii character. This is
the way Windows XP do.
PR: 77074 [1]
MFC after: 1 week
process that caused the clone event to take place for the device driver
creating the device. This allows cloned device drivers to adapt the
device node based on security aspects of the process, such as the uid,
gid, and MAC label.
- Add a cred reference to struct cdev, so that when a device node is
instantiated as a vnode, the cloning credential can be exposed to
MAC.
- Add make_dev_cred(), a version of make_dev() that additionally
accepts the credential to stick in the struct cdev. Implement it and
make_dev() in terms of a back-end make_dev_credv().
- Add a new event handler, dev_clone_cred, which can be registered to
receive the credential instead of dev_clone, if desired.
- Modify the MAC entry point mac_create_devfs_device() to accept an
optional credential pointer (may be NULL), so that MAC policies can
inspect and act on the label or other elements of the credential
when initializing the skeleton device protections.
- Modify tty_pty.c to register clone_dev_cred and invoke make_dev_cred(),
so that the pty clone credential is exposed to the MAC Framework.
While currently primarily focussed on MAC policies, this change is also
a prerequisite for changes to allow ptys to be instantiated with the UID
of the process looking up the pty. This requires further changes to the
pty driver -- in particular, to immediately recycle pty nodes on last
close so that the credential-related state can be recreated on next
lookup.
Submitted by: Andrew Reisse <andrew.reisse@sparta.com>
Obtained from: TrustedBSD Project
Sponsored by: SPAWAR, SPARTA
MFC after: 1 week
MFC note: Merge to 6.x, but not 5.x for ABI reasons
This is good enough to be able to run a RELENG_4 gdb binary against
a RELENG_4 application, along with various other tools (eg: 4.x gcore).
We use this at work.
ia32_reg.[ch]: handle the 32 bit register file format, used by ptrace,
procfs and core dumps.
procfs_*regs.c: vary the format of proc/XXX/*regs depending on the client
and target application.
procfs_map.c: Don't print a 64 bit value to 32 bit consumers, or their
sscanf fails. They expect an unsigned long.
imgact_elf.c: produce a valid 32 bit coredump for 32 bit apps.
sys_process.c: handle 32 bit consumers debugging 32 bit targets. Note
that 64 bit consumers can still debug 32 bit targets.
IA64 has got stubs for ia32_reg.c.
Known limitations: a 5.x/6.x gdb uses get/setcontext(), which isn't
implemented in the 32/64 wrapper yet. We also make a tiny patch to
gdb pacify it over conflicting formats of ld-elf.so.1.
Approved by: re
ioctl numbers in backwards compatability mode. eg: an IOC_IN ioctl with
a size of zero. Traditionally this was what you did before IOC_VOID
existed, and we had some established users of this in the tree, namely
procfs. Certain 3rd party drivers with binary userland components also
have this too.
This is necessary to have 4.x and 5.x binaries use these ioctl's. We
found this at work when trying to run 4.x binaries.
Approved by: re
To check a directory's in-use bitmap bit by bit, we use
a pointer to an 8 bit wide unsigned value.
The index used to dereference this pointer is calculated
by shifting the bit index right 3 bits. Then we do a
logical AND with the bit# represented by the lower 3
bits of the bit index.
This is an idiomatic way of iterating through a bit map
with simple bitwise operations.
This commit fixes the bug that we only checked bits
3:0 of each 8 bit chunk, because we only used bits 1:0
of the bit index for the bit# in the current 8 bit value.
This resulted in files not being returned by getdirentries(2).
Change the type of the bit map pointer from `char *' to
`u_int8_t *'.
perfect solution as the lower vm object can change at unpredictable times
if our lower vp happens to be on another unionfs, etc.
Submitted by: Oleg Sharoiko <os@rsu.ru>
Since the name cache is case-sensitive and msdosfs isn't,
creating a file 'foo' won't invalidate a negative entry for 'FOO'.
There are similar problems related to 8.3 filenames.
A better solution is to override VOP_LOOKUP with a method that
canonicalizes the name, then calls vfs_cache_lookup(). Unfortunately,
it's not quite that simple because vfs_cache_lookup() will call
msdosfs_lookup() on a cache miss, and msdosfs_lookup() needs a way to
get at the original component name.
than WIN_CHARS bytes, we shift the suffix (previous substrings) upwards
by the amount this substring exceeds its WIN_CHARS slot. Profiling shows
this change is indistinguishable from the previous code at 95% confidence.
This bug would result in attempts to access or create files or directories
with multi-byte characters returning an error but no data loss.
Reported and tested by: avatar
MFC after: 3 days