Commit Graph

2181 Commits

Author SHA1 Message Date
Bruce Evans
b7837a91c9 Fix and update the comments about the effect of the read-only flag on writing.
They are still too verbose.

Remove nearby unreachable code for handling symlinks.

Approved by:	re (kensmith) (blanket)
2007-08-07 05:42:10 +00:00
Bruce Evans
e3117f852e Fix some style bugs (don't assume that off_t == int64_t; fix some comments;
remove some parentheses; fix some whitespace errors; fix only one case of
a boolean comparison of a non-boolean).

Improve an error message by quoting ".", and by not printing large positive
values as negative ones.

Approved by:	re (kensmith) (blanket)
2007-08-07 03:59:49 +00:00
Bruce Evans
c0f5121cac Fix some style bugs (don't assume that off_t == int64_t; fix some comments;
remove some parentheses; fix only a couple of whtespace errors).

Approved by:	re (kensmith) (blanket)
2007-08-07 03:43:28 +00:00
Bruce Evans
2d7c6b2724 Fix some style bugs (mainly some whitespace errors).
Approved by:	re (kensmith) (blanket)
2007-08-07 03:38:36 +00:00
Bruce Evans
b6d0381e7e Fix some style bugs (some whitespace errors only).
Approved by:	re (kensmith) (blanket)
2007-08-07 03:22:10 +00:00
Bruce Evans
d2bb66bacd Sort includes.
Remove rotted banal comment attached to includes.

Approved by:	re (kensmith) (blanket)
2007-08-07 02:28:33 +00:00
Bruce Evans
6becd1c855 Sort includes.
Remove banal comments attached to includes.

Approved by:	re (kensmith) (blanket)
2007-08-07 02:27:35 +00:00
Bruce Evans
5696c6e0b2 Sort includes.
Remove banal comments before includes.  Remove rotted banal comments attached
to includes.

Approved by:	re (kensmith) (blanket)
2007-08-07 02:20:37 +00:00
Bruce Evans
9b0802c90b Remove unused include(s).
Remove banal comments before includes.

Approved by:	re (kensmith) (blanket)
2007-08-07 02:11:16 +00:00
Bruce Evans
a878a31c13 Remove unused include(s).
Approved by:	re (kensmith) (blanket)
2007-08-07 02:08:06 +00:00
Bruce Evans
eba34270fa Include <sys/mutex.h> and its prerequisite <sys/lock.h> instead of
depending on namespace pollution in <sys/buf.h> and/or <sys/vnode.h>

Approved by:	re (kensmith) (blanket)
2007-08-07 01:40:27 +00:00
Bruce Evans
1103771d95 Include <sys/mutex.h>'s prerequisite <sys/lock.h> instead of depending on
namespace pollution in <sys/vnode.h>.

Sort the include of <sys/mutex.h> instead of unsorting it after
<sys/vnode.h> and depending on the pollution there.

Approved by:	re (kensmith) (blanket)
2007-08-07 01:37:59 +00:00
Bruce Evans
6fd81fc7a6 Remove unused include(s).
Approved by:	re (kensmith) (blanket)
2007-08-07 01:07:16 +00:00
Bruce Evans
8d61a735c6 Silently fix up the estimated next free cluster number from the fsinfo
sector, instead of failing the whole mount if it is garbage.  Fields
in the fsinfo sector are only advisory, so there are better sanity
checks than this, and we already silently fix up the only other advisory
field in the fsinfo (the free cluster count).

This wasn't handled quite right in rev.1.92, 1.117, or in NetBSD.  1.92
also failed the whole mount for the non-garbage magic value 0xffffffff
1.117 fixed this well enough in practice since garbage values shouldn't
occur in practice, but left the error handling larger and more convoluted
than necessary.  Now we handle the magic value as a special case of
fixing up all out of bounds values.

Also fix up the estimated next free cluster number when there is no
fsinfo sector.  We were using 0, but CLUST_FIRST is safer.

Approved by:	re (kensmith)
2007-08-05 12:58:34 +00:00
Bruce Evans
3726942956 Oops, fix the fix for the i/o size of the fsinfo block. Its log
message explained why the size is 1 sector, but the code used a
size of 1 cluster.

I/o sizes larger than necessary may cause serious coherency problems
in the buffer cache.  Here I think there were only minor efficiency
problems, since a too-large fsinfo buffer could only get far enough
to overlap buffers for the same vnode (the device vnode), so mappings
are coherent at the page level although not at the buffer level, and
the former is probably enough due to our limited use of the fsinfo
buffer.

Approved by:	re (kensmith)
2007-08-03 23:13:50 +00:00
Xin LI
fb7557140e MFp4 - Refine locking to eliminate some potential race/panics:
- Copy before testing a pointer.  This closes a race window.
 - Use msleep with the node interlock instead of tsleep.
 - Do proper locking around access to tn_vpstate.
 - Assert vnode VOP lock for dir_{atta,de}tach to capture
   inconsistent locking.

Suggested by:	kib
Submitted by:	delphij
Reviewed by:	Howard Su
Approved by:	re (tmpfs blanket)
2007-08-03 06:24:31 +00:00
Pawel Jakub Dawidek
57fd3d5572 When we do open, we should lock the vnode exclusively. This fixes few races:
- fifo race, where two threads assign v_fifoinfo,
- v_writecount modifications,
- v_object modifications,
- and probably more...

Discussed with:	kib, ups
Approved by:	re (rwatson)
2007-07-26 16:58:09 +00:00
Xin LI
f62e5595fd MFp4: Force 64-bit arithmatic when caculating the maximum file size.
This fixes tmpfs caculations on 32-bit systems equipped with more than
4GB swap.

Reported by:	Craig Boston <craig xfoil gank org>
PR:		kern/114870
Approved by:	re (tmpfs blanket)
2007-07-24 17:14:53 +00:00
Bruce Evans
4eb3abf0a5 Make using msdosfs as the root file system sort of work:
o Initialize ownerships and permissions.  They were garbage (0) for
  root mounts since vfs_mountroot_try() doesn't ask for them to be set
  and msdosfs's old incomplete code to set them was removed.  The
  garbage happened to give the correct ownerships root:wheel, but it
  gave permissions 000 so init could not be execed.  Use the macros
  for root: wheel and 0755.  (The removed code gave 0:0 and 0777.  0755
  is more normal and secure, thought wrong for /tmp.)

o Check the readonly flag for initial (non-MNT_UPDATE) mounts in the
  correct place, as in ffs.  For root mounts, it is only passed in
  mp->mnt_flags, since vfs_mountroot_try() only passes it as a flag
  and nothing translates the flag to the "ro" option string.  msdosfs
  only looked for it in the string, so it gave a rw mount for root
  mounts without even clearing the flag in mp->mnt_flags, so the final
  state was inconsistent.  Checking the flag only in mp->mnt_flags
  works for initial userland mounts too.  The MNT_UPDATE case is
  messier.

The main point that should work but doesn't is fsck of msdosfs root
while it is mounted ro.  This needs mainly MNT_RELOAD support to work.
It should be possible to run fsck -p and succeed provided the fs is
consistent, not just for msdosfs, but this fails because fsck -p always
tries to open the device rw.  The hack that allows open for writing
in ffs is not implemented in msdosfs, since without MNT_RELOAD support
writing could only be harmful.  So fsck must be turned off to use
msdosfs as root.  This is quite dangerous, since msdosfs is still missing
actually using its fs-dirty flag internally, so it is happy to mount
dirty fileystems rw.

Unrelated changes:
- Fix missing error handling for MNT_UPDATE from rw to ro.
- Catch up with renaming msdos to msdosfs in a string.

Approved by:	re (kensmith)
2007-07-23 07:10:17 +00:00
Xin LI
7280082944 MFp4: When swapping is not enabled, allow creating files by taking
physical memory pages into account for tm_maxfilesize.

Reported by:	Dominique Goncalves <dominique.goncalves gmail.com>
Submitted by:	Howard Su
Approved by:	re (tmpfs blanket)
2007-07-23 06:54:58 +00:00
Bruce Evans
6b6c5f5ef9 Implement vfs clustering for msdosfs.
This gives a very large speedup for small block sizes (in my tests,
about 5 times for write and 3 times for read with a block size of 512,
if clustering is possible) and a moderate speedup for the moderatatly
large block sizes that should be used on non-small media (4K is the
best size in most cases, and the speedup for that is about 1.3 times
for write and 1.2 times for read).  mmap() should benefit from clustering
like read()/write(), but the current implementation of vm only supports
clustering (at least for getpages) if the fs block size is >= PAGE SIZE.

msdosfs is now only slightly slower than ffs with soft updates for
writing and slightly faster for reading when both use their best block
sizes.  Writing is slower for msdosfs because of more sync writes.
Reading is faster for msdosfs because indirect blocks interfere with
clustering in ffs.

The changes in msdosfs_read() and msdosfs_write() are simpler merges
of corresponding code in ffs (after fixing some style bugs in ffs).
msdosfs_bmap() needs fs-specific code.  This implementation loops
calling a lower level bmap function to do the hard parts.  This is a
bit inefficient, but is efficient enough since msdsfs_bmap() is only
called when there is physical i/o to do.

Approved by:	re (hrs)
2007-07-20 17:06:57 +00:00
Bruce Evans
d34b0a1bac Clean up before implementing vfs clustering for msdosfs:
In msdosfs_read(), mainly reorder the main loop to the same order as in
ffs_read().

In msdosfs_write() and extendfile(), use vfs_bio_clrbuf() instead of
clrbuf().  I think this just just a bogus optimization, but ffs always
does it and msdosfs already did it in one place, and it is what I've
tested.

In msdosfs_write(), merge good bits from a comment in ffs_write(), and
fix 1 style bug.

In the main comment for msdosfs_pcbmap(), improve wording and catch
up with 13 years of changes in the function.  This comment belongs in
VOP_BMAP.9 but that doesn't exist.

In msdosfs_bmap(), return EFBIG if the requested cluster number is out
of bounds instead of blindly truncating it, and fix many style bugs.

Approved by:	re (hrs)
2007-07-20 16:21:47 +00:00
Robert Watson
825eaf3470 Make sure we release the control vnode in Coda:
We allocate coda_ctlvp when /coda is mounted, but never release it.
During the unmount this vnode was marked as UNMOUNTING and when venus
is started a second time the system would hang, possibly waiting for
the old vnode to disappear.

So now we call vrele on the control vnode when file system is unmounted
to drop the reference we got during the mount. I'm pretty sure it is
also necessary to not skip the handling in coda_inactive for the control
vnode, it seems like that is the place we actually get rid of the vnode
once the refcount has dropped to 0.

Submitted by:	Jan Harkes <jaharkes at cs dot cmu dot edu>
Approved by:	re (kensmith)
2007-07-20 11:14:51 +00:00
Xin LI
c5be778305 MFp4: Rework on tmpfs's mapped read/write procedures. This
should finally fix fsx test case.

The printf's added here would be eventually turned into
assertions.

Submitted by:	Mingyan Guo (mostly)
Approved by:	re (tmpfs blanket)
2007-07-19 03:34:50 +00:00
Robert Watson
00f05dc847 Complete repo-copy and move of Coda from src/sys/coda to src/sys/fs/coda
by removing files from src/sys/coda, and updating include paths in the
new location, kernel configuration, and  Makefiles.  In one case add
$FreeBSD$.

Discussed with:		anderson, Jan Harkes <jaharkes@cs.cmu.edu>
Approved by:		re (kensmith)
Repo-copy madness:	simon
2007-07-12 21:04:58 +00:00
Robert Watson
d21e51d059 Forced commit to recognize repo-copy of Coda files from src/sys/coda to
src/sys/fs/coda.

Discussed with:         anderson, Jan Harkes <jaharkes@cs.cmu.edu>
Approved by:            re (kensmith)
Repo-copy madness:      simon
2007-07-12 20:40:38 +00:00
Bruce Evans
93fe42b62f Round up the FAT block size to a multiple of the sector size so that i/o
to the FAT is possible.

Make the FAT block size less arbitrary before it is rounded up:
- for FAT12, default to 3*512 instead of to 3 sectors.  The magic 3 is
  the default number of 512-byte FAT sectors on a floppy drive.  That
  many sectors is too many if the sector size is larger.
- for !FAT12, default to PAGE_SIZE instead of to 4096.  Remove
  MSDOSFS_DFLTBSIZE since it only obfuscated this 4096.

For reading the BPB, use a block size of 8192 instead of 2048 so that
sector sizes up to 8192 can work.  We should try several sizes, or just
try the maximum supported size (MAXBSIZE = 64K).  I use 8192 because
that is enough for DVD-RW's (even 2048 is enough) and 8192 has been
tested a lot in use by ffs.

This completes fixing msdosfs for some large sector sizes (up to 8K
for read and 64K for write).  Microsoft documents support for sector
sizes up to 4K in mdosfs.  ffs is currently limited to 8K for both
read and write.

Approved by:	re (kensmith)
Approved by:	nyan (several years ago)
2007-07-12 17:17:47 +00:00
Bruce Evans
fd7c4230b2 Fix some bugs involving the fsinfo block (many remain unfixed). This is
part of fixing msdosfs for large sector sizes.  One of the fixed bugs
was fatal for large sector sizes.

1. The fsinfo block has size 512, but it was misunderstood and declared
   as having size 1024, with nothing in the second 512 bytes except a
   signature at the end.  The second 512 bytes actually normally (if
   the file system was created by Windows) consist of a second boot
   sector which is normally (in WinXP) empty except for a signature --
   the normal layout is one boot sector, one fsinfo sector, another
   boot sector, then these 3 sectors duplicated.  However, other
   layouts are valid.  newfs_msdos produces a valid layout with one
   boot sector, one fsinfo sector, then these 2 sectors duplicated.
   The signature check for the extra part of the fsinfo was thus
   normally checking the signature in either the second boot sector
   or the first boot sector in the copy, and thus accidentally
   succeeding.  The extra signature check would just fail for weirder
   layouts with 512-byte sectors, and for normal layouts with any other
   sector size.

   Remove the extra bytes and the extra signature check.

2. Old versions did i/o to the fsinfo block using size 1024, with the
   second half only used for the extra signature check on read.  This
   was harmless for sector size 512, and worked accidentally for sector
   size 1024.  The i/o just failed for larger sector sizes.

   The version being fixed did i/o to the fsinfo block using size
   fsi_size(pmp) = (1024 << ((pmp)->pm_BlkPerSec >> 2)).  This
   expression makes no sense.  It happens to work for sector small
   sector sizes, but for sector size 32K it gives the preposterous
   value of 64M and thus causes panics.  A sector size of 32768 is
   necessary for at least some DVD-RW's (where the minimum write size
   is 32768 although the minimum read size is 2048).

   Now that the size of the fsinfo block is 512, it always fits in
   one sector so there is no need for a macro to express it.  Just
   use the sector size where the old code uses 1024.

Approved by:	re (kensmith)
Approved by:	nyan (several years ago for a different version of (2))
2007-07-12 16:09:07 +00:00
Robert Watson
26e3bc3a96 Fix ioctls on the control vnode: ioctls on a character device fail with
ENOTTY.  Make the control vnode a regular file so that ioctls are passed
through to our kernel module.

Submitted by:	Jan Harkes <jaharkes@cs.cmu.edu>
Approved by:	re (kensmith)
2007-07-11 21:34:41 +00:00
Robert Watson
0e3ce855cc Avoid a panic in insmntque when we pass a NULL mount: this reenables
some previously disabled code which according to the comment caused a
problem during shutdown.  But even that is still better than
triggering a kernel panic whenever venus is started.

Submitted by:	Jan Harkes <jaharkes@cs.cmu.edu>
Approved by:	re (kensmith)
2007-07-11 21:33:46 +00:00
Robert Watson
74d326ada8 Replace CODA_OPEN with CODA_OPEN_BY_FD: coda_open was disabled because
we can't open container files by device/inode number pair anymore.
Replace the CODA_OPEN upcall with CODA_OPEN_BY_FD, where venus returns
an open file descriptor for the container file.  We can then grab a
reference on the vnode coda_psdev.c:vc_nb_write and use this vnode for
further accesses to the container file.

Submitted by:	Jan Harkes <jaharkes@cs.cmu.edu>
Approved by:	re (kensmith)
2007-07-11 21:32:08 +00:00
Robert Watson
934030b2c9 Resolve Coda mount failing because Coda failed to match the device
operations.  But we don't have to, if we find the coda_mntinfo structure
for this device in our linked list, we know the device is good.

Submitted by:	Jan Harkes <jaharkes@cs.cmu.edu>
Approved by:	re (kensmith)
2007-07-11 21:21:55 +00:00
Robert Watson
7263babb85 Avoid crash when opening Coda device: when allocating coda_mntinfo, we
need to initialize dev so that we can actually find the allocated
coda_mntinfo structure later on.

Submitted by:	Jan Harkes <jaharkes@cs.cmu.edu>
Approved by:	re (kensmith)
2007-07-11 20:39:53 +00:00
Xin LI
8d9a89a3a0 MFp4: Make use of the kernel unit number allocation facility
for tmpfs nodes.

Submitted by:	Mingyan Guo <guomingyan gmail com>
Approved by:	re (tmpfs blanket)
2007-07-11 14:26:27 +00:00
Bruce Evans
8e55bfaf4b Don't use almost perfectly pessimal cluster allocation. Allocation
of the the first cluster in a file (and, if the allocation cannot be
continued contiguously, for subsequent clusters in a file) was randomized
in an attempt to leave space for contiguous allocation of subsequent
clusters in each file when there are multiple writers.  This reduced
internal fragmentation by a few percent, but it increased external
fragmentation by up to a few thousand percent.

Use simple sequential allocation instead.  Actually maintain the fsinfo
sequence index for this.  The read and write of this index from/to
disk still have many non-critical bugs, but we now write an index that
has something to do with our allocations instead of being modified
garbage.  If there is no fsinfo on the disk, then we maintain the index
internally and don't go near the bugs for writing it.

Allocating the first free cluster gives a layout that is almost as good
(better in some cases), but takes too much CPU if the FAT is large and
the first free cluster is not near the beginning.

The effect of this change for untar and tar of a slightly reduced copy
of /usr/src on a new file system was:

Before (msdosfs 4K-clusters):
untar:  459.57 real              untar from cached file (actually a pipe)
tar:    342.50 real              tar from uncached tree to /dev/zero
Before (ffs2 soft updates 4K-blocks 4K-frags)
untar:   39.18 real
tar:     29.94 real
Before (ffs2 soft updates 16K-blocks 2K-frags)
untar:   31.35 real
tar:     18.30 real

After (msdosfs 4K-clusters):
untar    54.83 real
tar      16.18 real

All of these times can be improved further.

With multiple concurrent writers or readers (especially readers), the
improvement is smaller, but I couldn't find any case where it is
negative.  342 seconds for tarring up about 342 MB on a ~47MB/S partition
is just hard to unimprove on.  (This operation would take about 7.3
seconds with reasonably localized allocation and perfect read-ahead.)
However, for active file systems, 342 seconds is closer to normal than
the 16+ seconds above or the 11 seconds with other changes (best I've
measured -- won easily by msdosfs!).  E.g., my active /usr/src on ffs1
is quite old and fragmented, so reading to prepare for the above
benchmark takes about 6 times longer than reading back the fresh copies
of it.

Approved by:	re (kensmith)
2007-07-10 13:20:24 +00:00
Xin LI
1df86a323d MFp4:
- Plug memory leak.
 - Respect underlying vnode's properties rather than assuming that
   the user want root:wheel + 0755.  Useful for using tmpfs(5) for
   /tmp.
 - Use roundup2 and howmany macros instead of rolling our own version.
 - Try to fix fsx -W -R foo case.
 - Instead of blindly zeroing a page, determine whether we need a pagein
   order to prevent data corruption.
 - Fix several bugs reported by Coverity.

Submitted by:	Mingyan Guo <guomingyan gmail com>, Howard Su, delphij
Coverity ID:	CID 2550, 2551, 2552, 2557
Approved by:	re (tmpfs blanket)
2007-07-08 15:56:12 +00:00
Konstantin Belousov
de10ffa527 Since rev. 1.199 of sys/kern/kern_conf.c, the thread that calls
destroy_dev() from d_close() cdev method would self-deadlock.
devfs_close() bump device thread reference counter, and destroy_dev()
sleeps, waiting for si_threadcount to reach zero for cdev without
d_purge method.

destroy_dev_sched() could be used instead from d_close(), to
schedule execution of destroy_dev() in another context. The
destroy_dev_sched_drain() function can be used to drain the scheduled
calls to destroy_dev_sched(). Similarly, drain_dev_clone_events() drains
the events clone to make sure no lingering devices are left after
dev_clone event handler deregistered.

make_dev_credf(MAKEDEV_REF) function should be used from dev_clone
event handlers instead of make_dev()/make_dev_cred() to ensure that created
device has reference counter bumped before cdev mutex is dropped inside
make_dev().

Reviewed by:	tegge (early versions), njl (programming interface)
Debugging help and testing by:	Peter Holm
Approved by:	re (kensmith)
2007-07-03 17:42:37 +00:00
Xin LI
9b258fca27 MFp4:
- Remove unnecessary NULL checks after M_WAITOK allocations.
 - Use VOP_ACCESS instead of hand-rolled suser_cred()
   calls. [1]
 - Use malloc(9) KPI to allocate memory for string.  The
   optimization taken from NetBSD is not valid for FreeBSD
   because our malloc(9) already act that way. [2]

Requested by:	rwatson [1]
Submitted by:	Howard Su [2]
Approved by:	re (tmpfs blanket)
2007-06-29 05:23:15 +00:00
Xin LI
a321f489a5 Space/style cleanups after last set of commits.
Approved by:	re (tmpfs blanket)
2007-06-28 02:39:31 +00:00
Xin LI
a96539bf8f Staticify most of fifo/vn operations, they should not
be directly exposed outside.

Approved by:	re (tmpfs blanket)
2007-06-28 02:36:41 +00:00
Xin LI
8d5892eeab Use vfs_timestamp instead of nanotime when obtaining
a timestamp for use with timekeeping.

Approved by:	re (tmpfs blanket)
2007-06-28 02:34:32 +00:00
Xin LI
5ff9b9158f Reorder tf_gen and tf_id in struct tmpfs_fid. This
saves 8 bytes on amd64 architecture.

Obtained from:	NetBSD
Approved by:	re (tmpfs blanket)
2007-06-28 02:32:44 +00:00
Xin LI
6ca4416347 Remove two function prototypes that are no longer used.
Approved by:	re (tmpfs blanket)
2007-06-26 02:08:29 +00:00
Xin LI
974fd8c650 - Sync with NetBSD's RCSID (HEAD preferred).
- Correct a typo.

Approved by:	re (tmpfs blanket)
2007-06-26 02:07:08 +00:00
Xin LI
7adb177693 MFp4: Several clean-ups and improvements over tmpfs:
- Remove tmpfs_zone_xxx KPI, the uma(9) wrapper, since
   they does not bring any value now.
 - Use |= instead of = when applying VV_ROOT flag.
 - Remove tm_avariable_nodes list.  Use uma to hold the
   released nodes.
 - init/destory interlock mutex of node when init/fini
   instead of ctor/dtor.
 - Change memory computing using u_int to fix negative
   value in 2G mem machine.
 - Remove unnecessary bzero's
 - Rely uma logic to make file id allocation harder to
   guess.
 - Fix some unsigned/signed related things.  Make sure
   we respect -o size=xxxx
 - Use wire instead of hold a page.
 - Pass allocate_zero to obtain zeroed pages upon first
   use.

Submitted by:	Howard Su
Approved by:	re (tmpfs blanket, kensmith)
2007-06-25 18:46:13 +00:00
Rong-En Fan
534046e301 - Remove UMAP filesystem. It was disconnected from build three years ago,
and it is seriously broken.

Discussed on:   freebsd-arch@
Approved by:	re (mux)
2007-06-25 05:06:57 +00:00
Xin LI
b746bf0820 Use vfs_timestamp() instead of nanotime() - make it up to
the user to make decisions about how detail they wanted
timestamps to have.
2007-06-18 14:40:19 +00:00
Xin LI
21cf0e3907 MFp4: fix two locking problems:
- Hold TMPFS_LOCK while updating tm_pages_used.
 - Hold vm page while doing uiomove.

This will hopefully fix all known panics.

Submitted by:	Howard Su
2007-06-18 01:43:13 +00:00
Xin LI
d1fa59e9e1 MFp4: Add tmpfs, an efficient memory file system.
Please note that, this is currently considered as an
experimental feature so there could be some rough
edges.  Consult http://wiki.freebsd.org/TMPFS for
more information.

For now, connect tmpfs to build on i386 and amd64
architectures only.  Please let us know if you have
success with other platforms.

This work was developed by Julio M. Merino Vidal
for NetBSD as a SoC project; Rohit Jalan ported it
from NetBSD to FreeBSD.  Howard Su and Glen Leeder
are worked on it to continue this effort.

Obtained from:	NetBSD via p4
Submitted by:	Howard Su (with some minor changes)
Approved by:	re (kensmith)
2007-06-16 01:56:05 +00:00
Robert Watson
32f9753cfb Eliminate now-unused SUSER_ALLOWJAIL arguments to priv_check_cred(); in
some cases, move to priv_check() if it was an operation on a thread and
no other flags were present.

Eliminate caller-side jail exception checking (also now-unused); jail
privilege exception code now goes solely in kern_jail.c.

We can't yet eliminate suser() due to some cases in the KAME code where
a privilege check is performed and then used in many different deferred
paths.  Do, however, move those prototypes to priv.h.

Reviewed by:	csjp
Obtained from:	TrustedBSD Project
2007-06-12 00:12:01 +00:00
Remko Lodder
5df29e0ce9 Correct corrupt read when the read starts at a non-aligned offset.
PR:		kern/77234
MFC After:	1 week
Approved by:	imp (mentor)
Requested by:	many many people
Submitted by:	Andriy Gapon <avg at icyb dot net dot ua>
2007-06-11 20:14:44 +00:00
Attilio Rao
a1fe14bc33 rufetch and calcru sometimes should be called atomically together.
This patch fixes places where they should be called atomically changing
their locking requirements (both assume per-proc spinlock held) and
introducing rufetchcalc which wrappers both calls to be performed in
atomic way.

Reviewed by: jeff
Approved by: jeff (mentor)
2007-06-09 21:48:44 +00:00
Bruce A. Mah
5cca41595d Fix off-by-one error (introduced in r1.60) that had the effect of
disallowing a read of exactly MAXPHYS bytes.

Reviewed by:	des, rdivacky
MFC after:	1 week
Sponsored by:	nCircle Network Security
2007-06-07 15:04:30 +00:00
Jeff Roberson
982d11f836 Commit 14/14 of sched_lock decomposition.
- Use thread_lock() rather than sched_lock for per-thread scheduling
   sychronization.
 - Use the per-process spinlock rather than the sched_lock for per-process
   scheduling synchronization.

Tested by:      kris, current@
Tested on:      i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-05 00:00:57 +00:00
Attilio Rao
b4b7081961 Do proper "locking" for missing vmmeters part.
Now, we assume no more sched_lock protection for some of them and use the
distribuited loads method for vmmeter (distribuited through CPUs).

Reviewed by: alc, bde
Approved by: jeff (mentor)
2007-06-04 21:45:18 +00:00
Tom Rhodes
1be5bc7459 Revert previous, part of NFS that I didn't know about. 2007-06-01 17:06:46 +00:00
Tom Rhodes
a33ebaecf6 Garbage collect msdosfs_fhtovp; it appears unused and I have been using
MSDOSFS without this function and problems for the last month.
2007-06-01 14:57:19 +00:00
Konstantin Belousov
7a31868ed0 Change the VOP_OPEN(), vn_open() vnode operation and d_fdopen() cdev operation
argument from being file descriptor index into the pointer to struct file:
part 2. Convert calls missed in the first big commit.

Noted by:	rwatson
Pointy hat to:	kib
2007-06-01 14:33:11 +00:00
Attilio Rao
2feb50bf7d Revert VMCNT_* operations introduction.
Probabilly, a general approach is not the better solution here, so we should
solve the sched_lock protection problems separately.

Requested by: alc
Approved by: jeff (mentor)
2007-05-31 22:52:15 +00:00
Konstantin Belousov
9e223287c0 Revert UF_OPENING workaround for CURRENT.
Change the VOP_OPEN(), vn_open() vnode operation and d_fdopen() cdev operation
argument from being file descriptor index into the pointer to struct file.

Proposed and reviewed by:	jhb
Reviewed by:	daichi (unionfs)
Approved by:	re (kensmith)
2007-05-31 11:51:53 +00:00
Robert Watson
97cd541437 Where I previously removed calls to kdb_enter(), now remove include of
kdb.h.

Pointed out by:	bde
2007-05-29 11:28:28 +00:00
Robert Watson
86fc5557a6 Rather than entering the debugger via kdb_enter() when detecting memory
corruption under SMBUFS_NAME_DEBUG, panic() with the same error message.
2007-05-27 13:12:36 +00:00
Robert Watson
cf29f18a25 Rather than entering the debugger via kdb_enter() in the event the
root vnode is unexpectedly locked under NULLFS_DEBUG in nullfs and
then returning EDEADLK, panic.
2007-05-27 13:10:16 +00:00
Konstantin Belousov
d413d21071 Since renaming of vop_lock to _vop_lock, pre- and post-condition
function calls are no more generated for vop_lock.
Rename _vop_lock to vop_lock1 to satisfy tools/vnode_if.awk assumption
about vop naming conventions. This restores pre/post-condition calls.
2007-05-18 13:02:13 +00:00
Jeff Roberson
222d01951f - define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating
vmcnts.  This can be used to abstract away pcpu details but also changes
   to use atomics for all counters now.  This means sched lock is no longer
   responsible for protecting counts in the switch routines.

Contributed by:		Attilio Rao <attilio@FreeBSD.org>
2007-05-18 07:10:50 +00:00
Dag-Erling Smørgrav
1d776018d4 The process lock is held when procfs_ioctl() is called. Assert that this
is so, and PHOLD the process while sleeping since msleep() will release
the lock.
2007-05-01 12:59:20 +00:00
Dag-Erling Smørgrav
b77d604841 Fix old locking bugs which were revealed when pseudofs was made MPSAFE.
Submitted by:	tegge
2007-04-23 19:17:01 +00:00
Robert Watson
305759909e Rename mac*devfsdirent*() to mac*devfs*() to synchronize with SEDarwin,
where similar data structures exist to support devfs and the MAC
Framework, but are named differently.

Obtained from:	TrustedBSD Project
Sponsored by:	SPARTA, Inc.
2007-04-23 13:36:54 +00:00
Alan Cox
cf75c506db Add synchronization. Eliminate the acquisition and release of Giant.
Reviewed by: tegge
2007-04-23 06:12:24 +00:00
Tom Rhodes
164554dec4 In some cases, like whenever devfs file times are zero, the fix(aa) will not
be applied to dev entries.  This leaves us with file times like "Jan 1 1970."
Work around this problem by replacing the tv_sec == 0 check with a
<= 3600 check.  It's doubtful anyone will be booting within an hour of the
Epoch, let alone care about a few seconds worth of nonzero timestamps.  It's
a hackish work around, but it does work and I have not experienced any
negatives in my testing.

Discussed with:	bde
"Ok with me:	phk
2007-04-20 01:47:05 +00:00
Dag-Erling Smørgrav
8edf8ae133 Avoid "unused variable" warning when building without PSEUDOFS_TRACE. 2007-04-15 20:35:18 +00:00
Dag-Erling Smørgrav
388596dffc Make pseudofs (and consequently procfs, linprocfs and linsysfs) MPSAFE. 2007-04-15 17:10:01 +00:00
Dag-Erling Smørgrav
b1f9e8cec9 Instead of stating GIANT_REQUIRED, just acquire and release Giant where
needed.  This does not make a difference now, but will when procfs is
marked MPSAFE.
2007-04-15 17:06:09 +00:00
Dag-Erling Smørgrav
302762c344 Fix the same bug as in procfs_doproc{,db}regs(): check that uio_offset is
0 upon entry, and don't reset it before returning.

MFC after:	3 weeks
2007-04-15 13:29:36 +00:00
Dag-Erling Smørgrav
66cd74a611 Don't reset uio_offset to 0 before returning. Instead, refuse to service
requests where uio_offset is not 0 to begin with.  This fixes a long-
standing bug where e.g. 'cat /proc/$$/regs' would loop forever.

MFC after:	3 weeks
2007-04-15 13:24:03 +00:00
Dag-Erling Smørgrav
f61bc4ea5e Further pseudofs improvements:
The pfs_info mutex is only needed to lock pi_unrhdr.  Everything else
in struct pfs_info is modified only while Giant is held (during
vfs_init() / vfs_uninit()); add assertions to that effect.

Simplify pfs_destroy somewhat.

Remove superfluous arguments from pfs_fileno_{alloc,free}(), and the
assertions which were added in the previous commit to ensure they were
consistent.

Assert that Giant is held while the vnode cache is initialized and
destroyed.  Also assert that the cache is empty when it is destroyed.

Rename the vnode cache mutex for consistency.

Fix a long-standing bug in pfs_getattr(): it would uncritically return
the node's pn_fileno as st_ino.  This would result in st_ino being 0
if the node had not previously been visited by readdir(), and also in
an incorrect st_ino for process directories and any files contained
therein.  Correct this by abstracting the fileno manipulations
previously done in pfs_readdir() into a new function, pfs_fileno(),
which is used by both pfs_getattr() and pfs_readdir().
2007-04-14 14:08:30 +00:00
Dag-Erling Smørgrav
15bad11fdb Add a flag to struct pfs_vdata to mark the vnode as dead (e.g. process-
specific nodes when the process exits)

Move the vnode-cache-walking loop which was duplicated in pfs_exit() and
pfs_disable() into its own function, pfs_purge(), which looks for vnodes
marked as dead and / or belonging to the specified pfs_node and reclaims
them.  Note that this loop is still extremely inefficient.

Add a comment in pfs_vncache_alloc() explaining why we have to purge the
vnode from the vnode cache before returning, in case anyone should be
tempted to remove the call to cache_purge().

Move the special handling for pfstype_root nodes into pfs_fileno_alloc()
and pfs_fileno_free() (the root node's fileno must always be 2).  This
also fixes a bug where pfs_fileno_free() would reclaim the root node's
fileno, triggering a panic in the unr code, as that fileno was never
allocated from unr to begin with.

When destroying a pfs_node, release its fileno and purge it from the
vnode cache.  I wish we could put off the call to pfs_purge() until
after the entire tree had been destroyed, but then we'd have vnodes
referencing freed pfs nodes.  This probably doesn't matter while we're
still under Giant, but might become an issue later.

When destroying a pseudofs instance, destroy the tree before tearing
down the fileno allocator.

In pfs_mount(), acquire the mountpoint interlock when required.

MFC after:	3 weeks
2007-04-11 22:40:57 +00:00
Dag-Erling Smørgrav
56c62ab69c Whitespace nits. 2007-04-05 13:43:00 +00:00
Robert Watson
5e3f7694b1 Replace custom file descriptor array sleep lock constructed using a mutex
and flags with an sxlock.  This leads to a significant and measurable
performance improvement as a result of access to shared locking for
frequent lookup operations, reduced general overhead, and reduced overhead
in the event of contention.  All of these are imported for threaded
applications where simultaneous access to a shared file descriptor array
occurs frequently.  Kris has reported 2x-4x transaction rate improvements
on 8-core MySQL benchmarks; smaller improvements can be expected for many
workloads as a result of reduced overhead.

- Generally eliminate the distinction between "fast" and regular
  acquisisition of the filedesc lock; the plan is that they will now all
  be fast.  Change all locking instances to either shared or exclusive
  locks.

- Correct a bug (pointed out by kib) in fdfree() where previously msleep()
  was called without the mutex held; sx_sleep() is now always called with
  the sxlock held exclusively.

- Universally hold the struct file lock over changes to struct file,
  rather than the filedesc lock or no lock.  Always update the f_ops
  field last. A further memory barrier is required here in the future
  (discussed with jhb).

- Improve locking and reference management in linux_at(), which fails to
  properly acquire vnode references before using vnode pointers.  Annotate
  improper use of vn_fullpath(), which will be replaced at a future date.

In fcntl(), we conservatively acquire an exclusive lock, even though in
some cases a shared lock may be sufficient, which should be revisited.
The dropping of the filedesc lock in fdgrowtable() is no longer required
as the sxlock can be held over the sleep operation; we should consider
removing that (pointed out by attilio).

Tested by:	kris
Discussed with:	jhb, kris, attilio, jeff
2007-04-04 09:11:34 +00:00
Kris Kennaway
6455de0029 Annotate that this giant acqusition is dependent on tty locking. 2007-03-26 21:56:46 +00:00
Maxim Konovalov
4b12bb048f o cd9660 code repo-copied, update a comment. 2007-03-24 22:40:16 +00:00
Tor Egge
61b9d89ff0 Make insmntque() externally visibile and allow it to fail (e.g. during
late stages of unmount).  On failure, the vnode is recycled.

Add insmntque1(), to allow for file system specific cleanup when
recycling vnode on failure.

Change getnewvnode() to no longer call insmntque().  Previously,
embryonic vnodes were put onto the list of vnode belonging to a file
system, which is unsafe for a file system marked MPSAFE.

Change vfs_hash_insert() to no longer lock the vnode.  The caller now
has that responsibility.

Change most file systems to lock the vnode and call insmntque() or
insmntque1() after a new vnode has been sufficiently setup.  Handle
failed insmntque*() calls by propagating errors to callers, possibly
after some file system specific cleanup.

Approved by:	re (kensmith)
Reviewed by:	kib
In collaboration with:	kib
2007-03-13 01:50:27 +00:00
Dag-Erling Smørgrav
771709eb78 Add a pn_destroy field to pfs_node. This field points to a destructor
function which is called from pfs_destroy() before the node is reclaimed.

Modify pfs_create_{dir,file,link}() to accept a pointer to a destructor
function in addition to the usual attr / fill / vis pointers.

This breaks both the programming and binary interfaces between pseudofs
and its consumers.  It is believed that there are no pseudofs consumers
outside the source tree, so that the impact of this change is minimal.

Submitted by:	Aniruddha Bohra <bohra@cs.rutgers.edu>
2007-03-12 12:16:52 +00:00
Mike Pritchard
45cdcb7aab Change fifo_printinfo to check if the vnode v_fifoinfo pointer
is NULL and print a message to that effect to prevent a panic.
2007-03-02 00:10:11 +00:00
John Baldwin
4d70511ac3 Use pause() rather than tsleep() on stack variables and function pointers. 2007-02-27 17:23:29 +00:00
Olivier Houchard
9bf1500921 Check that the error returned by vfs_getopts() is not ENOENT before assuming
there's actually an error.
This is just in order to unbreak ntfs on current, before a proper solution is
committed.
2007-02-21 00:30:09 +00:00
Robert Watson
969e5bdcd0 Do allow PIOCSFL in jail for setguid processes; this is more consistent
with other debugging checks elsewhere.  XXX comment on the fact that
p_candebug() is not being used here remains.
2007-02-19 13:04:25 +00:00
Pawel Jakub Dawidek
10bcafe9ab Move vnode-to-file-handle translation from vfs_vptofh to vop_vptofh method.
This way we may support multiple structures in v_data vnode field within
one file system without using black magic.

Vnode-to-file-handle should be VOP in the first place, but was made VFS
operation to keep interface as compatible as possible with SUN's VFS.
BTW. Now Solaris also implements vnode-to-file-handle as VOP operation.

VFS_VPTOFH() was left for API backward compatibility, but is marked for
removal before 8.0-RELEASE.

Approved by:	mckusick
Discussed with:	many (on IRC)
Tested with:	ufs, msdosfs, cd9660, nullfs and zfs
2007-02-15 22:08:35 +00:00
Craig Rodrigues
a8d36d0d9a Forced commit and #include changes for repo copy from
sys/isofs/cd9660 to sys/fs/cd9660.

Discussed on freebsd-current.
2007-02-11 13:54:25 +00:00
Craig Rodrigues
d6140aaa69 Add noatime to the list of mount options that msdosfs accepts.
PR:		108896
Submitted by:	Eugene Grosbein <eugen grosbein pp ru>
2007-02-08 02:30:55 +00:00
Craig Rodrigues
dc9a617afb Style fixes: use ANSI C function declarations. 2007-02-08 02:25:35 +00:00
Konstantin Belousov
a257337698 Fix the race of dereferencing /proc/<pid>/file with execve(2) by caching
the value of p_textvp. This way, we always unlock the locked vnode.
While there, vhold() the vnode around the vn_lock().

Reported and tested by:	Guy Helmer (ghelmer palisadesys com)
Approved by:		des (procfs maintainer)
MFC after:		1 week
2007-02-07 10:30:49 +00:00
Craig Rodrigues
8a4cab026b Eliminate some dead code which was introduced in 1.23, yet was always
commented out.
2007-02-06 03:30:58 +00:00
Pawel Jakub Dawidek
5ab5525469 coda_vptofh is never defined nor used. 2007-02-02 15:47:28 +00:00
Tai-hwa Liang
61ad2e26ef Fixing compilation bustage by removing references to opt_msdosfs.h.
This auto-generated header file no longer exists since the removal of
MSDOSFS_LARGE in sys/conf/options:1.574.
2007-01-30 08:05:04 +00:00
Tom Rhodes
bade0e00f3 Fix spacing from my previous commit to this file:
Noticed by:	fjoe
2007-01-30 04:41:38 +00:00
Craig Rodrigues
f458f2a553 Add a "-o large" mount option for msdosfs. Convert compile-time checks for
#ifdef MSDOSFS_LARGE to run-time checks to see if "-o large" was specified.

Test case provided by Oliver Fromme:
  truncate -s 200G test.img
  mdconfig -a -t vnode -f test.img -u 9
  newfs_msdos -s 419430400 -n 1 /dev/md9 zip250
  mount -t msdosfs /dev/md9 /mnt    # should fail
  mount -t msdosfs -o large /dev/md9 /mnt   # should succeed

PR:		105964
Requested by:	Oliver Fromme <olli lurza secnetix de>
Tested by:	trhodes
MFC after:	2 weeks
2007-01-30 03:11:45 +00:00
Konstantin Belousov
7f92c4ee02 Below is slightly edited description of the LOR by Tor Egge:
--------------------------
[Deadlock] is caused by a lock order reversal in vfs_lookup(), where
[some] process is trying to lock a directory vnode, that is the parent
directory of covered vnode) while holding an exclusive vnode lock on
covering vnode.

A simplified scenario:

root fs					var fs
/    		A			/    (/var)	D
/var		B			/log (/var/log) E
vfs lock	C			vfs lock	F

Within each file system, the lock order is clear: C->A->B and F->D->E

When traversing across mounts, the system can choose between two lock orders,
but everything must then follow that lock order:

      L1: C->A->B
		|
	        +->F->D->E

      L2: F->D->E
	     |
             +->C->A->B

The lookup() process for namei("/var") mixes those two lock orders:

    VOP_LOOKUP() obtains B while A is held
    vfs_busy() obtains a shared lock on F while A and B are held (follows L1,
    violates L2)
    vput() releases lock on B
    VOP_UNLOCK() releases lock on A
    VFS_ROOT() obtains lock on D while shared lock on F is held
    vfs_unbusy() releases shared lock on F
    vn_lock() obtains lock on A while D is held (violates L1, follows L2)

dounmount() follows L1 (B is locked while F is drained).

Without unmount activity, vfs_busy() will always succeed without blocking
and the deadlock isn't triggered (the system behaves as if L2 is followed).

With unmount, you can get 4 processes in a deadlock:

     p1: holds D, want A (in lookup())
     p2: holds shared lock on F, want D (in VFS_ROOT())
     p3: holds B, want drain lock on F (in dounmount())
     p4: holds A, want B (in VOP_LOOKUP())

You can have more than one instance of p2.

The reversal was introduced in revision 1.81 of src/sys/kern/vfs_lookup.c and
MFCed to revision 1.80.2.1, probably to avoid a cascade of vnode locks when nfs
servers are dead (VFS_ROOT() just hangs) spreading to the root fs root vnode.

- Tor Egge

To fix the LOR, ups@ noted that when crossing the mount point, ni_dvp
is actually not used by the callers of namei. Thus, placeholder deadfs
vnode vp_crossmp is introduced that is filled into ni_dvp.

Idea by:	ups
Reviewed by:	tegge, ups, jeff, rwatson (mac interaction)
Tested by:	Peter Holm
MFC after:	2 weeks
2007-01-22 11:25:22 +00:00
Tom Rhodes
752945d6c0 Add a 3rd entry in the cache, which keeps the end position
from just before extending a file.  This has the desired effect
of keeping the write speed constant.  And yes, that helps a lot
copying large files always at full speed now, and I have seen
improvements using benchmarks/bonnie.

Stolen from:	NetBSD
Reviewed by:	bde
2007-01-16 23:43:14 +00:00
Pav Lucistnik
0c09ac0d57 Rewrite the udf_read() routine to use a file vnode instead of the devvp vnode.
The code is modelled after cd9660, including support for simple read-ahead
courtesy of clustered read.

Fix udf_strategy to DTRT.

This change fixes sendfile(2) not to send out garbage.

Reviewed by:	scottl
MFC after:	1 month
2007-01-15 18:45:36 +00:00
Pav Lucistnik
9f3eef13ca Tell backing v_object the filesize right on it's creation.
MFC after:	1 week
2007-01-07 23:53:16 +00:00
Craig Rodrigues
82c59ec651 When performing a mount update to change a mount from read-only to read-write,
do not call markvoldirty() until the mount has been flagged as read-write.
Due to the nature of the msdosfs code, this bug only seemed to appear for
FAT-16 and FAT-32.

This fixes the testcase:
#!/bin/sh
dd if=/dev/zero bs=1m count=1 oseek=119 of=image.msdos
mdconfig -a -t vnode -f image.msdos
newfs_msdos -F 16 /dev/md0 fd120m
mount_msdosfs -o ro /dev/md0 /mnt
mount | grep md0
mount -u -o rw /dev/md0; echo $?
mount | grep md0
umount /mnt
mdconfig -d -u 0

PR:		105412
Tested by:	Eugene Grosbein <eugen grosbein pp ru>
2007-01-06 20:46:02 +00:00
Craig Rodrigues
dda4f444de Simplify code in union_hashins() and union_hashget() functions. These
functions now more closely resemble similar functions in nullfs.
This also eliminates some errors.

Submitted by:	daichi, Masanori OZAWA <ozawa ongs co jp>
2007-01-05 14:06:42 +00:00
Craig Rodrigues
9170c87faa Eliminate obsolete comment, now that getushort() is implemented in
terms of functions in <sys/endian.h>.
2007-01-05 05:28:57 +00:00
Craig Rodrigues
98155f1f51 Eliminate ASSERT_VOP_ELOCKED panics when doing mkdir or symlink when
sysctl vfs.lookup_shared=1.

Submitted by:	daichi, Masanori OZAWA <ozawa ongs co jp>
2007-01-05 02:25:44 +00:00
John Baldwin
b082761327 Use the vnode interlock to close a race where pfs_vncache_alloc() could
attempt to vn_lock() a destroyed vnode resulting in a hang.

MFC after:	1 week
Submitted by:	ups
Reviewed by:	des
2007-01-02 17:27:52 +00:00
Pav Lucistnik
35e0662415 Call vnode_create_vobject() in VOP_OPEN. Makes mmap work on UDF filesystem.
PR:		kern/92040
Approved by:	scottl
MFC after:	1 week
2006-12-23 18:53:22 +00:00
Marcel Moolenaar
94632b9fe1 Unbreak 64-bit little-endian systems that do require alignment.
The fix involves using le16dec(), le32dec(), le16enc() and
le32enc(). This eliminates invalid casts and duplicated logic.
2006-12-21 05:40:46 +00:00
Craig Rodrigues
3244bb8a12 For big-endian version of getulong() macro, cast result to u_int32_t.
This macro was written expecting a 32-bit unsigned long, and
doesn't work properly on 64-bit systems.  This bug caused vn_stat()
to return incorrect values for files larger than 2gb on msdosfs filesystems
on 64-bit systems.

PR:		106703
Submitted by:	Axel Gonzalez <loox e-shell net>
MFC after:	3 days
2006-12-19 02:31:58 +00:00
Craig Rodrigues
d01e83878b Fix get_ulong() macro on AMD64 (or any little-endian 64-bit platform).
This bug caused vn_stat() to fail on files larger than 2gb on msdosfs
filesystems on AMD64.

PR:		106703
Tested by:	Axel Gonzalez <loox e-shell net>
MFC after:	3 days
2006-12-19 01:55:45 +00:00
Craig Rodrigues
b05872f29b Remove unused variable in unionfs_root().
Submitted by:	daichi, Masanori OZAWA
2006-12-09 17:24:18 +00:00
Craig Rodrigues
1e370dbbdc Use vfs_mount_error() in a few places to give more descriptive mount error
messages.
2006-12-09 17:21:25 +00:00
Craig Rodrigues
30d471e654 Add locking around calls to unionfs_get_node_status()
in unionfs_ioctl() and unionfs_poll().

Submitted by:	daichi, Masanori OZAWA <ozawa@ongs.co.jp>
Prompted by:	kris
2006-12-09 16:51:09 +00:00
Craig Rodrigues
b16f4eec16 In unionfs_readdir(), prevent a possible NULL dereference.
CID:		1667
Found by:	Coverity Prevent (tm)
2006-12-09 16:34:37 +00:00
Craig Rodrigues
acc4bab11b In unionfs_hashrem(), use LIST_FOREACH_SAFE when iterating over
the list of nodes to free them.

CID:		1668
Found by:	Coverity Prevent (tm)
2006-12-09 16:27:50 +00:00
Craig Rodrigues
e9022ef898 Minor cleanup. If we are doing a mount update, and we pass in
an "export" flag indicating that we are trying to NFS export the
filesystem, and the MSDOSFS_LARGEFS flag is set on the filesystem,
then deny the mount update and export request.  Otherwise,
let the full mount update proceed normally.
MSDOSFS_LARGES and NFS don't mix because of the way inodes are calculated
for MSDOSFS_LARGEFS.

MFC after:	3 days
2006-12-09 01:49:19 +00:00
Tim Kientzle
8d3027e203 The ISO9660 spec does allow files up to 4G. Change the i_size
field to "unsigned long" so that it actually works.
Thanks to Robert Sciuk for sending me a DVD that
demonstrated ISO9660-formatted media with a file >2G.
I've now fixed this both in libarchive and in the cd9660
filesystem.

MFC after: 14 days
2006-12-08 07:43:53 +00:00
Julian Elischer
ad1e7d285a Threading cleanup.. part 2 of several.
Make part of John Birrell's KSE patch permanent..
Specifically, remove:
Any reference of the ksegrp structure. This feature was
never fully utilised and made things overly complicated.
All code in the scheduler that tried to make threaded programs
fair to unthreaded programs.  Libpthread processes will already
do this to some extent and libthr processes already disable it.

Also:
Since this makes such a big change to the scheduler(s), take the opportunity
to rename some structures and elements that had to be moved anyhow.
This makes the code a lot more readable.

The ULE scheduler compiles again but I have no idea if it works.

The 4bsd scheduler still reqires a little cleaning and some functions that now do
ALMOST nothing will go away, but I thought I'd do that as a separate commit.

Tested by David Xu, and Dan Eischen using libthr and libpthread.
2006-12-06 06:34:57 +00:00
Maxim Konovalov
1c5cf521ae o Do not leave uninitialized birthtime: in MSDOSFSMNT_LONGNAME
set birthtime to FAT CTime (creation time) and in the other cases
set birthtime to -1.

o Set ctime to mtime instead of FAT CTime which has completely
different meaning.

PR:		kern/106018
Submitted by:	Oliver Fromme
MFC after:	1 month
2006-12-03 19:04:26 +00:00
Craig Rodrigues
3d253c11cf Add missing includes for <sys/buf.h> and <sys/bio.h>. 2006-12-02 22:30:30 +00:00
Craig Rodrigues
d00947d83a Many, many thanks to Masanori OZAWA <ozawa@ongs.co.jp>
and Daichi GOTO <daichi@FreeBSD.org> for submitting this
major rewrite of unionfs.  This rewrite was done to
try to solve many of the longstanding crashing and locking
issues in the existing unionfs implementation.  This
implementation also adds a 'MASQUERADE mode', which allows
the user to set different user, group, and file permission
modes in the upper layer.

Submitted by:	daichi, Masanori OZAWA
Reviewed by:	rodrigc (modified for minor style issues)
2006-12-02 19:35:56 +00:00
Maxim Konovalov
cc005bb92c o From the submitter: dos2unixchr will convert to lower case if
LCASE_BASE or LCASE_EXT or both are set.  But dos2unixfn uses
dos2unixchr separately for the basename and the extension.  So if
either LCASE_BASE or LCASE_EXT is set, dos2unixfn will convert both
the basename and extension to lowercase because it is blindly
passing in the state of both flags to dos2unixchr.  The bit masks I
used ensure that only the state of LCASE_BASE gets passed to
dos2unixchr when the basename is converted, and only the state of
LCASE_EXT is passed in when the extension is converted.

PR:		kern/86655
Submitted by:	Micah Lieske
MFC after:	3 weeks
2006-11-26 18:49:44 +00:00
Lukas Ertl
9df1370eab Fix an integer overflow and allow access to files larger than 4GB on
NTFS.
2006-11-20 19:28:36 +00:00
Konstantin Belousov
dbf989ea6a Wake up PIOCWAIT handler on the process exit in addition to the stop
events. &p->p_stype is explicitely woken up on process exit for us.

Now, truss /nonexistent exits with error instead of waiting until killed
by signal.

Reported by:	Nikos Vassiliadis nvass at teledomenet gr
Reviewed by:	jhb
MFC after:	1 week
2006-11-17 14:52:38 +00:00
Kip Macy
2f6a774be4 change vop_lock handling to allowing tracking of callers' file and line for
acquisition of lockmgr locks

Approved by: scottl (standing in for mentor rwatson)
2006-11-13 05:51:22 +00:00
Robert Watson
acd3428b7d Sweep kernel replacing suser(9) calls with priv(9) calls, assigning
specific privilege names to a broad range of privileges.  These may
require some future tweaking.

Sponsored by:           nCircle Network Security, Inc.
Obtained from:          TrustedBSD Project
Discussed on:           arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
                        Alex Lyashkov <umka at sevcity dot net>,
                        Skip Ford <skip dot ford at verizon dot net>,
                        Antoine Brodin <antoine dot brodin at laposte dot net>
2006-11-06 13:42:10 +00:00
Boris Popov
fb8e9ead37 Create a bidirectional mapping of the DOS 'read only' attribute
to the 'w' flag.

PR:		kern/77958
Submitted by:	ghozzy gmail com
MFC after:	1 month
2006-11-05 06:38:42 +00:00
John Birrell
8460a577a4 Make KSE a kernel option, turned on by default in all GENERIC
kernel configs except sun4v (which doesn't process signals properly
with KSE).

Reviewed by:	davidxu@
2006-10-26 21:42:22 +00:00
Poul-Henning Kamp
3c925ad2aa Ditch crummy fattime <--> timespec conversion functions 2006-10-24 11:55:18 +00:00
Poul-Henning Kamp
4a4cd136b4 Drop crummy fattime to timespec conversion routines.
Leave a XXX here for anybody able to test.
2006-10-24 11:43:41 +00:00
Poul-Henning Kamp
3c960d9379 Replace slightly crummy fattime<->timespec conversion functions. 2006-10-24 11:14:05 +00:00
Robert Watson
aed5570872 Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h.  sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from:	TrustedBSD Project
Sponsored by:	SPARTA
2006-10-22 11:52:19 +00:00
Tom Rhodes
94a28290c1 Fake the link count until we have no choice but to load data from the
MFT.

PR:		86965
Submitted by:	Lowell Gilbert <lgfbsd@be-well.ilk.org>
2006-10-21 08:17:17 +00:00
Konstantin Belousov
16f50bcd80 Update the access and modification times for dev while still holding
thread reference on it.

Reviewed by:	tegge
Approved by:	pjd (mentor)
2006-10-20 08:03:42 +00:00
Konstantin Belousov
1663075c64 Fix the race between devfs_fp_check and devfs_reclaim. Derefence the
vnode' v_rdev and increment the dev threadcount , as well as clear it
(in devfs_reclaim) under the dev_lock().

Reviewed by:	tegge
Approved by:	pjd (mentor)
2006-10-20 07:59:50 +00:00
Konstantin Belousov
828d6d12da Properly lock the vnode around vgone() calls.
Unlock the vnode in devfs_close() while calling into the driver d_close()
routine.

devfs_revoke() changes by:	ups
Reviewed and bugfixes by:	tegge
Tested by:	mbr, Peter Holm
Approved by:	pjd (mentor)
MFC after:	1 week
2006-10-18 11:17:14 +00:00
Poul-Henning Kamp
e5037a18a9 Use utc_offset() where applicable, and hide the internals of it
as static variables.
2006-10-02 18:23:37 +00:00
Poul-Henning Kamp
f645b0b51c First part of a little cleanup in the calendar/timezone/RTC handling.
Move relevant variables to <sys/clock.h> and fix #includes as necessary.

Use libkern's much more time- & spamce-efficient BCD routines.
2006-10-02 12:59:59 +00:00
Ruslan Ermilov
9fddcc6661 Fix our ioctl(2) implementation when the argument is "int". New
ioctls passing integer arguments should use the _IOWINT() macro.
This fixes a lot of ioctl's not working on sparc64, most notable
being keyboard/syscons ioctls.

Full ABI compatibility is provided, with the bonus of fixing the
handling of old ioctls on sparc64.

Reviewed by:	bde (with contributions)
Tested by:	emax, marius
MFC after:	1 week
2006-09-27 19:57:02 +00:00
Tor Egge
5da56ddb21 Use mount interlock to protect all changes to mnt_flag and mnt_kern_flag.
This eliminates a race where MNT_UPDATE flag could be lost when nmount()
raced against sync(), sync_fsync() or quotactl().
2006-09-26 04:12:49 +00:00
Konstantin Belousov
af72db7175 Fix the bug in rev. 1.134. In devfs_allocv_drop_refs(), when not_found == 2
and drop_dm_lock is true, no unlocking shall be attempted. The lock is
already dropped and memory is freed.

Found with:	Coverity Prevent(tm)
CID:	1536
Approved by:	pjd (mentor)
2006-09-19 14:03:02 +00:00
Konstantin Belousov
e7f9b74438 Resolve the devfs deadlock caused by LOR between devfs_mount->dm_lock and
vnode lock in devfs_allocv. Do this by temporary dropping dm_lock around
vnode locking.

For safe operation, add hold counters for both devfs_mount and devfs_dirent,
and DE_DOOMED flag for devfs_dirent. The facilities allow to continue after
dropping of the dm_lock, by making sure that referenced memory does not
disappear.

Reviewed by:	tegge
Tested by:	kris
Approved by:	kan (mentor)
PR:		kern/102335
2006-09-18 13:23:08 +00:00
Warner Losh
b4583894aa Put the osta.c license on osta.h. The license is the same.
Approved by: scottl@
2006-09-12 19:02:34 +00:00
Warner Losh
1a3c917f9d while (0); -> while (0) in multi-line macros 2006-08-17 22:50:33 +00:00
Alan Cox
5786be7cc7 Introduce a field to struct vm_page for storing flags that are
synchronized by the lock on the object containing the page.

Transition PG_WANTED and PG_SWAPINPROG to use the new field,
eliminating the need for holding the page queues lock when setting
or clearing these flags.  Rename PG_WANTED and PG_SWAPINPROG to
VPO_WANTED and VPO_SWAPINPROG, respectively.

Eliminate the assertion that the page queues lock is held in
vm_page_io_finish().

Eliminate the acquisition and release of the page queues lock
around calls to vm_page_io_finish() in kern_sendfile() and
vfs_unbusy_pages().
2006-08-09 17:43:27 +00:00
Yaroslav Tykhiy
776fc0e90e Commit the results of the typo hunt by Darren Pilgrim.
This change affects documentation and comments only,
no real code involved.

PR:		misc/101245
Submitted by:	Darren Pilgrim <darren pilgrim bitfreak org>
Tested by:	md5(1)
MFC after:	1 week
2006-08-04 07:56:35 +00:00
Xin LI
bcc4260f3b When the volume is being downgraded from a read-write mode, mark
it as clean.

PR:		kern/85366
Submitted by:	Dan Lukes <dan at obluda dot cz>
MFC After:	2 weeks
2006-08-03 03:55:52 +00:00
Yaroslav Tykhiy
69f0212f52 In udf_find_partmaps(), when we find a type 1 partition map, we have to
skip the actual type 1 length (6 bytes). With this change, it is now possible
to correctly spot the VAT partition map in certain discs.

Submitted by:	Pedro Martelletto <pedro@ambientworks.net>
2006-07-25 14:15:50 +00:00
John Baldwin
c2de792e32 Update comment. 2006-07-18 22:29:54 +00:00
John Baldwin
fe78538353 Lock the smb share before doing a 'put' on it in smbfs_unmount().
Tested by:	"Jiawei Ye" <leafy7382 at gmail>
2006-07-17 16:13:42 +00:00
Poul-Henning Kamp
9c499ad92f Remove the NDEVFSINO and NDEVFSOVERFLOW options which no longer exists in
DEVFS.

Remove the opt_devfs.h file now that it is empty.
2006-07-17 09:07:02 +00:00
Stephan Uphoff
56eeb277cb Add vnode interlocking to devfs.
This prevents race conditions that can cause pagefaults or devfs
to use arbitrary vnodes.

MFC after:	1 week
2006-07-12 20:25:35 +00:00
John Baldwin
c1cccebe8b Add a kern_close() so that the ABIs can close a file descriptor w/o having
to populate a close_args struct and change some of the places that do.
2006-07-08 20:03:39 +00:00
Robert Watson
be54a5eeb3 Remove unneeded mac.h include.
MFC after:	3 days
2006-07-06 13:25:01 +00:00
Robert Watson
2551d4f66e Remove now unneeded opt_mac.h and mac.h includes.
MFC after:	3 days
2006-07-06 13:24:22 +00:00
Robert Watson
83ff52a7f3 Use #include "", not #include <> for opt_foo.h.
MFC after:	3 days
2006-07-06 13:22:08 +00:00
Alexander Leidinger
85646f7eb1 Correctly calculate a buffer length. It was off by one so a read() returned
one byte less than needed.

This is a RELENG_x_y candidate, since it fixes a problem with Oracle 10.

Noticed by:	Dmitry Ganenko <dima@apk-inform.com>
Testcase by:	Dmitry Ganenko <dima@apk-inform.com>
Reviewed by:	des
Submitted by:	rdivacky
Sponsored by:	Google SoC 2006
MFC after:	1 week
2006-06-27 20:21:38 +00:00
Scott Long
09e19031ab Fix a memory leak and a nested 'for' loop in the spare table handling.
Submitted by: Pedro Martelletto
2006-06-26 03:21:19 +00:00
Guy Helmer
3266c22854 Upon further review, DES prefers this change over that in revision 1.13
to resolve the directory access problem for processes with P_SUGID flag
set.

Suggested by: des
2006-06-05 16:41:27 +00:00
Craig Rodrigues
829b898c7c mount_msdosfs.c:
- remove call to getmntopts(), and just pass -o options to
    nmount().  This removes some confusion as to what options
    msdosfs can parse, by pushing the responsibility of option parsing
    to the VFS and FS specific code in the kernel.

msdosfs_vfsops.c:
  - add "force" and "sync" to msdosfs_opts.  They used to be specified
    in mount_msdosfs.c, so move them here.  It's not clear whethere these
    options should be placed into global_opts in vfs_mount.c or not.

Motivated by:	marcus
2006-06-01 02:25:00 +00:00
Colin Percival
72f6a0fa7a Enable inadvertantly disabled "securenet" access controls in ypserv. [1]
Correct a bug in the handling of backslash characters in smbfs which can
allow an attacker to escape from a chroot(2). [2]

Security:	FreeBSD-SA-06:15.ypserv [1]
Security:	FreeBSD-SA-06:16.smbfs [2]
2006-05-31 22:32:22 +00:00
Craig Rodrigues
05c0f5c1e2 Remove incorrect null_checkexp() routine. This
will allow the NFS server to call vfs_stdcheckexp() on the exported nullfs
filesystem, not the underlying filesystem being nullfs mounted.
If the lower filesystem was not NFS exported, then the NFS exported
null filesystem would not work.

Pointed out by:	scottl
PR:		kern/87906
MFC after:	1 week
2006-05-28 22:45:52 +00:00
Craig Rodrigues
ebbf93fd4c Modify MNT_UPDATE behavior for nullfs so that it does not
return EOPNOTSUPP if an "export" parameter was passed in.
This should allow nullfs mounts to be NFS exported.

PR:		kern/87906
MFC after:	1 week
2006-05-28 20:09:18 +00:00
Craig Rodrigues
23badd1016 Remove calls to vfs_export() for exporting a filesystem for NFS mounting
from individual filesystems.  Call it instead in vfs_mount.c,
after we call VFS_MOUNT() for a specific filesystem.
2006-05-26 01:21:51 +00:00
Craig Rodrigues
5eb304a91a Remove calls to vfs_export() for exporting a filesystem for NFS mounting
from individual filesystems.  Call it instead in vfs_mount.c,
after we call VFS_MOUNT() for a specific filesystem.
2006-05-26 00:32:21 +00:00
Stephan Uphoff
6c1b7d16c2 Call vm_object_page_clean() with the object lock held.
Submitted by:	kensmith@
Reviewed by:	mohans@
MFC after:	6 days
2006-05-25 17:16:11 +00:00
Stephan Uphoff
dcf67e65d2 Do not set B_NOCACHE on buffers when releasing them in flushbuflist().
If B_NOCACHE is set the pages of vm backed buffers will be invalidated.
However clean buffers can be backed by dirty VM pages so invalidating them
can lead to data loss.
Add support for flush dirty page in the data invalidation function
of some network file systems.

This fixes data losses during vnode recycling (and other code paths
using invalbuf(*,V_SAVE,*,*)) for data written using an mmaped file.

Collaborative effort by: jhb@,mohans@,peter@,ps@,ups@
Reviewed by:	tegge@
MFC after:	7 days
2006-05-25 01:00:35 +00:00
Guy Helmer
e06dbd3229 Revision 1.4 set access for all sensitive files in /proc/<PID> to mode 0
if a process's uid or gid has changed, but the /proc/<PID> directory
itself was also set to mode 0.  Assuming this doesn't open any
security holes, open access to the /proc/<PID> directory for users
other than root to read or search the directory.

Reviewed by:	des (back in February)
MFC after:	3 weeks
2006-05-24 14:03:51 +00:00
Poul-Henning Kamp
c40da00ca3 Since DELAY() was moved, most <machine/clock.h> #includes have been
unnecessary.
2006-05-16 14:37:58 +00:00
Kelly Yancey
c9ad8a67af Restore the ability to mount procfs and fdescfs filesystems via the
mount(2) system call:

  * Add cmount hook to fdescfs and pseudofs (and, by extension, procfs and
    linprocfs).  This (mostly) restores the ability to mount these
    filesystems using the old mount(2) system call (see below for the
    rest of the fix).

  * Remove not-NULL check for the data argument from the mount(2) entry
    point.  Per the mount(2) man page, it is up to the individual
    filesystem being mounted to verify data.  Or, in the case of procfs,
    etc. the filesystem is free to ignore the data parameter if it does
    not use it.  Enforcing data to be not-NULL in the mount(2) system call
    entry point prevented passing NULL to filesystems which ignored the
    data pointer value.  Apparently, passing NULL was common practice
    in such cases, as even our own mount_std(8) used to do it in the
    pre-nmount(2) world.

All userland programs in the tree were converted to nmount(2) long ago,
but I've found at least one external program which broke due to this
(presumably unintentional) mount(2) API change.  One could argue that
external programs should also be converted to nmount(2), but then there
isn't much point in keeping the mount(2) interface for backward
compatibility if it isn't backward compatible.
2006-05-15 19:42:10 +00:00
Pawel Jakub Dawidek
00a480ac5c Remove unused prototypes. 2006-04-12 12:17:29 +00:00
Jeff Roberson
23b77994f2 - Add a bogus vhold/vdrop around vgone() in devfs_revoke. Without this
the vnode is never recycled.  It is bogus because the reference really
   should be associated with the devfs dirent.
2006-03-31 23:37:29 +00:00
Tor Egge
87f0769a57 Call vn_start_write() before locking vnode. 2006-03-19 20:45:06 +00:00
Robert Watson
eca7e73743 Add a_fdidx to comment prototype for fifo_open().
MFC after:	3 days
Submitted by:	Kostik Belousov <kostikbel at gmail dot com>
2006-03-15 10:15:35 +00:00
Robert Watson
945a519a23 If fifo_open() is called with a negative file descriptor, return EINVAL
rather than panicking later.  This can occur if the kernel calls
vn_open() on a fifo, as there will be no associated file descriptor,
and therefore the file descriptor operations cannot be modified to
point to the fifo operation set.

MFC after:	3 days
Reported by:	Martin <nakal at nurfuerspam dot de>
PR:		94278
2006-03-14 19:29:45 +00:00
Joerg Wunsch
f7d5a5328f When encountering a ISO_SUSP_CFLAG_ROOT element in Rock Ridge
processing, this actually means there's a double slash recorded in the
symbolic link's path name.  We used to start over from / then, which
caused link targets like ../../bsdi.1.0/include//pathnames.h to be
interpreted as /pathnahes.h.  This is both contradictionary to our
conventional slash interpretation, as well as potentially dangerous.

The right thing to do is (obviously) to just ignore that element.

bde once pointed out that mistake when he noticed it on the
4.4BSD-Lite2 CD-ROM, and asked me for help.

Reviewed by:	bde (about half a year ago)
MFC after:	3 days
2006-03-13 22:32:33 +00:00
Jeff Roberson
4bf5133b1f - Define a null_getwritemount to get the mount-point for the lower
filesystem so that nullfs doesn't permit you to circumvent snapshots.

Discussed with:		tegge
Sponsored by:		Isilon Systems, Inc.
2006-03-12 04:58:18 +00:00
Kris Kennaway
7d65872fff Correct the vnode locking in fdescfs.
PR:		kern/93905
Submitted by:	Kostik Belousov <kostikbel@gmail.com>
Reviewed by:	jeff
MFC After:	1 week
2006-02-28 00:05:44 +00:00
Yaroslav Tykhiy
82967ff0b8 CODA_COMPAT_5 may not be defined unconditionally in the coda5 module.
Otherwise a kernel build would break in the coda5 module if the main
kernel conf file enabled CODA_COMPAT_5, too.  Redefined symbols are
strictly disallowed by -Werror.

To overcome this issue, introduce a different symbol indicating coda5
build, CODA5_MODULE, and translate it to CODA_COMPAT_5 appropriately
in /sys/coda/coda.h.

MFC after:	3 days
2006-02-27 12:04:13 +00:00
John Baldwin
06ad42b2f7 Close some races between procfs/ptrace and exit(2):
- Reorder the events in exit(2) slightly so that we trigger the S_EXIT
  stop event earlier.  After we have signalled that, we set P_WEXIT and
  then wait for any processes with a hold on the vmspace via PHOLD to
  release it.  PHOLD now KASSERT()'s that P_WEXIT is clear when it is
  invoked, and PRELE now does a wakeup if P_WEXIT is set and p_lock drops
  to zero.
- Change proc_rwmem() to require that the processing read from has its
  vmspace held via PHOLD by the caller and get rid of all the junk to
  screw around with the vmspace reference count as we no longer need it.
- In ptrace() and pseudofs(), treat a process with P_WEXIT set as if it
  doesn't exist.
- Only do one PHOLD in kern_ptrace() now, and do it earlier so it covers
  FIX_SSTEP() (since on alpha at least this can end up calling proc_rwmem()
  to clear an earlier single-step simualted via a breakpoint).  We only
  do one to avoid races.  Also, by making the EINVAL error for unknown
  requests be part of the default: case in the switch, the various
  switch cases can now just break out to return which removes a _lot_ of
  duplicated PRELE and proc unlocks, etc.  Also, it fixes at least one bug
  where a LWP ptrace command could return EINVAL with the proc lock still
  held.
- Changed the locking for ptrace_single_step(), ptrace_set_pc(), and
  ptrace_clear_single_step() to always be called with the proc lock
  held (it was a mixed bag previously).  Alpha and arm have to drop
  the lock while the mess around with breakpoints, but other archs
  avoid extra lock release/acquires in ptrace().  I did have to fix a
  couple of other consumers in kern_kse and a few other places to
  hold the proc lock and PHOLD.

Tested by:	ps (1 mostly, but some bits of 2-4 as well)
MFC after:	1 week
2006-02-22 18:57:50 +00:00
John Baldwin
f8e3eeb519 Change pfs_visible() to optionally return a pointer to the process
associated with the passed in pfs_node.  If it does return a pointer, it
keeps the process locked.  This allows a lot of places that were calling
pfind() again right after pfs_visible() to not have to do that and avoids
races since we don't drop the proc lock just to turn around and lock it
again.  This will become more important with future changes to fix races
between procfs/ptrace and exit(2).  Also, removed a duplicate pfs_visible()
call in pfs_getextattr().

Reviewed by:	des
MFC after:	1 week
2006-02-22 17:24:54 +00:00
John Baldwin
7a61c1a3cb Hold the proc lock while calling proc_sstep() since the function asserts
it and remove a PRELE() that didn't have a matching PHOLD().  The calling
code already has a PHOLD anyway.

MFC after:	1 week
2006-02-22 17:20:37 +00:00
Jeff Roberson
f50b03bfd6 - We must hold a reference to a vnode before calling vgone() otherwise
it may not be removed from the freelist.

MFC After:	1 week
Found by:	kris
2006-02-22 09:05:40 +00:00
Jeff Roberson
f5cacb3964 - spell VOP_LOCK(vp, LK_RELEASE... VOP_UNLOCK(vp,... so that asserts in
vop_lock_post do not trigger.
 - Rearrange null_inactive to null_hashrem earlier so there is no chance
   of finding the null node on the hash list after the locks have been
   switched.
 - We should never have a NULL lowervp in null_reclaim() so there is
   no need to handle this situation.  panic instead.

MFC After:	1 week
2006-02-22 06:17:31 +00:00
Jeff Roberson
9c12e63100 - Assert that the lowervp is locked in null_hashget().
- Simplify the logic dealing with recycled vnodes in null_hashget() and
   null_hashins().  Since we hold the lower node locked in both cases
   the null node can not be undergoing recycling unless reclaim somehow
   called null_nodeget().  The logic that was in place was not safe and
   was essentially dead code.

MFC After:	1 week
2006-02-22 06:15:12 +00:00
Jeff Roberson
578abc8e54 - Deadfs should not use the std GETWRITEMOUNT routine. Add one that always
returns NULL.

MFC After:	1 week
2006-02-22 06:11:59 +00:00
John Baldwin
ccabcacb30 Correctly set MNTK_MPSAFE flag from the lower vnode's mount rather than
always turning it on along with any flags set in the lower mount.

Tested by:	kris
Reviewed by:	jeff
MFC after:	3 days
2006-02-10 18:06:49 +00:00
Jeff Roberson
fbf586bd40 - No need to WANTPARENT when we're just going to vrele it in a deadlock
prone way later.

Reported by:	kkenn
MFC After:	3 days
2006-02-07 11:31:32 +00:00
Will Andrews
937a238777 Make UDF endian-safe.
Submitted by:	Pedro Martelletto <pedro@ambientworks.net> (via scottl)
Tested on:	sparc64
2006-02-03 15:25:52 +00:00
Jeff Roberson
89b0e10910 - Reorder calls to vrele() after calls to vput() when the vrele is a
directory.  vrele() may lock the passed vnode, which in these cases would
   give an invalid lock order of child -> parent.  These situations are
   deadlock prone although do not typically deadlock because the vrele
   is typically not releasing the last reference to the vnode.  Users of
   vrele must consider it as a call to vn_lock() and order it appropriately.

MFC After: 	1 week
Sponsored by:	Isilon Systems, Inc.
Tested by:	kkenn
2006-02-01 00:25:26 +00:00
Jeff Roberson
3b77d80cdd - Remove a stale comment. This function was rewritten to be SMP safe some
time ago.

Sponsored by:	Isilon Systems, Inc.
2006-01-30 08:24:14 +00:00
Tom Rhodes
9fc31f8a5f Update incorrect comments here, there should not be a call to panic()
over fs corruption.

Discussed with:	alfred, phk
2006-01-23 17:45:57 +00:00
Max Khon
710a9accfe Do not assume that `char direntry::deExtension[3]' starts right after
`char direntry::deName[8]' and access deExtension[] explicitly.

Found by:	Coverity Prevent(tm)
CID:		350, 351, 352
2006-01-22 21:09:38 +00:00
Robert Watson
0bdfeca765 Convert last four functions in coda_vnops.c to ANSI C function
declarations.  I knew I would get to fix something in Coda
eventually.

MFC after:	1 week
2006-01-21 19:51:47 +00:00
Alfred Perlstein
92e73f5711 I ran into an nfs client panic a couple of times in a row over the
last few days.  I tracked it down to the fact that nfs_reclaim()
is setting vp->v_data to NULL _before_ calling vnode_destroy_object().
After silence from the mailing list I checked further and discovered
that ufs_reclaim() is unique among FreeBSD filesystems for calling
vnode_destroy_object() early, long before tossing v_data or much
of anything else, for that matter.  The rest, including NFS, appear
to be identical, as if they were just clones of one original routine.

The enclosed patch fixes all file systems in essentially the same
way, by moving the call to vnode_destroy_object() to early in the
routine (before the call to vfs_hash_remove(), if any).  I have
only tested NFS, but I've now run for over eighteen hours with the
patch where I wouldn't get past four or five without it.

Submitted by: Frank Mayhar
Requested by: Mohan Srinivasan
MFC After: 1 week
2006-01-17 17:29:03 +00:00
Tor Egge
82be0a5a24 Add marker vnodes to ensure that all vnodes associated with the mount point are
iterated over when using MNT_VNODE_FOREACH.

Reviewed by:	truckman
2006-01-09 20:42:19 +00:00
Maxim Konovalov
036cd12a8d o Fix typo in the define: s/MRAK_INT_GEN/MARK_INT_GEN/. The typo
was harmless because the define is not used in coda_vfsops.c.

Submitted by:	Hugo Meiland
2006-01-09 18:07:06 +00:00
Maxim Konovalov
98a95f61fa o Typo in the debug message: s/skiped/skipped.
PR:		kern/91346
Submitted by:	Gavin Atkinson
2006-01-05 13:39:23 +00:00
Robert Watson
8f0d99d790 When returning EIO from DEVFSIO_RADD ioctl, drop the exclusive rule
lock.  Otherwise the system comes to a rather sudden and grinding
halt.

MFC after:	1 week
2006-01-03 09:49:10 +00:00
Tom Rhodes
09c00166e4 Make tv_sec a time_t on all platforms but alpha. Brings us more in line with
POSIX.  This also makes the struct correct we ever implement an i386-time64
architecture.  Not that we need too.

Reviewed by:	imp, brooks
Approved by:	njl (acpica), des (no objects, touches procfs)
Tested with:	make universe
2005-12-24 22:22:17 +00:00
Dag-Erling Smørgrav
0430a5e289 Eradicate caddr_t from the VFS API. 2005-12-14 00:49:52 +00:00
Tai-hwa Liang
8bfc230455 Recent nmount(2) adoption in mount_smbfs(8) did not flag the "long" option
since mount_smbfs(8) assumed long name mounting by default unless "-n long"
was explicitly specified.

Rather than supplying a "long" option in mount_smbfs(8), this commit brings
back the original behaviour by associating SMBFS_MOUNT_NO_LONG with the
"nolong" option.  This should fix the broken long file names on smbfs people
observed recently.

Reported by:	Vladimir Grebenschikov <vova at fbsd dot ru>
Reviewed by:	phk
Tested by:	Slawa Olhovchenkov <slw at zxy dot spb dot ru>
2005-12-05 19:05:06 +00:00
Ruslan Ermilov
342ed5d948 Fix -Wundef warnings found when compiling i386 LINT, GENERIC and
custom kernels.
2005-12-05 11:58:35 +00:00
Ruslan Ermilov
3238c6bd33 Fix -Wundef from compiling the amd64 LINT. 2005-12-04 10:06:06 +00:00
Ruslan Ermilov
f4e9888107 Fix -Wundef. 2005-12-04 02:12:43 +00:00
Boris Popov
cc518d3b67 Fix interaction with Windows 2000/XP based servers:
If the complete reply on the TRANS2_FIND_FIRST2 request fits exactly
into one responce packet, then next call to TRANS2_FIND_NEXT2 will return
zero entries and server will close current transaction.  To avoid
subsequent errors we should not perform FIND_CLOSE2 request.

PR:		kern/78953
Submitted by:	Jim Carroll
2005-11-22 07:13:00 +00:00
Craig Rodrigues
d75b2048db Properly parse the nowin95 mount option.
Tested by:	Rainer Hurling <rhurlin at gwdg dot de>
2005-11-19 16:38:39 +00:00
Craig Rodrigues
4ab125739b Add "shortnames" and "longnames" mount options which are
synonyms for "shortname" and "longname" mount options.  The old
(before nmount()) mount_msdosfs program accepted "shortnames" and "longnames",
but the kernel nmount() checked for "shortname" and "longname".
So, make the kernel accept "shortnames", "longnames", "shortname", "longname"
for forwards and backwarsd compatibility.

Discovered by:	Rainer Hurling <rhurlin at gwdg dot de>
2005-11-18 22:34:31 +00:00
Craig Rodrigues
43fa5bf534 - Add errmsg to the list of smbfs mount options.
- Use vfs_mount_error() to propagate smbfs mount errors back to userspace.

Reviewed by:	bp (smbfs maintainer)
2005-11-16 02:26:25 +00:00
Doug White
16e35dcc39 This is a workaround for a complicated issue involving VFS cookies and devfs.
The PR and patch have the details. The ultimate fix requires architectural
changes and clarifications to the VFS API, but this will prevent the system
from panicking when someone does "ls /dev" while running in a shell under the
linuxulator.

This issue affects HEAD and RELENG_6 only.

PR:		88249
Submitted by:	"Devon H. O'Dell" <dodell@ixsystems.com>
MFC after:	3 days
2005-11-09 22:03:50 +00:00
Robert Watson
5bb84bc84b Normalize a significant number of kernel malloc type names:
- Prefer '_' to ' ', as it results in more easily parsed results in
  memory monitoring tools such as vmstat.

- Remove punctuation that is incompatible with using memory type names
  as file names, such as '/' characters.

- Disambiguate some collisions by adding subsystem prefixes to some
  memory types.

- Generally prefer lower case to upper case.

- If the same type is defined in multiple architecture directories,
  attempt to use the same name in additional cases.

Not all instances were caught in this change, so more work is required to
finish this conversion.  Similar changes are required for UMA zone names.
2005-10-31 15:41:29 +00:00
Poul-Henning Kamp
3b72f38b5e Use correct cirteria for determining which directory entries we can
purge right away and which we merely can hide.

Beaten into my skull by:	kris
2005-10-18 20:21:25 +00:00
Dag-Erling Smørgrav
a92fef8afc Implement the full range of ISO9660 number conversion routines in iso.h.
MFC after:	2 weeks
2005-10-18 13:35:08 +00:00
Craig Rodrigues
c583f369a7 Unconditionally mount a CD9660 filesystem as read-only, instead of
returning EROFS if we forget to mount it as read-only.
2005-10-17 03:29:53 +00:00
Craig Rodrigues
b137e1c8ba Use the actual sector size of the media instead of hard-coding it to 2048.
This eliminates KASSERTs in GEOM if we accidentally mount an audio CD
as a cd9660 filesystem.
2005-10-17 03:27:35 +00:00
Craig Rodrigues
073833a420 Unconditionally mount a UDF filesystem as read-only, instead of
returning an EROFS if we forget to mount it as read-only.
2005-10-17 03:07:36 +00:00
Florent Thoumie
86391603da - Fix typo.
Approved by:	ssouhlal
MFC after:	1 week
2005-10-17 00:04:35 +00:00
Don Lewis
8bcc0d3f95 Update nwfs_lookup() to match the current cache_lookup() API.
cache_lookup() has returned a ref'ed and locked vnode since
vfs_cache.c:1.96, dated Tue Mar 29 12:59:06 2005 UTC.  This change
is similar to the change made to smbfs_lookup() in smbfs_vnops.c:1.58.

Tested by:	"Antony Mawer" ant AT mawer.org
MFC after:	2 weeks
2005-10-16 21:54:35 +00:00
Kris Kennaway
3554cddbfa Reflect mpsafety of the underlying filesystem in the nullfs image.
I benchmarked this by simultaneously extracting 4 large tarballs (basically
world images) on a 4-processor AMD64 system, in a malloc-backed md.

With this patch, system time was reduced by 43%, and wall clock time by 33%.

Submitted by:	jeff
MFC after: 	1 week
2005-10-16 21:45:25 +00:00
Don Lewis
d31c91fbcf Apply the same fix to a potential race in the ISDOTDOT code in
cd9660_lookup() that was used to fix an actual race in ufs_lookup.c:1.78.
This is not currently a hazard, but the bug would be activated by
marking cd9660 as MPSAFE.

Requested by:	bde
2005-10-16 21:41:54 +00:00
Yaroslav Tykhiy
10d645b7e5 In preparation for making the modules actually use opt_*.h files
provided in the kernel build directory, fix modules that were
failing to build this way due to not quite correct kernel option
usage.  In particular:

ng_mppc.c uses two complementary options, both of which are listed
in sys/conf/files.  Ideally, there should be a separate option for
including ng_mppc.c in kernel build, but now only
NETGRAPH_MPPC_ENCRYPTION is usable anyway, the other one requires
proprietary files.

nwfs and smbfs were trying to ensure they were built with proper
network components, but the check was rather questionable.

Discussed with:	ru
2005-10-14 23:17:45 +00:00
David Xu
9104847f21 1. Change prototype of trapsignal and sendsig to use ksiginfo_t *, most
changes in MD code are trivial, before this change, trapsignal and
   sendsig use discrete parameters, now they uses member fields of
   ksiginfo_t structure. For sendsig, this change allows us to pass
   POSIX realtime signal value to user code.

2. Remove cpu_thread_siginfo, it is no longer needed because we now always
   generate ksiginfo_t data and feed it to libpthread.

3. Add p_sigqueue to proc structure to hold shared signals which were
   blocked by all threads in the proc.

4. Add td_sigqueue to thread structure to hold all signals delivered to
   thread.

5. i386 and amd64 now return POSIX standard si_code, other arches will
   be fixed.

6. In this sigqueue implementation, pending signal set is kept as before,
   an extra siginfo list holds additional siginfo_t data for signals.
   kernel code uses psignal() still behavior as before, it won't be failed
   even under memory pressure, only exception is when deleting a signal,
   we should call sigqueue_delete to remove signal from sigqueue but
   not SIGDELSET. Current there is no kernel code will deliver a signal
   with additional data, so kernel should be as stable as before,
   a ksiginfo can carry more information, for example, allow signal to
   be delivered but throw away siginfo data if memory is not enough.
   SIGKILL and SIGSTOP have fast path in sigqueue_add, because they can
   not be caught or masked.
   The sigqueue() syscall allows user code to queue a signal to target
   process, if resource is unavailable, EAGAIN will be returned as
   specification said.
   Just before thread exits, signal queue memory will be freed by
   sigqueue_flush.
   Current, all signals are allowed to be queued, not only realtime signals.

Earlier patch reviewed by: jhb, deischen
Tested on: i386, amd64
2005-10-14 12:43:47 +00:00
Craig Rodrigues
a3d7f575c0 - Do not hardcode the bsize to a sectorsize of 2048, even though
the UDF specification specifies a logical sectorsize of 2048.
  Instead, get it from GEOM.
- When reading the UDF Anchor Volume Descriptor, use the logical
  sectorsize of 2048 when calculating the offset to read from, but
  use the actual sectorsize to determine how much to read.

- works with reading a DVD disk and a DVD disk image file via mdconfig
- correctly returns EINVAL if we try to mount_udf an audio CD, instead
  of panicking inside GEOM when INVARIANTS is set
2005-10-09 04:45:33 +00:00
Pawel Jakub Dawidek
8597a1c5b2 We don't need 'imp' here. 2005-10-07 10:30:47 +00:00
Robert Watson
2affdbee3e Second attempt at a work-around for fifo-related socket panics during
make -j with high levels of parallelism: acquire Giant in fifo I/O
routines.

Discussed with:	ups
MFC after:	3 days
2005-10-01 20:15:41 +00:00
Poul-Henning Kamp
73a2c3a32e The NWFS code in RELENG_6 is broken due to a typo in
sys/fs/nwfs/nwfs_vfsop= s.c, introduced with the conversion to
nmount with revision 1.38. This causes mount_nwfs to fail with
the error message:

  mount_nwfs: mount error: /mnt/netware: syserr = No such file or directo=
ry

This is caused by a typo on line 178, which specifies "nwfw_args"
rather than "nwfs_args".

Submitted by:	Antony Mawer <gnats@mawer.org>
Fat fingers:	phk
PR:		86757
MFC:		3 days
2005-09-30 18:21:05 +00:00
Peter Edwards
20c5ba3685 Remove checks for BOOTSIG[23] from FAT32 bootblocks.
There seems to be very little documentary evidence outside this
implementation to suggest a these checks are neccessary, and more
than one camera-formatted flash disk fails the check, but mounts
successfully on most other systems.

Reviewed By: bde@
2005-09-29 14:09:46 +00:00
Robert Watson
a0e81bce69 Back out fifo_vnops.c:1.127, which introduced an sx lock around I/O on
a fifo.  While this did indeed close the race, confirming suspicions
about the nature of the problem, it causes difficulties with blocking
I/O on fifos.

Discussed with:		ups
Also spotted by:	Peter Holm <peter at holm dot cc>
2005-09-27 16:45:22 +00:00
Robert Watson
454c3d13be Assert v_fifoinfo is non-NULL in fifo_close() in order to catch
non-conforming cases sooner.

MFC after:	3 days
Reported by:	Peter Holm <peter at holm dot cc>
2005-09-26 08:17:03 +00:00
Robert Watson
ee47648770 Lock the read socket receive buffer when frobbing the sb_state flag on
that socket during open, not the write socket receive buffer.  This
might explain clearing of the sb_state SB_LOCK flag seen occasionally
in soreceive() on fifos.

MFC after:	3 days
Spotted by:	ups
2005-09-25 19:52:09 +00:00
Poul-Henning Kamp
e515ee7832 Make rule zero really magical, that way we don't have to do anything
when we mount and get zero cost if no rules are used in a mountpoint.

Add code to deref rules on unmount.

Switch from SLIST to TAILQ.

Drop SYSINIT, use SX_SYSINIT and static initializer of TAILQ instead.

Drop goto, a break will do.

Reduce double pointers to single pointers.

Combine reaping and destroying rulesets.

Avoid memory leaks in a some error cases.
2005-09-24 07:03:09 +00:00
Robert Watson
5d3df5cc1b For reasons of consistency (and necessity), assert an exclusive vnode
lock on the fifo vnode in fifo_open(): we rely on the vnode lock to
serialize access to v_fifoinfo.

MFC after:	3 days
2005-09-23 12:39:51 +00:00
Robert Watson
7028887eac Add fi_sx, an sx lock to serialize I/O operations on the socket pair
underlying the POSIX fifo implementation.  In 6.x/7.x, fifo access is
moved from the VFS layer, where it was serialized using the vnode
lock, to the file descriptor layer, where access is protected by a
reference count but not serialized.  This exposed socket buffer
locking to high levels of parallelism in specific fifo workloads, such
as make -j 32, which expose as yet unresolved socket buffer bugs.

fi_sx re-adds serialization about the read and write routines,
although not paths that simply test socket buffer mbuf queue state,
such as the poll and kqueue methods.  This restores the extra locking
cost previously present in some cases, but is an effective workaround
for the instability that has been experienced.  This workaround should
be removed once the bug in socket buffer handling has been fixed.

Reported by:	kris, jhb, Julien Gabel <jpeg at thilelli dot net>,
		Peter Holm <peter at holm dot cc>, others
MFC after:	3 days
2005-09-22 10:51:12 +00:00
Poul-Henning Kamp
e606a3c63e Rewamp DEVFS internals pretty severely [1].
Give DEVFS a proper inode called struct cdev_priv.  It is important
to keep in mind that this "inode" is shared between all DEVFS
mountpoints, therefore it is protected by the global device mutex.

Link the cdev_priv's into a list, protected by the global device
mutex.  Keep track of each cdev_priv's state with a flag bit and
of references from mountpoints with a dedicated usecount.

Reap the benefits of much improved kernel memory allocator and the
generally better defined device driver APIs to get rid of the tables
of pointers + serial numbers, their overflow tables,  the atomics
to muck about in them and all the trouble that resulted in.

This makes RAM the only limit on how many devices we can have.

The cdev_priv is actually a super struct containing the normal cdev
as the "public" part, and therefore allocation and freeing has moved
to devfs_devs.c from kern_conf.c.

The overall responsibility is (to be) split such that kern/kern_conf.c
is the stuff that deals with drivers and struct cdev and fs/devfs
handles filesystems and struct cdev_priv and their private liason
exposed only in devfs_int.h.

Move the inode number from cdev to cdev_priv and allocate inode
numbers properly with unr.  Local dirents in the mountpoints
(directories, symlinks) allocate inodes from the same pool to
guarantee against overlaps.

Various other fields are going to migrate from cdev to cdev_priv
in the future in order to hide them.  A few fields may migrate
from devfs_dirent to cdev_priv as well.

Protect the DEVFS mountpoint with an sx lock instead of lockmgr,
this lock also protects the directory tree of the mountpoint.

Give each mountpoint a unique integer index, allocated with unr.
Use it into an array of devfs_dirent pointers in each cdev_priv.
Initially the array points to a single element also inside cdev_priv,
but as more devfs instances are mounted, the array is extended with
malloc(9) as necessary when the filesystem populates its directory
tree.

Retire the cdev alias lists, the cdev_priv now know about all the
relevant devfs_dirents (and their vnodes) and devfs_revoke() will
pick them up from there.  We still spelunk into other mountpoints
and fondle their data without 100% good locking.  It may make better
sense to vector the revoke event into the tty code and there do a
destroy_dev/make_dev on the tty's devices, but that's for further
study.

Lots of shuffling of stuff and churn of bits for no good reason[2].

XXX: There is still nothing preventing the dev_clone EVENTHANDLER
from being invoked at the same time in two devfs mountpoints.  It
is not obvious what the best course of action is here.

XXX: comment out an if statement that lost its body, until I can
find out what should go there so it doesn't do damage in the meantime.

XXX: Leave in a few extra malloc types and KASSERTS to help track
down any remaining issues.

Much testing provided by:		Kris
Much confusion caused by (races in):	md(4)

[1] You are not supposed to understand anything past this point.

[2] This line should simplify life for the peanut gallery.
2005-09-19 19:56:48 +00:00
Robert Watson
526e258d3a Assert that (vp) is locked in fifo_close(), since we rely on the
exclusive vnode lock to synchronize the reference counts on struct
fifoinfo.

MFC after:	3 days
2005-09-18 10:44:50 +00:00
Poul-Henning Kamp
59307b0dfe Don't attempt to recurse lockmgr, it doesn't like it. 2005-09-15 21:16:43 +00:00
Alexander Kabaev
d11c07ba56 Handle a race condition where NULLFS vnode can be cleaned while threads
can still be asleep waiting for lowervp lock.

Tested by:	kkenn
Discussed with: ssouhlal, jeffr
2005-09-15 19:21:26 +00:00
Robert Watson
ca17bccaa1 The socket pointers in fifoinfo are not permitted to be NULL, so
don't check if they are, it just confuses the fifo code more.

MFC after:	3 days
2005-09-15 15:45:34 +00:00
Poul-Henning Kamp
214c8ff0e4 Various minor polishing. 2005-09-15 10:28:19 +00:00
Poul-Henning Kamp
6556102dcb Protect the devfs rule internal global lists with a sx lock, the per
mount locks are not enough.  Finer granularity (x)locking could be
implemented, but I prefer to keep it simple for now.
2005-09-15 08:50:16 +00:00
Poul-Henning Kamp
ab32e95296 Absolve devfs_rule.c from locking responsibility and call it with
all necessary locking held.
2005-09-15 08:36:37 +00:00
Poul-Henning Kamp
5e080af41f Close a race which could result in unwarranted "ruleset %d already
running" panics.

Previously, recursion through the "include" feature was prevented by
marking each ruleset as "running" when applied.  This doesn't work for
the case where two DEVFS instances try to apply the same ruleset at
the same time.

Instead introduce the sysctl vfs.devfs.rule_depth (default == 1) which
limits how many levels of "include" we will traverse.

Be aware that traversal of "include" is recursive and kernel stack
size is limited.

MFC:	after 3 days
2005-09-15 06:57:28 +00:00
Robert Watson
447bbaa2cf Trim down now (believed to be) unused fifo_ioctl() and
fifo_kqfilter() VOP implementations, since they in theory are used
only on open file descriptors, in which case the ioctls are via
fifo_ioctl_f() and kqueue requests are via fifo_kqfilter_f().
Generate warnings if they are entered for now.  These printf()
calls should become panic() calls.

Annotate and re-implement fifo_ioctl_f(): don't arbitrarily
forward ioctls to the socket layer, only forward the ones we
explicitly support for fifos.  In the case of FIONREAD, don't
forward the request to the write socket on a read-write fifo, or
the read result is overwritten.  Annotate a nasty case for the
undefined POSIX O_RDWR on fifos, in which failure of the second
ioctl will result in the socket pair being in an inconsistent
state.

Assert copyright as I find myself rewriting non-trivial parts of
fifofs.

MFC after:	3 days
2005-09-13 17:46:48 +00:00
Robert Watson
8a22e151be As a result of kqueue locking work, socket buffer locks will always
be held when entering a kqueue filter for fifos via a socket buffer
event: as such, assert the lock unconditionally rather than acquiring
it conditionall.

MFC after:	3 days
2005-09-13 10:39:24 +00:00
Robert Watson
db7a6c2f43 Annotate two issues:
1) fifo_kqfilter() is not actually ever used, it likely should be GC'd.

2) fifo_kqfilter_f() doesn't implement EVFILT_VNODE, so detecting events
   on the underlying vnode for a fifo no longer works (it did in 4.x).
   Likely, fifo_kqfilter_f() should forward the request to the VFS using
   fp->f_vnode, which would work once fifo_kqfilter() was detached from
   the vnode operation vector (removing the fifo override).

Discussed with:	phk
2005-09-13 09:23:22 +00:00
Robert Watson
88f39e8e95 Introduce no-op nosup fifo kqueue filter and detach routine, which are
used when a read filter is requested on a write-only fifo descriptor, or
a write filter is requested on a read-only fifo descriptor.  This
permits the filters to be registered, but never raises the event, which
causes kqueue behavior for fifos to more closely match similar semantics
for poll and select, which permit testing for the condition even though
the condition will never be raised, and is consistent with POSIX's notion
that a fifo has identical semantics to a one-way IPC channel created
using pipe() on most operating systems.

The fifo regression test suite can now run to completion on HEAD without
errors.

MFC after:	3 days
2005-09-12 19:59:12 +00:00
Robert Watson
48afebb83d When a request is made to register a filter on a fifo that doesn't
apply to the fifo (i.e., not EVFILT_READ or EVFILT_WRITE), reject
it as EINVAL, not by returning 1 (EPERM).

MFC after:	3 days
2005-09-12 18:07:49 +00:00
Robert Watson
114538d85b Remove DFLAG_SEEKABLE from fifo file descriptors: fifos are not seekable
according to POSIX, not to mention the fact that it doesn't make sense
(and hence isn't really implemented).  This causes the fifo_misc
regression test to succeed.
2005-09-12 12:15:12 +00:00
Robert Watson
6dd84b0bdc Only poll the fifo for read events if the fifo is attached to a readable
file descriptor.  Otherwise, the read end of a fifo might return that it
is writable (which it isn't).

Only poll the fifo for write events if the fifo attached to a writable
file descriptor.  Otherwise, the write end of a fifo might return that
it is readable (which it isn't).

In the event that a file is FREAD|FWRITE (which is allowed by POSIX, but
has undefined behavior), we poll for both.

MFC after:	3 days
2005-09-12 10:16:18 +00:00
Robert Watson
845e8e827b After going to some trouble to identify only the write-related events
to poll the write socket for, the fifo polling code proceeded to poll
for the complete set of events.  Use 'levents' instead of 'events' as
the argument to poll, and only poll the write socket if there is
interest in write events.

MFC after:	3 days
2005-09-12 10:13:15 +00:00