freebsd-skq

Author	SHA1	Message	Date
kib	df9f3d2faa	Fix locking for f_offset, vn_read() and vn_write() cases only, for now. It seems that intended locking protocol for struct file f_offset field was as follows: f_offset should always be changed under the vnode lock (except fcntl(2) and lseek(2) did not followed the rules). Since read(2) uses shared vnode lock, FOFFSET_LOCKED block is additionally taken to serialize shared vnode lock owners. This was broken first by enabling shared lock on writes, then by fadvise changes, which moved f_offset assigned from under vnode lock, and last by vn_io_fault() doing chunked i/o. More, due to uio_offset not yet valid in vn_io_fault(), the range lock for reads was taken on the wrong region. Change the locking for f_offset to always use FOFFSET_LOCKED block, which is placed before rangelocks in the lock order. Extract foffset_lock() and foffset_unlock() functions which implements FOFFSET_LOCKED lock, and consistently lock f_offset with it in the vn_io_fault() both for reads and writes, even if MNTK_NO_IOPF flag is not set for the vnode mount. Indicate that f_offset is already valid for vn_read() and vn_write() calls from vn_io_fault() with FOF_OFFSET flag, and assert that all callers of vn_read() and vn_write() follow this protocol. Extract get_advice() function to calculate the POSIX_FADV_XXX value for the i/o region, and use it were appropriate. Reviewed by: jhb Tested by: pho MFC after: 2 weeks	2012-06-21 09:19:41 +00:00
jhb	571562fffb	Further refine the implementation of POSIX_FADV_NOREUSE. First, extend the changes in r230782 to better handle the common case of using NOREUSE with sequential reads. A NOREUSE file descriptor will now track the last implicit DONTNEED request it made as a result of a NOREUSE read. If a subsequent NOREUSE read is adjacent to the previous range, it will apply the DONTNEED request to the entire range of both the previous read and the current read. The effect is that each read of a file accessed sequentially will apply the DONTNEED request to the entire range that has been read. This allows NOREUSE to properly handle misaligned reads by flushing each buffer to cache once it has been completely read. Second, apply the same changes made to read(2) by r230782 and this change to writes. This provides much better performance in the sequential write case as it allows writes to still be clustered. It also provides much better performance for misaligned writes. It does mean that NOREUSE will be generally ineffective for non-sequential writes as the current implementation relies on a future NOREUSE write's implicit DONTNEED request to flush the dirty buffer from the current write. MFC after: 2 weeks	2012-06-19 18:42:24 +00:00
jhb	176ddf31c3	Split the second half of vn_open_cred() (after a vnode has been found via a lookup or created via VOP_CREATE()) into a new vn_open_vnode() function and use this function in fhopen() instead of duplicating code from vn_open_cred() directly. Tested by: pho Reviewed by: kib MFC after: 2 weeks	2012-06-08 18:32:09 +00:00
kib	472101ae88	Add a knob to disable vn_io_fault. MFC after: 1 month	2012-06-03 16:19:37 +00:00
kib	d977144831	Count and export the number of prefaulting happen. MFC after: 1 month	2012-06-03 16:06:56 +00:00
kib	080f2e89d9	vn_io_fault() is a facility to prevent page faults while filesystems perform copyin/copyout of the file data into the usermode buffer. Typical filesystem hold vnode lock and some buffer locks over the VOP_READ() and VOP_WRITE() operations, and since page fault handler may need to recurse into VFS to get the page content, a deadlock is possible. The facility works by disabling page faults handling for the current thread and attempting to execute i/o while allowing uiomove() to access the usermode mapping of the i/o buffer. If all buffer pages are resident, uiomove() is successfull and request is finished. If EFAULT is returned from uiomove(), the pages backing i/o buffer are faulted in and held, and the copyin/out is performed using uiomove_fromphys() over the held pages for the second attempt of VOP call. Since pages are hold in chunks to prevent large i/o requests from starving free pages pool, and since vnode lock is only taken for i/o over the current chunk, the vnode lock no longer protect atomicity of the whole i/o request. Use newly added rangelocks to provide the required atomicity of i/o regardind other i/o and truncations. Filesystems need to explicitely opt-in into the scheme, by setting the MNTK_NO_IOPF struct mount flag, and optionally by using vn_io_fault_uiomove(9) helper which takes care of calling uiomove() or converting uio into request for uiomove_fromphys(). Reviewed by: bf (comments), mdf, pjd (previous version) Tested by: pho Tested by: flo, Gustau P?rez <gperez entel upc edu> (previous version) MFC after: 2 months	2012-05-30 16:42:08 +00:00
kib	08dbe8fa01	Add a vn_bmap_seekhole(9) vnode helper which can be used by any filesystem which supports VOP_BMAP(9) to implement SEEK_HOLE/SEEK_DATA commands for lseek(2). MFC after: 2 weeks	2012-05-26 05:28:47 +00:00
jhb	19feaba08b	Add KTR_VFS traces to track modifications to a vnode's writecount.	2012-03-08 20:27:20 +00:00
kib	80ae8fe82c	Fix found places where uio_resid is truncated to int. Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from the usermode. Discussed with: bde, das (previous versions) MFC after: 1 month	2012-02-21 01:05:12 +00:00
jhb	d372b841c6	Refine the implementation of POSIX_FADV_NOREUSE for the read(2) case such that instead of using direct I/O it allows read-ahead similar to POSIX_FADV_NORMAL, but invokes VOP_ADVISE(POSIX_FADV_DONTNEED) after the read(2) has completed to purge just-read data. The write(2) path continues to use direct I/O for POSIX_FADV_NOREUSE for now. Note that NOREUSE works optimally if an application reads and writes full fs blocks.	2012-01-30 19:35:15 +00:00
jhb	78c075174e	Add the posix_fadvise(2) system call. It is somewhat similar to madvise(2) except that it operates on a file descriptor instead of a memory region. It is currently only supported on regular files. Just as with madvise(2), the advice given to posix_fadvise(2) can be divided into two types. The first type provide hints about data access patterns and are used in the file read and write routines to modify the I/O flags passed down to VOP_READ() and VOP_WRITE(). These modes are thus filesystem independent. Note that to ease implementation (and since this API is only advisory anyway), only a single non-normal range is allowed per file descriptor. The second type of hints are used to hint to the OS that data will or will not be used. These hints are implemented via a new VOP_ADVISE(). A default implementation is provided which does nothing for the WILLNEED request and attempts to move any clean pages to the cache page queue for the DONTNEED request. This latter case required two other changes. First, a new V_CLEANONLY flag was added to vinvalbuf(). This requests vinvalbuf() to only flush clean buffers for the vnode from the buffer cache and to not remove any backing pages from the vnode. This is used to ensure clean pages are not wired into the buffer cache before attempting to move them to the cache page queue. The second change adds a new vm_object_page_cache() method. This method is somewhat similar to vm_object_page_remove() except that instead of freeing each page in the specified range, it attempts to move clean pages to the cache queue if possible. To preserve the ABI of struct file, the f_cdevpriv pointer is now reused in a union to point to the currently active advice region if one is present for regular files. Reviewed by: jilles, kib, arch@ Approved by: re (kib) MFC after: 1 month	2011-11-04 04:02:50 +00:00
kmacy	99851f359e	In order to maximize the re-usability of kernel code in user space this patch modifies makesyscalls.sh to prefix all of the non-compatibility calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel entry points and all places in the code that use them. It also fixes an additional name space collision between the kernel function psignal and the libc function of the same name by renaming the kernel psignal kern_psignal(). By introducing this change now we will ease future MFCs that change syscalls. Reviewed by: rwatson Approved by: re (bz)	2011-09-16 13:58:51 +00:00
mm	e104c96f01	Generalize ffs_pages_remove() into vn_pages_remove(). Remove mapped pages for all dataset vnodes in zfs_rezget() using new vn_pages_remove() to fix mmapped files changed by zfs rollback or zfs receive -F. PR: kern/160035, kern/156933 Reviewed by: kib, pjd Approved by: re (kib) MFC after: 1 week	2011-08-25 08:17:39 +00:00
kib	011f42054d	Add the fo_chown and fo_chmod methods to struct fileops and use them to implement fchown(2) and fchmod(2) support for several file types that previously lacked it. Add MAC entries for chown/chmod done on posix shared memory and (old) in-kernel posix semaphores. Based on the submission by: glebius Reviewed by: rwatson Approved by: re (bz)	2011-08-16 20:07:47 +00:00
mdf	bbbc4c5455	Use a name instead of a magic number for kern_yield(9) when the priority should not change. Fetch the td_user_pri under the thread lock. This is probably not necessary but a magic number also seems preferable to knowing the implementation details here. Requested by: Jason Behmer < jason DOT behmer AT isilon DOT com >	2011-05-13 05:27:58 +00:00
mdf	33ee365b55	Based on discussions on the svn-src mailing list, rework r218195: - entirely eliminate some calls to uio_yeild() as being unnecessary, such as in a sysctl handler. - move should_yield() and maybe_yield() to kern_synch.c and move the prototypes from sys/uio.h to sys/proc.h - add a slightly more generic kern_yield() that can replace the functionality of uio_yield(). - replace source uses of uio_yield() with the functional equivalent, or in some cases do not change the thread priority when switching. - fix a logic inversion bug in vlrureclaim(), pointed out by bde@. - instead of using the per-cpu last switched ticks, use a per thread variable for should_yield(). With PREEMPTION, the only reasonable use of this is to determine if a lock has been held a long time and relinquish it. Without PREEMPTION, this is essentially the same as the per-cpu variable.	2011-02-08 00:16:36 +00:00
pjd	6ff3cc04b0	Correct arguments order.	2010-06-26 21:44:45 +00:00
trasz	f26ccb52af	Avoid overflow. Submitted by: bde@	2010-05-06 18:52:41 +00:00
trasz	e6f92048fa	Style fixes and removal of unneeded variable. Submitted by: bde@	2010-05-06 18:43:19 +00:00
trasz	402e3baade	Move checking against RLIMIT_FSIZE into one place, vn_rlimit_fsize(). Reviewed by: kib	2010-05-05 16:44:25 +00:00
avg	91cd4478c2	vn_stat: take into account va_blocksize when setting st_blksize As currently st_blksize is always PAGE_SIZE, it is playing safe to not use any smaller value. For some cases this might not be optimal, but at least nothing should get broken. Generally I don't expect this commit to change much for the following reasons (in case of VREG, VDIR): - application I/O and physical I/O are sufficiently decoupled by filesystem code, buffer cache code, cluster and read-ahead logic - not all applications use st_blksize as a hint, some use f_iosize, some use fixed block sizes I expect writes to the middle of files on ZFS to benefit the most from this change. Silence from: fs@ MFC after: 2 weeks	2010-04-03 08:39:00 +00:00
ed	4f08ecd7ed	Rename st_timespec fields to st_tim for POSIX 2008 compliance. A nice thing about POSIX 2008 is that it finally standardizes a way to obtain file access/modification/change times in sub-second precision, namely using struct timespec, which we already have for a very long time. Unfortunately POSIX uses different names. This commit adds compatibility macros, so existing code should still build properly. Also change all source code in the kernel to work without any of the compatibility macros. This makes it all a less ambiguous. I am also renaming st_birthtime to st_birthtim, even though it was a local extension anyway. It seems Cygwin also has a st_birthtim.	2010-03-28 13:13:22 +00:00
ed	6156503467	Actually make O_DIRECTORY work. According to POSIX open() must return ENOTDIR when the path name does not refer to a path name. Change vn_open() to respect this flag. This also simplifies the Linuxolator a bit.	2010-03-21 20:43:23 +00:00
trasz	110b737502	Don't add VAPPEND if the file is not being opened for writing. Note that this only affects cases where open(2) is being used improperly - i.e. when the user specifies O_APPEND without O_WRONLY or O_RDWR. Reviewed by: rwatson	2009-12-08 20:47:10 +00:00
trasz	faaa27ec8e	Revert r198874, pending further discussion.	2009-11-04 07:14:16 +00:00
trasz	7ff2a9de0b	Make sure we don't end up with VAPPEND without VWRITE, if someone calls open(2) like this: open(..., O_APPEND).	2009-11-04 06:48:34 +00:00
delphij	79f2f8c774	Add two new fcntls to enable/disable read-ahead: - F_READAHEAD: specify the amount for sequential access. The amount is specified in bytes and is rounded up to nearest block size. - F_RDAHEAD: Darwin compatible version that use 128KB as the sequential access size. A third argument of zero disables the read-ahead behavior. Please note that the read-ahead amount is also constrainted by sysctl variable, vfs.read_max, which may need to be raised in order to better utilize this feature. Thanks Igor Sysoev for proposing the feature and submitting the original version, and kib@ for his valuable comments. Submitted by: Igor Sysoev <is rambler-co ru> Reviewed by: kib@ MFC after: 1 month	2009-09-28 16:59:47 +00:00
kib	32ffadd900	Fix mount reference leak when V_XSLEEP is specified to vn_start_write(). Submitted by: tegge	2009-09-01 12:05:39 +00:00
kib	eab33b2665	Make the mnt_writeopcount and mnt_secondary_writes counters, used by the suspension code, not greater then mnt_ref reference counter value. Increment mnt_ref together with write counter in vn_start_write()/ vn_start_secondary_write(), releasing in vn_finished_write/vn_finished_secondary_write(). Since r186197, unmount code requires that no writers occured after all references are expired. We still could get write counter incremented for freed or reused struct mount, but it seems to be innocent, since corresponding vnode should be referenced and reclaimed then. Reported by: pho (last half a year), erwin Reviewed by: attilio Tested by: pho, erwin MFC after: 1 week	2009-08-31 10:20:52 +00:00
kib	350f96b4bf	In vn_vget_ino() and their inline equivalents, mnt_ref() the mount point around the sequence that drop vnode lock and then busies the mount point. Not having vlocked node or direct reference to the mp allows for the forced unmount to proceed, making mp unmounted or reused. Tested by: pho Reviewed by: jeff Approved by: re (kensmith) MFC after: 2 weeks	2009-07-02 18:02:55 +00:00
kib	171c37f865	Add another flags argument to vn_open_cred. Use it to specify that some vn_open_cred invocations shall not audit namei path. In particular, specify VN_OPEN_NOAUDIT for dotdot lookup performed by default implementation of vop_vptocnp, and for the open done for core file. vn_fullpath is called from the audit code, and vn_open there need to disable audit to avoid infinite recursion. Core file is created on return to user mode, that, in particular, happens during syscall return. The creation of the core file is audited by direct calls, and we do not want to overwrite audit information for syscall. Reported, reviewed and tested by: rwatson	2009-06-21 13:41:32 +00:00
ps	ae099c88f7	Simply shared vnode locking and extend it to also include fsync. Also, in vop_write, no longer assert for exclusive locks on the vnode. Reviewed by: jhb, kmacy, jeffr	2009-06-08 21:23:54 +00:00
rwatson	f4934662e5	Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC and used in a large number of files, but also because an increasing number of incorrect uses of MAC calls were sneaking in due to copy-and-paste of MAC-aware code without the associated opt_mac.h include. Discussed with: pjd	2009-06-05 14:55:22 +00:00
ps	46a77b2b36	When checking for shared writes, use the struct mount returned from vn_start_write. Reviewed by: jhb	2009-06-04 16:50:03 +00:00
ps	4505aa56ed	Support shared vnode locks for write operations when the offset is provided on filesystems that support it. This really improves mysql + innodb performance on ZFS. Reviewed by: jhb, kmacy, jeffr	2009-06-04 16:18:07 +00:00
attilio	1dcb84131b	Remove the thread argument from the FSD (File-System Dependent) parts of the VFS. Now all the VFS_* functions and relating parts don't want the context as long as it always refers to curthread. In some points, in particular when dealing with VOPs and functions living in the same namespace (eg. vflush) which still need to be converted, pass curthread explicitly in order to retain the old behaviour. Such loose ends will be fixed ASAP. While here fix a bug: now, UFS_EXTATTR can be compiled alone without the UFS_EXTATTR_AUTOSTART option. VFS KPI is heavilly changed by this commit so thirdy parts modules needs to be recompiled. Bump __FreeBSD_version in order to signal such situation.	2009-05-11 15:33:26 +00:00
kib	78e147b4e4	Eliminate the loop and the call to pause(9) in vfs_vget_ino(). If vfs_busy(MBF_NOWAIT) failed, unlock the vnode and sleep in vfs_busy(). Suggested and reviewed by: jeff Tested by: pho MFC after: 3 weeks	2009-05-07 18:14:21 +00:00
kmacy	de9c351c80	- use a shared lock for reads - remove stale comment Reviewed by: jeffr	2009-04-13 23:09:44 +00:00
rwatson	fba90f2e03	Remove VOP_LEASE and supporting functions. This hasn't been used since the removal of NQNFS, but was left in in case it was required for NFSv4. Since our new NFSv4 client and server can't use it for their requirements, GC the old mechanism, as well as other unused lease- related code and interfaces. Due to its impact on kernel programming and binary interfaces, this change should not be MFC'd. Proposed by: jeff Reviewed by: jeff Discussed with: rmacklem, zach loafman @ isilon	2009-04-10 10:52:19 +00:00
jhb	520acdaf69	Add a new internal mount flag (MNTK_EXTENDED_SHARED) to indicate that a filesystem supports additional operations using shared vnode locks. Currently this is used to enable shared locks for open() and close() of read-only file descriptors. - When an ISOPEN namei() request is performed with LOCKSHARED, use a shared vnode lock for the leaf vnode only if the mount point has the extended shared flag set. - Set LOCKSHARED in vn_open_cred() for requests that specify O_RDONLY but not O_CREAT. - Use a shared vnode lock around VOP_CLOSE() if the file was opened with O_RDONLY and the mountpoint has the extended shared flag set. - Adjust md(4) to upgrade the vnode lock on the vnode it gets back from vn_open() since it now may only have a shared vnode lock. - Don't enable shared vnode locks on FIFO vnodes in ZFS and UFS since FIFO's require exclusive vnode locks for their open() and close() routines. (My recent MPSAFE patches for UDF and cd9660 already included this change.) - Enable extended shared operations on UFS, cd9660, and UDF. Submitted by: ups Reviewed by: pjd (ZFS bits) MFC after: 1 month	2009-03-11 14:13:47 +00:00
kib	44eef9d9bb	Move the code from ufs_lookup.c used to do dotdot lookup, into the helper function. It is supposed to be useful for any filesystem that has to unlock dvp to walk to the ".." entry in lookup routine. Requested by: jhb Tested by: pho MFC after: 1 month	2009-01-21 14:51:38 +00:00
pjd	881f5f6bef	Improve KASSERT() call a bit: - Print flags in hex. - Note that flags can be fine and panic can be due unexpected error condition. - Remove redundant new line character. Eventhough panic message excess 80 characters keep it in one line so it is easier to grep.	2008-11-29 12:40:14 +00:00
kib	ec4c199459	Revert r184118. There is actually a code in the kernel, for instance in kern_unlinkat(), that expects that vn_start_write() actually fills the mp even when the call failed. As Tor noted, that pattern relies on the the type stability of the mount points, as well as that suspended mount points are never freed and V_XSLEEP is always passed to vn_start_write() when called on a freed mount point. Reported by: stass Reviewed by: tegge PR: 123768	2008-11-16 21:56:29 +00:00
jhb	ee8312c8bb	Use shared vnode locks instead of exclusive vnode locks for the access(), chdir(), chroot(), eaccess(), fpathconf(), fstat(), fstatfs(), lseek() (when figuring out the current size of the file in the SEEK_END case), pathconf(), readlink(), and statfs() system calls. Submitted by: ups (mostly) Tested by: pho MFC after: 1 month	2008-11-03 20:31:00 +00:00
attilio	e1f493235e	Improve VFS locking: - Implement real draining for vfs consumers by not relying on the mnt_lock and using instead a refcount in order to keep track of lock requesters. - Due to the change above, remove the mnt_lock lockmgr because it is now useless. - Due to the change above, vfs_busy() is no more linked to a lockmgr. Change so its KPI by removing the interlock argument and defining 2 new flags for it: MBF_NOWAIT which basically replaces the LK_NOWAIT of the old version (which was unlinked from the lockmgr alredy) and MBF_MNTLSTLOCK which provides the ability to drop the mountlist_mtx once the mnt interlock is held (ability still desired by most consumers). - The stub used into vfs_mount_destroy(), that allows to override the mnt_ref if running for more than 3 seconds, make it totally useless. Remove it as it was thought to work into older versions. If a problem of "refcount held never going away" should appear, we will need to fix properly instead than trust on such hackish solution. - Fix a bug where returning (with an error) from dounmount() was still leaving the MNTK_MWAIT flag on even if it the waiters were actually woken up. Just a place in vfs_mount_destroy() is left because it is going to recycle the structure in any case, so it doesn't matter. - Remove the markercnt refcount as it is useless. This patch modifies VFS ABI and breaks KPI for vfs_busy() so manpages and __FreeBSD_version will be modified accordingly. Discussed with: kib Tested by: pho	2008-11-02 10:15:42 +00:00
trasz	0ad8692247	Introduce accmode_t. This is required for NFSv4 ACLs - it will be neccessary to add more V* constants, and the variables changed by this patch were often being assigned to mode_t variables, which is 16 bit. Approved by: rwatson (mentor)	2008-10-28 13:44:11 +00:00
kib	cc3d7dc928	Change vn_start_write() to clear *mpp on all failures when non-NULL vp is supplied, since vm_pageout_scan() expects it to be cleared on error. Submitted by: tegge PR: 123768 MFC after: 1 week	2008-10-21 09:55:49 +00:00
kib	e4785f6af4	Assert that v_holdcnt is non-zero before entering lockmgr in vn_lock and ffs_lock. This cannot catch situations where holdcnt is incremented not by curthread, but I think it is useful. Reviewed by: tegge, attilio Tested by: pho MFC after: 2 weeks	2008-10-20 10:11:33 +00:00
kib	570463af80	Initialize va_rdev to NODEV and va_fsid to VNOVAL before the VOP_GETATTR() call in vn_stat(). Thus if a file system doesn't initialize those fields in VOP_GETATTR() they will have a sane default value. Submitted by: Jaakko Heinonen <jh saunalahti fi> Discussed on: freebsd-fs MFC after: 1 month	2008-09-20 19:48:24 +00:00
kib	c6232cabbe	Initialize birthtime fields in vn_stat() to prevent stat(2) from returning uninitialized birthtime. Most file systems don't initialize birthtime properly in their VOP_GETTATTR(). Submitted by: Jaakko Heinonen <jh saunalahti fi> Reviewed by: bde Discussed on: freebsd-fs MFC after: 1 month	2008-09-20 19:43:22 +00:00

1 2 3 4 5 ...

315 Commits