Commit Graph

506 Commits

Author SHA1 Message Date
Alan Somers
0cfc1ef38d FIOBMAP2: inline vn_ioc_bmap2
Reported by:	kib
Reviewed by:	kib
MFC after:	2 weeks
MFC-With:	349238
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20783
2019-06-27 23:39:06 +00:00
Alan Somers
4f53d57e8c fcntl: style changes to r349248
Reported by:	bde
MFC after:	2 weeks
MFC-With:	349248
Sponsored by:	The FreeBSD Foundation
2019-06-25 19:44:22 +00:00
Alan Somers
38b06f8ac4 fcntl: fix overflow when setting F_READAHEAD
VOP_READ and VOP_WRITE take the seqcount in blocks in a 16-bit field.
However, fcntl allows you to set the seqcount in bytes to any nonnegative
31-bit value. The result can be a 16-bit overflow, which will be
sign-extended in functions like ffs_read. Fix this by sanitizing the
argument in kern_fcntl. As a matter of policy, limit to IO_SEQMAX rather
than INT16_MAX.

Also, fifos have overloaded the f_seqcount field for a completely different
purpose ever since r238936.  Formalize that by using a union type.

Reviewed by:	cem
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20710
2019-06-20 23:07:20 +00:00
Alan Somers
d49b446bfb Add FIOBMAP2 ioctl
This ioctl exposes VOP_BMAP information to userland. It can be used by
programs like fragmentation analyzers and optimized cp implementations. But
I'm using it to test fusefs's VOP_BMAP implementation. The "2" in the name
distinguishes it from the similar but incompatible FIBMAP ioctls in NetBSD
and Linux.  FIOBMAP2 differs from FIBMAP in that it uses a 64-bit block
number instead of 32-bit, and it also returns runp and runb.

Reviewed by:	mckusick
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20705
2019-06-20 14:13:10 +00:00
Conrad Meyer
daec92844e Include ktr.h in more compilation units
Similar to r348026, exhaustive search for uses of CTRn() and cross reference
ktr.h includes.  Where it was obvious that an OS compat header of some kind
included ktr.h indirectly, .c files were left alone.  Some of these files
clearly got ktr.h via header pollution in some scenarios, or tinderbox would
not be passing prior to this revision, but go ahead and explicitly include it
in files using it anyway.

Like r348026, these CUs did not show up in tinderbox as missing the include.

Reported by:	peterj (arm64/mp_machdep.c)
X-MFC-With:	r347984
Sponsored by:	Dell EMC Isilon
2019-05-21 20:38:48 +00:00
Konstantin Belousov
78022527bb Switch to use shared vnode locks for text files during image activation.
kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.

The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount.  To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal

The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change.  vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP.  vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.

nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.

On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.

Reviewed by:	markj, trasz
Tested by:	mjg, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
Differential revision:	https://reviews.freebsd.org/D19923
2019-05-05 11:20:43 +00:00
Konstantin Belousov
ae90941431 Add vn_fsync_buf().
Provide a convenience function to avoid the hack with filling fake
struct vop_fsync_args and then calling vop_stdfsync().

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-04-09 20:20:04 +00:00
Konstantin Belousov
4ae3f5a7fd vn_vmap_seekhole(): align running offset to the block boundary.
Otherwise we might miss the last iteration where EOF appears below
unaligned noff.

Reported and reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D19811
2019-04-05 16:14:16 +00:00
Konstantin Belousov
4f77f48884 Implement O_BENEATH and AT_BENEATH.
Flags prevent open(2) and *at(2) vfs syscalls name lookup from
escaping the starting directory.  Supposedly the interface is similar
to the same proposed Linux flags.

Reviewed by:	jilles (code, previous version of manpages), 0mp (manpages)
Discussed with:	allanjude, emaste, jonathan
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D17547
2018-10-25 22:16:34 +00:00
Gordon Tetlow
c9e562b188 Correct ELF header parsing code to prevent invalid ELF sections from
disclosing memory.

Submitted by:	markj
Reported by:	Thomas Barabosch, Fraunhofer FKIE
Approved by:	re (implicit)
Approved by:	so
Security:	FreeBSD-SA-18:12.elf
Security:	CVE-2018-6924
Sponsored by:	The FreeBSD Foundation
2018-09-12 04:57:34 +00:00
Ed Maste
b8d908b71e ANSIfy sys/kern 2018-06-01 13:26:45 +00:00
Matt Macy
b99aa0fbb2 hwpmc: don't enter epoch section across mmap hook 2018-05-29 18:03:48 +00:00
Konstantin Belousov
161bf65f8a In vn_io_fault1(), reduce the scope where pagefaults are disabled.
Most important for the future use, do not call
vm_fault_quick_hold_pages() with disabled pagefaults.

Reported and tested by:	pho (as part of the larger patch)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-03-24 13:13:52 +00:00
Pedro F. Giffuni
51369649b0 sys: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
2017-11-20 19:43:44 +00:00
Konstantin Belousov
03311f117b Use whole mnt_stat.f_fsid bits for st_dev.
Since ino64 expanded dev_t to 64bit, make VOP_GETATTR(9) provide all
bits of mnt_stat.f_fsid as va_fsid for vnodes on filesystems which use
f_fsid.  In particular, NFSv3 and sometimes NFSv4, and ZFS use this
method or reporting st_dev by stat(2).

Provide a new helper vn_fsid() to avoid duplicating code to copy
f_fsid to va_fsid.

Note that the change is mostly cosmetic.  Its motivation is to avoid
sign-extension of f_fsid[0] into 64bit dev_t value which happens after
dev_t becomes 64bit..

Reviewed by:	avg(zfs), rmacklem (nfs) (both for previous version)
Sponsored by:	The FreeBSD Foundation
2017-05-27 17:00:30 +00:00
Konstantin Belousov
6992112349 Commit the 64-bit inode project.
Extend the ino_t, dev_t, nlink_t types to 64-bit ints.  Modify
struct dirent layout to add d_off, increase the size of d_fileno
to 64-bits, increase the size of d_namlen to 16-bits, and change
the required alignment.  Increase struct statfs f_mntfromname[] and
f_mntonname[] array length MNAMELEN to 1024.

ABI breakage is mitigated by providing compatibility using versioned
symbols, ingenious use of the existing padding in structures, and
by employing other tricks.  Unfortunately, not everything can be
fixed, especially outside the base system.  For instance, third-party
APIs which pass struct stat around are broken in backward and
forward incompatible ways.

Kinfo sysctl MIBs ABI is changed in backward-compatible way, but
there is no general mechanism to handle other sysctl MIBS which
return structures where the layout has changed. It was considered
that the breakage is either in the management interfaces, where we
usually allow ABI slip, or is not important.

Struct xvnode changed layout, no compat shims are provided.

For struct xtty, dev_t tty device member was reduced to uint32_t.
It was decided that keeping ABI compat in this case is more useful
than reporting 64-bit dev_t, for the sake of pstat.

Update note: strictly follow the instructions in UPDATING.  Build
and install the new kernel with COMPAT_FREEBSD11 option enabled,
then reboot, and only then install new world.

Credits: The 64-bit inode project, also known as ino64, started life
many years ago as a project by Gleb Kurtsou (gleb).  Kirk McKusick
(mckusick) then picked up and updated the patch, and acted as a
flag-waver.  Feedback, suggestions, and discussions were carried
by Ed Maste (emaste), John Baldwin (jhb), Jilles Tjoelker (jilles),
and Rick Macklem (rmacklem).  Kris Moore (kris) performed an initial
ports investigation followed by an exp-run by Antoine Brodin (antoine).
Essential and all-embracing testing was done by Peter Holm (pho).
The heavy lifting of coordinating all these efforts and bringing the
project to completion were done by Konstantin Belousov (kib).

Sponsored by:	The FreeBSD Foundation (emaste, kib)
Differential revision:	https://reviews.freebsd.org/D10439
2017-05-23 09:29:05 +00:00
Ed Maste
3e85b721d6 Remove register keyword from sys/ and ANSIfy prototypes
A long long time ago the register keyword told the compiler to store
the corresponding variable in a CPU register, but it is not relevant
for any compiler used in the FreeBSD world today.

ANSIfy related prototypes while here.

Reviewed by:	cem, jhb
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D10193
2017-05-17 00:34:34 +00:00
Konstantin Belousov
ecc6c515ab Apply noexec mount option for mmap(PROT_EXEC).
Right now the noexec mount option disallows image activators to try
execve the files on the mount point.  Also, after r127187, noexec
also limits max_prot map entries permissions for mappings of files
from such mounts, but not the actual mapping permissions.

As result, the API behaviour is inconsistent.  The files from noexec
mount can be mapped with PROT_EXEC, but if mprotect(2) drops execution
permission, it cannot be re-enabled later.  Make this consistent
logically and aligned with behaviour of other systems, by disallowing
PROT_EXEC for mmap(2).

Note that this change only ensures aligned results from mmap(2) and
mprotect(2), it does not prevent actual code execution from files
coming from noexec mount.  Such files can always be read into
anonymous executable memory and executed from there.

Reported by:	shamaz.mazum@gmail.com
PR:	217062
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-02-19 20:51:04 +00:00
Konstantin Belousov
987ff18184 Consistently handle negative or wrapping offsets in the mmap(2) syscalls.
For regular files and posix shared memory, POSIX requires that
[offset, offset + size) range is legitimate.  At the maping time,
check that offset is not negative.  Allowing negative offsets might
expose the data that filesystem put into vm_object for internal use,
esp. due to OFF_TO_IDX() signess treatment.  Fault handler verifies
that the mapped range is valid, assuming that mmap(2) checked that
arithmetic gives no undefined results.

For device mappings, leave the semantic of negative offsets to the
driver.  Correct object page index calculation to not erronously
propagate sign.

In either case, disallow overflow of offset + size.

Update mmap(2) man page to explain the requirement of the range
validity, and behaviour when the range becomes invalid after mapping.

Reported and tested by:	royger (previous version)
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-02-12 21:05:44 +00:00
Konstantin Belousov
e83a71c656 Fix r313495.
The file type DTYPE_VNODE can be assigned as a fallback if VOP_OPEN()
did not initialized file type.  This is a typical code path used by
normal file systems.

Also, change error returned for inappropriate file type used for
O_EXLOCK to EOPNOTSUPP, as declared in the open(2) man page.

Reported by:	cy, dhw, Iblis Lin <iblis@hs.ntnu.edu.tw>
Tested by:	dhw
Sponsored by:	The FreeBSD Foundation
MFC after:	13 days
2017-02-10 14:49:04 +00:00
Konstantin Belousov
e628e1b919 Increase a chance of devfs_close() calling d_close cdevsw method.
If a file opened over a vnode has an advisory lock set at close,
vn_closefile() acquires additional vnode use reference to prevent
freeing the vnode in vn_close().  Side effect is that for device
vnodes, devfs_close() sees that vnode reference count is greater than
one and refuses to call d_close().  Create internal version of
vn_close() which can avoid dropping the vnode reference if needed, and
use this to execute VOP_CLOSE() without acquiring a new reference.

Note that any parallel reference to the vnode would still prevent
d_close call, if the reference is not from an opened file, e.g. due to
stat(2).

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-02-09 23:36:50 +00:00
Konstantin Belousov
7903b00087 Do not establish advisory locks when doing open(O_EXLOCK) or open(O_SHLOCK)
for files which do not have DTYPE_VNODE type.

Both flock(2) and fcntl(2) syscalls refuse to acquire advisory lock on
a file which type is not DTYPE_VNODE.  Do the same when lock is
requested from open(2).

Restructure the block in vn_open_vnode() which handles O_EXLOCK and
O_SHLOCK open flags to make it easier to quit its execution earlier
with an error.

Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-02-09 23:35:57 +00:00
Mateusz Guzik
f1f7f1cb29 hwpmc: partially depessimize mmap handling if the module is not loaded
In particular this means the pmc sx lock is no longer taken when an
executable mapping succeeds.

MFC after:	1 week
2017-01-27 22:13:15 +00:00
Konstantin Belousov
25c6816845 More style cleanup. Use ANSI C definition for vn_closefile(). Switch
to VNASSERT in _vn_lock(), simplify messages.

Sponsored by:	The FreeBSD Foundation
X-MFC with:	r312600, r312601, r312602, r312606
2017-01-22 19:38:45 +00:00
Mateusz Guzik
eaf0969bda vfs: fix LK_RETRY logic braino in r312600 2017-01-21 20:34:20 +00:00
Mateusz Guzik
829857c893 vfs: __predict_false the need to handle F_HASLOCK
Also reorder the check with DTYPE_VNODE. Passed files are vnodes vast
majority of the time, so it is typically true.
2017-01-21 19:01:42 +00:00
Mateusz Guzik
abbc538d9a vfs: fix whitespace damage in r312600
While here wrap the previously overly long line so that it fits 80 chars.
2017-01-21 18:56:58 +00:00
Mateusz Guzik
1091fb52c1 vfs: refactor _vn_lock
Stop testing for LK_RETRY and error multiple times. Also postpone the
VI_DOOMED until after LK_RETRY was seen as it reads from the vnode.

No functional changes.
2017-01-21 18:38:16 +00:00
Ed Maste
69a2875821 Renumber license clauses in sys/kern to avoid skipping #3 2016-09-15 13:16:20 +00:00
Robert Watson
c3c0088bb0 Audit additional vnode information in the implementation of the
ftruncate(2) system call.  This was not required by the Common
Criteria, which needed only open-time audit.

MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
2016-08-20 18:51:48 +00:00
Conrad Meyer
af326ace9d devfs: Move most ioctl logic down to vnode layer
Devfs' file layer ioctl is now just a thin shim around the vnode layer.

Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D7286
2016-07-25 16:28:02 +00:00
Robert Watson
971711fb7c Call audit hooks to capture vnode attributes for three file-descriptor
method implementations: fstat(2), close(2), and poll(2).  This change
synchronises auditing here with similar auditing for VFS-specific system
calls such as stat(2) that audit more complete vnode information.

Sponsored by:	DARPA, AFRL
Approved by:	re (kib)
MFC after:	1 week
2016-07-05 16:37:01 +00:00
Konstantin Belousov
3f7ca894de Ensure that ftruncate(2) is performed synchronously when file is
opened in O_SYNC mode, at least for UFS.  This also handles
truncation, done due to the O_SYNC | O_TRUNC flags combination to
open(2), in synchronous way.

Noted by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-05-18 12:03:57 +00:00
Pedro F. Giffuni
31b6732008 sys/kern: spelling fixes.
Mostly on comments but affects some debug messages.

MFC after: 2 weeks
2016-04-29 21:54:28 +00:00
Pedro F. Giffuni
74b8d63dcc Cleanup unnecessary semicolons from the kernel.
Found with devel/coccinelle.
2016-04-10 23:07:00 +00:00
Konstantin Belousov
6adf19481c The struct file f_advice member is overlaid with the devfs f_cdevpriv
data.  If vnode bypass for devfs file failed, vn_read/vn_write are
called and might try to dereference f_advice.  Limit the accesses to
f_advice to VREG vnodes only, which is the type ensured by
posix_fadvise().

The f_advice for regular files is protected by mtxpool lock.  Recheck
that f_advice is not NULL after lock is taken.

Reported and tested by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
2016-01-22 20:35:20 +00:00
Konstantin Belousov
ce958bdefe When cleaning up from failed adv locking and checking for write, do
not call VOP_CLOSE() manually.  Instead, delegate the close to
fo_close() performed as part of the fdrop() on the file failed to
open.  For this, finish constructing file on error, in particular, set
f_vnode and f_ops.

Forcibly resetting f_ops to badfileops disabled additional cleanups
performed by fo_close() for some file types, in this case it was noted
that cdevpriv data was corrupted.  Since fo_close() call must be
enabled for some file types, it makes more sense to enable it for all
files opened through vn_open_cred().

In collaboration with:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-01-17 08:40:51 +00:00
Fabien Thomas
78e79434d2 Fix r283998 that broke mapin events for hwpmc.
Reviewed by:	jhb
Sponsored by:	Stormshield
2015-10-08 09:54:33 +00:00
Mark Johnston
3138cd3670 As a step towards the elimination of PG_CACHED pages, rework the handling
of POSIX_FADV_DONTNEED so that it causes the backing pages to be moved to
the head of the inactive queue instead of being cached.

This affects the implementation of POSIX_FADV_NOREUSE as well, since it
works by applying POSIX_FADV_DONTNEED to file ranges after they have been
read or written.  At that point the corresponding buffers may still be
dirty, so the previous implementation would coalesce successive ranges and
apply POSIX_FADV_DONTNEED to the result, ensuring that pages backing the
dirty buffers would eventually be cached.  To preserve this behaviour in an
efficient manner, this change adds a new buf flag, B_NOREUSE, which causes
the pages backing a VMIO buf to be placed at the head of the inactive queue
when the buf is released.  POSIX_FADV_NOREUSE then works by setting this
flag in bufs that underlie the specified range.

Reviewed by:	alc, kib
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D3726
2015-09-30 23:06:29 +00:00
Konstantin Belousov
9e18c9eb27 For open("name", O_DIRECTORY | O_CREAT), do not try to create the
named node, open(2) cannot create directories.  But do allow the flag
combination to succeed if the directory already exists.

Declare the open("name", O_DIRECTORY | O_CREAT | O_EXCL) always
invalid for the same reason, since open(2) cannot create directory.

Note that there is an argument that O_DIRECTORY | O_CREAT should be
invalid always, regardless of the target directory existence or
O_EXCL.  The current fix is conservative and allows the call to
succeed in the situation where it succeeded before the patch.

Reported by:	Tom Ridge <freebsd@tom-ridge.com>
Reviewed by:	rwatson
PR:	 202892
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-09-09 19:31:08 +00:00
Conrad Meyer
14bdbaf2e4 Detect badly behaved coredump note helpers
Coredump notes depend on being able to invoke dump routines twice; once
in a dry-run mode to get the size of the note, and another to actually
emit the note to the corefile.

When a note helper emits a different length section the second time
around than the length it requested the first time, the kernel produces
a corrupt coredump.

NT_PROCSTAT_FILES output length, when packing kinfo structs, is tied to
the length of filenames corresponding to vnodes in the process' fd table
via vn_fullpath.  As vnodes may move around during dump, this is racy.

So:

 - Detect badly behaved notes in putnote() and pad underfilled notes.

 - Add a fail point, debug.fail_point.fill_kinfo_vnode__random_path to
   exercise the NT_PROCSTAT_FILES corruption.  It simply picks random
   lengths to expand or truncate paths to in fo_fill_kinfo_vnode().

 - Add a sysctl, kern.coredump_pack_fileinfo, to allow users to
   disable kinfo packing for PROCSTAT_FILES notes.  This should avoid
   both FILES note corruption and truncation, even if filenames change,
   at the cost of about 1 kiB in padding bloat per open fd.  Document
   the new sysctl in core.5.

 - Fix note_procstat_files to self-limit in the 2nd pass.  Since
   sometimes this will result in a short write, pad up to our advertised
   size.  This addresses note corruption, at the risk of sometimes
   truncating the last several fd info entries.

 - Fix NT_PROCSTAT_FILES consumers libutil and libprocstat to grok the
   zero padding.

With suggestions from:	bjk, jhb, kib, wblock
Approved by:	markj (mentor)
Relnotes:	yes
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D3548
2015-09-03 20:32:10 +00:00
Konstantin Belousov
8917728875 vn_io_fault() handling of the LOR for i/o into the file-backed buffers
has observable overhead when the buffer pages are not resident or not
mapped.  The overhead comes at least from two factors, one is the
additional work needed to detect the situation, prepare and execute
the rollbacks.  Another is the consequence of the i/o splitting into
the batches of the held pages, causing filesystems see series of the
smaller i/o requests instead of the single large request.

Note that expected case of the resident i/o buffer does not expose
these issues.  Provide a prefaulting for the userspace i/o buffers,
disabled by default.  I am careful of not enabling prefaulting by
default for now, since it would be detrimental for the applications
which speculatively pass extra-large buffers of anonymous memory to
not deal with buffer sizing (if such apps exist).

Found and tested by:	bde, emaste
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-07-31 04:12:51 +00:00
Mark Johnston
5f34e93c58 Check suspendability on the mountpoint returned by VOP_GETWRITEMOUNT.
This obviates the need for a MNTK_SUSPENDABLE flag, since passthrough
filesystems like nullfs and unionfs no longer need to inherit this
information from their lower layer(s). This change also restores the
pre-r273336 behaviour of using the presence of a susp_clean VFS method to
request suspension support.

Reviewed by:	kib, mjg
Differential Revision:	https://reviews.freebsd.org/D2937
2015-07-05 22:37:33 +00:00
Mateusz Guzik
f6f6d24062 Implement lockless resource limits.
Use the same scheme implemented to manage credentials.

Code needing to look at process's credentials (as opposed to thred's) is
provided with *_proc variants of relevant functions.

Places which possibly had to take the proc lock anyway still use the proc
pointer to access limits.
2015-06-10 10:48:12 +00:00
John Baldwin
7077c42623 Add a new file operations hook for mmap operations. File type-specific
logic is now placed in the mmap hook implementation rather than requiring
it to be placed in sys/vm/vm_mmap.c.  This hook allows new file types to
support mmap() as well as potentially allowing mmap() for existing file
types that do not currently support any mapping.

The vm_mmap() function is now split up into two functions.  A new
vm_mmap_object() function handles the "back half" of vm_mmap() and accepts
a referenced VM object to map rather than a (handle, handle_type) tuple.
vm_mmap() is now reduced to converting a (handle, handle_type) tuple to a
a VM object and then calling vm_mmap_object() to handle the actual mapping.
The vm_mmap() function remains for use by other parts of the kernel
(e.g. device drivers and exec) but now only supports mapping vnodes,
character devices, and anonymous memory.

The mmap() system call invokes vm_mmap_object() directly with a NULL object
for anonymous mappings.  For mappings using a file descriptor, the
descriptors fo_mmap() hook is invoked instead.  The fo_mmap() hook is
responsible for performing type-specific checks and adjustments to
arguments as well as possibly modifying mapping parameters such as flags
or the object offset.  The fo_mmap() hook routines then call
vm_mmap_object() to handle the actual mapping.

The fo_mmap() hook is optional.  If it is not set, then fo_mmap() will
fail with ENODEV.  A fo_mmap() hook is implemented for regular files,
character devices, and shared memory objects (created via shm_open()).

While here, consistently use the VM_PROT_* constants for the vm_prot_t
type for the 'prot' variable passed to vm_mmap() and vm_mmap_object()
as well as the vm_mmap_vnode() and vm_mmap_cdev() helper routines.
Previously some places were using the mmap()-specific PROT_* constants
instead.  While this happens to work because PROT_xx == VM_PROT_xx,
using VM_PROT_* is more correct.

Differential Revision:	https://reviews.freebsd.org/D2658
Reviewed by:	alc (glanced over), kib
MFC after:	1 month
Sponsored by:	Chelsio
2015-06-04 19:41:15 +00:00
Konstantin Belousov
2db0e1f50d Add V_MNTREF flag to the vn_start_write(9) and
vn_start_secondary_write(9) functions.  The flag indicates that the
caller already owns a reference on the mount point, and the functions
can consume it.  The reference is released by vn_finished_write(9) and
vn_finished_secondary_write(9) in due course.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-05-27 09:21:47 +00:00
Craig Rodrigues
d5fec48956 Support file verification in MAC.
* Add VCREAT flag to indicate when a new file is being created
* Add VVERIFY to indicate verification is required
* Both VCREAT and VVERIFY are only passed on the MAC method vnode_check_open
  and are removed from the accmode after
* Add O_VERIFY flag to rtld open of objects
* Add 'v' flag to __sflags to set O_VERIFY flag.

Submitted by:		Steve Kiernan <stevek@juniper.net>
Obtained from:		Juniper Networks, Inc.
GitHub Pull Request:	https://github.com/freebsd/freebsd/pull/27
Relnotes:		yes
2015-04-22 01:54:25 +00:00
Konstantin Belousov
8ee9765a9d Add VN_OPEN_NAMECACHE flag for vn_open_cred(9), which requests that
the created file name was cached.  Use the flag for core dumps.

Requested by:	rpaulo
Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-12-21 13:32:07 +00:00
Konstantin Belousov
6c21f6edb8 The VOP_LOOKUP() implementations for CREATE op do not put the name
into namecache, to avoid cache trashing when doing large operations.
E.g., tar archive extraction is not usually followed by access to many
of the files created.

Right now, each VOP_LOOKUP() implementation explicitely knowns about
this quirk and tests for both MAKEENTRY flag presence and op != CREATE
to make the call to cache_enter().  Centralize the handling of the
quirk into VFS, by deciding to cache only by MAKEENTRY flag in VOP.
VFS now sets NOCACHE flag for CREATE namei() calls.

Note that the change in semantic is backward-compatible and could be
merged to the stable branch, and is compatible with non-changed
third-party filesystems which correctly handle MAKEENTRY.

Suggested by:	Chris Torek <torek@pi-coral.com>
Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-12-18 10:01:12 +00:00
Konstantin Belousov
0061ddb3ed Only sleep interruptible while waiting for suspension end when
filesystem specified VFCF_SBDRY flag, i.e. for NFS.

There are two issues with the sleeps.  First, applications may get
unexpected EINTR from the disk i/o syscalls.  Second, interruptible
sleep allows the stop of the process, and since mount point is
referenced while thread sleeps, unmount cannot free mount point
structure' memory, blocking unmount indefinitely.

Even for NFS, it is probably only reasonable to enable PCATCH for intr
mounts, but this information is currently not available at VFS level.

Reported and tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-12-13 16:07:01 +00:00
Konstantin Belousov
5fab60a071 In vfs_write_suspend_umnt(), if suspension cannot be established, do
not forget to restore write ops count when returning the error.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-11-14 11:31:10 +00:00
Mateusz Guzik
4fce16e4c9 Provide vfs suspension support only for filesystems which need it, take
two.

nullfs and unionfs need to request suspension if underlying filesystem(s)
use it. Utilize mnt_kern_flag for this purpose.

This is a fixup for 273271.

No strong objections from: kib
Pointy hat to: mjg
MFC after:	2 weeks
2014-10-20 18:00:50 +00:00
Mateusz Guzik
020b8f17a0 Provide vfs suspension support only for filesystems which need it.
Need is expressed by providing vfs_susp_clean function in vfsops.

Differential Revision:	D952
Reviewed by:	kib (previous version)
MFC after:	2 weeks
2014-10-19 06:59:33 +00:00
Konstantin Belousov
4142462eeb Slightly reword comment. Move code, which is described by the
comment, after it.

Discussed with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-10-04 18:51:55 +00:00
Konstantin Belousov
e3d6feceb1 Add IO_RANGELOCKED flag for vn_rdwr(9), which specifies that vnode is
not locked, but range is.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-10-04 18:28:27 +00:00
John Baldwin
9696feebe2 Add a new fo_fill_kinfo fileops method to add type-specific information to
struct kinfo_file.
- Move the various fill_*_info() methods out of kern_descrip.c and into the
  various file type implementations.
- Rework the support for kinfo_ofile to generate a suitable kinfo_file object
  for each file and then convert that to a kinfo_ofile structure rather than
  keeping a second, different set of code that directly manipulates
  type-specific file information.
- Remove the shm_path() and ksem_info() layering violations.

Differential Revision:	https://reviews.freebsd.org/D775
Reviewed by:	kib, glebius (earlier version)
2014-09-22 16:20:47 +00:00
Mateusz Guzik
037755fd15 Fix up races with f_seqcount handling.
It was possible that the kernel would overwrite user-supplied hint.

Abuse vnode lock for this purpose.

In collaboration with: kib
MFC after:	1 week
2014-08-26 08:17:22 +00:00
Konstantin Belousov
895b3782c6 Extract the code to put a filesystem into the suspended state (at the
unmount time) in the helper vfs_write_suspend_umnt().  Use it instead
of two inline copies in FFS.

Fix the bug in the FFS unmount, when suspension failed, the ufs
extattrs were not reinitialized.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-07-14 09:10:00 +00:00
Konstantin Belousov
a69452162a Generalize vn_get_ino() to allow filesystems to use custom vnode
producer, instead of hard-coding VFS_VGET().  New function, which
takes callback, is called vn_get_ino_gen(), standard callback for
vn_get_ino() is provided.

Convert inline copies of vn_get_ino() in msdosfs and cd9660 into the
uses of vn_get_ino_gen().

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-07-14 08:34:54 +00:00
Konstantin Belousov
7b81a399a4 In msdosfs_setattr(), add a check for result of the utimes(2)
permissions test, forgotten in r164033.

Refactor the permission checks for utimes(2) into vnode helper
function vn_utimes_perm(9), and simplify its code comparing with the
UFS origin, by writing the call to VOP_ACCESSX only once.  Use the
helper for UFS(5), tmpfs(5), devfs(5) and msdosfs(5).

Reported by:	bde
Reviewed by:	bde, trasz
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-06-17 07:11:00 +00:00
Konstantin Belousov
2e501b0a9e Use vn_io_fault for the writes from core dumping code. Recursing into
VM due to copyin(9) faulting while VFS locks are held is
deadlock-prone there in the same way as for the write(2) syscall.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-06-15 04:51:53 +00:00
John-Mark Gurney
6f2b769cac change td_retval into a union w/ off_t, with defines to mask the
change...  This eliminates a cast, and also forces td_retval
(often 2 32-bit registers) to be aligned so that off_t's can be
stored there on arches with strict alignment requirements like
armeb (AVILA)...  On i386, this doesn't change alignment, and on
amd64 it doesn't either, as register_t is already 64bits...

This will also prevent future breakage due to people adding additional
fields to the struct...

This gets AVILA booting a bit farther...

Reviewed by:	bde
2014-03-16 00:53:40 +00:00
Konstantin Belousov
65f05eeb3d If vn_open_vnode() succeeded in opening the vnode, but subsequent
advisory lock cannot be obtained, prevent double-close of the vnode in
vn_close() called from the fdrop(), by resetting file' f_ops methods.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-12-17 17:31:16 +00:00
Konstantin Belousov
7e14088d93 Revert back to use int for the page counts. In vn_io_fault(), the i/o
is chunked to pieces limited by integer io_hold_cnt tunable, while
vm_fault_quick_hold_pages() takes integer max_count as the upper bound.

Rearrange the checks to correctly handle overflowing address arithmetic.

Submitted by:	bde
Tested by:	pho
Discussed with:	alc
MFC after:	1 week
2013-11-20 08:45:26 +00:00
Konstantin Belousov
d005ed537c Avoid overflow for the page counts.
Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-11-12 08:47:58 +00:00
Konstantin Belousov
6272798a3f Both vn_close() and VFS_PROLOGUE() evaluate vp->v_mount twice, without
holding the vnode lock; vp->v_mount is checked first for NULL
equiality, and then dereferenced if not NULL.  If vnode is reclaimed
meantime, second dereference would still give NULL.  Change
VFS_PROLOGUE() to evaluate the mp once, convert MNTK_SHARED_WRITES and
MNTK_EXTENDED_SHARED tests into inline functions.

Reviewed by:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-11-09 20:30:13 +00:00
Konstantin Belousov
9bec6325ad When opening or closing fifo, ensure that the vnode is locked
exclusively.  Filesystems are assumed to disable shared locking for
the fifo vnode locks, but some do not.

Reported and tested by:	olgeni
Discussed with:	avg
Sponsored by:   The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (glebius)
2013-09-13 06:52:23 +00:00
Konstantin Belousov
c0a46535c4 Make the seek a method of the struct fileops.
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2013-08-21 17:36:01 +00:00
Konstantin Belousov
b1dd38f408 Restore the previous sendfile(2) behaviour on the block devices.
Provide valid .fo_sendfile method for several missed struct fileops.

Reviewed by:	glebius
Sponsored by:	The FreeBSD Foundation
2013-08-16 14:22:20 +00:00
Gleb Smirnoff
ca04d21d5f Make sendfile() a method in the struct fileops. Currently only
vnode backed file descriptors have this method implemented.

Reviewed by:	kib
Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2013-08-15 07:54:31 +00:00
Konstantin Belousov
cc3d8c35f5 There are several code sequences like
vfs_busy(mp);
      vfs_write_suspend(mp);
which are problematic if other thread starts unmount between two
calls.  The unmount starts a write, while vfs_write_suspend() drain
writers.  On the other hand, unmount drains busy references, causing
the deadlock.

Add a flag argument to vfs_write_suspend and require the callers of it
to specify VS_SKIP_UNMOUNT flag, when the call is performed not in the
mount path, i.e. the covered vnode is not locked.  The suspension is
not attempted if VS_SKIP_UNMOUNT is specified and unmount is in
progress.

Reported and tested by:	Andreas Longwitz <longwitz@incore.de>
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
2013-07-09 20:49:32 +00:00
John Baldwin
3d4c503cf0 Style fixes to vn_ioctl().
Suggested by:	bde
2013-05-31 16:15:22 +00:00
John Baldwin
dfa66c01ae Fix FIONREAD on regular files. The computed result was being ignored and
it was being passed down to VOP_IOCTL() where it promptly resulted in
ENOTTY due to a missing else for the past 8 years.  While here, use a
shared vnode lock while fetching the current file's size.

MFC after:	1 week
2013-05-03 19:08:58 +00:00
Matthew D Fleming
926cd204c7 Use a shared lock for VOP_GETEXTATTR, as it is a read-like operation.
MFC after:	1 week
2013-03-30 15:09:04 +00:00
Konstantin Belousov
aed5a114d7 Separate the copyright lines and the informational block by a blank line.
Requested by:	joel
MFC after:	2 weeks
2013-03-15 14:01:37 +00:00
Konstantin Belousov
5791cee883 Add my copyright for the 2012 year work, in particular vn_io_fault()
and f_offset locking.  Add required Foundation notice for r248319.

Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-03-15 12:57:30 +00:00
Konstantin Belousov
5f5f055441 Implement the helper function vn_io_fault_pgmove(), intended to use by
the filesystem VOP_READ() and VOP_WRITE() implementations in the same
way as vn_io_fault_uiomove() over the unmapped buffers.  Helper
provides the convenient wrapper over the pmap_copy_pages() for struct
uio consumers, taking care of the TDP_UIOHELD situations.

Sponsored by:	The FreeBSD Foundation
Tested by:	pho
MFC after:	2 weeks
2013-03-15 11:16:12 +00:00
Attilio Rao
89f6b8632c Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
  - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
  - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
  - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
  - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
    (in order to avoid visibility of implementation details)
  - The read-mode operations are added:
    VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
    VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
  sys/mutex.h in consumers directly to cater its inlining functions
  using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
  consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
  the compat layer because the name clash between FreeBSD and solaris
  versions must be avoided.
  At this purpose zfs redefines the vm_object locking functions
  directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit.  Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by:	EMC / Isilon storage division
Reviewed by:	jeff
Reviewed by:	pjd (ZFS specific review)
Discussed with:	alc
Tested by:	pho
2013-03-09 02:32:23 +00:00
Pawel Jakub Dawidek
71ac38e896 Remove unnecessary variables. 2013-03-01 21:58:56 +00:00
Sergey Kandaurov
d7ffa24831 vn_io_faults_cnt:
- use u_long consistently
- use SYSCTL_ULONG to match the type of variable

Reviewed by:	kib
MFC after:	1 week
2013-02-15 14:22:05 +00:00
Pawel Jakub Dawidek
9e2677fd6d Simplify code a bit. This is leftover after Giant removal from VFS. 2013-01-31 22:12:48 +00:00
Konstantin Belousov
ddd6b3fc33 Add flags argument to vfs_write_resume() and remove
vfs_write_resume_flags().

Sponsored by:	The FreeBSD Foundation
2013-01-11 06:08:32 +00:00
Konstantin Belousov
f99cb34c4f The process_deferred_inactive() function locks the vnodes of the ufs
mount, which means that is must not be called while the snaplock is
owned.  The vfs_write_resume(9) does call the function as the
VFS_SUSP_CLEAN() method, which is too early and falls into the region
still protected by snaplock.

Add yet another flag for the vfs_write_resume_flags() to avoid calling
suspension cleanup handler after the suspend is lifted, and use it in
the ffs_snapshot() call to vfs_write_resume.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-01-01 16:14:48 +00:00
Konstantin Belousov
91e9474552 Make it possible to atomically resume writes on the mount and account
the write start, by adding a variation of the vfs_write_resume(9)
which accepts flags.

Use the new function to prevent a deadlock between parallel suspension
and snapshotting a UFS mount.  The ffs_snapshot() code performed
vfs_write_resume() followed by vn_start_write() while owning the
snaplock.  If the suspension intervene between resume and
vn_start_write(), the deadlock occured after the suspending thread
tried to lock the snaplock, most typically during the write in the
ffs_copyonwrite().

Reported and tested by:	Andreas Longwitz <longwitz@incore.de>
Reviewed by:	mckusick
MFC after:	2 weeks
X-MFC-note:	make the vfs_write_resume(9) function a macro after the MFC,
	in HEAD
2012-12-28 23:08:30 +00:00
Pawel Jakub Dawidek
f121e3e81d - Add NOCAPCHECK flag to namei that allows lookup to work even if the process
is in capability mode.
- Add VN_OPEN_NOCAPCHECK flag for vn_open_cred() to will ne converted into
  NOCAPCHECK namei flag.

This functionality will be used to enable core dumps for sandboxed processes.

Reviewed by:	rwatson
Obtained from:	WHEEL Systems
MFC after:	2 weeks
2012-11-27 10:32:35 +00:00
Konstantin Belousov
140dedb81c The r241025 fixed the case when a binary, executed from nullfs mount,
was still possible to open for write from the lower filesystem.  There
is a symmetric situation where the binary could already has file
descriptors opened for write, but it can be executed from the nullfs
overlay.

Handle the issue by passing one v_writecount reference to the lower
vnode if nullfs vnode has non-zero v_writecount.  Note that only one
write reference can be donated, since nullfs only keeps one use
reference on the lower vnode.  Always use the lower vnode v_writecount
for the checks.

Introduce the VOP_GET_WRITECOUNT to read v_writecount, which is
currently always bypassed to the lower vnode, and VOP_ADD_WRITECOUNT
to manipulate the v_writecount value, which manages a single bypass
reference to the lower vnode.  Caling the VOPs instead of directly
accessing v_writecount provide the fix described in the previous
paragraph.

Tested by:	pho
MFC after:	3 weeks
2012-11-02 13:56:36 +00:00
Konstantin Belousov
5050aa86cf Remove the support for using non-mpsafe filesystem modules.
In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by:	attilio
Tested by:	pho
2012-10-22 17:50:54 +00:00
Konstantin Belousov
877d24ac8a Fix the mis-handling of the VV_TEXT on the nullfs vnodes.
If you have a binary on a filesystem which is also mounted over by
nullfs, you could execute the binary from the lower filesystem, or
from the nullfs mount. When executed from lower filesystem, the lower
vnode gets VV_TEXT flag set, and the file cannot be modified while the
binary is active. But, if executed as the nullfs alias, only the
nullfs vnode gets VV_TEXT set, and you still can open the lower vnode
for write.

Add a set of VOPs for the VV_TEXT query, set and clear operations,
which are correctly bypassed to lower vnode.

Tested by:	pho (previous version)
MFC after:	2 weeks
2012-09-28 11:25:02 +00:00
Pawel Jakub Dawidek
8c706ce0d0 vn_write() always expects FOF_OFFSET flag, which is asserted at the begining,
so there is no need to check for it.

Sponsored by:	FreeBSD Foundation
MFC after:	2 weeks
2012-09-25 21:31:17 +00:00
John Baldwin
e838f09cd0 Reorder the managament of advisory locks on open files so that the advisory
lock is obtained before the write count is increased during open() and the
lock is released after the write count is decreased during close().

The first change closes a race where an open() that will block with O_SHLOCK
or O_EXLOCK can increase the write count while it waits.  If the process
holding the current lock on the file then tries to call exec() on the file
it has locked, it can fail with ETXTBUSY even though the advisory lock is
preventing other threads from succesfully completeing a writable open().

The second change closes a race where a read-only open() with O_SHLOCK or
O_EXLOCK may return successfully while the write count is non-zero due to
another descriptor that had the advisory lock and was blocking the open()
still being in the process of closing.  If the process that completed the
open() then attempts to call exec() on the file it locked, it can fail with
ETXTBUSY even though the other process that held a write lock has closed
the file and released the lock.

Reviewed by:	kib
MFC after:	1 month
2012-07-31 18:25:00 +00:00
Konstantin Belousov
c5c1199c83 Extend the KPI to lock and unlock f_offset member of struct file. It
now fully encapsulates all accesses to f_offset, and extends f_offset
locking to other consumers that need it, in particular, to lseek() and
variants of getdirentries().

Ensure that on 32bit architectures f_offset, which is 64bit quantity,
always read and written under the mtxpool protection. This fixes
apparently easy to trigger race when parallel lseek()s or lseek() and
read/write could destroy file offset.

The already broken ABI emulations, including iBCS and SysV, are not
converted (yet).

Tested by:	pho
No objections from:	jhb
MFC after:    3 weeks
2012-07-02 21:01:03 +00:00
Konstantin Belousov
854c3ce7ac Fix locking for f_offset, vn_read() and vn_write() cases only, for now.
It seems that intended locking protocol for struct file f_offset field
was as follows: f_offset should always be changed under the vnode lock
(except fcntl(2) and lseek(2) did not followed the rules). Since
read(2) uses shared vnode lock, FOFFSET_LOCKED block is additionally
taken to serialize shared vnode lock owners.

This was broken first by enabling shared lock on writes, then by
fadvise changes, which moved f_offset assigned from under vnode lock,
and last by vn_io_fault() doing chunked i/o. More, due to uio_offset
not yet valid in vn_io_fault(), the range lock for reads was taken on
the wrong region.

Change the locking for f_offset to always use FOFFSET_LOCKED block,
which is placed before rangelocks in the lock order.

Extract foffset_lock() and foffset_unlock() functions which implements
FOFFSET_LOCKED lock, and consistently lock f_offset with it in the
vn_io_fault() both for reads and writes, even if MNTK_NO_IOPF flag is
not set for the vnode mount. Indicate that f_offset is already valid
for vn_read() and vn_write() calls from vn_io_fault() with FOF_OFFSET
flag, and assert that all callers of vn_read() and vn_write() follow
this protocol.

Extract get_advice() function to calculate the POSIX_FADV_XXX value
for the i/o region, and use it were appropriate.

Reviewed by:	jhb
Tested by:	pho
MFC after:	2 weeks
2012-06-21 09:19:41 +00:00
John Baldwin
cd4ecf3cd2 Further refine the implementation of POSIX_FADV_NOREUSE.
First, extend the changes in r230782 to better handle the common case
of using NOREUSE with sequential reads.  A NOREUSE file descriptor
will now track the last implicit DONTNEED request it made as a result
of a NOREUSE read.  If a subsequent NOREUSE read is adjacent to the
previous range, it will apply the DONTNEED request to the entire range
of both the previous read and the current read.  The effect is that
each read of a file accessed sequentially will apply the DONTNEED
request to the entire range that has been read.  This allows NOREUSE
to properly handle misaligned reads by flushing each buffer to cache
once it has been completely read.

Second, apply the same changes made to read(2) by r230782 and this
change to writes.  This provides much better performance in the
sequential write case as it allows writes to still be clustered.  It
also provides much better performance for misaligned writes.  It does
mean that NOREUSE will be generally ineffective for non-sequential
writes as the current implementation relies on a future NOREUSE
write's implicit DONTNEED request to flush the dirty buffer from the
current write.

MFC after:	2 weeks
2012-06-19 18:42:24 +00:00
John Baldwin
7ac1b61aac Split the second half of vn_open_cred() (after a vnode has been found via
a lookup or created via VOP_CREATE()) into a new vn_open_vnode() function
and use this function in fhopen() instead of duplicating code from
vn_open_cred() directly.

Tested by:	pho
Reviewed by:	kib
MFC after:	2 weeks
2012-06-08 18:32:09 +00:00
Konstantin Belousov
bba080854d Add a knob to disable vn_io_fault.
MFC after:	1 month
2012-06-03 16:19:37 +00:00
Konstantin Belousov
bb2f52a61d Count and export the number of prefaulting happen.
MFC after:	 1 month
2012-06-03 16:06:56 +00:00
Konstantin Belousov
41014d996a vn_io_fault() is a facility to prevent page faults while filesystems
perform copyin/copyout of the file data into the usermode
buffer. Typical filesystem hold vnode lock and some buffer locks over
the VOP_READ() and VOP_WRITE() operations, and since page fault
handler may need to recurse into VFS to get the page content, a
deadlock is possible.

The facility works by disabling page faults handling for the current
thread and attempting to execute i/o while allowing uiomove() to
access the usermode mapping of the i/o buffer. If all buffer pages are
resident, uiomove() is successfull and request is finished. If EFAULT
is returned from uiomove(), the pages backing i/o buffer are faulted
in and held, and the copyin/out is performed using uiomove_fromphys()
over the held pages for the second attempt of VOP call.

Since pages are hold in chunks to prevent large i/o requests from
starving free pages pool, and since vnode lock is only taken for
i/o over the current chunk, the vnode lock no longer protect atomicity
of the whole i/o request. Use newly added rangelocks to provide the
required atomicity of i/o regardind other i/o and truncations.

Filesystems need to explicitely opt-in into the scheme, by setting the
MNTK_NO_IOPF struct mount flag, and optionally by using
vn_io_fault_uiomove(9) helper which takes care of calling uiomove() or
converting uio into request for uiomove_fromphys().

Reviewed by:	bf (comments), mdf, pjd (previous version)
Tested by:	pho
Tested by:	flo, Gustau P?rez <gperez entel upc edu> (previous version)
MFC after:	2 months
2012-05-30 16:42:08 +00:00
Konstantin Belousov
292520f710 Add a vn_bmap_seekhole(9) vnode helper which can be used by any
filesystem which supports VOP_BMAP(9) to implement SEEK_HOLE/SEEK_DATA
commands for lseek(2).

MFC after:	2 weeks
2012-05-26 05:28:47 +00:00
John Baldwin
b47f624183 Add KTR_VFS traces to track modifications to a vnode's writecount. 2012-03-08 20:27:20 +00:00
Konstantin Belousov
526d0bd547 Fix found places where uio_resid is truncated to int.
Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the
sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from
the usermode.

Discussed with:	bde, das (previous versions)
MFC after:	1 month
2012-02-21 01:05:12 +00:00
John Baldwin
2bd3e4c2c2 Refine the implementation of POSIX_FADV_NOREUSE for the read(2) case such
that instead of using direct I/O it allows read-ahead similar to
POSIX_FADV_NORMAL, but invokes VOP_ADVISE(POSIX_FADV_DONTNEED) after the
read(2) has completed to purge just-read data.  The write(2) path continues
to use direct I/O for POSIX_FADV_NOREUSE for now.  Note that NOREUSE works
optimally if an application reads and writes full fs blocks.
2012-01-30 19:35:15 +00:00
John Baldwin
936c09ac0f Add the posix_fadvise(2) system call. It is somewhat similar to
madvise(2) except that it operates on a file descriptor instead of a
memory region.  It is currently only supported on regular files.

Just as with madvise(2), the advice given to posix_fadvise(2) can be
divided into two types.  The first type provide hints about data access
patterns and are used in the file read and write routines to modify the
I/O flags passed down to VOP_READ() and VOP_WRITE().  These modes are
thus filesystem independent.  Note that to ease implementation (and
since this API is only advisory anyway), only a single non-normal
range is allowed per file descriptor.

The second type of hints are used to hint to the OS that data will or
will not be used.  These hints are implemented via a new VOP_ADVISE().
A default implementation is provided which does nothing for the WILLNEED
request and attempts to move any clean pages to the cache page queue for
the DONTNEED request.  This latter case required two other changes.
First, a new V_CLEANONLY flag was added to vinvalbuf().  This requests
vinvalbuf() to only flush clean buffers for the vnode from the buffer
cache and to not remove any backing pages from the vnode.  This is
used to ensure clean pages are not wired into the buffer cache before
attempting to move them to the cache page queue.  The second change adds
a new vm_object_page_cache() method.  This method is somewhat similar to
vm_object_page_remove() except that instead of freeing each page in the
specified range, it attempts to move clean pages to the cache queue if
possible.

To preserve the ABI of struct file, the f_cdevpriv pointer is now reused
in a union to point to the currently active advice region if one is
present for regular files.

Reviewed by:	jilles, kib, arch@
Approved by:	re (kib)
MFC after:	1 month
2011-11-04 04:02:50 +00:00
Kip Macy
8451d0dd78 In order to maximize the re-usability of kernel code in user space this
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.

Reviewed by:	rwatson
Approved by:	re (bz)
2011-09-16 13:58:51 +00:00
Martin Matuska
82378711f9 Generalize ffs_pages_remove() into vn_pages_remove().
Remove mapped pages for all dataset vnodes in zfs_rezget() using
new vn_pages_remove() to fix mmapped files changed by
zfs rollback or zfs receive -F.

PR:		kern/160035, kern/156933
Reviewed by:	kib, pjd
Approved by:	re (kib)
MFC after:	1 week
2011-08-25 08:17:39 +00:00
Konstantin Belousov
9c00bb9190 Add the fo_chown and fo_chmod methods to struct fileops and use them
to implement fchown(2) and fchmod(2) support for several file types
that previously lacked it. Add MAC entries for chown/chmod done on
posix shared memory and (old) in-kernel posix semaphores.

Based on the submission by:	glebius
Reviewed by:	rwatson
Approved by:	re (bz)
2011-08-16 20:07:47 +00:00
Matthew D Fleming
3d08a76bbc Use a name instead of a magic number for kern_yield(9) when the priority
should not change.  Fetch the td_user_pri under the thread lock.  This
is probably not necessary but a magic number also seems preferable to
knowing the implementation details here.

Requested by:	Jason Behmer < jason DOT behmer AT isilon DOT com >
2011-05-13 05:27:58 +00:00
Matthew D Fleming
e7ceb1e99b Based on discussions on the svn-src mailing list, rework r218195:
- entirely eliminate some calls to uio_yeild() as being unnecessary,
   such as in a sysctl handler.

 - move should_yield() and maybe_yield() to kern_synch.c and move the
   prototypes from sys/uio.h to sys/proc.h

 - add a slightly more generic kern_yield() that can replace the
   functionality of uio_yield().

 - replace source uses of uio_yield() with the functional equivalent,
   or in some cases do not change the thread priority when switching.

 - fix a logic inversion bug in vlrureclaim(), pointed out by bde@.

 - instead of using the per-cpu last switched ticks, use a per thread
   variable for should_yield().  With PREEMPTION, the only reasonable
   use of this is to determine if a lock has been held a long time and
   relinquish it.  Without PREEMPTION, this is essentially the same as
   the per-cpu variable.
2011-02-08 00:16:36 +00:00
Pawel Jakub Dawidek
3297cdd096 Correct arguments order. 2010-06-26 21:44:45 +00:00
Edward Tomasz Napierala
77dda2b96f Avoid overflow.
Submitted by:	bde@
2010-05-06 18:52:41 +00:00
Edward Tomasz Napierala
307d88b787 Style fixes and removal of unneeded variable.
Submitted by:	bde@
2010-05-06 18:43:19 +00:00
Edward Tomasz Napierala
b5f770bd86 Move checking against RLIMIT_FSIZE into one place, vn_rlimit_fsize().
Reviewed by:	kib
2010-05-05 16:44:25 +00:00
Andriy Gapon
364b8a7b33 vn_stat: take into account va_blocksize when setting st_blksize
As currently st_blksize is always PAGE_SIZE, it is playing safe to not
use any smaller value.  For some cases this might not be optimal, but
at least nothing should get broken.

Generally I don't expect this commit to change much for the following
reasons (in case of VREG, VDIR):
- application I/O and physical I/O are sufficiently decoupled by
  filesystem code, buffer cache code, cluster and read-ahead logic
- not all applications use st_blksize as a hint, some use f_iosize, some
  use fixed block sizes

I expect writes to the middle of files on ZFS to benefit the most from
this change.

Silence from:	fs@
MFC after:	2 weeks
2010-04-03 08:39:00 +00:00
Ed Schouten
510ea843ba Rename st_*timespec fields to st_*tim for POSIX 2008 compliance.
A nice thing about POSIX 2008 is that it finally standardizes a way to
obtain file access/modification/change times in sub-second precision,
namely using struct timespec, which we already have for a very long
time. Unfortunately POSIX uses different names.

This commit adds compatibility macros, so existing code should still
build properly. Also change all source code in the kernel to work
without any of the compatibility macros. This makes it all a less
ambiguous.

I am also renaming st_birthtime to st_birthtim, even though it was a
local extension anyway. It seems Cygwin also has a st_birthtim.
2010-03-28 13:13:22 +00:00
Ed Schouten
0fef797f4a Actually make O_DIRECTORY work.
According to POSIX open() must return ENOTDIR when the path name does
not refer to a path name. Change vn_open() to respect this flag. This
also simplifies the Linuxolator a bit.
2010-03-21 20:43:23 +00:00
Edward Tomasz Napierala
9d7031a6d6 Don't add VAPPEND if the file is not being opened for writing. Note that this
only affects cases where open(2) is being used improperly - i.e. when the user
specifies O_APPEND without O_WRONLY or O_RDWR.

Reviewed by:	rwatson
2009-12-08 20:47:10 +00:00
Edward Tomasz Napierala
3b1b09809f Revert r198874, pending further discussion. 2009-11-04 07:14:16 +00:00
Edward Tomasz Napierala
8fafa5cecf Make sure we don't end up with VAPPEND without VWRITE, if someone calls open(2)
like this: open(..., O_APPEND).
2009-11-04 06:48:34 +00:00
Xin LI
82aebf697c Add two new fcntls to enable/disable read-ahead:
- F_READAHEAD: specify the amount for sequential access.  The amount is
   specified in bytes and is rounded up to nearest block size.
 - F_RDAHEAD: Darwin compatible version that use 128KB as the sequential
   access size.

A third argument of zero disables the read-ahead behavior.

Please note that the read-ahead amount is also constrainted by sysctl
variable, vfs.read_max, which may need to be raised in order to better
utilize this feature.

Thanks Igor Sysoev for proposing the feature and submitting the original
version, and kib@ for his valuable comments.

Submitted by:	Igor Sysoev <is rambler-co ru>
Reviewed by:	kib@
MFC after:	1 month
2009-09-28 16:59:47 +00:00
Konstantin Belousov
579b976090 Fix mount reference leak when V_XSLEEP is specified to vn_start_write().
Submitted by:	tegge
2009-09-01 12:05:39 +00:00
Konstantin Belousov
a505c2c72c Make the mnt_writeopcount and mnt_secondary_writes counters,
used by the suspension code, not greater then mnt_ref reference
counter value. Increment mnt_ref together with write counter
in vn_start_write()/ vn_start_secondary_write(), releasing in
vn_finished_write/vn_finished_secondary_write().

Since r186197, unmount code requires that no writers occured after all
references are expired. We still could get write counter incremented
for freed or reused struct mount, but it seems to be innocent, since
corresponding vnode should be referenced and reclaimed then.

Reported by:	pho (last half a year), erwin
Reviewed by:	attilio
Tested by:	pho, erwin
MFC after:	1 week
2009-08-31 10:20:52 +00:00
Konstantin Belousov
f1eccd05ec In vn_vget_ino() and their inline equivalents, mnt_ref() the mount point
around the sequence that drop vnode lock and then busies the mount point.
Not having vlocked node or direct reference to the mp allows for the
forced unmount to proceed, making mp unmounted or reused.

Tested by:	pho
Reviewed by:	jeff
Approved by:	re (kensmith)
MFC after:	2 weeks
2009-07-02 18:02:55 +00:00
Konstantin Belousov
e0c161b89c Add another flags argument to vn_open_cred. Use it to specify that some
vn_open_cred invocations shall not audit namei path.

In particular, specify VN_OPEN_NOAUDIT for dotdot lookup performed by
default implementation of vop_vptocnp, and for the open done for core
file. vn_fullpath is called from the audit code, and vn_open there need
to disable audit to avoid infinite recursion. Core file is created on
return to user mode, that, in particular, happens during syscall return.
The creation of the core file is audited by direct calls, and we do not
want to overwrite audit information for syscall.

Reported, reviewed and tested by: rwatson
2009-06-21 13:41:32 +00:00
Paul Saab
27bfb741a0 Simply shared vnode locking and extend it to also include fsync.
Also, in vop_write, no longer assert for exclusive locks on the
vnode.

Reviewed by:	jhb, kmacy, jeffr
2009-06-08 21:23:54 +00:00
Robert Watson
bcf11e8d00 Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with:	pjd
2009-06-05 14:55:22 +00:00
Paul Saab
983b43c5de When checking for shared writes, use the struct mount returned from
vn_start_write.

Reviewed by:	jhb
2009-06-04 16:50:03 +00:00
Paul Saab
a6d545d8ed Support shared vnode locks for write operations when the offset is
provided on filesystems that support it.  This really improves mysql
+ innodb performance on ZFS.

Reviewed by:	jhb, kmacy, jeffr
2009-06-04 16:18:07 +00:00
Attilio Rao
dfd233edd5 Remove the thread argument from the FSD (File-System Dependent) parts of
the VFS.  Now all the VFS_* functions and relating parts don't want the
context as long as it always refers to curthread.

In some points, in particular when dealing with VOPs and functions living
in the same namespace (eg. vflush) which still need to be converted,
pass curthread explicitly in order to retain the old behaviour.
Such loose ends will be fixed ASAP.

While here fix a bug: now, UFS_EXTATTR can be compiled alone without the
UFS_EXTATTR_AUTOSTART option.

VFS KPI is heavilly changed by this commit so thirdy parts modules needs
to be recompiled.  Bump __FreeBSD_version in order to signal such
situation.
2009-05-11 15:33:26 +00:00
Konstantin Belousov
41b72e6e50 Eliminate the loop and the call to pause(9) in vfs_vget_ino(). If
vfs_busy(MBF_NOWAIT) failed, unlock the vnode and sleep in vfs_busy().

Suggested and reviewed by:	jeff
Tested by:	pho
MFC after:	3 weeks
2009-05-07 18:14:21 +00:00
Kip Macy
5e6a926611 - use a shared lock for reads
- remove stale comment

Reviewed by:	jeffr
2009-04-13 23:09:44 +00:00
Robert Watson
885868cd8f Remove VOP_LEASE and supporting functions. This hasn't been used since
the removal of NQNFS, but was left in in case it was required for NFSv4.
Since our new NFSv4 client and server can't use it for their
requirements, GC the old mechanism, as well as other unused lease-
related code and interfaces.

Due to its impact on kernel programming and binary interfaces, this
change should not be MFC'd.

Proposed by:    jeff
Reviewed by:    jeff
Discussed with: rmacklem, zach loafman @ isilon
2009-04-10 10:52:19 +00:00
John Baldwin
33fc362512 Add a new internal mount flag (MNTK_EXTENDED_SHARED) to indicate that a
filesystem supports additional operations using shared vnode locks.
Currently this is used to enable shared locks for open() and close() of
read-only file descriptors.
- When an ISOPEN namei() request is performed with LOCKSHARED, use a
  shared vnode lock for the leaf vnode only if the mount point has the
  extended shared flag set.
- Set LOCKSHARED in vn_open_cred() for requests that specify O_RDONLY but
  not O_CREAT.
- Use a shared vnode lock around VOP_CLOSE() if the file was opened with
  O_RDONLY and the mountpoint has the extended shared flag set.
- Adjust md(4) to upgrade the vnode lock on the vnode it gets back from
  vn_open() since it now may only have a shared vnode lock.
- Don't enable shared vnode locks on FIFO vnodes in ZFS and UFS since
  FIFO's require exclusive vnode locks for their open() and close()
  routines.  (My recent MPSAFE patches for UDF and cd9660 already included
  this change.)
- Enable extended shared operations on UFS, cd9660, and UDF.

Submitted by:	ups
Reviewed by:	pjd (ZFS bits)
MFC after:	1 month
2009-03-11 14:13:47 +00:00
Konstantin Belousov
e9aff35739 Move the code from ufs_lookup.c used to do dotdot lookup, into
the helper function. It is supposed to be useful for any filesystem
that has to unlock dvp to walk to the ".." entry in lookup routine.

Requested by:	jhb
Tested by:	pho
MFC after:	1 month
2009-01-21 14:51:38 +00:00
Pawel Jakub Dawidek
0c6a80e78d Improve KASSERT() call a bit:
- Print flags in hex.
- Note that flags can be fine and panic can be due unexpected error condition.
- Remove redundant new line character.

Eventhough panic message excess 80 characters keep it in one line so it is
easier to grep.
2008-11-29 12:40:14 +00:00
Konstantin Belousov
c5f77bf986 Revert r184118. There is actually a code in the kernel, for instance in
kern_unlinkat(), that expects that vn_start_write() actually fills the mp
even when the call failed.

As Tor noted, that pattern relies on the the type stability of the mount
points, as well as that suspended mount points are never freed and
V_XSLEEP is always passed to vn_start_write() when called on a freed
mount point.

Reported by:	stass
Reviewed by:	tegge
PR:		123768
2008-11-16 21:56:29 +00:00
John Baldwin
21fc02d271 Use shared vnode locks instead of exclusive vnode locks for the access(),
chdir(), chroot(), eaccess(), fpathconf(), fstat(), fstatfs(), lseek()
(when figuring out the current size of the file in the SEEK_END case),
pathconf(), readlink(), and statfs() system calls.

Submitted by:	ups (mostly)
Tested by:	pho
MFC after:	1 month
2008-11-03 20:31:00 +00:00
Attilio Rao
83b3bdbc8a Improve VFS locking:
- Implement real draining for vfs consumers by not relying on the
  mnt_lock and using instead a refcount in order to keep track of lock
  requesters.
- Due to the change above, remove the mnt_lock lockmgr because it is now
  useless.
- Due to the change above, vfs_busy() is no more linked to a lockmgr.
  Change so its KPI by removing the interlock argument and defining 2 new
  flags for it: MBF_NOWAIT which basically replaces the LK_NOWAIT of the
  old version (which was unlinked from the lockmgr alredy) and
  MBF_MNTLSTLOCK which provides the ability to drop the mountlist_mtx
  once the mnt interlock is held (ability still desired by most consumers).
- The stub used into vfs_mount_destroy(), that allows to override the
  mnt_ref if running for more than 3 seconds, make it totally useless.
  Remove it as it was thought to work into older versions.
  If a problem of "refcount held never going away" should appear, we will
  need to fix properly instead than trust on such hackish solution.
- Fix a bug where returning (with an error) from dounmount() was still
  leaving the MNTK_MWAIT flag on even if it the waiters were actually
  woken up. Just a place in vfs_mount_destroy() is left because it is
  going to recycle the structure in any case, so it doesn't matter.
- Remove the markercnt refcount as it is useless.

This patch modifies VFS ABI and breaks KPI for vfs_busy() so manpages and
__FreeBSD_version will be modified accordingly.

Discussed with:	kib
Tested by:	pho
2008-11-02 10:15:42 +00:00
Edward Tomasz Napierala
15bc6b2bd8 Introduce accmode_t. This is required for NFSv4 ACLs - it will be neccessary
to add more V* constants, and the variables changed by this patch were often
being assigned to mode_t variables, which is 16 bit.

Approved by:	rwatson (mentor)
2008-10-28 13:44:11 +00:00
Konstantin Belousov
3ba28ace77 Change vn_start_write() to clear *mpp on all failures when non-NULL vp
is supplied, since vm_pageout_scan() expects it to be cleared on error.

Submitted by:	tegge
PR:	123768
MFC after:	1 week
2008-10-21 09:55:49 +00:00
Konstantin Belousov
016f98f947 Assert that v_holdcnt is non-zero before entering lockmgr in vn_lock
and ffs_lock. This cannot catch situations where holdcnt is incremented
not by curthread, but I think it is useful.

Reviewed by:	tegge, attilio
Tested by:	pho
MFC after:	2 weeks
2008-10-20 10:11:33 +00:00
Konstantin Belousov
0fbbf2ea56 Initialize va_rdev to NODEV and va_fsid to VNOVAL before the
VOP_GETATTR() call in vn_stat(). Thus if a file system doesn't
initialize those fields in VOP_GETATTR() they will have a sane default
value.

Submitted by:   Jaakko Heinonen <jh saunalahti fi>
Discussed on:   freebsd-fs
MFC after:	1 month
2008-09-20 19:48:24 +00:00
Konstantin Belousov
ea60a5f526 Initialize birthtime fields in vn_stat() to prevent stat(2) from
returning uninitialized birthtime. Most file systems don't initialize
birthtime properly in their VOP_GETTATTR().

Submitted by:   Jaakko Heinonen <jh saunalahti fi>
Reviewed by:	bde
Discussed on:   freebsd-fs
MFC after:	1 month
2008-09-20 19:43:22 +00:00
Konstantin Belousov
2814d5ba5f When attempt is made to suspend a filesystem that is already syspended,
wait until the current suspension is lifted instead of silently returning
success immediately. The consequences of calling vfs_write() resume when
not owning the suspension are not well-defined at best.

Add the vfs_susp_clean() mount method to be called from
vfs_write_resume(). Set it to process_deferred_inactive() for ffs, and
stop calling it manually.

Add the thread flag TDP_IGNSUSP that allows to bypass the suspension
point in the vn_start_write. It is intended for use by VFS in the
situations where the suspender want to do some i/o requiring calls to
vn_start_write(), and this i/o cannot be done later.

Reviewed by:	tegge
In collaboration with:	pho
MFC after:	 1 month
2008-09-16 11:51:06 +00:00
Konstantin Belousov
bdb8094763 Garbage-collect vn_write_suspend_wait().
Suggested and reviewed by:	tegge
Tested by:	pho
MFC after:	1 month
2008-09-16 11:09:26 +00:00
Attilio Rao
0359a12ead Decontextualize the couplet VOP_GETATTR / VOP_SETATTR as the passed thread
was always curthread and totally unuseful.

Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
2008-08-28 15:23:18 +00:00
Robert Watson
1d986c5ff1 Remove broken code to replace st_mode value with ACCESSPERMS when
lstat(2) is called on symlinks -- this code appears never to have
worked.  The PR this addresses suggests that the intended
original behavior is the right one, but as bde points out in the
PR comments, we do actually support storing a mode on symlinks,
so returning it seems reasonable.

This is consistent with Mac OS X, which despite documentation to
the contrary does return the mode set on a symlink, but not some
other platforms.  The Single Unix Spec requires only that the
returned bits be "meaningful", which seems at best unhelpful as
advice goes.

PR:		25018
MFC after:	3 days
2008-08-03 15:44:56 +00:00
Konstantin Belousov
e314f69fff Add the support for the O_EXEC open(2) mode, as specified by the
POSIX Extended API Set Part 2 extension specification.

Reviewed by:	rwatson, rdivacky
Tested by:	pho
2008-03-31 11:57:18 +00:00
Jeff Roberson
5634d48667 - Don't allow calls to vn_lock() with no lock type requested. Callers
which simply want a reference should use vref().  Callers which want
   to check validity need to hold a lock while performing any action
   based on that validity.  vn_lock() would always release the interlock
   before returning making any action synchronous with the validity check
   impossible.
2008-03-29 23:36:26 +00:00
Jeff Roberson
804e60d4cf - Don't acquire the vnode interlock in _vn_lock() unless no lock type
is requested.  Handle this case specially before the while loop.
 - Use the held vnode lock to check for VI_DOOMED.  The vnode lock and
   interlock must both be held to set VI_DOOMED so either one held, even
   shared, is sufficient to check it.

No objection by:	kib
2008-03-24 04:17:35 +00:00
Attilio Rao
22db15c06f VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>
2008-01-13 14:44:15 +00:00
Attilio Rao
cb05b60a89 vn_lock() is currently only used with the 'curthread' passed as argument.
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.

Manpage and FreeBSD_version will be updated through further commits.

As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.

Tested by:	Diego Sardina <siarodx at gmail dot com>,
		Andrea Di Pasquale <whyx dot it at gmail dot com>
2008-01-10 01:10:58 +00:00