Commit Graph

3399 Commits

Author SHA1 Message Date
Pedro F. Giffuni
daf884fa9f ext4: mount panic from freeing invalid pointers
Initialize the struct with those fields to zeroes on allocation,
preventing the panic.

Patch by:	Damjan Jovanovic.

PR:		206056
MFC after:	3 days
2016-01-11 19:25:43 +00:00
Pedro F. Giffuni
e813d9d7fa ext4: add support for reading sparse files
Add support for sparse files in ext4. Also implement read-ahead, which
greatly increases the performance when transferring files from ext4.

Both features implemented by Damjan Jovanovic.

PR:		205816
MFC after:	1 week
2016-01-11 19:14:55 +00:00
Andrey V. Elsukov
c829016e85 Change the type of newsize argument in the smbfs_smb_setfsize() function
from int to int64.
MSDN says that SMB_SET_FILE_END_OF_FILE_INFO uses signed 64-bit integer
to specify offset, but since smbfs_smb_setfsize() has used plain int,
a value was truncated in case when offset was larger than 2G.
	https://msdn.microsoft.com/en-us/library/ff469975.aspx

In particular, now `truncate -s 10G` will work correctly on the mounted
SMB share.

Reported and tested by:	Eugene Grosbein <eugen at grosbein dot net>
MFC after:	1 week
2016-01-11 18:11:06 +00:00
Pedro F. Giffuni
7135ca50c1 ext2fs: reading mmaped file in Ext4 causes panic
Always call brelse(path.ep_bp), fixing reading EXT4 files using mmap().

Patch by Damjan Jovanovic.

PR:		205938
MFC after:	1 week
2016-01-07 21:43:43 +00:00
Konstantin Belousov
fb57d63e47 Hide transient EBADF errors caused by the parallel revoke(2) or forced
unmount of devfs mounts, by restarting the failed syscall.

When restarted, failing syscalls eventually either stop finding the
node and returning ENOENT, or the vnode op vectors finally transition
to the deadfs vop.  The later return EIO or other error, more
appropriate for the operation.

Submitted by:	bde
Tested by:	pho
MFC after:	3 weeks
2016-01-02 20:29:28 +00:00
Konstantin Belousov
d52aff3c7a Minor style cleanup.
Submitted by:	bde
MFC after:	1 week
2016-01-01 15:48:48 +00:00
Konstantin Belousov
6f73b583d9 Force nullfs vnode reclaim after unlinking, to potentially unlink
lower vnode.  Otherwise, reference to the lower vnode from the upper
one prevents final unlink.

PR:	178238
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-12-30 19:49:22 +00:00
Pedro F. Giffuni
26069aec57 ext2: recognize ext4 INCOMPAT_RECOVER flag
This is a flag specific for journalling in ext4.
Add it to the list of ext4 features we ignore for
read-only purposes.

PR:		205668
MFC after:	1 week
2015-12-29 15:51:52 +00:00
Konstantin Belousov
cccac8a1ef Make it possible for the cdevsw d_close() driver method to detect last
close and close due to revoke(2)-like operation.

A new FLASTCLOSE flag indicates that this is last close.  FREVOKE is
set for revokes, and FNONBLOCK is also set, same as is already done
for VOP_CLOSE() call from vgonel().

The flags reuse user open(2) flags which are never stored in f_flag,
to not consume bit space in the ABI visible way.  Assert this with the
static check.

Requested and reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-12-22 20:37:34 +00:00
Konstantin Belousov
b63d070ad1 Keep devfs mount locked for the whole duration of the devfs_setattr(),
and ensure that our dirent is instantiated.

Reported and tested by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-12-22 20:22:17 +00:00
Hans Petter Selasky
beebd9aac8 Make CUSE usable with platforms where the size of "unsigned long" is
different from the size of a pointer.
2015-12-22 09:55:44 +00:00
Hans Petter Selasky
9fb69c9d9d Make CUSE usable with platforms where the size of "unsigned long" is
different from the size of a pointer.
2015-12-22 09:41:33 +00:00
Hans Petter Selasky
ac60e80129 Guard against the same process being both CUSE server and client at
the same time. This can easily lead to a deadlock when destroying the
character devices nodes.
2015-12-22 09:26:24 +00:00
Gleb Smirnoff
f17f88d3e0 Fix breakage caused by r292373 in ZFS/FUSE/NFS/SMBFS.
With the new VOP_GETPAGES() KPI the "count" argument counts pages already,
and doesn't need to be translated from bytes to pages.

While here make it consistent that *rbehind and *rahead are updated only
if we doesn't return error.

Pointy hat to:	glebius
2015-12-16 23:48:50 +00:00
Gleb Smirnoff
b0cd20172d A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
  unlike before, all pages will be kept busied on return, like it was
  done before with the 'reqpage' only. Now the reqpage goes away. With
  new interface it is easier to implement code protected from race
  conditions.

  Such arrayed requests for now should be preceeded by a call to
  vm_pager_haspage() to make sure that request is possible. This
  could be improved later, making vm_pager_haspage() obsolete.

  Strenghtening the promises on the business of the array of pages
  allows us to remove such hacks as swp_pager_free_nrpage() and
  vm_pager_free_nonreq().

o New KPI accepts two integer pointers that may optionally point at
  values for read ahead and read behind, that a pager may do, if it
  can. These pages are completely owned by pager, and not controlled
  by the caller.

  This shifts the UFS-specific readahead logic from vm_fault.c, which
  should be file system agnostic, into vnode_pager.c. It also removes
  one VOP_BMAP() request per hard fault.

Discussed with:	kib, alc, jeff, scottl
Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2015-12-16 21:30:45 +00:00
John Baldwin
8d7e0f5889 The cdevpriv_dtr_t typedef was not able to be used in a function prototype
like the various d_*_t typedefs since it declared a function pointer rather
than a function.  Add a new d_priv_dtor_t typedef that declares the function
and can be used as a function prototype.  The previous typedef wasn't
useful outside of the cdevpriv implementation, so retire it.

The name d_priv_dtor_t was chosen to be more consistent with cdev methods
since it is commonly used in place of d_close_t even though it is not a
direct pointer in struct cdevsw.

Reviewed by:	kib, imp
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D4340
2015-12-02 18:27:30 +00:00
Rick Macklem
65171ebbc8 Fix the memory leak that occurs when the nfscommon.ko module is unloaded.
This leak was introduced by r291527.
Since the nfscommon.ko module is rarely unloaded, this leak would not
have been much of an issue.

MFC after:	2 weeks
2015-12-02 02:47:13 +00:00
Rick Macklem
10b2e06e3e Delete the TUNABLE_INT() line. It was in r291527 so that it could be
MFC'd to stable/10 and still work.
2015-11-30 23:37:09 +00:00
Rick Macklem
84be7e0952 Add kernel support to the NFS server for the "-manage-gids"
option that will be added to the nfsuserd daemon in a future
commit. It modifies the cache used by NFSv4 for name<-->id
translation (both username/uid and group/gid) to support this.
When "-manage-gids" is set, the server looks up each uid
for the RPC and uses the list of groups cached in the server
instead of the list of groups provided in the RPC request.
The cached group list is acquired for the cache by the nfsuserd
daemon via getgrouplist(3).
This avoids the 16 groups limit for the list in the RPC request.
Since the cache is now used for every RPC when "-manage-gids"
is enabled, the code also modifies the cache to use a separate
mutex for each hash list instead of a single global mutex.

Suggested by:	jpaetzel
Tested by:	jpaetzel
MFC after:	2 weeks
2015-11-30 21:54:27 +00:00
Kirk McKusick
43a993bb7d For performance reasons, it is useful to have a single string used as
the name of a filesystem when setting it as the first parameter to the
getnewvnode() function. Most filesystems call getnewvnode from just one
place so can use a literal string as the first parameter. However, NFS
calls getnewvnode from two places, so we create a global constant string
that can be used by the two instances. This change also collapses two
instances of getnewvnode() in the UFS filesystem to a single call.

Reviewed by: kib
Tested by:   Peter Holm
2015-11-29 21:01:02 +00:00
Rick Macklem
a0962bf8bc When the nfsd threads are terminated, the NFSv4 server state
(opens, locks, etc) is retained, which I believe is correct behaviour.
However, for NFSv4.1, the server also retained a reference to the xprt
(RPC transport socket structure) for the backchannel. This caused
svcpool_destroy() to not call SVC_DESTROY() for the xprt and allowed
a socket upcall to occur after the mutexes in the svcpool were destroyed,
causing a crash.
This patch fixes the code so that the backchannel xprt structure is
dereferenced just before svcpool_destroy() is called, so the code
does do an SVC_DESTROY() on the xprt, which shuts down the socket upcall.

Tested by:	g_amanakis@yahoo.com
PR:		204340
MFC after:	2 weeks
2015-11-21 23:55:46 +00:00
Rick Macklem
b179878dde Revert r283330 since it broke directory caching in the client.
At this time I cannot see a way to fix directory caching when it
has partial blocks in the buffer cache, due to the fact that the
syscall's uio_offset won't stay the same as the lblkno * NFS_DIRBLKSIZ
offset.

Reported by:	bde
MFC after:	2 weeks
2015-11-21 00:15:41 +00:00
Rick Macklem
f315383406 mnt_stat.f_iosize (which is used to set bo_bsize) must be set to
the largest size of buffer cache block or the mapping of the buffer
is bogus. When a mount with rsize=4096,wsize=4096 was done, f_iosize
would be set to 4096. This resulted in corrupted directory data, since
the buffer cache block size for directories is NFS_DIRBLKSIZ (8192).
This patch fixes the code so that it always sets f_iosize to at least
NFS_DIRBLKSIZ.

Tested by:	krichy@cflinux.hu
PR:		177971
MFC after:	2 weeks
2015-11-17 01:44:26 +00:00
Mark Johnston
d28713378a - Consistently use PROC_ASSERT_HELD() to verify that a process' hold count
is non-zero.
- Include the process address in the PROC_ASSERT_HELD() and
  PROC_ASSERT_NOT_HELD() assertion messages so that the corresponding
  process can be found easily when debugging.

MFC after:	1 week
2015-11-08 01:38:56 +00:00
Konstantin Belousov
1d48f121d8 Ensure that when a blockable open of fifo returns success, a valid
file descriptor opened for complimentary access exists as well.

The implementation of the guarantee is done by counting the
generations of readers and writers opens.  We return success and not
EINTR or ERESTART error, when the sleep for complimentary opening is
interrupted, but the generation was changed during the sleep.

Longer explanation: assume there are two threads, A doing open("fifo",
O_RDONLY) and B doing open("fifo", O_WRONLY), and no other threads
either trying to open the fifo, nor there are any file descriptors
referencing the fifo.  Before the change, it was possible e.g. for for
thread A to return a valid file descriptor, while thread B returned
EINTR if a signal to B was delivered simultaneously with the wakeup
from A.  After the change, in this situation both A::open() and
B::open() succeed and the signal is made "as if" it was noticed
slightly later.  Note that the signal actual delivery is not changed,
it is done by ast on syscall return path, so signal handler is still
executed before first instruction after syscall.

See PR for the code demonstrating the issue.

PR:	203162
Reported by:	Victor Stinner victor.stinner@gmail.com
Reviewed by:	jilles
Tested by:	bapt, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-09-20 21:18:33 +00:00
Edward Tomasz Napierala
1d4c0424c8 Fix an NFS server bug that manifested in "ls -al" displaying a plus
sign on every directory exported via NFSv4 with NFSv4 ACLs enabled.

Reviewed by:	rmacklem@
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D3502
2015-08-28 14:26:11 +00:00
Edward Tomasz Napierala
643e5ec210 Make it possible to forcibly unmount devfs.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2015-08-24 14:04:44 +00:00
Edward Tomasz Napierala
6e572e084b After r286237 it should be fine to call vgone(9) on a busy GEOM vnode;
remove KASSERT that would prevent forced devfs unmount from working.

MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2015-08-23 14:53:54 +00:00
Rick Macklem
29dc40b6be For the case where an NFSv4.1 ExchangeID operation has the client identifier
that already has a confirmed ClientID, the nfsrv_setclient() function would
not fill in the clientidp being returned. As such, the value of ClientID
returned would be whatever garbage was on the stack.
An NFSv4.1 client would not normally do this, but it appears that it can
happen for certain Linux clients. When it happens, the client persistently
retries the ExchangeID and Create_session after Create_session fails when
it uses the bogus clientid. With this patch, the correct clientid is replied.
This problem was identified in a packet trace supplied by
Ahmed Kamal via email.

Reported by:	email.ahmedkamal@googlemail.com
MFC after:	2 weeks
2015-08-14 22:02:14 +00:00
John Baldwin
fada4adf95 The changes that introduced fo_mmap() treated all character device
mappings as if MAP_SHARED was always present since in general MAP_PRIVATE
is not permitted for character devices.  However, there is one exception
in that MAP_PRIVATE mappings are permitted for /dev/zero.

Only require a writable file descriptor (FWRITE) for shared, writable
mappings of character devices.  vm_mmap_cdev() will reject any private
mappings for other devices.

Reviewed by:	kib
Reported by:	sbruno (broke qemu cross-builds), peter
Differential Revision:	https://reviews.freebsd.org/D3316
2015-08-06 16:50:37 +00:00
Conrad Meyer
b5af3f30a7 nfsclient: Protest loudly when GETATTR responses are invalid
BROKEN NFS SERVER OR MIDDLEWARE: Certain WAN "accelerators" attempt to cache
NFS GETATTR traffic, but actually corrupt it (e.g., responding to requests
with attributes for totally different files).

Warn very verbosely when this is detected. Linux' NFS client has a similar
warning.

Adds a sysctl/tunable (vfs.nfs.fileid_maxwarnings) to configure the quantity
of warnings; default to 10. (Zero disables; -1 is unlimited.)

Adds a failpoint to aid in validating the warning / behavior with a
non-broken server. Use something like:

    sysctl 'debug.fail_point.nfscl_force_fileid_warning=10%return(1)'

Reviewed by:	rmacklem
Approved by:	markj (mentor)
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D3304
2015-08-05 22:27:30 +00:00
Rick Macklem
25f37276e5 This patch fixes a problem where, if the NFSv4 server has a previous
unconfirmed clientid structure for the same client on the last hash list,
this old entry would not be removed/deleted. I do not think this bug would have
caused serious problems, since the new entry would have been before the old one
on the list. This old entry would have eventually been scavenged/removed.
Detected while reading the code looking for another bug.

MFC after:	3 days
2015-07-29 23:06:30 +00:00
Jeff Roberson
7d07bfd8a3 - Remove some dead code copied from ffs. 2015-07-29 03:06:08 +00:00
Christian Brueffer
382353e2e8 In tmpfs_chtimes(), remove checks on the nanosecond level when
determining whether a node changed.

Other filesystems, e.g., UFS, only check on seconds, when determining
whether something changed.

This also corrects the birthtime case, where we checked tv_nsec
twice, instead of tv_sec and tv_nsec (PR).

PR:			201284
Submitted by:		David Binderman
Patch suggested by:	kib
Reviewed by:		kib
MFC after:		2 weeks
Committed from:		Essen FreeBSD Hackathon
2015-07-26 08:33:46 +00:00
Konstantin Belousov
b4490c6e93 The si_status field of the siginfo_t, provided by the waitid(2) and
SIGCHLD signal, should keep full 32 bits of the status passed to the
_exit(2).

Split the combined p_xstat of the struct proc into the separate exit
status p_xexit for normal process exit, and signalled termination
information p_xsig.  Kernel-visible macro KW_EXITCODE() reconstructs
old p_xstat from p_xexit and p_xsig.  p_xexit contains complete status
and copied out into si_status.

Requested by:	Joerg Schilling
Reviewed by:	jilles (previous version), pho
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2015-07-18 09:02:50 +00:00
Mark Johnston
5f34e93c58 Check suspendability on the mountpoint returned by VOP_GETWRITEMOUNT.
This obviates the need for a MNTK_SUSPENDABLE flag, since passthrough
filesystems like nullfs and unionfs no longer need to inherit this
information from their lower layer(s). This change also restores the
pre-r273336 behaviour of using the presence of a susp_clean VFS method to
request suspension support.

Reviewed by:	kib, mjg
Differential Revision:	https://reviews.freebsd.org/D2937
2015-07-05 22:37:33 +00:00
Mateusz Guzik
f131759f54 fd: make 'rights' a manadatory argument to fget* functions 2015-07-05 19:05:16 +00:00
Rick Macklem
2a3508eb48 If a "principal" argument isn't provided for a Kerberized NFS mount,
the kernel would generate a bogus one with a ":/<path>" suffix.
This would only occur for the case where there was no explicit
"principal" argument and the getaddrinfo() call in mount_nfs.c failed to a
return a cannonical name for the server.
This patch fixes this unusual case.

PR:		201073
Submitted by:	masato@itc.naist.jp
MFC after:	2 weeks
2015-07-03 22:11:07 +00:00
Rick Macklem
d189dcb6e2 Alex Burlyga reported a POLA violation for the new NFS client as
compared to the old NFS client via email to the freebsd-fs@ mailing list.
For the new client, when multiple clients attempted to create a symbolic
link concurrently, more that one client would report success instead of
EEXIST. This was caused by code in the new client that mapped EEXIST to
OK assuming it was caused by a retried RPC request.
Since the old client did not do this, the patch defaults to the old
behaviour and permits the new behaviour to be enabled via a sysctl.

Reported by:	alex.burlyga.ietf@gmail.com
Tested by:	alex.burlyga.ietf@gmail.com
MFC after:	2 weeks
2015-07-03 01:15:21 +00:00
Mark Murray
d1b06863fb Huge cleanup of random(4) code.
* GENERAL
- Update copyright.
- Make kernel options for RANDOM_YARROW and RANDOM_DUMMY. Set
  neither to ON, which means we want Fortuna
- If there is no 'device random' in the kernel, there will be NO
  random(4) device in the kernel, and the KERN_ARND sysctl will
  return nothing. With RANDOM_DUMMY there will be a random(4) that
  always blocks.
- Repair kern.arandom (KERN_ARND sysctl). The old version went
  through arc4random(9) and was a bit weird.
- Adjust arc4random stirring a bit - the existing code looks a little
  suspect.
- Fix the nasty pre- and post-read overloading by providing explictit
  functions to do these tasks.
- Redo read_random(9) so as to duplicate random(4)'s read internals.
  This makes it a first-class citizen rather than a hack.
- Move stuff out of locked regions when it does not need to be
  there.
- Trim RANDOM_DEBUG printfs. Some are excess to requirement, some
  behind boot verbose.
- Use SYSINIT to sequence the startup.
- Fix init/deinit sysctl stuff.
- Make relevant sysctls also tunables.
- Add different harvesting "styles" to allow for different requirements
  (direct, queue, fast).
- Add harvesting of FFS atime events. This needs to be checked for
  weighing down the FS code.
- Add harvesting of slab allocator events. This needs to be checked for
  weighing down the allocator code.
- Fix the random(9) manpage.
- Loadable modules are not present for now. These will be re-engineered
  when the dust settles.
- Use macros for locks.
- Fix comments.

* src/share/man/...
- Update the man pages.

* src/etc/...
- The startup/shutdown work is done in D2924.

* src/UPDATING
- Add UPDATING announcement.

* src/sys/dev/random/build.sh
- Add copyright.
- Add libz for unit tests.

* src/sys/dev/random/dummy.c
- Remove; no longer needed. Functionality incorporated into randomdev.*.

* live_entropy_sources.c live_entropy_sources.h
- Remove; content moved.
- move content to randomdev.[ch] and optimise.

* src/sys/dev/random/random_adaptors.c src/sys/dev/random/random_adaptors.h
- Remove; plugability is no longer used. Compile-time algorithm
  selection is the way to go.

* src/sys/dev/random/random_harvestq.c src/sys/dev/random/random_harvestq.h
- Add early (re)boot-time randomness caching.

* src/sys/dev/random/randomdev_soft.c src/sys/dev/random/randomdev_soft.h
- Remove; no longer needed.

* src/sys/dev/random/uint128.h
- Provide a fake uint128_t; if a real one ever arrived, we can use
  that instead. All that is needed here is N=0, N++, N==0, and some
  localised trickery is used to manufacture a 128-bit 0ULLL.

* src/sys/dev/random/unit_test.c src/sys/dev/random/unit_test.h
- Improve unit tests; previously the testing human needed clairvoyance;
  now the test will do a basic check of compressibility. Clairvoyant
  talent is still a good idea.
- This is still a long way off a proper unit test.

* src/sys/dev/random/fortuna.c src/sys/dev/random/fortuna.h
- Improve messy union to just uint128_t.
- Remove unneeded 'static struct fortuna_start_cache'.
- Tighten up up arithmetic.
- Provide a method to allow eternal junk to be introduced; harden
  it against blatant by compress/hashing.
- Assert that locks are held correctly.
- Fix the nasty pre- and post-read overloading by providing explictit
  functions to do these tasks.
- Turn into self-sufficient module (no longer requires randomdev_soft.[ch])

* src/sys/dev/random/yarrow.c src/sys/dev/random/yarrow.h
- Improve messy union to just uint128_t.
- Remove unneeded 'staic struct start_cache'.
- Tighten up up arithmetic.
- Provide a method to allow eternal junk to be introduced; harden
  it against blatant by compress/hashing.
- Assert that locks are held correctly.
- Fix the nasty pre- and post-read overloading by providing explictit
  functions to do these tasks.
- Turn into self-sufficient module (no longer requires randomdev_soft.[ch])
- Fix some magic numbers elsewhere used as FAST and SLOW.

Differential Revision: https://reviews.freebsd.org/D2025
Reviewed by: vsevolod,delphij,rwatson,trasz,jmg
Approved by: so (delphij)
2015-06-30 17:00:45 +00:00
Konstantin Belousov
8551285097 Restore the td_cookie value for the tmpfs directory entry which was a
dup entry, upon detach from the parent directory.  If the node is
renamed, the entry is re-attached at the different directory, and
invalud cookie value triggers assert (or corrupts directory rb tree,
it seems).

Reported by:	clusteradm (gjb, antoine)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-06-19 07:25:15 +00:00
Gleb Smirnoff
093ebe1d28 o Un-inline vm_pager_get_pages(), vm_pager_get_pages_async().
o Provide an extensive set of assertions for input array of pages.
o Remove now duplicate assertions from different pagers.

Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2015-06-17 22:44:27 +00:00
Mateusz Guzik
4da8456f0a Replace struct filedesc argument in getvnode with struct thread
This is is a step towards removal of spurious arguments.
2015-06-16 13:09:18 +00:00
Gleb Smirnoff
093c7f396d Make KPI of vm_pager_get_pages() more strict: if a pager changes a page
in the requested array, then it is responsible for disposition of previous
page and is responsible for updating the entry in the requested array.
Now consumers of KPI do not need to re-lookup the pages after call to
vm_pager_get_pages().

Reviewed by:	kib
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2015-06-12 11:32:20 +00:00
Mateusz Guzik
f6f6d24062 Implement lockless resource limits.
Use the same scheme implemented to manage credentials.

Code needing to look at process's credentials (as opposed to thred's) is
provided with *_proc variants of relevant functions.

Places which possibly had to take the proc lock anyway still use the proc
pointer to access limits.
2015-06-10 10:48:12 +00:00
Mark Johnston
068a3d319a unionfs: fix suspendability check bugs
- MNTK_SUSPENDABLE is set in mnt_kern_flag, not mnt_flag.
- The lower layer of a unionfs mount is read-only, so the mount should
  be suspendable iff the upper layer is suspendable.
- Remove a couple of superfluous comments.

Differential Revision:	https://reviews.freebsd.org/D2714
Reviewed by:	kib, mjg
2015-06-06 16:36:13 +00:00
John Baldwin
7077c42623 Add a new file operations hook for mmap operations. File type-specific
logic is now placed in the mmap hook implementation rather than requiring
it to be placed in sys/vm/vm_mmap.c.  This hook allows new file types to
support mmap() as well as potentially allowing mmap() for existing file
types that do not currently support any mapping.

The vm_mmap() function is now split up into two functions.  A new
vm_mmap_object() function handles the "back half" of vm_mmap() and accepts
a referenced VM object to map rather than a (handle, handle_type) tuple.
vm_mmap() is now reduced to converting a (handle, handle_type) tuple to a
a VM object and then calling vm_mmap_object() to handle the actual mapping.
The vm_mmap() function remains for use by other parts of the kernel
(e.g. device drivers and exec) but now only supports mapping vnodes,
character devices, and anonymous memory.

The mmap() system call invokes vm_mmap_object() directly with a NULL object
for anonymous mappings.  For mappings using a file descriptor, the
descriptors fo_mmap() hook is invoked instead.  The fo_mmap() hook is
responsible for performing type-specific checks and adjustments to
arguments as well as possibly modifying mapping parameters such as flags
or the object offset.  The fo_mmap() hook routines then call
vm_mmap_object() to handle the actual mapping.

The fo_mmap() hook is optional.  If it is not set, then fo_mmap() will
fail with ENODEV.  A fo_mmap() hook is implemented for regular files,
character devices, and shared memory objects (created via shm_open()).

While here, consistently use the VM_PROT_* constants for the vm_prot_t
type for the 'prot' variable passed to vm_mmap() and vm_mmap_object()
as well as the vm_mmap_vnode() and vm_mmap_cdev() helper routines.
Previously some places were using the mmap()-specific PROT_* constants
instead.  While this happens to work because PROT_xx == VM_PROT_xx,
using VM_PROT_* is more correct.

Differential Revision:	https://reviews.freebsd.org/D2658
Reviewed by:	alc (glanced over), kib
MFC after:	1 month
Sponsored by:	Chelsio
2015-06-04 19:41:15 +00:00
Eric van Gyzen
63e4c6cdf9 Provide vnode in memory map info for files on tmpfs
When providing memory map information to userland, populate the vnode pointer
for tmpfs files.  Set the memory mapping to appear as a vnode type, to match
FreeBSD 9 behavior.

This fixes the use of tmpfs files with the dtrace pid provider,
procstat -v, procfs, linprocfs, pmc (pmcstat), and ptrace (PT_VM_ENTRY).

Submitted by:   Eric Badger <eric@badgerio.us> (initial revision)
Obtained from:  Dell Inc.
PR:             198431
MFC after:      2 weeks
Reviewed by:    jhb
Approved by:    kib (mentor)
2015-06-02 18:37:04 +00:00
Xin LI
6e55e724a6 Clear p_stops upon PROCFS_CTL_DETACH, similar to r283889.
Noticed by:	jhb
Reviewed by:	sef
Sponsored by:	iXsystems, Inc.
MFC after:	2 weeks
2015-06-01 18:49:31 +00:00
Rick Macklem
0c419e226c Make the NFS server use shared vnode locks for a few cases
that are allowed by the VFS/VOP interface instead of using
exclusive locks.

MFC after:	2 weeks
2015-05-29 20:22:53 +00:00
Pedro F. Giffuni
e54a659a26 Provide VOP_GETPAGES_ASYNC() for extfs.
Merge the filesystem specific part from r274914 to ext2fs.

I only did regular testing with the change but UFS and our ext2fs
are similar enough that the code should just work with the new
sendfile.

Discussed with:	glebius
2015-05-28 21:06:59 +00:00
Rick Macklem
1f54e596ad Make the size of the hash tables used by the NFSv4 server tunable.
No appreciable change in performance was observed after increasing
the sizes of these tables and then testing with a single client.
However, there was an email that indicated high CPU overheads for
a heavily loaded NFSv4 and it is hoped that increasing the sizes
of the hash tables via these tunables might help.
The tables remain the same size by default.

Differential Revision:	https://reviews.freebsd.org/D2596
MFC after:	2 weeks
2015-05-27 22:00:05 +00:00
Konstantin Belousov
1bc93bb7b9 Currently, softupdate code detects overstepping on the workitems
limits in the code which is deep in the call stack, and owns several
critical system resources, like vnode locks.  Attempt to wait while
the per-mount softupdate thread cleans up the backlog may deadlock,
because the thread might need to lock the same vnode which is owned by
the waiting thread.

Instead of synchronously waiting for the worker, perform the worker'
tickle and pause until the backlog is cleaned, at the safe point
during return from kernel to usermode.  A new ast request to call
softdep_ast_cleanup() is created, the SU code now only checks the size
of queue and schedules ast.

There is no ast delivery for the kernel threads, so they are exempted
from the mechanism, except NFS daemon threads.  NFS server loop
explicitely checks for the request, and informs the schedule_cleanup()
that it is capable of handling the requests by the process P2_AST_SU
flag.  This is needed because nfsd may be the sole cause of the SU
workqueue overflow.  But, to not cause nsfd to spawn additional
threads just because we slow down existing workers, only tickle su
threads, without waiting for the backlog cleanup.

Reviewed by:	jhb, mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-05-27 09:20:42 +00:00
Dmitry Chagin
63cc3320d9 Hide vfs.pfs.trace variable if it is not used. 2015-05-24 18:11:22 +00:00
Rick Macklem
262a84286d The NFS client generated directory block(s) with d_fileno == 0
so that it would not return less data than requested.
Since returning less directory data than requested is not a problem
for FreeBSD and even UFS no longer returns directory structures
with d_fileno == 0, this patch stops the client from doing this.
Although entries with d_fileno == 0 should not be a problem,
the man pages no longer document that these entries should be
ignored, so there was a concern that these entries might be an
issue in the future.

Suggested by:	trasz
Tested by:	trasz
MFC after:	2 weeks
2015-05-23 21:58:41 +00:00
Jung-uk Kim
fd90e2ed54 CALLOUT_MPSAFE has lost its meaning since r141428, i.e., for more than ten
years for head.  However, it is continuously misused as the mpsafe argument
for callout_init(9).  Deprecate the flag and clean up callout_init() calls
to make them more consistent.

Differential Revision:	https://reviews.freebsd.org/D2613
Reviewed by:	jhb
MFC after:	2 weeks
2015-05-22 17:05:21 +00:00
John Baldwin
312827253b Always set p_oppid when attaching to an existing process via procfs
tracing.  This matches the behavior of ptrace(PT_ATTACH).  Also,
the procfs detach request assumes p_oppid is always set.

Reviewed by:	kib
MFC after:	2 weeks
2015-05-22 11:03:51 +00:00
Rick Macklem
86b9457f5b The NFS client wasn't handling getdirentries(2) requests for sizes
that are not an exact multiple of DIRBLKSIZ correctly. Fortunately
readdir(3) always uses an exact multiple of DIRBLKSIZ, so few applications
were affected. This patch fixes this problem by reducing the size
of the directory read to an exact multiple of DIRBLKSIZ.

Tested by:	trasz
Reported by:	trasz
Reviewed by:	trasz
MFC after:	2 weeks
2015-05-21 23:14:18 +00:00
Alexander Motin
a87627b26b Do not promote large async writes to sync.
Present implementation of large sync writes is too strict and so can be
quite slow.  Instead of doing that, execute large async write in chunks,
syncing each chunk separately.

It would be good to fix large sync writes too, but I leave it to somebody
with more skills in this area.

Reviewed by:	rmacklem
MFC after:	1 week
2015-05-14 10:04:42 +00:00
Rick Macklem
2fbb9d563b Fix the NFS server's handling of a bogus NFSv2 ROOT RPC.
The ROOT RPC is deprecated in the NFSv2 RFC, RFC-1094
and should never be used by a client.

Tested by:	thmu@freenet.de
MFC after:	1 week
2015-04-25 00:58:24 +00:00
Rick Macklem
7cfdc2a7bc MAXBSIZE defines both the largest UFS block size and the
largest size for a buffer in the buffer cache. This patch
defines a new constant MAXBCACHEBUF, which is the largest
size for a buffer in the buffer cache. Having a separate
constant allows MAXBCACHEBUF to be set larger than MAXBSIZE
on a per-architecture basis, so that NFS can do larger read/writes
for these architectures. It modifies sys/param.h so that BKVASIZE
can also be set on a per-architecture basis.
A couple of cases where NFS used MAXBSIZE instead of NFS_MAXBSIZE
is fixed as well.

Differential Revision:	https://reviews.freebsd.org/D2330
Reviewed by:	mav, kib
MFC after:	2 weeks
2015-04-25 00:52:01 +00:00
Pedro F. Giffuni
2f39c91019 Prevent a double free.
This is similar to r281756 so set the ptr NULL after free as a safety belt
against future changes.

Obtained from:	HardenedBSD (b2e77ced9ae213d358b44d98f552d9ae4636ecac)
Submitted by:	Oliver Pinter
Revewed by:	rmacklem
2015-04-20 16:40:13 +00:00
Pedro F. Giffuni
a3a4b110da nfsrpc_createv4: fix double free.
Reported by:	Oliver Pinter, clang static checker
Obtained from:	HardenedBSD (commit 63cac77c42c0c3fc67da62f97d5ab651d52ae707)
Reviewed by:	rmacklem
MFC after:	5 days
2015-04-19 23:55:59 +00:00
Alexander Motin
afdfc9a40d Change wcommitsize default from one empirical value to another.
The new value is more predictable with growing RAM size:

        hibufspace maxvnodes      old      new
i386:
  256MB   32980992     15800  2198732  2097152
    2GB   94027776    107677   878764  4194304
amd64:
  256MB   32980992     15800  2198732  2097152
    1GB  114114560     68062  1678155  4194304
    4GB  217055232    111807  1955452  4194304
   16GB 1717846016    337308  5097465 16777216
   64GB 1734918144   1164427  1490479 16777216
  256GB 1734918144   4426453   391983 16777216

Reviewed by:	rmacklem
MFC after:	2 weeks
2015-04-19 11:34:41 +00:00
Edward Tomasz Napierala
50a220c699 Replace "new NFS" with just "NFS" in some sysctl description strings.
Sponsored by:	The FreeBSD Foundation
2015-04-19 06:18:41 +00:00
Pedro F. Giffuni
f738ee4825 Drop experimental dir_index support.
The htree directory index is a highly desirable feature for research
purposes and was meant to improve performance in our ext2/3 driver.
Unfortunately our implementation has two problems:

- It never really delivered any performance improvement.
- It appears to corrupt the filesystem in undetermined circumstances.

Strictly speaking dir_index is not required for read/write support in
ext2/3 and our limited ext4 support still works fine without it.

Regain stability in the ext2 driver by removing it. We may need it back
(fixed) if we want to support encrypted ext4 support but thanks to the
wonders of version control we can always revert this change and bring it
back.

PR:	191895
PR:	198731
PR:	199309

MFC after:	5 days
2015-04-17 22:26:01 +00:00
Rick Macklem
66e80f77d2 mav@ has found that NFS servers exporting ZFS file systems
can perform better when using a 128K read/write data size.
This patch changes NFS_MAXDATA from 64K to 128K so that
clients can use 128K for NFS mounts to allow this.
The patch also renames NFS_MAXDATA to NFS_SRVMAXIO so
that it is clear that it applies to the NFS server side
only. It also avoids a name conflict with the NFS_MAXDATA
defined in rpcsvc/nfs_prot.h, that is used for userland RPC.

Tested by:	mav
Reviewed by:	mav
MFC after:	2 weeks
2015-04-16 22:35:15 +00:00
Rick Macklem
dda11d4ab9 File systems that do not use the buffer cache (such as ZFS) must
use VOP_FSYNC() to perform the NFS server's Commit operation.
This patch adds a mnt_kern_flag called MNTK_USES_BCACHE which
is set by file systems that use the buffer cache. If this flag
is not set, the NFS server always does a VOP_FSYNC().
This should be ok for old file system modules that do not set
MNTK_USES_BCACHE, since calling VOP_FSYNC() is correct, although
it might not be optimal for file systems that use the buffer cache.

Reviewed by:	kib
MFC after:	2 weeks
2015-04-15 20:16:31 +00:00
Will Andrews
677c3c0c66 tmpfs_getattr(): Return more correct allocated byte counts.
For VREG vnodes, return the resident page count (multiplied by PAGE_SIZE)
for the tmpfs node's anonymous VM object that stores actual file contents.

For all other vnodes, return the tmpfs_node's tn_size, which should not
be rounded to a page.

This change allows using stat(2) to identify a sparse file on tmpfs.

Reviewed by:	kib
MFC after:	1 week
2015-04-10 19:04:39 +00:00
Konstantin Belousov
2359e2dcc3 Do not call msdosfs_sync() on the read-only msdosfs mounts. In fact,
it should be a nop for ro.

PR:	199152
Reviewed by:	bde (PR version of the patch)
Submitted by:	longwitz@incore.de
MFC after:	1 week
2015-04-05 21:10:38 +00:00
Konstantin Belousov
420d65d9e4 Assert that an msdosfs mount is not read-only when FAT modifications
are requested.

PR:	199152
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-04-05 21:08:04 +00:00
Konstantin Belousov
bda2eb9ae8 Refine r280308. Do not completely disable timestamping of devfs nodes
on reads or writes, the time marks are used to display idle time by
w(1) [1].  Instead, use vfs.devfs.dotimes as the selector of default
precision vs. using time_second.  The later gives seconds precision,
which is good enough for the purpose.

Note that timestamp updates are unlocked and the updates itself, as
well as the check in devfs_timestamp, are non-atomic.

Noted by:	truckman [1]
Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-04-01 08:25:40 +00:00
Konstantin Belousov
c66d17c7bc msdosfs: mark unused compat-mount fields
The magic number MSDOSFS_ARGSMAGIC, which used to distinguish
"old" vs "new" msdosfs mount arguments, has not been used since
2005; it should just go away now.

Likewise, the local-to-Unicode table that changed at the same
time is unused.

Leave the space reserved in the old style mount arguments, though,
since we still support the old mount call (via the cmount entry
point).

Submitted by:	Chris Torek <chris.torek@gmail.com>
MFC after:	2 weeks
2015-03-22 09:09:26 +00:00
Xin LI
4f9343fc7c Disable timestamping on devfs read/write operations by default.
Currently we update timestamps unconditionally when doing read or
write operations.  This may slow things down on hardware where
reading timestamps is expensive (e.g. HPET, because of the default
vfs.timestamp_precision setting is nanosecond now) with limited
benefit.

A new sysctl variable, vfs.devfs.dotimes is added, which can be
set to non-zero value when the old behavior is desirable.

Differential Revision:	https://reviews.freebsd.org/D2104
Reported by:	Mike Tancsa <mike sentex net>
Reviewed by:	kib
Relnotes:	yes
Sponsored by:	iXsystems, Inc.
MFC after:	2 weeks
2015-03-21 01:14:11 +00:00
Gleb Smirnoff
4d6481a4c9 o Enhance vm_pager_free_nonreq() function:
- Allow to call the function with vm object lock held.
  - Allow to specify reqpage that doesn't match any page in the region,
    meaning freeing all pages.
o Utilize the new function in couple more places in vnode pager.

Reviewed by:	alc, kib
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2015-03-17 19:19:19 +00:00
Jung-uk Kim
2d427c524d Fix white spaces. 2015-03-02 19:14:58 +00:00
Edward Tomasz Napierala
ead063e0a2 Make fuse(4) respect FOPEN_DIRECT_IO. This is required for correct
operation of GlusterFS.

PR:		192701
Submitted by:	harsha at harshavardhana.net
Reviewed by:	kib@
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2015-03-02 19:04:27 +00:00
Warner Losh
03afe9ba70 nandfs_meta_bread() calls bread() which can set bp to NULL in some
error cases. Calling brelse() with a NULL pointer is not allowed,
so only call brelse() when the bp is non-NULL.

Reported by: Maxime Villard (reported as uninitialized variable)
2015-03-01 21:41:37 +00:00
Alexander Kabaev
0fd841369a Do not leak 'copy' buffer if bmap_truncate_indirect fails.
Reported by: Brainy Code Scanner, by Maxime Villard.
MFC after: 2 weeks
2015-02-28 22:24:45 +00:00
Konstantin Belousov
4fc4286fbc Some fixes for fdescfs lookup code.
Do not ever return doomed vnode from lookup.  This could happen, if
not checked, since dvp is relocked in the 'looking up ourselves' case.

In the other case, since dvp is relocked, mount point might go away
while fdesc_allocvp() is called.  Prevent the situation by doing
vfs_busy() before unlocking dvp.  Reuse the vn_vget_ino_gen() helper.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-02-28 19:57:22 +00:00
Konstantin Belousov
08189ed667 The VNASSERT in vflush() FORCECLOSE case is trying to panic early to
prevent errors from yanking devices out from under filesystems.  Only
care about special vnodes on devfs, special nodes on other kinds of
filesystems do not have special properties.

Sponsored by:  EMC / Isilon Storage Division
Submitted by:   Conrad Meyer
MFC after:	1 week
2015-02-27 16:43:50 +00:00
Pedro F. Giffuni
e5c356b2a2 ext2fs: Plug small memory leak
free() e2fs_contigdirs upon error.
Undo zeroing of e2fs_gd as this was actually a false positive.

X-MFC with:	278790
2015-02-15 14:25:00 +00:00
Pedro F. Giffuni
6be4cf2244 Reuse value of cursize instead of recalculating.
Reported by:	Clang static checker
MFC after:	1 week
2015-02-15 01:34:00 +00:00
Pedro F. Giffuni
f3ee91ed2b Initialize the allocation of variables related to the ext2 allocator.
The e2fs_gd struct was not being initialized and garbage was
being used for hinting the ext2 allocator variant.
Use malloc to clear the values and also initialize e2fs_contigdirs
during allocation to keep consistency.

While here clean up small style issues.

Reported by:	Clang static analyser
MFC after:	1 week
2015-02-15 01:12:15 +00:00
Edward Tomasz Napierala
29836e077a Restore ABI compatibility, broken in r273127. Note that while this fixes
ABI with 10.1, it breaks ABI for 11-CURRENT, so rebuild of automountd(8)
is neccessary.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2015-02-10 16:17:16 +00:00
Konstantin Belousov
bf5fce2bee Remove duplicated assignment.
CID:	1267988
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-02-03 12:09:48 +00:00
Konstantin Belousov
e0a60ae16a Update directory times immediately after an entry is created or
removed.  Postponing it until tmpfs_getattr() is called causes
discordant values reported for file times vs. directory times.

Reported and tested by:	madpilot
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-01-31 21:31:53 +00:00
Konstantin Belousov
f1a90a7bac Remove single-use boolean.
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-01-31 12:58:04 +00:00
Konstantin Belousov
311d39f2ee POSIX states that write(2) "shall mark for update the last data
modification and last file status change timestamps of the file".
Currently, tmpfs only modifies ctime when file was extended.  Since
r277828 followed tmpfs_write(), mmaped writes also do not modify
ctime.

Fix this, by updating both ctime and mtime for writes to tmpfs files.

Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-01-31 12:27:18 +00:00
Dimitry Andric
7f4daa88f1 Fix a -Wcast-qual warning in smbfs_subr.c, by using __DECONST. No
functional change.

MFC after:	3 days
2015-01-30 22:02:32 +00:00
Dimitry Andric
38fc0aa484 Fix a -Wcast-qual warning in udf_vnops.c, by using __DECONST. No
functional change.

MFC after:	3 days
2015-01-30 22:01:45 +00:00
Dimitry Andric
03ce3d7219 Fix a bunch of -Wcast-qual warnings in cd9660_util.c, by using
__DECONST.  No functional change.

MFC after:	3 days
2015-01-29 20:40:25 +00:00
Dimitry Andric
09bb4a314a Fix a bunch of -Wcast-qual warnings in msdosfs_conv.c, by using
__DECONST.  No functional change.

MFC after:	3 days
2015-01-29 20:30:13 +00:00
Jamie Gritton
464aad1407 Add allow.mount.fdescfs jail flag.
PR:		192951
Submitted by:	ruben@verweg.com
MFC after:	3 days
2015-01-28 21:08:09 +00:00
Konstantin Belousov
f40cb1c645 Update mtime for tmpfs files modified through memory mapping. Similar
to UFS, perform updates during syncer scans, which in particular means
that tmpfs now performs scan on sync.  Also, this means that a mtime
update may be delayed up to 30 seconds after the write.

The vm_object' OBJ_TMPFS_DIRTY flag for tmpfs swap object is similar
to the OBJ_MIGHTBEDIRTY flag for the vnode object, it indicates that
object could have been dirtied.  Adapt fast page fault handler and
vm_object_set_writeable_dirty() to handle OBJ_TMPFS_NODE same as
OBJT_VNODE.

Reported by:	Ronald Klop <ronald-lists@klop.ws>
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-01-28 10:37:23 +00:00
Konstantin Belousov
3544b0f68f tmpfs does not use UVM on FreeBSD.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2015-01-28 10:25:35 +00:00
Konstantin Belousov
3b50dff506 Stop enforcing additional reference on all cdevs, which was introduced
in r277199.  Acquire the neccessary reference in delist_dev_locked()
and inform destroy_devl() about it using CDP_UNREF_DTR flag.

Fix some style nits, add asserts.

Discussed with:	hselasky
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-01-19 17:36:52 +00:00
Konstantin Belousov
a57a934a38 Ignore devfs directory entries for devices either being destroyed or
delisted.  The check is racy.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-01-19 17:24:52 +00:00
Enji Cooper
ae266f893f Fix the build when INVARIANTS is defined by restoring bo's definition in
ext2_truncate(..) and by putting it under INVARIANTS ifdefs

X-MFC with: r277354
MFC after: 2 weeks
2015-01-19 07:10:08 +00:00
Pedro F. Giffuni
9a53618ab2 ext2: Garbage-collect some unused variables
Reported by:	clang static analysis
MFC after:	2 weeks
2015-01-19 03:30:45 +00:00