3065 Commits

Author SHA1 Message Date
Konstantin Belousov
b1dd38f408 Restore the previous sendfile(2) behaviour on the block devices.
Provide valid .fo_sendfile method for several missed struct fileops.

Reviewed by:	glebius
Sponsored by:	The FreeBSD Foundation
2013-08-16 14:22:20 +00:00
Rick Macklem
93c5875b24 Fix several performance related issues in the new NFS server's
DRC for NFS over TCP.
- Increase the size of the hash tables.
- Create a separate mutex for each hash list of the TCP hash table.
- Single thread the code that deletes stale cache entries.
- Add a tunable called vfs.nfsd.tcphighwater, which can be increased
  to allow the cache to grow larger, avoiding the overhead of frequent
  scans to delete stale cache entries.
  (The default value will result in frequent scans to delete stale cache
   entries, analagous to what the pre-patched code does.)
- Add a tunable called vfs.nfsd.cachetcp that can be used to disable
  DRC caching for NFS over TCP, since the old NFS server didn't DRC cache TCP.
It also adjusts the size of nfsrc_floodlevel dynamically, so that it is
always greater than vfs.nfsd.tcphighwater.

For UDP the algorithm remains the same as the pre-patched code, but the
tunable vfs.nfsd.udphighwater can be used to allow the cache to grow
larger and reduce the overhead caused by frequent scans for stale entries.
UDP also uses a larger hash table size than the pre-patched code.

Reported by:	wollman
Tested by:	wollman (earlier version of patch)
Submitted by:	ivoras (earlier patch)
Reviewed by:	jhb (earlier version of patch)
MFC after:	1 month
2013-08-14 21:11:26 +00:00
Pedro F. Giffuni
4a62545173 ext2fs: update format specifiers for ext4 type.
Previous bandaid was not appropriate and didn't really work for
all platforms. While here, cleanup the surrounding code to match
ffs_checkoverlap()

Reported by:	dim, jmallet and bde
MFC after:	3 weeks
2013-08-14 14:22:46 +00:00
Pedro F. Giffuni
88ae190ea0 ext2fs: update format specifiers for ext4 type.
Reported by:	Sam Fourman Jr.
MFC after:	3 weeks
2013-08-13 18:39:36 +00:00
Pedro F. Giffuni
70097aac13 Define ext2fs local types and use them.
Add definitions for e2fs_daddr_t, e4fs_daddr_t in addition
to the already existing e2fs_lbn_t and adjust them for ext4.
Other than making the code more readable these changes should
fix problems related to big filesystems.

Setting the proper types can be tricky so the process was
helped by looking at UFS. In our implementation, logical block
numbers can be negative and the code depends on it. In ext2,
block numbers are unsigned so it is convenient to keep
e2fs_daddr_t unsigned and use the complete 32 bits. In the
case of e4fs_daddr_t, while the value should be unsigned, for
ext4 we only need to support 48 bits so preserving an extra
bit from the sign is not an issue.

While here also drop the ext2_setblock() prototype that was
never used.

Discussed with:	mckusick, bde
MFC after:	3 weeks
2013-08-13 15:40:43 +00:00
Pedro F. Giffuni
d7511a40a7 Add read-only support for extents in ext2fs.
Basic support for extents was implemented by Zheng Liu as part
of his Google Summer of Code in 2010. This support is read-only
at this time.

In addition to extents we also support the huge_file extension
for read-only purposes. This works nicely with the additional
support for birthtime/nanosec timestamps and dir_index that
have been added lately.

The implementation may not work for all ext4 filesystems as
it doesn't support some features that are being enabled by
default on recent linux like flex_bg. Nevertheless, the feature
should be very useful for migration or simple access in
filesystems that have been converted from ext2/3 or don't use
incompatible features.

Special thanks to Zheng Liu for his dedication and continued
work to support ext2 in FreeBSD.

Submitted by:	Zheng Liu (lz@)
Reviewed by:	Mike Ma, Christoph Mallon (previous version)
Sponsored by:	Google Inc.
MFC after:	3 weeks
2013-08-12 21:34:48 +00:00
Attilio Rao
c7aebda8a1 The soft and hard busy mechanism rely on the vm object lock to work.
Unify the 2 concept into a real, minimal, sxlock where the shared
acquisition represent the soft busy and the exclusive acquisition
represent the hard busy.
The old VPO_WANTED mechanism becames the hard-path for this new lock
and it becomes per-page rather than per-object.
The vm_object lock becames an interlock for this functionality:
it can be held in both read or write mode.
However, if the vm_object lock is held in read mode while acquiring
or releasing the busy state, the thread owner cannot make any
assumption on the busy state unless it is also busying it.

Also:
- Add a new flag to directly shared busy pages while vm_page_alloc
  and vm_page_grab are being executed.  This will be very helpful
  once these functions happen under a read object lock.
- Move the swapping sleep into its own per-object flag

The KPI is heavilly changed this is why the version is bumped.
It is very likely that some VM ports users will need to change
their own code.

Sponsored by:	EMC / Isilon storage division
Discussed with:	alc
Reviewed by:	jeff, kib
Tested by:	gavin, bapt (older version)
Tested by:	pho, scottl
2013-08-09 11:11:11 +00:00
Pedro F. Giffuni
95f1f8d262 Small typo.
MFC after:	3 days
2013-08-08 22:07:59 +00:00
Konstantin Belousov
8239a7a878 The tmpfs_alloc_vp() is used to instantiate vnode for the tmpfs node,
in particular, from the tmpfs_lookup VOP method.  If LK_NOWAIT is not
specified in the lkflags, the lookup is supposed to return an alive
vnode whenever the underlying node is valid.

Currently, the tmpfs_alloc_vp() returns ENOENT if the vnode attached
to node exists and is being reclaimed.  This causes spurious ENOENT
errors from lookup on tmpfs and corresponding random 'No such file'
failures from syscalls working with tmpfs files.

Fix this by waiting for the doomed vnode to be detached from the tmpfs
node if sleepable allocation is requested.

Note that filesystems which use vfs_hash.c, correctly handle the case
due to vfs_hash_get() looping when vget() returns ENOENT for sleepable
requests.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-08-05 18:53:59 +00:00
Attilio Rao
be99683637 Revert r253939:
We cannot busy a page before doing pagefaults.
Infact, it can deadlock against vnode lock, as it tries to vget().
Other functions, right now, have an opposite lock ordering, like
vm_object_sync(), which acquires the vnode lock first and then
sleeps on the busy mechanism.

Before this patch is reinserted we need to break this ordering.

Sponsored by:	EMC / Isilon storage division
Reported by:	kib
2013-08-05 08:55:35 +00:00
Attilio Rao
3b6714cacb The page hold mechanism is fast but it has couple of fallouts:
- It does not let pages respect the LRU policy
- It bloats the active/inactive queues of few pages

Try to avoid it as much as possible with the long-term target to
completely remove it.
Use the soft-busy mechanism to protect page content accesses during
short-term operations (like uiomove_fromphys()).

After this change only vm_fault_quick_hold_pages() is still using the
hold mechanism for page content access.
There is an additional complexity there as the quick path cannot
immediately access the page object to busy the page and the slow path
cannot however busy more than one page a time (to avoid deadlocks).

Fixing such primitive can bring to complete removal of the page hold
mechanism.

Sponsored by:	EMC / Isilon storage division
Discussed with:	alc
Reviewed by:	jeff
Tested by:	pho
2013-08-04 21:07:24 +00:00
Attilio Rao
878a788734 Remove unnecessary soft busy of the page before to do vn_rdwr() in
kern_sendfile() which is unnecessary.
The page is already wired so it will not be subjected to pagefault.
The content cannot be effectively protected as it is full of races
already.
Multiple accesses to the same indexes are serialized through vn_rdwr().

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc, jeff
Tested by:	pho
2013-08-04 15:56:19 +00:00
Pedro F. Giffuni
d192e40f77 Add license for the half MD4 algorithm used in ext2_half_md4().
The htree implementation uses code derived from the
RSA Data Security, Inc. MD4 Message-Digest Algorithm.

Add a proper licensing statement for the code and clarify
the corresponding comments.

Approved by:	core (hrs)
2013-08-01 16:04:48 +00:00
Marius Strobl
cd67748bde - Add const-qualifiers to the arguments of isonum_*().
- According to ISO 9660 7.1.2, isonum_712() should return a signed value.
- Try to get isonum_*() closer to style(9).
2013-07-28 12:29:10 +00:00
Andriy Gapon
8e94193e58 make path matching in devfs rules consistent and sane (and safer)
Before this change path matching had the following features:
- for device nodes the patterns were matched against full path
- in the above case '/' in a path could be matched by a wildcard
- for directories and links only the last component was matched

So, for example, a pattern like 're*' could match the following entries:
- re0 device
- responder/u0 device
- zvol/recpool directory

Although it was possible to work around this behavior (once it was spotted
and understood), it was very confusing and contrary to documentation.

Now we always match a full path for all types of devfs entries (devices,
directories, links) and a '/' has to be matched explicitly.
This behavior follows the shell globbing rules.

This change is originally developed by Jaakko Heinonen.
Many thanks!

PR:		kern/122838
Submitted by:	jh
MFC after:	4 weeks
2013-07-26 14:25:58 +00:00
Pedro F. Giffuni
9670f48107 ext2fs: Return EINVAL for negative uio_offset as in UFS.
While here drop old comment that doesn't really apply.

MFC after:	1 month
Discussed with:	gleb
2013-07-25 19:37:49 +00:00
Pedro F. Giffuni
0b54fe540c ext2fs: Drop a check that wan't supposed to be in r253651.
MFC after:	1 month
2013-07-25 16:04:55 +00:00
Pedro F. Giffuni
78d912bbc3 ext2fs: Don't assume that on-disk format of a directory is the same
as in <sys/dirent.h>

ext2_readdir() has always been very fs specific and different
with respect to its ufs_ counterpart. Recent changes from UFS
have made it possible to share more closely the implementation.

MFUFS r252438:
Always start parsing at DIRBLKSIZ aligned offset, skip first entries if
uio_offset is not DIRBLKSIZ aligned. Return EINVAL if buffer is too
small for single entry.

Preallocate buffer for cookies.

Skip entries with zero inode number.

Reviewed by:	gleb, Zheng Liu
MFC after:	1 month
2013-07-25 15:34:20 +00:00
Pedro F. Giffuni
7d20a270cc fuse: revert kernel_header update.
It seems to be causing problems due to the lack of the new features.

Found by:	bapt
Pointed hat:	pfg
2013-07-24 20:21:29 +00:00
Nathan Whitehorn
59169d9156 tmpfs works perfectly fine with -o union -- there is no reason to exclude it
from the list of options.
2013-07-23 14:48:37 +00:00
Rick Macklem
a36b76a787 The NFSv4 server incorrectly assumed that the high order words of
the attribute bitmap argument would be non-zero. This caused an
interoperability problem for a recent patch to the Linux NFSv4 client.
The Linux folks have changed their patch to avoid this, but this
patch fixes the problem on the server.

Reported and tested by:	Andre Heider (a.heider@gmail.com)
MFC after:	3 days
2013-07-20 22:35:32 +00:00
Pedro F. Giffuni
feba8afb59 fuse: revert birthtime support.
The creation time support breaks the data structures used in linux
fuse.  libfuse carries it's own header.

Revert the changes for now. We will try to get an agreement with the
fuse  upstream maintainers to avoid having to patch the library
headers all the time.
2013-07-20 14:50:35 +00:00
Pedro F. Giffuni
77b8f8a998 Adjust outsizes:
Recalculate FUSE_COMPAT_ENTRY_OUT_SIZE and COMPAT_ATTR_OUT_SIZE.
These were wrong in the previous commit. They are actually unused
in FreeBSD though.

Pointed out by:	Jan Beich
2013-07-20 03:55:56 +00:00
Pedro F. Giffuni
05ad761667 Adjust outsizes:
When birthtime was added (r253331) we missed adding the weight
of the new fields in FUSE_COMPAT_ENTRY_OUT_SIZE and
COMPAT_ATTR_OUT_SIZE. Adjust them accordingly.

Pointed out by:	Jan Beich
2013-07-20 03:08:50 +00:00
Pedro F. Giffuni
c230e70881 Update fuse_kernel header.
Bring in the changes from the FUSE kernel interface 7.10
(available under a BSD license).

After 7.10 the linux FUSE developers added support for a
controversial CUSE driver and some linux especific
features that are unlikely to find its way into FreeBSD.

We currently don't implement any of the new features so we
are *not* bumping the FUSE_KERNEL_MINOR_VERSION. The header
should, nevertheless, serve  as a template to add the new
features in a compatible manner.

While here adopt some minor cleanups from the upstream version
like removing FUSE_MAJOR and FUSE_MINOR which were never
used. Also add multiple inclusion header guards,
2013-07-15 00:05:27 +00:00
Pedro F. Giffuni
da7d8f2a65 Add creation timestamp (birthtime) support for fuse.
I was keeping this #ifdef'd for reference with the MacFUSE change[1]
but on second thought, this is a FreeBSD-only header so the SVN
history should be enough.

Add missing padding while here.

Reference [1]:
http://code.google.com/p/macfuse/source/detail?spec=svn1686&r=1360
2013-07-13 22:06:41 +00:00
Pedro F. Giffuni
944d37b123 Add creation timestamp (birthtime) support for fuse.
This is based on similar support in MacFUSE.
2013-07-12 17:22:59 +00:00
Pedro F. Giffuni
c5249f35b8 Implement 1003.1-2001 pathconf() keys.
This is based on r106058 in UFS.

MFC after:	1 month
2013-07-10 22:03:01 +00:00
Pedro F. Giffuni
db20714a87 Reinstate the assertion from r253045.
UFS r232732 reverted the change as the real problem was to be fixed
at the syscall level.

Reported by:	bde
2013-07-09 14:23:00 +00:00
Pedro F. Giffuni
bf3c9330ba Enhancement when writing an entire block of a file.
Merge from UFS r231313:

This change first attempts the uiomove() to the newly allocated
(and dirty) buffer and only zeros it if the uiomove() fails. The
effect is to eliminate the gratuitous zeroing of the buffer in
the usual case where the uiomove() successfully fills it.

MFC after:	3 days
2013-07-09 01:31:04 +00:00
Rick Macklem
88a2437a65 Add support for host-based (Kerberos 5 service principal) initiator
credentials to the kernel rpc. Modify the NFSv4 client to add
support for the gssname and allgssname mount options to use this
capability. Requires the gssd daemon to be running with the "-h" option.

Reviewed by:	jhb
2013-07-09 01:05:28 +00:00
Pedro F. Giffuni
7ce75e5f1f Avoid a panic and return EINVAL instead.
Merge from UFS r232692:
syscall() fuzzing can trigger this panic.

MFC after:	3 days
2013-07-08 20:21:36 +00:00
Pedro F. Giffuni
bdf1d79884 Implement SEEK_HOLE/SEEK_DATA for ext2fs.
Merged from r236044 on UFS.

MFC after:	3 days
2013-07-07 15:51:28 +00:00
Pedro F. Giffuni
d66aed2e76 Fix some typos.
MFC after:	1 week
2013-07-07 01:32:52 +00:00
Pedro F. Giffuni
91f5a4670f Initial implementation of the HTree directory index.
This is a port of NetBSD's GSoC 2012 Ext3 HTree directory indexing
by Vyacheslav Matyushin.  It was cleaned up and enhanced for FreeBSD
by Zheng Liu (lz@).

This is an excellent example of work shared among different projects:
Vyacheslav was able to look at an early prototype from Zheng Liu who
was also able to check the code from Haiku (with permission).

As in linux, the feature is not available by default and must be
enabled explicitly with tune2fs. We still do not support the
workarounds required in readdir for NFS.

Submitted by:	Zheng Liu
Tested by:	Mike Ma
Sponsored by:	Google Inc.
MFC after:	1 week
2013-07-06 18:28:06 +00:00
Konstantin Belousov
18a8d3d7f8 The tvp vnode on rename is usually unlinked. Drop the cached null
vnode for tvp to allow the free of the lower vnode, if needed.

PR:	kern/180236
Tested by:	smh
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-07-04 19:01:18 +00:00
Davide Italiano
9e9421bcdf - Fix double frees/user after free.
- Allocate using smb_rq_alloc() instead of inlining it.

Reported by:	uqs
Found with:	Coverity Scan
2013-07-03 10:31:45 +00:00
Rick Macklem
a820822ec8 A problem with the old NFS client where large writes to large files
would sometimes result in a corrupted file was reported via email.
This problem appears to have been caused by r251719 (reverting
r251719 fixed the problem). Although I have not been able to
reproduce this problem, I suspect it is caused by another thread
increasing np->n_size after the mtx_unlock(&np->n_mtx) but before
the vnode_pager_setsize() call. Since the np->n_mtx mutex serializes
updates to np->n_size, doing the vnode_pager_setsize() with the
mutex locked appears to avoid the problem.
Unfortunately, vnode_pager_setsize() where the new size is smaller,
cannot be called with a mutex held.
This patch returns the semantics to be close to pre-r251719 (actually
pre-r248567, r248581, r248567 for the new client) such that the call to
vnode_pager_setsize() is only delayed until after the mutex is
unlocked when np->n_size is shrinking. Since the file is growing
when being written, I believe this will fix the corruption.
A better solution might be to replace the mutex with a sleep lock,
but that is a non-trivial conversion, so this fix is hoped to be
sufficient in the meantime.

Reported by:	David G. Lawrence (dg@dglawrence.com)
Tested by:	David G. Lawrence (to be done soon)
Reviewed by:	kib
MFC after:	1 week
2013-07-03 00:19:03 +00:00
Pedro F. Giffuni
e2bc2ccec0 ext2fs: Use the complete random() range in i_gen.
i_gen is unsigned in ext2fs so we can handle the complete
32 bits.

MFC after:	1 week
2013-06-30 00:42:51 +00:00
Pedro F. Giffuni
d849f17dca Bring some updates from ufs_lookup to ext2fs.
r156418:

Don't set IN_CHANGE and IN_UPDATE on inodes for potentially suspended
file systems.  This could cause deadlocks when creating snapshots.
(We can't do snapshots on ext2fs but it is useful to keep things in sync).

r183079:

- Only set i_offset in the parent directory's i-node during a lookup for
  non-LOOKUP operations.
- Relax a VOP assertion for a DELETE lookup.

r187528:

Move the code from ufs_lookup.c used to do dotdot lookup, into
the helper function. It is supposed to be useful for any filesystem
that has to unlock dvp to walk to the ".." entry in lookup routine.

MFC after:	5 days
2013-06-29 01:35:28 +00:00
Davide Italiano
bbc6d2c1af Properly use v_data field. This magically worked (even if wrong) until
now because v_data is the first field of the structure, but it's not
something we should rely on.
2013-06-28 20:32:48 +00:00
Davide Italiano
189e41259b Garbage collect an useless check. smp should be never NULL. 2013-06-28 20:14:30 +00:00
Davide Italiano
c7d2e4cf9b Plug a couple of leakages in smbfs_lookup(). 2013-06-28 20:07:24 +00:00
Pedro F. Giffuni
fafb835a0b Minor sorting.
MFC after:	3 days
2013-06-26 19:43:22 +00:00
Pedro F. Giffuni
da057ed2d3 Define and use e2fs_lbn_t in ext2fs.
In line to what is done in UFS, define an internal type
e2fs_lbn_t for the logical block numbers.

This change is basically a no-op as the new type is unchanged
(int32_t) but it may be useful as bumping this may be required
for ext4fs.

Also, as pointed out by Bruce Evans:

-Use daddr_t for daddr in ext2_bmaparray(). This seems to
improve reliability with the reallocblks option.
- Add a cast to the fsbtodb() macro as in UFS.

Reviewed by:	bde
MFC after:	3 days
2013-06-23 02:44:42 +00:00
Rick Macklem
2e6a4b0c55 Fix r252074 so that it builds on 64bit arches. 2013-06-22 21:58:21 +00:00
Rick Macklem
1dd95a046c The NFSv4.1 LayoutCommit operation requires a valid offset and length.
(0, 0 is not sufficient) This patch a loop for each file layout, using
the offset, length of each file layout in a separate LayoutCommit.
2013-06-21 22:46:16 +00:00
Rick Macklem
562395581b When the NFSv4.1 client is writing to a pNFS Data Server (DS), the
file's size attribute does not get updated. As such, it is necessary
to invalidate the attribute cache before clearing NMODIFIED for pNFS.

MFC after:	2 weeks
2013-06-21 22:26:18 +00:00
Rick Macklem
315c38d135 Since some NFSv4 servers enforce the requirement for a reserved port#,
enable use of the (no)resvport mount option for NFSv4. I had thought
that the RFC required that non-reserved port #s be allowed, but I couldn't
find it in the RFC.

MFC after:	2 weeks
2013-06-21 19:41:30 +00:00
Pedro F. Giffuni
3f5747b69d Rename some prefixes in the Block Group Descriptor fields to ext4bgd_
Change prefix to avoid confusion and denote that these fields
are generally only available starting with ext4.

MFC after:	3 days
2013-06-20 00:00:33 +00:00