post-create/mkdir directory attributes. This allows the RPC to
name cache the newly created directory and reduces the lookup RPC
count for applications creating a lot of directories.
MFC after: 2 weeks
post-open/create directory attributes. This allows the RPC to
name cache the newly created file and reduces the lookup RPC
count by about 10% for software builds.
MFC after: 2 weeks
attributes. This allows the client to cache directory names
when they are looked up, reducing the Lookup RPC count by
about 40% for software builds.
MFC after: 2 weeks
worse when filling up a device and then trying to erase files to make
space. Without enough space, you can't do that. Also, ensure that the
metadata writes don't generate ENOSPC. They will be retried later
since the buffers are still dirty...
Submitted by: mjg@
When server doesn't support this request, try to use SMB_INFO_ALLOCATION.
And use SMB_COM_QUERY_INFORMATION_DISK request as fallback.
MFC after: 2 weeks
To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.
Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.
Exp-run revealed no ports using it directly.
No objection from: arch@
Sponsored by: EMC / Isilon Storage Division
ext2fs: minor update to the dirpref policy.
The change in UFS r254996, reverted the change as the
older code seems to work better. This was not visible
in local testing but we can trust UFS is vastly more
exercised in diferent environments.
Bring in a minor change to the dirpref policy based on r248623.
This is pretty minimal change to keep the implementation in
sync with UFS but other parts from the original change are not
directly applicable so don't expect improvements in fsck times.
MFC after: 2 weeks
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.
MFC after: 3 weeks
lookup cookies to be less obscure.
No functional change.
Since r245115, cnt has not really been needed in tmpfs_dir_getdents(). Keep
it for the MPASS() for now though.
Sponsored by: EMC / Isilon Storage Division
MFC after: 2 weeks
Consistently use a single tab after a #define as mentioned in style(9).
Use tabs instead of space for indenting.
Fix a typo: "hash_vesion".
No functional change.
MFC after: 3 days
The ext4 developers tend to tag Ext4-specific flags as
"incompatible" even when such features are not relevant for
read-only support. This is a consequence of the process
though which this filesystem is implemented without design
and the fact that some new features are not extensible to
ext2/3.
Organize the features according to what we support and sort
them so that we can now read-only mount filesystems with
some features that may be found in newly formatted ext4 fs.
Submitted by: Zheng Liu
Reviewed by: pfg
MFC after: 5 days
The ext4 inode flags do not have equivalents for chflags (1)
and hold information that is private to the implementation.
The i_flag field in the inode is a better place to hold the Ext4
inode flags as it saves us from masking flags while setting or
getting attributes. It should also make things cleaner if we
implement write support for Ext4.
Suggested by: bde
Tested by: Mike Ma
MFC after: 3 days
The IN_* flags should be set in i_flag instead of corrupting
i_flags [1].
Re-enable HTree dirindex as the last series of bug fixes
seems to have fixed the issues.
Reported by: bde [1]
Tested by: kevlo
MFC after: 1 week
Use the bitwise negation instead of bogus boolean negation and move
the flag manipulation with the assignment.
Fix some grammatical errors introduced in the same change.
Reported by: bde
MFC after: 3 days
r260545 cleared the inode flags to fix corruption problems but
we still need to pass some EXT4 flags for the ext4 read-only
mode. None of these attributes has an equivalent in FreeBSD and
are uninteresting for the system utilities so they should be
innaccessible in ext2_getattrib().
Note: we also use EXT4_HUGE_FILE but we use it directly from the
dinode structure so it is not necessary to translate it,
Suggested by: bde
MFC after: 3 days
After r252890 we are naively attempting to pass through the
inode flags. This is technically incorrect as the ext2
inode flags don't match the UFS/system values used in
FreeBSD and a clean conversion is needed.
Some filtering was left in place so the change didn't cause
significant changes in FreeBSD but some of the garbage passed
is likely to be the cause for warning messages in linux.
Fix the issue by resetting the flags before conversion as was
done previously. This also means we will not pass the EXT4_*
inode flags into FreeBSD's inode.
PR: kern/185448
MFC after: 3 days
- Introduce additional hash to group requests by hash of sockref. This
allows to process TCP acknowledgements without looping though all the cache,
and as result allows to do it every time.
- Indroduce additional callbacks to notify application layer about sockets
disconnection. Without this last few requests processed just before socket
disconnection never processed their ACKs and stuck in cache for many hours.
- Implement transport-specific method for tracking reply acknowledgements.
New implementation does not cross multiple stack layers to get the data and
does not have race conditions that previously made some requests stuck
in cache. This could be done more efficiently at sockbuf layer, but that
would broke some KBIs, while I don't know other consumers for it aside NFS.
- Instead of traversing all DRC twice per request, run cleaning only once
per request, and except in some conditions traverse only single hash slot
at a time.
Together this limits NFS DRC growth only to situations of real connectivity
problems. If network is working well, and so all replies are acknowledged,
cache remains almost empty even after hours of heavy load. Without this
change on the same test cache was growing to many thousand requests even
with perfectly working local network.
As another result this reduces CPU time spent on the DRC handling during
SPEC NFS benchmark from about 10% to 0.5%.
Sponsored by: iXsystems, Inc.
- Do not update the histogram for items we are any way deleting from cache.
- Do not update the histogram if nfsrc_tcphighwater is not set.
- Remove some extra math operations.
when a Getattr for a file is done by a client other than the one that
holds the file's delegation. This would only happen when delegations
are enabled and the problem is fixed by this patch.
MFC after: 1 week
reported to the freebsd-fs mailing list. I believe the problem was
caused by the Readdir operation using VFS_VGET() for a snapshot file entry
instead of VOP_LOOKUP(). This would not occur for NFSv3, since it
will do a VFS_VGET() of "." which fails with ENOTSUPP at the beginning
of the directory, whereas NFSv4 does not check "." or "..". This
patch adds a call to VFS_VGET() for the directory being read to check
for ENOTSUPP.
I also observed that the mount_on_fileid and fsid attributes were
not correct at the snapshot's auto mountpoints when looking at packet
traces for the Readdir. This patch fixes the attributes by doing a check
for different v_mount structure, even if the vnode v_mountedhere is not
set.
Reported by: jas@cse.yorku.ca
Tested by: jas@cse.yorku.ca
Reviewed by: asomers
MFC after: 1 week
nfsv4_fillattr() as NULLs for the Getattr callback. This caused
nfsv4_fillattr() to not fill in the Change attribute for the reply.
I believe this was a violation of the RFC, but had little effect on
server behaviour. This patch passes a non-NULL p argument to fix this.
MFC after: 1 week
for the Getattr and Recall callbacks. This patch fixes it.
Since the NFSv4.1 specific error codes would only happen for
abnormal circumstances, this patch has little effect, in practice.
MFC after: 1 week
Instead of taking 8 specific bytes of file handle to identify file during
RPC thread affitinity handling, use trivial hash of the full file handle.
ZFS's struct zfid_short does not have padding field after the length field,
as result, originally picked 8 bytes are loosing lower 16 bits of object ID,
causing many false matches and unneeded requests affinity to same thread.
This fix substantially improves NFS server latency and scalability in SPEC
NFS benchmark by more flexible use of multiple NFS threads.
Sponsored by: iXsystems, Inc.
Besides not making sense, open(O_EXEC) for fifo creates fifoinfo with
zero readers and writers counts, which causes premature free of pipes.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Historically creation of device aliases created symbolic links using only
name of target device as a link target, not considering current directory.
Fix that by adding number of "../" chunks to the terget device name,
required to get out of the current directory to devfs root first.
MFC after: 1 month
synchronous (with FILE_SYNC) writes because non-contiguous
byte ranges in the same buffer cache block are being
written. This patch adds a new mount option "noncontigwr"
which allows the non-contiguous byte ranges to be combined,
with the dirty byte range becoming the superset of the bytes
that are dirty, if the file has not been file locked.
This reduces the number of writes significantly for software
builds. The only case where this change might break existing
applications is where an application is writing
non-overlapping byte ranges within the same buffer cache block
of a file from multiple clients concurrently.
Since such an application would normally do file locking on
the file, avoiding the byte range merge for files that have
been file locked should be sufficient for most (maybe all?) cases.
Submitted by: jhb (earlier version)
Reviewed by: kib
MFC after: 3 weeks
According to online documentation [1], Ext4 has two new "special"
inodes so add the new exclude and replica inodes.
Reference:
[1] https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout
Reported by: Mike Ma
MFC after: 3 weeks
option, unbreak the lock tracing release semantic by embedding
calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
version of the releasing functions for mutex, rwlock and sxlock.
Failing to do so skips the lockstat_probe_func invokation for
unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
kernel compiled without lock debugging options, potentially every
consumer must be compiled including opt_kdtrace.h.
Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
is linked there and it is only used as a compile-time stub [0].
[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested. As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while. Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].
Sponsored by: EMC / Isilon storage division
Discussed with: rstone
[0] Reported by: rstone
[1] Discussed with: philip
compiler interprets this as an undefined behaviour. Instead, ensure
that the sum of uio_offset and uio_resid is below OFF_MAX using the
operation which cannot overflow.
Reported and tested by: pho
Discussed with: bde
Approved by: des (pseudofs maintainer)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
sources from uio. Both uio_offset and offset, and uio_resid and resid
have the same types for some time.
Add check for buflen overflow by comparing the buflen with both offset
and resid (vs. comparing with offset only, as it is currently done).
Reported and tested by: pho
Approved by: des (pseudofs maintainer)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
The hang occurred in nfsv4_setsequence() when it couldn't find an
available session slot and is fixed by checking for a forced dismount
in progress and just returning for this case.
MFC after: 1 month
CLNT_CONTROL() would be called on "client" after it was
released via CLNT_RELEASE(). It was unlikely that this
code path gets executed and I have not heard of any problem
report caused by this bug. This patch fixes the code so that
this cannot happen.
MFC after: 2 months
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
di_extsize is the EA size and as such it should be unsigned.
Adjust related types for consistency.
Reviewed by: mckusick (previous version)
MFC after: 3 weeks
devfs_iosize_max_clamp sysctl, which allows/disables SSIZE_MAX-sized
i/o requests on the devfs files.
Sponsored by: The FreeBSD Foundation
Reminded by: Dmitry Sivachenko <trtrmitya@gmail.com>
MFC after: 1 week
The kernel normally didn't unmap/context switch away before we accessed
the buffer most of the time, but under heavy I/O pressure and lots of
mount/unmounting this would cause a fault on nofault panic...
Reviewed by: dteske
Approved by: re (kib)
Sponsored by: Vicor
MFC after: 3 days
transmission which could be tricked into rounding up to the nearest
page size, leaking up to a page of kernel memory. [13:11]
In IPv6 and NetATM, stop SIOCSIFADDR, SIOCSIFBRDADDR, SIOCSIFDSTADDR
and SIOCSIFNETMASK at the socket layer rather than pass them on to the
link layer without validation or credential checks. [SA-13:12]
Prevent cross-mount hardlinks between different nullfs mounts of the
same underlying filesystem. [SA-13:13]
Security: CVE-2013-5666
Security: FreeBSD-SA-13:11.sendfile
Security: CVE-2013-5691
Security: FreeBSD-SA-13:12.ifioctl
Security: CVE-2013-5710
Security: FreeBSD-SA-13:13.nullfs
Approved by: re
Our code does not consider yet the case of hash collisions. This
is a rather annoying situation where two or more files that
happen to have the same hash value will not appear accessible.
The situation is not difficult to work-around but given that things
will just work without enabling htree we will save possible
embarrassments for the next release.
Reported by: Kevin Lo
in the future in a backward compatible (API and ABI) way.
The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.
The structure definition looks like this:
struct cap_rights {
uint64_t cr_rights[CAP_RIGHTS_VERSION + 2];
};
The initial CAP_RIGHTS_VERSION is 0.
The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.
The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.
To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.
#define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL)
We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:
#define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL)
#define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL)
#define CAP_FCHMODAT (CAP_FCHMOD | CAP_LOOKUP)
There is new API to manage the new cap_rights_t structure:
cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
void cap_rights_set(cap_rights_t *rights, ...);
void cap_rights_clear(cap_rights_t *rights, ...);
bool cap_rights_is_set(const cap_rights_t *rights, ...);
bool cap_rights_is_valid(const cap_rights_t *rights);
void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);
Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:
cap_rights_t rights;
cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);
There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:
#define cap_rights_set(rights, ...) \
__cap_rights_set((rights), __VA_ARGS__, 0ULL)
void __cap_rights_set(cap_rights_t *rights, ...);
Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:
cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);
Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.
This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.
Sponsored by: The FreeBSD Foundation
is being shut down which were caused by the nfscbd_pool being
destroyed before the backchannel is disabled. This patch is
believed to fix the problem, by simply avoiding ever destroying
the nfscbd_pool. Since the NFS client module cannot be unloaded,
this should not cause a memory leak.
MFC after: 2 weeks
waiting for an RPC reply from the server while holding the mount
point busy (mnt_lockref incremented). This happens because dounmount()
msleep()s waiting for mnt_lockref to become 0, before calling
VFS_UNMOUNT(). This patch adds a new VFS operation called VFS_PURGE(),
which the NFS client implements as purging RPCs in progress. Making
this call before checking mnt_lockref fixes the problem, by ensuring
that the VOP_xxx() calls will fail and unbusy the mount point.
Reported by: sbruno
Reviewed by: kib
MFC after: 2 weeks
UF_SYSTEM, UF_SPARSE, UF_OFFLINE, UF_REPARSE, UF_ARCHIVE, UF_READONLY,
and UF_HIDDEN.
Sort the file flags tmpfs supports alphabetically. tmpfs now
supports the same flags as UFS, with the exception of SF_SNAPSHOT.
Reported by: bdrewery, antoine
Sponsored by: Spectra Logic
and CIFS file attributes as BSD stat(2) flags.
This work is intended to be compatible with ZFS, the Solaris CIFS
server's interaction with ZFS, somewhat compatible with MacOS X,
and of course compatible with Windows.
The Windows attributes that are implemented were chosen based on
the attributes that ZFS already supports.
The summary of the flags is as follows:
UF_SYSTEM: Command line name: "system" or "usystem"
ZFS name: XAT_SYSTEM, ZFS_SYSTEM
Windows: FILE_ATTRIBUTE_SYSTEM
This flag means that the file is used by the
operating system. FreeBSD does not enforce any
special handling when this flag is set.
UF_SPARSE: Command line name: "sparse" or "usparse"
ZFS name: XAT_SPARSE, ZFS_SPARSE
Windows: FILE_ATTRIBUTE_SPARSE_FILE
This flag means that the file is sparse. Although
ZFS may modify this in some situations, there is
not generally any special handling for this flag.
UF_OFFLINE: Command line name: "offline" or "uoffline"
ZFS name: XAT_OFFLINE, ZFS_OFFLINE
Windows: FILE_ATTRIBUTE_OFFLINE
This flag means that the file has been moved to
offline storage. FreeBSD does not have any special
handling for this flag.
UF_REPARSE: Command line name: "reparse" or "ureparse"
ZFS name: XAT_REPARSE, ZFS_REPARSE
Windows: FILE_ATTRIBUTE_REPARSE_POINT
This flag means that the file is a Windows reparse
point. ZFS has special handling code for reparse
points, but we don't currently have the other
supporting infrastructure for them.
UF_HIDDEN: Command line name: "hidden" or "uhidden"
ZFS name: XAT_HIDDEN, ZFS_HIDDEN
Windows: FILE_ATTRIBUTE_HIDDEN
This flag means that the file may be excluded from
a directory listing if the application honors it.
FreeBSD has no special handling for this flag.
The name and bit definition for UF_HIDDEN are
identical to the definition in MacOS X.
UF_READONLY: Command line name: "urdonly", "rdonly", "readonly"
ZFS name: XAT_READONLY, ZFS_READONLY
Windows: FILE_ATTRIBUTE_READONLY
This flag means that the file may not written or
appended, but its attributes may be changed.
ZFS currently enforces this flag, but Illumos
developers have discussed disabling enforcement.
The behavior of this flag is different than MacOS X.
MacOS X uses UF_IMMUTABLE to represent the DOS
readonly permission, but that flag has a stronger
meaning than the semantics of DOS readonly permissions.
UF_ARCHIVE: Command line name: "uarch", "uarchive"
ZFS_NAME: XAT_ARCHIVE, ZFS_ARCHIVE
Windows name: FILE_ATTRIBUTE_ARCHIVE
The UF_ARCHIVED flag means that the file has changed and
needs to be archived. The meaning is same as
the Windows FILE_ATTRIBUTE_ARCHIVE attribute, and
the ZFS XAT_ARCHIVE and ZFS_ARCHIVE attribute.
msdosfs and ZFS have special handling for this flag.
i.e. they will set it when the file changes.
sys/param.h: Bump __FreeBSD_version to 1000047 for the
addition of new stat(2) flags.
chflags.1: Document the new command line flag names
(e.g. "system", "hidden") available to the
user.
ls.1: Reference chflags(1) for a list of file flags
and their meanings.
strtofflags.c: Implement the mapping between the new
command line flag names and new stat(2)
flags.
chflags.2: Document all of the new stat(2) flags, and
explain the intended behavior in a little
more detail. Explain how they map to
Windows file attributes.
Different filesystems behave differently
with respect to flags, so warn the
application developer to take care when
using them.
zfs_vnops.c: Add support for getting and setting the
UF_ARCHIVE, UF_READONLY, UF_SYSTEM, UF_HIDDEN,
UF_REPARSE, UF_OFFLINE, and UF_SPARSE flags.
All of these flags are implemented using
attributes that ZFS already supports, so
the on-disk format has not changed.
ZFS currently doesn't allow setting the
UF_REPARSE flag, and we don't really have
the other infrastructure to support reparse
points.
msdosfs_denode.c,
msdosfs_vnops.c: Add support for getting and setting
UF_HIDDEN, UF_SYSTEM and UF_READONLY
in MSDOSFS.
It supported SF_ARCHIVED, but this has been
changed to be UF_ARCHIVE, which has the same
semantics as the DOS archive attribute instead
of inverse semantics like SF_ARCHIVED.
After discussion with Bruce Evans, change
several things in the msdosfs behavior:
Use UF_READONLY to indicate whether a file
is writeable instead of file permissions, but
don't actually enforce it.
Refuse to change attributes on the root
directory, because it is special in FAT
filesystems, but allow most other attribute
changes on directories.
Don't set the archive attribute on a directory
when its modification time is updated.
Windows and DOS don't set the archive attribute
in that scenario, so we are now bug-for-bug
compatible.
smbfs_node.c,
smbfs_vnops.c: Add support for UF_HIDDEN, UF_SYSTEM,
UF_READONLY and UF_ARCHIVE in SMBFS.
This is similar to changes that Apple has
made in their version of SMBFS (as of
smb-583.8, posted on opensource.apple.com),
but not quite the same.
We map SMB_FA_READONLY to UF_READONLY,
because UF_READONLY is intended to match
the semantics of the DOS readonly flag.
The MacOS X code maps both UF_IMMUTABLE
and SF_IMMUTABLE to SMB_FA_READONLY, but
the immutable flags have stronger meaning
than the DOS readonly bit.
stat.h: Add definitions for UF_SYSTEM, UF_SPARSE,
UF_OFFLINE, UF_REPARSE, UF_ARCHIVE, UF_READONLY
and UF_HIDDEN.
The definition of UF_HIDDEN is the same as
the MacOS X definition.
Add commented-out definitions of
UF_COMPRESSED and UF_TRACKED. They are
defined in MacOS X (as of 10.8.2), but we
do not implement them (yet).
ufs_vnops.c: Add support for getting and setting
UF_ARCHIVE, UF_HIDDEN, UF_OFFLINE, UF_READONLY,
UF_REPARSE, UF_SPARSE, and UF_SYSTEM in UFS.
Alphabetize the flags that are supported.
These new flags are only stored, UFS does
not take any action if the flag is set.
Sponsored by: Spectra Logic
Reviewed by: bde (earlier version)
DRC for NFS over TCP.
- Increase the size of the hash tables.
- Create a separate mutex for each hash list of the TCP hash table.
- Single thread the code that deletes stale cache entries.
- Add a tunable called vfs.nfsd.tcphighwater, which can be increased
to allow the cache to grow larger, avoiding the overhead of frequent
scans to delete stale cache entries.
(The default value will result in frequent scans to delete stale cache
entries, analagous to what the pre-patched code does.)
- Add a tunable called vfs.nfsd.cachetcp that can be used to disable
DRC caching for NFS over TCP, since the old NFS server didn't DRC cache TCP.
It also adjusts the size of nfsrc_floodlevel dynamically, so that it is
always greater than vfs.nfsd.tcphighwater.
For UDP the algorithm remains the same as the pre-patched code, but the
tunable vfs.nfsd.udphighwater can be used to allow the cache to grow
larger and reduce the overhead caused by frequent scans for stale entries.
UDP also uses a larger hash table size than the pre-patched code.
Reported by: wollman
Tested by: wollman (earlier version of patch)
Submitted by: ivoras (earlier patch)
Reviewed by: jhb (earlier version of patch)
MFC after: 1 month
Previous bandaid was not appropriate and didn't really work for
all platforms. While here, cleanup the surrounding code to match
ffs_checkoverlap()
Reported by: dim, jmallet and bde
MFC after: 3 weeks
Add definitions for e2fs_daddr_t, e4fs_daddr_t in addition
to the already existing e2fs_lbn_t and adjust them for ext4.
Other than making the code more readable these changes should
fix problems related to big filesystems.
Setting the proper types can be tricky so the process was
helped by looking at UFS. In our implementation, logical block
numbers can be negative and the code depends on it. In ext2,
block numbers are unsigned so it is convenient to keep
e2fs_daddr_t unsigned and use the complete 32 bits. In the
case of e4fs_daddr_t, while the value should be unsigned, for
ext4 we only need to support 48 bits so preserving an extra
bit from the sign is not an issue.
While here also drop the ext2_setblock() prototype that was
never used.
Discussed with: mckusick, bde
MFC after: 3 weeks
Basic support for extents was implemented by Zheng Liu as part
of his Google Summer of Code in 2010. This support is read-only
at this time.
In addition to extents we also support the huge_file extension
for read-only purposes. This works nicely with the additional
support for birthtime/nanosec timestamps and dir_index that
have been added lately.
The implementation may not work for all ext4 filesystems as
it doesn't support some features that are being enabled by
default on recent linux like flex_bg. Nevertheless, the feature
should be very useful for migration or simple access in
filesystems that have been converted from ext2/3 or don't use
incompatible features.
Special thanks to Zheng Liu for his dedication and continued
work to support ext2 in FreeBSD.
Submitted by: Zheng Liu (lz@)
Reviewed by: Mike Ma, Christoph Mallon (previous version)
Sponsored by: Google Inc.
MFC after: 3 weeks
Unify the 2 concept into a real, minimal, sxlock where the shared
acquisition represent the soft busy and the exclusive acquisition
represent the hard busy.
The old VPO_WANTED mechanism becames the hard-path for this new lock
and it becomes per-page rather than per-object.
The vm_object lock becames an interlock for this functionality:
it can be held in both read or write mode.
However, if the vm_object lock is held in read mode while acquiring
or releasing the busy state, the thread owner cannot make any
assumption on the busy state unless it is also busying it.
Also:
- Add a new flag to directly shared busy pages while vm_page_alloc
and vm_page_grab are being executed. This will be very helpful
once these functions happen under a read object lock.
- Move the swapping sleep into its own per-object flag
The KPI is heavilly changed this is why the version is bumped.
It is very likely that some VM ports users will need to change
their own code.
Sponsored by: EMC / Isilon storage division
Discussed with: alc
Reviewed by: jeff, kib
Tested by: gavin, bapt (older version)
Tested by: pho, scottl
in particular, from the tmpfs_lookup VOP method. If LK_NOWAIT is not
specified in the lkflags, the lookup is supposed to return an alive
vnode whenever the underlying node is valid.
Currently, the tmpfs_alloc_vp() returns ENOENT if the vnode attached
to node exists and is being reclaimed. This causes spurious ENOENT
errors from lookup on tmpfs and corresponding random 'No such file'
failures from syscalls working with tmpfs files.
Fix this by waiting for the doomed vnode to be detached from the tmpfs
node if sleepable allocation is requested.
Note that filesystems which use vfs_hash.c, correctly handle the case
due to vfs_hash_get() looping when vget() returns ENOENT for sleepable
requests.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
We cannot busy a page before doing pagefaults.
Infact, it can deadlock against vnode lock, as it tries to vget().
Other functions, right now, have an opposite lock ordering, like
vm_object_sync(), which acquires the vnode lock first and then
sleeps on the busy mechanism.
Before this patch is reinserted we need to break this ordering.
Sponsored by: EMC / Isilon storage division
Reported by: kib
- It does not let pages respect the LRU policy
- It bloats the active/inactive queues of few pages
Try to avoid it as much as possible with the long-term target to
completely remove it.
Use the soft-busy mechanism to protect page content accesses during
short-term operations (like uiomove_fromphys()).
After this change only vm_fault_quick_hold_pages() is still using the
hold mechanism for page content access.
There is an additional complexity there as the quick path cannot
immediately access the page object to busy the page and the slow path
cannot however busy more than one page a time (to avoid deadlocks).
Fixing such primitive can bring to complete removal of the page hold
mechanism.
Sponsored by: EMC / Isilon storage division
Discussed with: alc
Reviewed by: jeff
Tested by: pho
kern_sendfile() which is unnecessary.
The page is already wired so it will not be subjected to pagefault.
The content cannot be effectively protected as it is full of races
already.
Multiple accesses to the same indexes are serialized through vn_rdwr().
Sponsored by: EMC / Isilon storage division
Reviewed by: alc, jeff
Tested by: pho
The htree implementation uses code derived from the
RSA Data Security, Inc. MD4 Message-Digest Algorithm.
Add a proper licensing statement for the code and clarify
the corresponding comments.
Approved by: core (hrs)
Before this change path matching had the following features:
- for device nodes the patterns were matched against full path
- in the above case '/' in a path could be matched by a wildcard
- for directories and links only the last component was matched
So, for example, a pattern like 're*' could match the following entries:
- re0 device
- responder/u0 device
- zvol/recpool directory
Although it was possible to work around this behavior (once it was spotted
and understood), it was very confusing and contrary to documentation.
Now we always match a full path for all types of devfs entries (devices,
directories, links) and a '/' has to be matched explicitly.
This behavior follows the shell globbing rules.
This change is originally developed by Jaakko Heinonen.
Many thanks!
PR: kern/122838
Submitted by: jh
MFC after: 4 weeks
as in <sys/dirent.h>
ext2_readdir() has always been very fs specific and different
with respect to its ufs_ counterpart. Recent changes from UFS
have made it possible to share more closely the implementation.
MFUFS r252438:
Always start parsing at DIRBLKSIZ aligned offset, skip first entries if
uio_offset is not DIRBLKSIZ aligned. Return EINVAL if buffer is too
small for single entry.
Preallocate buffer for cookies.
Skip entries with zero inode number.
Reviewed by: gleb, Zheng Liu
MFC after: 1 month
the attribute bitmap argument would be non-zero. This caused an
interoperability problem for a recent patch to the Linux NFSv4 client.
The Linux folks have changed their patch to avoid this, but this
patch fixes the problem on the server.
Reported and tested by: Andre Heider (a.heider@gmail.com)
MFC after: 3 days
The creation time support breaks the data structures used in linux
fuse. libfuse carries it's own header.
Revert the changes for now. We will try to get an agreement with the
fuse upstream maintainers to avoid having to patch the library
headers all the time.
Recalculate FUSE_COMPAT_ENTRY_OUT_SIZE and COMPAT_ATTR_OUT_SIZE.
These were wrong in the previous commit. They are actually unused
in FreeBSD though.
Pointed out by: Jan Beich
When birthtime was added (r253331) we missed adding the weight
of the new fields in FUSE_COMPAT_ENTRY_OUT_SIZE and
COMPAT_ATTR_OUT_SIZE. Adjust them accordingly.
Pointed out by: Jan Beich
Bring in the changes from the FUSE kernel interface 7.10
(available under a BSD license).
After 7.10 the linux FUSE developers added support for a
controversial CUSE driver and some linux especific
features that are unlikely to find its way into FreeBSD.
We currently don't implement any of the new features so we
are *not* bumping the FUSE_KERNEL_MINOR_VERSION. The header
should, nevertheless, serve as a template to add the new
features in a compatible manner.
While here adopt some minor cleanups from the upstream version
like removing FUSE_MAJOR and FUSE_MINOR which were never
used. Also add multiple inclusion header guards,
I was keeping this #ifdef'd for reference with the MacFUSE change[1]
but on second thought, this is a FreeBSD-only header so the SVN
history should be enough.
Add missing padding while here.
Reference [1]:
http://code.google.com/p/macfuse/source/detail?spec=svn1686&r=1360
Merge from UFS r231313:
This change first attempts the uiomove() to the newly allocated
(and dirty) buffer and only zeros it if the uiomove() fails. The
effect is to eliminate the gratuitous zeroing of the buffer in
the usual case where the uiomove() successfully fills it.
MFC after: 3 days
credentials to the kernel rpc. Modify the NFSv4 client to add
support for the gssname and allgssname mount options to use this
capability. Requires the gssd daemon to be running with the "-h" option.
Reviewed by: jhb