by doing most of the work in a new function prison_add_vfs in kern_jail.c
Now a jail-enabled filesystem need only mark itself with VFCF_JAIL, and
the rest is taken care of. This includes adding a jail parameter like
allow.mount.foofs, and a sysctl like security.jail.mount_foofs_allowed.
Both of these used to be a static list of known filesystems, with
predefined permission bits.
Reviewed by: kib
Differential Revision: D14681
global to per-domain state. Protect reservations with the free lock
from the domain that they belong to. Refactor to make vm domains more
of a first class object.
Reviewed by: markj, kib, gallatin
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14000
pathconf(2) and fpathconf(2) both return a long. The kern_[f]pathconf()
functions now accept a pointer to a long value rather than modifying
td_retval directly. Instead, the system calls explicitly store the
returned long value in td_retval[0].
Requested by: bde
Reviewed by: kib
Sponsored by: Chelsio Communications
On the one hand, FIFOs should respect other variables not supported by
the fifofs vnode operation (such as _PC_NAME_MAX, _PC_LINK_MAX, etc.).
These values are fs-specific and must come from a fs-specific method.
On the other hand, filesystems that support FIFOs are required to
support _PC_PIPE_BUF on directory vnodes that can contain FIFOs.
Given this latter requirement, once the fs-specific VOP_PATHCONF
method supports _PC_PIPE_BUF for directories, it is also suitable for
FIFOs permitting a single VOP_PATHCONF method to be used for both
FIFOs and non-FIFOs.
To that end, retire all of the FIFO-specific pathconf methods from
filesystems and change FIFO-specific vnode operation switches to use
the existing fs-specific VOP_PATHCONF method. For fifofs, set it's
VOP_PATHCONF to VOP_PANIC since it should no longer be used.
While here, move _PC_PIPE_BUF handling out of vop_stdpathconf() so that
only filesystems supporting FIFOs will report a value. In addition,
only report a valid _PC_PIPE_BUF for directories and FIFOs.
Discussed with: bde
Reviewed by: kib (part of a larger patch)
MFC after: 1 month
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D12572
Add a new TMPFS_LINK_MAX to use in place of LINK_MAX for link overflow
checks and pathconf() reporting. Rather than storing a full 64-bit
link count, just use a plain int and use INT_MAX as TMPFS_LINK_MAX.
Discussed with: bde
Reviewed by: kib (part of a larger patch)
Sponsored by: Chelsio Communications
Having all filesystems fall through to default values isn't always correct
and these values can vary for different filesystem implementations. Most
of these changes just use the existing default values with a few exceptions:
- Don't report CHOWN_RESTRICTED for ZFS since it doesn't do the exact
permissions check this claims for chown().
- Use NANDFS_NAME_LEN for NAME_MAX for nandfs.
- Don't report a LINK_MAX of 0 on smbfs. Now fail with EINVAL to
indicate hard links aren't supported.
Requested by: bde (though perhaps not this exact implementation)
Reviewed by: kib (earlier version)
MFC after: 1 month
Sponsored by: Chelsio Communications
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
similar to the kernel memory allocator.
This simplifies NUMA allocation because the domain will be known at wait
time and races between failure and sleeping are eliminated. This also
reduces boilerplate code and simplifies callers.
A wait primitive is supplied for uma zones for similar reasons. This
eliminates some non-specific VM_WAIT calls in favor of more explicit
sleeps that may be satisfied without new pages.
Reviewed by: alc, kib, markj
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Clearing the unr in tmpfs_unmount is not correct. In the case of
multiple references to the tmpfs mount (e.g. when there are lookup
threads using it) it will not be the one to finish tmpfs_free_tmp. In
those cases tmpfs_free_node_locked will be the final one to execute
tmpfs_free_tmp, and until then the unr must be valid.
Reported by: pho
Approved/reviewed by: rstone (mentor)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12749
tmpfs uses unr(9) to allocate inodes. Previously when unmounting it
would individually free the units when it freed each vnode. This is
unnecessary as we can use the newly-added unrhdr_clear function to clear
out the unr in onde go. This measurably reduces the time to unmount a
tmpfs with many files.
Reviewed by: cem, lidl
Approved by: rstone (mentor)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12591
Such updates consisted of vast majority of modificiations, especially
in tmpfs_reg_resize.
For the case where page count did no change and the size grew we only
need to update tn_size. Use this fact to avoid vm object lock/relock.
MFC after: 1 week
Update filesystems not currently using vop_stdpathconf() in pathconf
VOPs to use vop_stdpathconf() for any configuration variables that do
not have filesystem-specific values. vop_stdpathconf() is used for
variables that have system-wide settings as well as providing default
values for some values based on system limits. Filesystems can still
explicitly override individual settings.
PR: 219851
Reported by: cem
Reviewed by: cem, kib, ngie
MFC after: 1 month
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D11541
The option "nonc" disables using of namecache for the created mount,
by default namecache is used. The rationale for the option is that
namecache duplicates the information which is already kept in memory
by tmpfs. Since it believed that namecache scales better than tmpfs,
or will scale better, do not enable the option by default. On the
other hand, smaller machines may benefit from lesser namecache
pressure.
Discussed with: mjg
Tested by: pho (as part of larger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
For directories, node->tn_spec.tn_dir.tn_parent pointer to the parent
is used. For non-directories, the implementation is naive, all
directory nodes are scanned to find a dirent linking the specified
node. This can be significantly improved by maintaining tn_parent for
all nodes, later.
Tested by: pho (as part of larger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
On dotdot lookup and fhtovp operations, it is possible for the file
represented by tmpfs node to be removed after the thread calculated
the pointer. In this case, tmpfs_alloc_vp() accesses freed memory.
Introduce the reference count on the nodes. The allnodes list from
tmpfs mount owns 1 reference, and threads performing unlocked
operations on the node, add one transient reference. Similarly, since
struct tmpfs_mount maintains the list where nodes are enlisted,
refcount it by one reference from struct mount and one reference from
each node on the list. Both nodes and tmpfs_mounts are removed when
refcount goes to zero.
Note that this means that nodes and tmpfs_mounts might survive some
time after the node is deleted or tmpfs_unmount() finished. The
tmpfs_alloc_vp() in these cases returns error either due to node
removal (tn_nlinks == 0) or because of insmntque1(9) error.
Tested by: pho (as part of larger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Remove TMPFS_ASSERT_ELOCKED(). Its claims are already stated by other
asserts nearby and by VFS guarantees.
Change TMPFS_ASSERT_LOCKED() and one inlined place to use
ASSERT_VOP_(E)LOCKED() instead of hand-rolled imprecise asserts.
Tested by: pho (as part of the larger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Edit comments which explain no longer relevant details, and add
locking annotations to the struct tmpfs_node members.
Tested by: pho (as part of the larger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
If tmpfs vnode is only shared locked, tn_status field still needs
updates to note the access time modification. Use the same locking
scheme as for UFS, protect tn_status with the node interlock + shared
vnode lock.
Fix nearby style.
Noted and reviewed by: mjg
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
truncation, immediately queue the page for asynchronous laundering rather
than making the page pass through inactive queue first.
Reviewed by: kib, markj
longer used. More precisely, they are always zero because the code that
decremented and incremented them no longer exists.
Bump __FreeBSD_version to mark this change.
Reviewed by: kib, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8583
not remove user-space visible fields from vm_cnt or all of the references to
cached pages from comments. Those changes will come later.)
Reviewed by: kib, markj
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8497
The offset of the directory file, passed to getdirentries(2) syscall,
is user-controllable. The value of the offset must not be asserted,
instead the invalid value should be checked and rejected if invalid.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
It is otherwise left dangling, and callers that request cookies always free
the cookie buffer, even when VOP_READDIR(9) returns an error. This results
in a double free if tmpfs_readdir() returns an error to the NFS server or
the Linux getdents(2) emulation code.
Reported by: pho
MFC after: 1 week
Security: double free of malloc(9)-backed memory
Sponsored by: EMC / Isilon Storage Division
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
determining whether a node changed.
Other filesystems, e.g., UFS, only check on seconds, when determining
whether something changed.
This also corrects the birthtime case, where we checked tv_nsec
twice, instead of tv_sec and tv_nsec (PR).
PR: 201284
Submitted by: David Binderman
Patch suggested by: kib
Reviewed by: kib
MFC after: 2 weeks
Committed from: Essen FreeBSD Hackathon
This obviates the need for a MNTK_SUSPENDABLE flag, since passthrough
filesystems like nullfs and unionfs no longer need to inherit this
information from their lower layer(s). This change also restores the
pre-r273336 behaviour of using the presence of a susp_clean VFS method to
request suspension support.
Reviewed by: kib, mjg
Differential Revision: https://reviews.freebsd.org/D2937
* GENERAL
- Update copyright.
- Make kernel options for RANDOM_YARROW and RANDOM_DUMMY. Set
neither to ON, which means we want Fortuna
- If there is no 'device random' in the kernel, there will be NO
random(4) device in the kernel, and the KERN_ARND sysctl will
return nothing. With RANDOM_DUMMY there will be a random(4) that
always blocks.
- Repair kern.arandom (KERN_ARND sysctl). The old version went
through arc4random(9) and was a bit weird.
- Adjust arc4random stirring a bit - the existing code looks a little
suspect.
- Fix the nasty pre- and post-read overloading by providing explictit
functions to do these tasks.
- Redo read_random(9) so as to duplicate random(4)'s read internals.
This makes it a first-class citizen rather than a hack.
- Move stuff out of locked regions when it does not need to be
there.
- Trim RANDOM_DEBUG printfs. Some are excess to requirement, some
behind boot verbose.
- Use SYSINIT to sequence the startup.
- Fix init/deinit sysctl stuff.
- Make relevant sysctls also tunables.
- Add different harvesting "styles" to allow for different requirements
(direct, queue, fast).
- Add harvesting of FFS atime events. This needs to be checked for
weighing down the FS code.
- Add harvesting of slab allocator events. This needs to be checked for
weighing down the allocator code.
- Fix the random(9) manpage.
- Loadable modules are not present for now. These will be re-engineered
when the dust settles.
- Use macros for locks.
- Fix comments.
* src/share/man/...
- Update the man pages.
* src/etc/...
- The startup/shutdown work is done in D2924.
* src/UPDATING
- Add UPDATING announcement.
* src/sys/dev/random/build.sh
- Add copyright.
- Add libz for unit tests.
* src/sys/dev/random/dummy.c
- Remove; no longer needed. Functionality incorporated into randomdev.*.
* live_entropy_sources.c live_entropy_sources.h
- Remove; content moved.
- move content to randomdev.[ch] and optimise.
* src/sys/dev/random/random_adaptors.c src/sys/dev/random/random_adaptors.h
- Remove; plugability is no longer used. Compile-time algorithm
selection is the way to go.
* src/sys/dev/random/random_harvestq.c src/sys/dev/random/random_harvestq.h
- Add early (re)boot-time randomness caching.
* src/sys/dev/random/randomdev_soft.c src/sys/dev/random/randomdev_soft.h
- Remove; no longer needed.
* src/sys/dev/random/uint128.h
- Provide a fake uint128_t; if a real one ever arrived, we can use
that instead. All that is needed here is N=0, N++, N==0, and some
localised trickery is used to manufacture a 128-bit 0ULLL.
* src/sys/dev/random/unit_test.c src/sys/dev/random/unit_test.h
- Improve unit tests; previously the testing human needed clairvoyance;
now the test will do a basic check of compressibility. Clairvoyant
talent is still a good idea.
- This is still a long way off a proper unit test.
* src/sys/dev/random/fortuna.c src/sys/dev/random/fortuna.h
- Improve messy union to just uint128_t.
- Remove unneeded 'static struct fortuna_start_cache'.
- Tighten up up arithmetic.
- Provide a method to allow eternal junk to be introduced; harden
it against blatant by compress/hashing.
- Assert that locks are held correctly.
- Fix the nasty pre- and post-read overloading by providing explictit
functions to do these tasks.
- Turn into self-sufficient module (no longer requires randomdev_soft.[ch])
* src/sys/dev/random/yarrow.c src/sys/dev/random/yarrow.h
- Improve messy union to just uint128_t.
- Remove unneeded 'staic struct start_cache'.
- Tighten up up arithmetic.
- Provide a method to allow eternal junk to be introduced; harden
it against blatant by compress/hashing.
- Assert that locks are held correctly.
- Fix the nasty pre- and post-read overloading by providing explictit
functions to do these tasks.
- Turn into self-sufficient module (no longer requires randomdev_soft.[ch])
- Fix some magic numbers elsewhere used as FAST and SLOW.
Differential Revision: https://reviews.freebsd.org/D2025
Reviewed by: vsevolod,delphij,rwatson,trasz,jmg
Approved by: so (delphij)
dup entry, upon detach from the parent directory. If the node is
renamed, the entry is re-attached at the different directory, and
invalud cookie value triggers assert (or corrupts directory rb tree,
it seems).
Reported by: clusteradm (gjb, antoine)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
in the requested array, then it is responsible for disposition of previous
page and is responsible for updating the entry in the requested array.
Now consumers of KPI do not need to re-lookup the pages after call to
vm_pager_get_pages().
Reviewed by: kib
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
For VREG vnodes, return the resident page count (multiplied by PAGE_SIZE)
for the tmpfs node's anonymous VM object that stores actual file contents.
For all other vnodes, return the tmpfs_node's tn_size, which should not
be rounded to a page.
This change allows using stat(2) to identify a sparse file on tmpfs.
Reviewed by: kib
MFC after: 1 week