mnt_noasync counter to temporary remove MNTK_ASYNC mount option, which
is needed to guarantee a synchronous completion of the initiated i/o
before syscall or VOP return. Global removal of MNTK_ASYNC option is
harmful because not only i/o started from corresponding thread becomes
synchronous, but all i/o is synchronous on the filesystem which is
initiated during sync(2) or syncer activity.
Instead of removing MNTK_ASYNC from mnt_kern_flag, provide a local
thread flag to disable async i/o for current thread only. Use the
opportunity to move DOINGASYNC() macro into sys/vnode.h and
consistently use it through places which tested for MNTK_ASYNC.
Some testing demonstrated 60-70% improvements in run time for the
metadata-intensive operations on async-mounted UFS volumes, but still
with great deviation due to other reasons.
Reviewed by: mckusick
Tested by: scottl
MFC after: 2 weeks
set on the new thread. This prevents the thread from inadvertently
inheriting affinity from a random sibling.
Submitted by: attilio
Tested by: pho
MFC after: 1 week
it is possible that a single AIO event will be reported to multiple
threads, it is not threading friendly, and the existing API can not
control this behavior.
Allocate a kevent flags field sigev_notify_kevent_flags for AIO event
notification in sigevent, and allow user to pass EV_CLEAR, EV_DISPATCH
or EV_ONESHOT to AIO kernel code, user can control whether the event
should be cleared once it is retrieved by a thread. This change should
be comptaible with existing application, because the field should have
already been zero-filled, and no additional action will be taken by
kernel.
PR: kern/156567
about new child not only when doing PT_TO_SCX, but also for PT_CONTINUE.
If TDB_FORK flag is set, always issue a stop, the same as is done for
TDB_EXEC.
Reported by: Dmitry Mikulin <dmitrym juniper net>
MFC after: 1 week
that instead of using direct I/O it allows read-ahead similar to
POSIX_FADV_NORMAL, but invokes VOP_ADVISE(POSIX_FADV_DONTNEED) after the
read(2) has completed to purge just-read data. The write(2) path continues
to use direct I/O for POSIX_FADV_NOREUSE for now. Note that NOREUSE works
optimally if an application reads and writes full fs blocks.
that we have the lock now. This cleans up a locking panic ASSERT when
knlist_empty is called without a lock when INVARIANTS etc. are turned.
Reviewed by: kib jhb
MFC after: 1 week
primitives by breaking stop_scheduler into a per-thread variable.
Also, store the new td_stopsched very close to td_*locks members as
they will be accessed mostly in the same codepaths as td_stopsched and
this results in avoiding a further cache-line pollution, possibly.
STOP_SCHEDULER() was pondered to use a new 'thread' argument, in order to
take advantage of already cached curthread, but in the end there should
not really be a performance benefit, while introducing a KPI breakage.
In collabouration with: flo
Reviewed by: avg
MFC after: 3 months (or never)
X-MFC: r228424
share/man/man4/Makefile,
share/man/man4/xnb.4,
sys/dev/xen/netback/netback.c,
sys/dev/xen/netback/netback_unit_tests.c:
Rewrote the netback driver for xen to attach properly via newbus
and work properly in both HVM and PVM mode (only HVM is tested).
Works with the in-tree FreeBSD netfront driver or the Windows
netfront driver from SuSE. Has not been extensively tested with
a Linux netfront driver. Does not implement LRO, TSO, or
polling. Includes unit tests that may be run through sysctl
after compiling with XNB_DEBUG defined.
sys/dev/xen/blkback/blkback.c,
sys/xen/interface/io/netif.h:
Comment elaboration.
sys/kern/uipc_mbuf.c:
Fix page fault in kernel mode when calling m_print() on a
null mbuf. Since m_print() is only used for debugging, there
are no performance concerns for extra error checking code.
sys/kern/subr_scanf.c:
Add the "hh" and "ll" width specifiers from C99 to scanf().
A few callers were already using "ll" even though scanf()
was handling it as "l".
Submitted by: Alan Somers <alans@spectralogic.com>
Submitted by: John Suykerbuyk <johns@spectralogic.com>
Sponsored by: Spectra Logic
MFC after: 1 week
Reviewed by: ken
casted types: to ssize_t in filesystem code and to
int in buf code, thus supplying a negative argument
leads to kernel panic later. To fix that check user
supplied argument in the beginning of syscall.
Submitted by: Maxim Dounin <mdounin mdounin.ru>, maxim@
Unmounts do vfs_msync() before calling VFS_UNMOUNT(), but there is
still a race allowing a process to dirty pages after msync
finished. Remounts rw->ro just left dirty pages in system.
Reviewed by: alc, tegge (long time ago)
Tested by: pho
MFC after: 2 weeks
appropriate timestamps. Restore the assertions which verify that
NCF_TS is set when timestamp is asked for.
Reviewed by: jhb (previous version)
MFC after: 2 weeks
consistently, creating some namecache entries without NCF_TS flag.
This causes panic due to failed assertion.
As a temporal relief, remove the assert. Return epoch timestamp for
the entries without timestamp if asked.
While there, consolidate the code which returns timestamps, into a
helper cache_out_ts().
Discussed with: jhb
MFC after: 2 weeks
- retrive only one, specified limit for a process, not the whole
array, as it was previously (the sysctl has been added recently and
has not been backported to stable yet, so this change is ok);
- allow to set a resource limit for another process.
Submitted by: Andrey Zonov <andrey at zonov.org>
Discussed with: kib
Reviewed by: kib
MFC after: 2 weeks
provide struct namecache_ts which is the old struct namecache. Only
allocate struct namecache_ts if non-null struct timespec *tsp was
passed to cache_enter_time, otherwise use struct namecache.
Change struct namecache allocation and deallocation macros into static
functions, since logic becomes somewhat twisty. Provide accessor for
the nc_name member of struct namecache to hide difference between
struct namecache and namecache_ts.
The aim of the change is to not waste 20 bytes per small namecache
entry.
Reviewed by: jhb
MFC after: 2 weeks
X-MFC-note: after r230394
entries on one client when a directory was renamed on another client. The
root cause for the stale entry being trusted is that each per-vnode nfsnode
structure has a single 'n_ctime' timestamp used to validate positive name
cache entries. However, if there are multiple entries for a single vnode,
they all share a single timestamp. To fix this, extend the name cache
to allow filesystems to optionally store a timestamp value in each name
cache entry. The NFS clients now fetch the timestamp associated with
each name cache entry and use that to validate cache hits instead of the
timestamps previously stored in the nfsnode. Another part of the fix is
that the NFS clients now use timestamps from the post-op attributes of
RPCs when adding name cache entries rather than pulling the timestamps out
of the file's attribute cache. The latter is subject to races with other
lookups updating the attribute cache concurrently. Some more details:
- Add a variant of nfsm_postop_attr() to the old NFS client that can return
a vattr structure with a copy of the post-op attributes.
- Handle lookups of "." as a special case in the NFS clients since the name
cache does not store name cache entries for ".", so we cannot get a
useful timestamp. It didn't really make much sense to recheck the
attributes on the the directory to validate the namecache hit for "."
anyway.
- ABI compat shims for the name cache routines are present in this commit
so that it is safe to MFC.
MFC after: 2 weeks
to an OBJT_VNODE-specific field of the vm object. The same
information can be just as easily obtained from the struct vattr that
is in struct image_params if the latter is passed to
elf*_load_section(). Moreover, by replacing the vmspace and vm
object parameters to elf*_load_section() with a struct image_params
parameter, we actually reduce the size of the object code.
In collaboration with: kib
to read strings completely to know the actual size.
As a side effect it fixes the issue with kern.proc.args and kern.proc.env
sysctls, which didn't return the size of available data when calling
sysctl(3) with the NULL argument for oldp.
Note, in get_ps_strings(), which does actual work for proc_getargv() and
proc_getenvv(), we still have a safety limit on the size of data read in
case of a corrupted procces stack.
Suggested by: kib
MFC after: 3 days
This function updates path string to vnode's full global path and checks
the size of the new path string against the pathlen argument.
In vfs_domount(), sys_unmount() and kern_jail_set() this new function
is used to update the supplied path argument to the respective global path.
Unbreaks jailed zfs(8) with enforce_statfs set to 1.
Reviewed by: kib
MFC after: 1 month
On amd64, link_elf_obj.c must specify KERNBASE rather than
VM_MIN_KERNEL_ADDRESS to vm_map_find() because kernel loadable
modules must be mapped for execution in the same upper region
of the kernel map as the kernel code and data segments.
For MIPS32 KERNBASE lies below KVA area (it's less than
VM_MIN_KERNEL_ADDRESS) so basically vm_map_find got whole
KVA to look through. On MIPS64 it's not the case because
KERNBASE is set to the very end of XKSEG, well out of KVA
bounds, so vm_map_find always fails. We should use
VM_MIN_KERNEL_ADDRESS as a base for vm_map_find.
Details obtained from: alc@
The vfs_busy() is after covered vnode lock in the global lock order, but
since quotaon() does recursive VFS call to open quota file, we usually
end up locking covered vnode after mp is busied in sys_quotactl().
Change the interface of VFS_QUOTACTL(), requiring that mp was unbusied
by fs code, and do not try to pick up vfs_busy() reference in ufs quotaon,
esp. if vfs_busy cannot succeed due to unmount being performed.
Reported and tested by: pho
MFC after: 1 week
operation on POSIX shared memory objects and tmpfs. Previously, neither of
these modules correctly handled the case in which the new size of the object
or file was not a multiple of the page size. Specifically, they did not
handle partial page truncation of data stored on swap. As a result, stale
data might later be returned to an application.
Interestingly, a data inconsistency was less likely to occur under tmpfs
than POSIX shared memory objects. The reason being that a different mistake
by the tmpfs truncation operation helped avoid a data inconsistency. If the
data was still resident in memory in a PG_CACHED page, then the tmpfs
truncation operation would reactivate that page, zero the truncated portion,
and leave the page pinned in memory. More precisely, the benevolent error
was that the truncation operation didn't add the reactivated page to any of
the paging queues, effectively pinning the page. This page would remain
pinned until the file was destroyed or the page was read or written. With
this change, the page is now added to the inactive queue.
Discussed with: jhb
Reviewed by: kib (an earlier version)
MFC after: 3 weeks
The wrong structure happened to work since the only argument used was
the vnode which is in the same place in both VOP_SETATTR() and the two
extattr VOPs.
MFC after: 3 days
- Only initialize the per-cpu switchticks and switchtime in sched_throw()
for the very first context switch on APs during boot. This avoids a
small gap between the middle of thread_exit() and sched_throw() where
time is not accounted to any thread.
- In thread_exit(), update the timestamp bookkeeping to track the changes
to mi_switch() introduced by td_rux so that the code once again matches
the comment claiming it is mimicing mi_switch(). Specifically, only
update the per-thread stats directly and depend on ruxagg() to update
p_rux rather than adjusting p_rux directly. While here, move the
timestamp bookkeeping as late in the function as possible.
Reviewed by: bde, kib
MFC after: 1 week