following clang warning:
sys/kern/sys_pipe.c:1556:10: error: promoted type 'int' of K&R function parameter is not compatible with the parameter type 'mode_t'
(aka 'unsigned short') declared in a previous prototype [-Werror]
mode_t mode;
^
sys/kern/sys_pipe.c:155:19: note: previous declaration is here
static fo_chmod_t pipe_chmod;
^
not get syscall exit notification until the child performed exec of
exit. Swap the order of doing ptracestop() and waiting for P_PPWAIT
clearing, by postponing the wait into syscallret after ptracestop()
notification is done.
Reported, tested and reviewed by: Dmitry Mikulin <dmitrym juniper net>
MFC after: 2 weeks
Descriptions are specific to drivers and we don't change drivers on attached
devices. This fixes a few places where we were not clearing the description
when detaching a driver (e.g. with device_attach() failed). While here, fix
a few other nits:
- Remove spurious call to remove a device's driver from
devclass_driver_deleted(). device_detach() removes it already.
- Fix a typo.
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
Uftdi(4) examines (c_iflag & (IXON|IXOFF)) to control hw XON-XOFF support.
This is obviously no good, if changes to those bits are not communicated
down the stack.
allow.mount.zfs:
allow mounting the zfs filesystem inside a jail
This way the permssions for mounting all current VFCF_JAIL filesystems
inside a jail are controlled wia allow.mount.* jail parameters.
Update sysctl descriptions.
Update jail(8) and zfs(8) manpages.
TODO: document the connection of allow.mount.* and VFCF_JAIL for kernel
developers
MFC after: 10 days
socket protocol number. This is useful since the socket type can
be implemented by different protocols in the same protocol family,
e.g. SOCK_STREAM may be provided by both TCP and SCTP.
Submitted by: Jukka A. Ukkonen <jau iki fi>
PR: kern/162352
Discussed with: bz
Reviewed by: glebius
MFC after: 2 weeks
unp->unp_vnode pointer to detect if there is a vnode associated with
(binded to) this socket and does necessary cleanup if there is.
The issue is that after forced unmount this check may be too late as
the unp_vnode is reclaimed and the reference is stale.
To fix this provide a helper function that is called on a socket vnode
reclamation to do necessary cleanup.
Pointed by: kib
Reviewed by: kib
MFC after: 2 weeks
according to POSIX document, the clock ID may be dynamically allocated,
it unlikely will be in 64K forever. To make it future compatible, we
pack all timeout information into a new structure called _umtx_time, and
use fourth argument as a size indication, a zero means it is old code
using timespec as timeout value, but the new structure also includes flags
and a clock ID, so the size argument is different than before, and it is
non-zero. With this change, it is possible that a thread can sleep
on any supported clock, though current kernel code does not have such a
POSIX clock driver system.
a new jail parameter node with the following parameters:
allow.mount.devfs:
allow mounting the devfs filesystem inside a jail
allow.mount.nullfs:
allow mounting the nullfs filesystem inside a jail
Both parameters are disabled by default (equals the behavior before
devfs and nullfs in jails). Administrators have to explicitly allow
mounting devfs and nullfs for each jail. The value "-1" of the
devfs_ruleset parameter is removed in favor of the new allow setting.
Reviewed by: jamie
Suggested by: pjd
MFC after: 2 weeks
to the debugger. When reparenting for debugging, keep the child in
the new orphan list of old parent. When looping over the children in
kern_wait(), iterate over both children list and orphan list to search
for the process by pid.
Submitted by: Dmitry Mikulin <dmitrym juniper.net>
MFC after: 2 weeks
UMTX_OP_WAIT. Upper 16bits is enough to hold a clock id, and lower
16bits is used to pass flags. The change saves a clock_gettime() syscall
from libthr.
Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the
sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from
the usermode.
Discussed with: bde, das (previous versions)
MFC after: 1 month
Vnode-backed mappings cannot be put into the kernel map, since it is a
system map.
Use exec_map for transient mappings, and remove the mappings with
kmem_free_wakeup() to notify the waiters on available map space.
Do not map the whole executable into KVA at all to copy it out into
usermode. Directly use vn_rdwr() for the case of not page aligned
binary.
There is one place left where the potentially unbounded amount of data
is mapped into exec_map, namely, in the COFF image activator
enumeration of the needed shared libraries.
Reviewed by: alc
MFC after: 2 weeks
messages were printed.
This can be enabled with the kern.msgbuf_show_timestamp sysctl
PR: kern/161553
Reviewed by: avg
Submitted by: Arnaud Lacombe <lacombar@gmail.com>
Approved by: cperciva
MFC after: 1 month
A new jail(8) option "devfs_ruleset" defines the ruleset enforcement for
mounting devfs inside jails. A value of -1 disables mounting devfs in
jails, a value of zero means no restrictions. Nested jails can only
have mounting devfs disabled or inherit parent's enforcement as jails are
not allowed to view or manipulate devfs(8) rules.
Utilizes new functions introduced in r231265.
Reviewed by: jamie
MFC after: 1 month
mnt_noasync counter to temporary remove MNTK_ASYNC mount option, which
is needed to guarantee a synchronous completion of the initiated i/o
before syscall or VOP return. Global removal of MNTK_ASYNC option is
harmful because not only i/o started from corresponding thread becomes
synchronous, but all i/o is synchronous on the filesystem which is
initiated during sync(2) or syncer activity.
Instead of removing MNTK_ASYNC from mnt_kern_flag, provide a local
thread flag to disable async i/o for current thread only. Use the
opportunity to move DOINGASYNC() macro into sys/vnode.h and
consistently use it through places which tested for MNTK_ASYNC.
Some testing demonstrated 60-70% improvements in run time for the
metadata-intensive operations on async-mounted UFS volumes, but still
with great deviation due to other reasons.
Reviewed by: mckusick
Tested by: scottl
MFC after: 2 weeks
set on the new thread. This prevents the thread from inadvertently
inheriting affinity from a random sibling.
Submitted by: attilio
Tested by: pho
MFC after: 1 week
it is possible that a single AIO event will be reported to multiple
threads, it is not threading friendly, and the existing API can not
control this behavior.
Allocate a kevent flags field sigev_notify_kevent_flags for AIO event
notification in sigevent, and allow user to pass EV_CLEAR, EV_DISPATCH
or EV_ONESHOT to AIO kernel code, user can control whether the event
should be cleared once it is retrieved by a thread. This change should
be comptaible with existing application, because the field should have
already been zero-filled, and no additional action will be taken by
kernel.
PR: kern/156567
about new child not only when doing PT_TO_SCX, but also for PT_CONTINUE.
If TDB_FORK flag is set, always issue a stop, the same as is done for
TDB_EXEC.
Reported by: Dmitry Mikulin <dmitrym juniper net>
MFC after: 1 week
that instead of using direct I/O it allows read-ahead similar to
POSIX_FADV_NORMAL, but invokes VOP_ADVISE(POSIX_FADV_DONTNEED) after the
read(2) has completed to purge just-read data. The write(2) path continues
to use direct I/O for POSIX_FADV_NOREUSE for now. Note that NOREUSE works
optimally if an application reads and writes full fs blocks.
that we have the lock now. This cleans up a locking panic ASSERT when
knlist_empty is called without a lock when INVARIANTS etc. are turned.
Reviewed by: kib jhb
MFC after: 1 week
primitives by breaking stop_scheduler into a per-thread variable.
Also, store the new td_stopsched very close to td_*locks members as
they will be accessed mostly in the same codepaths as td_stopsched and
this results in avoiding a further cache-line pollution, possibly.
STOP_SCHEDULER() was pondered to use a new 'thread' argument, in order to
take advantage of already cached curthread, but in the end there should
not really be a performance benefit, while introducing a KPI breakage.
In collabouration with: flo
Reviewed by: avg
MFC after: 3 months (or never)
X-MFC: r228424
share/man/man4/Makefile,
share/man/man4/xnb.4,
sys/dev/xen/netback/netback.c,
sys/dev/xen/netback/netback_unit_tests.c:
Rewrote the netback driver for xen to attach properly via newbus
and work properly in both HVM and PVM mode (only HVM is tested).
Works with the in-tree FreeBSD netfront driver or the Windows
netfront driver from SuSE. Has not been extensively tested with
a Linux netfront driver. Does not implement LRO, TSO, or
polling. Includes unit tests that may be run through sysctl
after compiling with XNB_DEBUG defined.
sys/dev/xen/blkback/blkback.c,
sys/xen/interface/io/netif.h:
Comment elaboration.
sys/kern/uipc_mbuf.c:
Fix page fault in kernel mode when calling m_print() on a
null mbuf. Since m_print() is only used for debugging, there
are no performance concerns for extra error checking code.
sys/kern/subr_scanf.c:
Add the "hh" and "ll" width specifiers from C99 to scanf().
A few callers were already using "ll" even though scanf()
was handling it as "l".
Submitted by: Alan Somers <alans@spectralogic.com>
Submitted by: John Suykerbuyk <johns@spectralogic.com>
Sponsored by: Spectra Logic
MFC after: 1 week
Reviewed by: ken
casted types: to ssize_t in filesystem code and to
int in buf code, thus supplying a negative argument
leads to kernel panic later. To fix that check user
supplied argument in the beginning of syscall.
Submitted by: Maxim Dounin <mdounin mdounin.ru>, maxim@
Unmounts do vfs_msync() before calling VFS_UNMOUNT(), but there is
still a race allowing a process to dirty pages after msync
finished. Remounts rw->ro just left dirty pages in system.
Reviewed by: alc, tegge (long time ago)
Tested by: pho
MFC after: 2 weeks
appropriate timestamps. Restore the assertions which verify that
NCF_TS is set when timestamp is asked for.
Reviewed by: jhb (previous version)
MFC after: 2 weeks
consistently, creating some namecache entries without NCF_TS flag.
This causes panic due to failed assertion.
As a temporal relief, remove the assert. Return epoch timestamp for
the entries without timestamp if asked.
While there, consolidate the code which returns timestamps, into a
helper cache_out_ts().
Discussed with: jhb
MFC after: 2 weeks
- retrive only one, specified limit for a process, not the whole
array, as it was previously (the sysctl has been added recently and
has not been backported to stable yet, so this change is ok);
- allow to set a resource limit for another process.
Submitted by: Andrey Zonov <andrey at zonov.org>
Discussed with: kib
Reviewed by: kib
MFC after: 2 weeks
provide struct namecache_ts which is the old struct namecache. Only
allocate struct namecache_ts if non-null struct timespec *tsp was
passed to cache_enter_time, otherwise use struct namecache.
Change struct namecache allocation and deallocation macros into static
functions, since logic becomes somewhat twisty. Provide accessor for
the nc_name member of struct namecache to hide difference between
struct namecache and namecache_ts.
The aim of the change is to not waste 20 bytes per small namecache
entry.
Reviewed by: jhb
MFC after: 2 weeks
X-MFC-note: after r230394
entries on one client when a directory was renamed on another client. The
root cause for the stale entry being trusted is that each per-vnode nfsnode
structure has a single 'n_ctime' timestamp used to validate positive name
cache entries. However, if there are multiple entries for a single vnode,
they all share a single timestamp. To fix this, extend the name cache
to allow filesystems to optionally store a timestamp value in each name
cache entry. The NFS clients now fetch the timestamp associated with
each name cache entry and use that to validate cache hits instead of the
timestamps previously stored in the nfsnode. Another part of the fix is
that the NFS clients now use timestamps from the post-op attributes of
RPCs when adding name cache entries rather than pulling the timestamps out
of the file's attribute cache. The latter is subject to races with other
lookups updating the attribute cache concurrently. Some more details:
- Add a variant of nfsm_postop_attr() to the old NFS client that can return
a vattr structure with a copy of the post-op attributes.
- Handle lookups of "." as a special case in the NFS clients since the name
cache does not store name cache entries for ".", so we cannot get a
useful timestamp. It didn't really make much sense to recheck the
attributes on the the directory to validate the namecache hit for "."
anyway.
- ABI compat shims for the name cache routines are present in this commit
so that it is safe to MFC.
MFC after: 2 weeks
to an OBJT_VNODE-specific field of the vm object. The same
information can be just as easily obtained from the struct vattr that
is in struct image_params if the latter is passed to
elf*_load_section(). Moreover, by replacing the vmspace and vm
object parameters to elf*_load_section() with a struct image_params
parameter, we actually reduce the size of the object code.
In collaboration with: kib
to read strings completely to know the actual size.
As a side effect it fixes the issue with kern.proc.args and kern.proc.env
sysctls, which didn't return the size of available data when calling
sysctl(3) with the NULL argument for oldp.
Note, in get_ps_strings(), which does actual work for proc_getargv() and
proc_getenvv(), we still have a safety limit on the size of data read in
case of a corrupted procces stack.
Suggested by: kib
MFC after: 3 days
This function updates path string to vnode's full global path and checks
the size of the new path string against the pathlen argument.
In vfs_domount(), sys_unmount() and kern_jail_set() this new function
is used to update the supplied path argument to the respective global path.
Unbreaks jailed zfs(8) with enforce_statfs set to 1.
Reviewed by: kib
MFC after: 1 month
On amd64, link_elf_obj.c must specify KERNBASE rather than
VM_MIN_KERNEL_ADDRESS to vm_map_find() because kernel loadable
modules must be mapped for execution in the same upper region
of the kernel map as the kernel code and data segments.
For MIPS32 KERNBASE lies below KVA area (it's less than
VM_MIN_KERNEL_ADDRESS) so basically vm_map_find got whole
KVA to look through. On MIPS64 it's not the case because
KERNBASE is set to the very end of XKSEG, well out of KVA
bounds, so vm_map_find always fails. We should use
VM_MIN_KERNEL_ADDRESS as a base for vm_map_find.
Details obtained from: alc@
The vfs_busy() is after covered vnode lock in the global lock order, but
since quotaon() does recursive VFS call to open quota file, we usually
end up locking covered vnode after mp is busied in sys_quotactl().
Change the interface of VFS_QUOTACTL(), requiring that mp was unbusied
by fs code, and do not try to pick up vfs_busy() reference in ufs quotaon,
esp. if vfs_busy cannot succeed due to unmount being performed.
Reported and tested by: pho
MFC after: 1 week
operation on POSIX shared memory objects and tmpfs. Previously, neither of
these modules correctly handled the case in which the new size of the object
or file was not a multiple of the page size. Specifically, they did not
handle partial page truncation of data stored on swap. As a result, stale
data might later be returned to an application.
Interestingly, a data inconsistency was less likely to occur under tmpfs
than POSIX shared memory objects. The reason being that a different mistake
by the tmpfs truncation operation helped avoid a data inconsistency. If the
data was still resident in memory in a PG_CACHED page, then the tmpfs
truncation operation would reactivate that page, zero the truncated portion,
and leave the page pinned in memory. More precisely, the benevolent error
was that the truncation operation didn't add the reactivated page to any of
the paging queues, effectively pinning the page. This page would remain
pinned until the file was destroyed or the page was read or written. With
this change, the page is now added to the inactive queue.
Discussed with: jhb
Reviewed by: kib (an earlier version)
MFC after: 3 weeks
The wrong structure happened to work since the only argument used was
the vnode which is in the same place in both VOP_SETATTR() and the two
extattr VOPs.
MFC after: 3 days
- Only initialize the per-cpu switchticks and switchtime in sched_throw()
for the very first context switch on APs during boot. This avoids a
small gap between the middle of thread_exit() and sched_throw() where
time is not accounted to any thread.
- In thread_exit(), update the timestamp bookkeeping to track the changes
to mi_switch() introduced by td_rux so that the code once again matches
the comment claiming it is mimicing mi_switch(). Specifically, only
update the per-thread stats directly and depend on ruxagg() to update
p_rux rather than adjusting p_rux directly. While here, move the
timestamp bookkeeping as late in the function as possible.
Reviewed by: bde, kib
MFC after: 1 week
It seems strchr() and strrchr() are used more often than index() and
rindex(). Therefore, simply migrate all kernel code to use it.
For the XFS code, remove an empty line to make the code identical to
the code in the Linux kernel.
at SCHED_PRI_RANGE to prevent overflows in the priority value. This can
happen due to irregularities with clock interrupts under certain
virtualization environments.
Tested by: Larry Rosenman ler lerctr org
MFC after: 2 weeks
sysclock_getsnapshot() function allows the caller to obtain a snapshot of all
the system clock and timecounter state required to create time stamps at a later
point. The sysclock_snap2bintime() function converts a previously obtained
snapshot into a bintime time stamp according to the specified flags e.g. which
system clock, uptime vs absolute time, etc.
These KPIs enable useful functionality, including direct comparison of the
feedback and feed-forward system clocks and generation of multiple time stamps
with different formats from a single timecounter read.
Committed on behalf of Julien Ridoux and Darryl Veitch from the University of
Melbourne, Australia, as part of the FreeBSD Foundation funded "Feed-Forward
Clock Synchronization Algorithms" project.
For more information, see http://www.synclab.org/radclock/
In collaboration with: Julien Ridoux (jridoux at unimelb edu au)
these to trigger a NOTE_ATTRIB EVFILT_VNODE kevent when the extended
attributes of a vnode are changed.
Note that OS X already implements this behavior.
Reviewed by: rwatson
MFC after: 2 weeks
With the previous code, if the range of priorities for timeshare batch
threads was greater than RQ_NQS, then the threads with low priorities in
the part of the range above RQ_NQS would be scheduled to the run-queues
as if they had high priorities at the beginning of the range.
In other words, threads with a nice level of +N could be scheduled as
if they had a nice level of -M.
Reported by: George Mitchell <george@m5p.com>
Reviewed by: jhb
Tested by: George Mitchell <george@m5p.com> (earlier version)
MFC after: 1 week
locate a process calling pfind() and do some additional checks like
p_candebug(). To reduce this code duplication a new function pget() is
introduced and used.
As the function may be useful not only in kern_proc.c it is in the
kernel name space.
Suggested by: kib
Reviewed by: kib
MFC after: 2 weeks