directory entry, matched by the inode number, is ".".
NFSv4 client might instantiate the distinct vnodes which have the same
inode number, since single v4 export can be combined from several
filesystems on the server. For instance, a case when the nested
server mount point is exactly one directory below the top of the
export, causes directory and its parent to have the same inode number
2. The vop_stdvptocnp() algorithm then returns "." as the name of the
lower directory.
Filtering out the "." entry with ENOENT works around this behaviour,
the error forces getcwd(3) to fall back to usermode implementation,
which compares both st_dev and st_ino.
Based on the submission by: rmacklem
Tested by: rmacklem
MFC after: 1 week
Switch eventtimers(9) from using struct bintime to sbintime_t.
Even before this not a single driver really supported full dynamic range of
struct bintime even in theory, not speaking about practical inexpediency.
This change legitimates the status quo and cleans up the code.
When CPU becomes idle, cpu_idleclock() calculates time to the next timer
event in order to reprogram hw timer. Return that time in sbintime_t to
the caller and pass it to acpi_cpu_idle(), where it can be used as one
more factor (quite precise) to extimate furter sleep time and choose
optimal sleep state. This is a preparatory change for further callout
improvements will be committed in the next days.
The commmit is not targeted for MFC.
intact if getblk() is done on the already owned buffer. Exit from
brelse() early when the lock recursion is detected, otherwise brelse()
might prematurely destroy the buffer under some circumstances.
Sponsored by: The FreeBSD Foundation
Noted by: mckusick
Tested by: pho
MFC after: 2 weeks
changes in r246417 were incomplete as they did not add explicit calls to
sigdeferstop() around all the places that previously passed SBDRY to
_sleep(). In addition, nfs_getcacheblk() could trigger a write RPC from
getblk() resulting in sigdeferstop() recursing. Rather than manually
deferring stop signals in specific places, change the VFS_*() and VOP_*()
methods to defer stop signals for filesystems which request this behavior
via a new VFCF_SBDRY flag. Note that this has to be a VFC flag rather than
a MNTK flag so that it works properly with VFS_MOUNT() when the mount is
not yet fully constructed. For now, only the NFS clients are set this new
flag in VFS_SET().
A few other related changes:
- Add an assertion to ensure that TDF_SBDRY doesn't leak to userland.
- When a lookup request uses VOP_READLINK() to follow a symlink, mark
the request as being on behalf of the thread performing the lookup
(cnp_thread) rather than using a NULL thread pointer. This causes
NFS to properly handle signals during this VOP on an interruptible
mount.
PR: kern/176179
Reported by: Russell Cattelan (sigdeferstop() recursion)
Reviewed by: kib
MFC after: 1 month
write is a disk write request that tells the disk that the buffer
being written must be committed to the media along with any writes
that preceeded it before any future blocks may be written to the drive.
Barrier writes are provided by adding the functions bbarrierwrite
(bwrite with barrier) and babarrierwrite (bawrite with barrier).
Following a bbarrierwrite the client knows that the requested buffer
is on the media. It does not ensure that buffers written before that
buffer are on the media. It only ensure that buffers written before
that buffer will get to the media before any buffers written after
that buffer. A flush command must be sent to the disk to ensure that
all earlier written buffers are on the media.
Reviewed by: kib
Tested by: Peter Holm
blocking modes described in section 3.4.3 of RFC 2783, allowing the caller
to retrieve the most recent values without blocking, to block for a specified
time, or to block forever.
Reviewed by: discussion on hackers@
now disables read-ahead. It used to effectively restore the system default
readahead hueristic if it had been changed; a negative value now restores
the default.
Reviewed by: kib
every architecture's busdma_machdep.c. It is done by unifying the
bus_dmamap_load_buffer() routines so that they may be called from MI
code. The MD busdma is then given a chance to do any final processing
in the complete() callback.
The cam changes unify the bus_dmamap_load* handling in cam drivers.
The arm and mips implementations are updated to track virtual
addresses for sync(). Previously this was done in a type specific
way. Now it is done in a generic way by recording the list of
virtuals in the map.
Submitted by: jeff (sponsored by EMC/Isilon)
Reviewed by: kan (previous version), scottl,
mjacob (isp(4), no objections for target mode changes)
Discussed with: ian (arm changes)
Tested by: marius (sparc64), mips (jmallet), isci(4) on x86 (jharris),
amd64 (Fabian Keil <freebsd-listen@fabiankeil.de>)
Older entries should be 'before' newer entries in the new buffer too
and there should be no zero-filled gap between them.
Pointed out by: jhb
MFC after: 3 days
X-MFC with: r246282
until child performs exec(). The behaviour is reasonable when a
debugger is the real parent, because the parent is stopped until
exec(), and sending a debugging event to the debugger would deadlock
both parent and child.
On the other hand, when debugger is not the parent of the vforked
child, not sending debugging signals makes it impossible to debug
across vfork.
Fix the issue by declining generating debug signals only when vfork()
was done and child called ptrace(PT_TRACEME). Set a new process flag
P_PPTRACE from the attach code for PT_TRACEME, if P_PPWAIT flag is
set, which indicates that the process was created with vfork() and
still did not execed. Check P_PPTRACE from issignal(), instead of
refusing the trace outright for the P_PPWAIT case. The scope of
P_PPTRACE is exactly contained in the scope of P_PPWAIT.
Found and tested by: zont
Reviewed by: pluknet
MFC after: 2 weeks
Posix requires that open(2) is restartable for SA_RESTART.
For non-posix objects, in particular, devfs nodes, still disable
automatic restart of the opens. The open call to a driver could have
significant side effects for the hardware.
Noted and reviewed by: jilles
Discussed with: bde
MFC after: 2 weeks
cpuset mask for the associated interrupt thread.
The text used above is verbatim from r195249 and the code should now be
in line with the intent of that commit.
195702, 195703, and 195821 prevented a thread from suspending while holding
locks inside of NFS by forcing the thread to fail sleeps with EINTR or
ERESTART but defer the thread suspension to the user boundary. However,
this had the effect that stopping a process during an NFS request could
abort the request and trigger EINTR errors that were visible to userland
processes (previously the thread would have suspended and completed the
request once it was resumed).
This change instead effectively masks stop signals while in the NFS client.
It uses the existing TDF_SBDRY flag to effect this since SIGSTOP cannot
be masked directly. Also, instead of setting PBDRY on individual sleeps,
the NFS client now sets the TDF_SBDRY flag around each NFS request and
stop signals are masked for all sleeps during that region (the previous
change missed sleeps in lockmgr locks). The end result is that stop
signals sent to threads performing an NFS request are completely
ignored until after the NFS request has finished processing and the
thread prepares to return to userland. This restores the behavior of
stop signals being transparent to userland processes while still
preventing threads from suspending while holding NFS locks.
Reviewed by: kib
MFC after: 1 month
Only during very early boot, before malloc(9) is functional (SI_SUB_KMEM),
the static ktr_buf_init is used. Size of the static buffer is determined
by a new kernel option KTR_BOOT_ENTRIES. Its default value is 1024.
This commit builds on top of r243046.
Reviewed by: alc
MFC after: 17 days
And provide kernel compiler version as a sysctl as well.
This is useful while we have gcc and clang cohabitation.
This could be even more useful when we have support
for external toolchains.
In cooperation with: mjg
MFC after: 13 days
in kern_wait6(), which is called by kern_wait(). Remove the redundand
check, introduced in r243136, and add a comment noting this, to make
the code less confusing.
The blank lines are added to properly delineate the scope of the
preceeding comments.
Noted by: "Jukka A. Ukkonen" <jau@iki.fi>
MFC after: 1 week
tunable_mbinit() where it is next to where it is used later.
Change the sysinit level of tunable_mbinit() from SI_SUB_TUNABLES
to SI_SUB_KMEM after the VM is running. This allows to use better
methods to determine the effectively available physical and virtual
memory available to the kernel.
Update comments.
In a second step it can be merged into mbuf_init().
When maxusers was unrestricted and maxfiles was allowed to autotune
much higher the result was that ncallout which was based on maxfiles
and maxproc grew much higher than was needed.
To fix this clip autotuning to the same number we would get with
the old maxusers algorithm which would stop scaling at 384
maxusers.
Growing ncalout higher is not likely to be needed since most consumers
of timeout(9) are gone and any higher value for ncallout causes the
callwheel hashes to be much larger than will even be needed for
most applications.
MFC after: 1 month
Reviewed by: mav
Set the v_hash for a new vnode in the getnewvnode() to the value
calculated based on the vnode structure address. Filesystems using
vfs_hash_insert() override the v_hash using the standard formula of
(inode_number + mnt_hashseed). For other filesystems, the
initialization allows the vfs_hash_index() to provide useful hash too.
Suggested, reviewed and tested by: peter
Sponsored by: The FreeBSD Foundation
MFC after: 5 days
pre-masked hash for the given vnode. The function assumes that
vp->v_hash is initialized by the filesystem vnode instantiation
function. At the moment, it is only done if filesystem uses
vfs_hash_insert().
Reviewed by: peter
Tested by: peter, pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 5 days
mount, which means that is must not be called while the snaplock is
owned. The vfs_write_resume(9) does call the function as the
VFS_SUSP_CLEAN() method, which is too early and falls into the region
still protected by snaplock.
Add yet another flag for the vfs_write_resume_flags() to avoid calling
suspension cleanup handler after the suspend is lifted, and use it in
the ffs_snapshot() call to vfs_write_resume.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
the write start, by adding a variation of the vfs_write_resume(9)
which accepts flags.
Use the new function to prevent a deadlock between parallel suspension
and snapshotting a UFS mount. The ffs_snapshot() code performed
vfs_write_resume() followed by vn_start_write() while owning the
snaplock. If the suspension intervene between resume and
vn_start_write(), the deadlock occured after the suspending thread
tried to lock the snaplock, most typically during the write in the
ffs_copyonwrite().
Reported and tested by: Andreas Longwitz <longwitz@incore.de>
Reviewed by: mckusick
MFC after: 2 weeks
X-MFC-note: make the vfs_write_resume(9) function a macro after the MFC,
in HEAD
case. There is no point in optimizing further the code and use a TRUE
litteral for a path that does heavyweight stuff anyway (like lock acq),
at the price of obfuscated code.
Use the appropriate check where necessary and remove a macro.
Sponsored by: EMC / Isilon storage division
MFC after: 3 days
interrupt context can still be idlethread. At that point, without the
panic condition, it can still happen that idlethread then will try to
acquire some locks to carry on some operations.
Skip the idlethread check on block/sleep lock operations when KDB is
active.
Reported by: jh
Tested by: jh
MFC after: 1 week
When kern_yield() was introduced with the possibility to specify
a new priority, the behaviour changed by not lowering priority at all
in the consumers, making the yielding mechanism highly ineffective for
high priority kthreads like bufdaemon, syncer, vlrudaemon, etc.
There are no evidences that consumers could bear with such change in
semantic and this situation could finally lead to bugs similar to the
ones fixed in r244240.
Re-specify userland pri for kthreads involved.
Tested by: pho
Reviewed by: kib, mdf
MFC after: 1 week
manage a set of power-of-2 sized buffers for bus_dmamem_alloc().
This allows the caller to provide the back-end allocator uma allocator,
allowing full control of the memory pages backing the pool. For
convenience, it provides an optional builtin allocator that provides pages
allocated with the VM_MEMATTR_UNCACHEABLE attribute, for managing pools of
DMA buffers for BUS_DMA_COHERENT or BUS_DMA_NOCACHE.
This also allows the caller to specify a minimum alignment, and it ensures
that all buffers start on a boundary and have a length that's a multiple of
that value, to avoid using buffers that trigger partial cache line flushes.
Submitted by: Ian Lepore <freebsd@damnhippie.dyndns.org>
only returns name, but also vnode of corefile to use.
This simplifies the code and closes few races, especially in %I handling.
Reviewed by: kib
Obtained from: WHEEL Systems
- Use this new format to automatically handle syscalls and VOPs. This
changes the earlier format but is still human readable.
Sponsored by: EMC / Isilon Storage Division
calls and turn it on.
- Do not allow to call them inside jail. [1]
Pointed out by: trasz [1]
Reviewed by: avg
Approved by: kib (mentor)
MFC after: 1 week
This fixed panic where we hold mutex (process lock) and try to obtain sleepable
lock (vnode lock in expand_name()). The panic could occur when %I was used
in kern.corefile.
Additionally we avoid expand_name() overhead when coredumps are disabled.
Obtained from: WHEEL Systems
yields, specify the user priority for the yield. Otherwise, a
higher-priority (kernel) thread could fall into the priority-inversion
with the thread owning the mutex lock.
On single-processor machines or UP kernels, do not loop adaptively
when the next vnode cannot be locked, instead yield unconditionally.
Restructure the iteration initializer and the iterator to remove code
duplication. Put the code to fetch and lock a vnode next to the
current marker, into the mnt_vnode_next_active() function, and use it
instead of repeating the loop.
Reported by: hrs, rmacklem
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
was being copied from the wrong place. This patch fixes that.
This could cause access failures for mapped users, when the group
permissions were needed.
PR: 147998
Submitted by: Christopher Key (cjk32 at cam.ac.uk)
MFC after: 2 weeks
This is an ongoing effort to provide runtime debug information
useful in the field that does not panic existing installations.
This gives us the flexibility needed when shipping images to a
potentially large audience with WITNESS enabled without worrying
about formerly non-fatal LORs hurting a release.
Sponsored by: iXsystems
kern_yield() is problematic than.
The owned mutex is the mount interlock, and it is in fact not needed
to guarantee the stability of the mount list of active vnodes, so fix
the the issue by only taking the mount interlock for MNT_REF and
MNT_REL operations.
While there, augment the unconditional yield by some amount of
spinning [1].
Reported and tested by: pho
Reviewed by: attilio
Submitted by: attilio [1]
MFC after: 3 days
call. The function indicates a failure by the TRUE return value. To
be extra safe, assert that the return value from the following
vm_map_insert() indicates success.
Fix style issues in the nearby lines, reformulate the comment.
Reviewed by: alc (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
- unp_zone: kern.ipc.maxsockets limit reached
- socket_zone: kern.ipc.maxsockets limit reached
- zone_mbuf: kern.ipc.nmbufs limit reached
- zone_clust: kern.ipc.nmbclusters limit reached
- zone_jumbop: kern.ipc.nmbjumbop limit reached
- zone_jumbo9: kern.ipc.nmbjumbo9 limit reached
- zone_jumbo16: kern.ipc.nmbjumbo16 limit reached
Note that those warnings are printed not often than every five minutes and can
be globally turned off by setting sysctl/tunable vm.zone_warnings to 0.
Discussed on: arch
Obtained from: WHEEL Systems
MFC after: 2 weeks
This is to allow debug images to be used without taking down the
system when non-fatal asserts are hit.
The following sysctls are added:
debug.kassert.warn_only: 1 = log, 0 = panic
debug.kassert.do_ktr: set to a ktr mask for logging via KTR
debug.kassert.do_log: 1 = log, 0 = quiet
debug.kassert.warnings: stats, number of kasserts hit
debug.kassert.log_panic_at:
number of kasserts before we actually panic, 0 = never
debug.kassert.log_pps_limit: pps limit for log messages
debug.kassert.log_mute_at: stop warning after N kasserts, 0 = never stop
debug.kassert.kassert: set this sysctl to trigger a kassert
Discussed with: scottl, gnn, marcel
Sponsored by: iXsystems
EPROTONOSUPPORT if the address family is not supported.
- introduce pffinddomain() to find a domain by family and use it as
appropriate.
Reviewed by: glebius
- As the comment report, CALLOUT_LOCAL_ALLOC cannot be checked
directly from the callout flags but might be checked by a cached
value. Hence, do so before to actually remove the callout, when
needed, in softclock_call_cc().
- In softclock_call_cc() also add a comment in the waiting and deferred
migration case explaining that the dereference should be safe
because of the migration dereference invariants.
Additively:
- In softclock_call_cc(), for the deferred migration case, move all the
accesses to callout structure after the comment stating the callout
must not be destroyed.
- For consistency with this last tweak, use cached c_flags for the
KASSERT() in the deferred migration case. It is not strictly necessary
but this way all the callout accesses happen after the above mentioned
comment, improving consistency.
Pointy hat to: me
Sponsored by: Isilon Systems / EMC Corporation
Reviewed by: kib
MFC after: 2 weeks
X-MFC: 243901
from the callwheel. Calculate the cc->cc_next before removing the
callout, otherwise the code followed the invalid tailq links. After
this, make softclock_call_cc() return void, since it always return
cc->cc_next, which is immediately available to the softclock()
anyway. This also allows to eliminate a label under #ifdef SMP.
Remove the assignment of cc->cc_next from callout_cc_del(), since the
function is called with the callout already removed from callwheel.
If cancelling the migration, also clear the CALLOUT_DFRMIGRATION flag.
Postpone the free of the timeout(9) allocated callouts after the
migration checks are done.
Add some more strict asserts about the state of the callout in
callout_call_cc().
Reviewed by: attilio
Reported and tested by: pho (previous version)
MFC after: 2 weeks
callout is started before kern_setitimer() acquires process mutex, but
looses a race and kern_setitimer() gets the process mutex before the
callout. Then, assuming that new specified struct itimerval has
it_interval zero, but it_value non-zero, the callout, after it starts
executing again, clears p->p_realtimer.it_value, but kern_setitimer()
already rescheduled the callout.
As the result of the race, both p_realtimer is zero, and the callout
is rescheduled. Then, in the exit1(), the exit code sees that it_value
is zero and does not even try to stop the callout. This allows the
struct proc to be reused and eventually the armed callout is
re-initialized. The consequence is the corrupted callwheel tailq.
Use process mutex to interlock the callout start, which fixes the race.
Reported and tested by: pho
Reviewed by: jhb
MFC after: 2 weeks
over the active list. The mount interlock is not enough to guarantee
the validity of the tailq link pointers. The __mnt_vnode_next_active()
and __mnt_vnode_first_active() active lists iterators helper functions
did not provided the neccessary stability for the list, allowing the
iterators to pick garbage.
This was uncovered after the r243599 made the active list iterators
non-nop.
Since a vnode interlock is before the vnode_free_list_mtx, obtain the
vnode ilock in the non-blocking manner when under vnode_free_list_mtx,
and restart iteration after the yield if the lock attempt failed.
Assert that a vnode found on the list is active, and assert that the
helpers return the vnode with interlock owned.
Reported and tested by: pho
MFC after: 1 week
Fix path handling for *at() syscalls.
Before the change directory descriptor was totally ignored,
so the relative path argument was appended to current working
directory path and not to the path provided by descriptor, thus
wrong paths were stored in audit logs.
Now that we use directory descriptor in vfs_lookup, move
AUDIT_ARG_UPATH1() and AUDIT_ARG_UPATH2() calls to the place where
we hold file descriptors table lock, so we are sure paths will
be resolved according to the same directory in audit record and
in actual operation.
Sponsored by: FreeBSD Foundation (auditdistd)
Reviewed by: rwatson
MFC after: 2 weeks
Remove redundant call to AUDIT_ARG_UPATH1().
Path will be remembered by the following NDINIT(AUDITVNODE1) call.
Sponsored by: FreeBSD Foundation (auditdistd)
MFC after: 2 weeks
variable as they may overflow on i386/PAE and i386 with > 2GB RAM.
Use 64bit quad_t instead. It has broader kernel infrastructure support
with TUNABLE_QUAD_FETCH() and qmin/qmax() than other available types.
Pointed out by: alc, bde
kernel memory, whichever is lower. The overall mbuf related memory
limit must be set so that mbufs (and clusters of various sizes)
can't exhaust physical RAM or KVM.
The limit is set to half of the physical RAM or KVM (whichever is
lower) as the baseline. In any normal scenario we want to leave
at least half of the physmem/kvm for other kernel functions and
userspace to prevent it from swapping too easily. Via a tunable
kern.maxmbufmem the limit can be upped to at most 3/4 of physmem/kvm.
At the same time divorce maxfiles from maxusers and set maxfiles to
physpages / 8 with a floor based on maxusers. This way busy servers
can make use of the significantly increased mbuf limits with a much
larger number of open sockets.
Tidy up ordering in init_param2() and check up on some users of
those values calculated here.
Out of the overall mbuf memory limit 2K clusters and 4K (page size)
clusters to get 1/4 each because these are the most heavily used mbuf
sizes. 2K clusters are used for MTU 1500 ethernet inbound packets.
4K clusters are used whenever possible for sends on sockets and thus
outbound packets. The larger cluster sizes of 9K and 16K are limited
to 1/6 of the overall mbuf memory limit. When jumbo MTU's are used
these large clusters will end up only on the inbound path. They are
not used on outbound, there it's still 4K. Yes, that will stay that
way because otherwise we run into lots of complications in the
stack. And it really isn't a problem, so don't make a scene.
Normal mbufs (256B) weren't limited at all previously. This was
problematic as there are certain places in the kernel that on
allocation failure of clusters try to piece together their packet
from smaller mbufs.
The mbuf limit is the number of all other mbuf sizes together plus
some more to allow for standalone mbufs (ACK for example) and to
send off a copy of a cluster. Unfortunately there isn't a way to
set an overall limit for all mbuf memory together as UMA doesn't
support such a limiting.
NB: Every cluster also has an mbuf associated with it.
Two examples on the revised mbuf sizing limits:
1GB KVM:
512MB limit for mbufs
419,430 mbufs
65,536 2K mbuf clusters
32,768 4K mbuf clusters
9,709 9K mbuf clusters
5,461 16K mbuf clusters
16GB RAM:
8GB limit for mbufs
33,554,432 mbufs
1,048,576 2K mbuf clusters
524,288 4K mbuf clusters
155,344 9K mbuf clusters
87,381 16K mbuf clusters
These defaults should be sufficient for even the most demanding
network loads.
MFC after: 1 month
accept queues a new socket/connection may be added to the queue
due to a race on the ACCEPT_LOCK.
The submitted patch is slightly changed in comments, teardown
and locking order and extended with KASSERT's.
Submitted by: Vijay Singh <vijju.singh-at-gmail-dot-com>
Found by: His team.
MFC after: 1 week
is in capability mode.
- Add VN_OPEN_NOCAPCHECK flag for vn_open_cred() to will ne converted into
NOCAPCHECK namei flag.
This functionality will be used to enable core dumps for sandboxed processes.
Reviewed by: rwatson
Obtained from: WHEEL Systems
MFC after: 2 weeks
to himself. For example abort(3) at first tries to do kill(getpid(), SIGABRT)
which was failing in capability mode, so the code was failing back to exit(1).
Reviewed by: rwatson
Obtained from: WHEEL Systems
MFC after: 2 weeks
There has not been any complaints about the default behavior, so there
is no need to keep a knob that enables the worse alternative.
Now that the hard-stopping of other CPUs is the only behavior, the panic_cpu
spinlock-like logic can be dropped, because only a single CPU is
supposed to win stop_cpus_hard(other_cpus) race and proceed past that
call.
MFC after: 1 month
the unix domain sockets to the next tick, coalescing the serial calls
until the collection fires. The thought is that more work for the
collector could arise in the near time, allowing to clean more and not
spend too much CPU on repeated collection when there is no garbage.
Currently the collection task is fired immediately upon unix domain
socket close if there are any rights in flight, which caused excessive
CPU usage and too long blocking of the threads waiting for
unp_list_lock and unp_link_rwlock in write mode.
Robert noted that it would be nice if we could find some heuristic by
which we decide whether to run GC a bit more quickly. E.g., if the
number of UNIX domain sockets is close to its resource limit, but not
quite.
Reported and tested by: Markus Gebert <markus.gebert@hostpoint.ch>
Reviewed by: rwatson
MFC after: 2 weeks
taskqueue_enqueue_timeout(). Do not rearm the callout if it is
already armed and the ticks is negative. Otherwise rearm it to fire
in abs(ticks) ticks in the future.
The intended use is to call taskqueue_enqueue_timeout() for the given
timeout_task with the same negative ticks argument. As result, the
task is scheduled to execute not further than abs(ticks) ticks in
future, and the consequent enqueues are coalesced until the already
scheduled task is finished.
Reviewed by: rwatson
Tested by: Markus Gebert <markus.gebert@hostpoint.ch>
MFC after: 2 weeks
then:
- assume the lock is held in exclusive mode and remove a moot check
about the lock acquisition.
- in the destructor remove !MPSAFE specific chunk.
Reviewed by: kib
MFC after: 2 weeks
Also remove the checks from vop_lookup_pre and vop_lookup_post, which
are now completely redundant (before this change they were partially
redundant).
Discussed with: kib
MFC after: 10 days
zombie list for the pid. This allows several kern.proc sysctls to
report useful information for zombies.
Hold the allproc_lock around all searches instead of relocking it.
Remove private pfind_locked() from the new nfs client code.
Requested and reviewed by: pjd
Tested by: pho
MFC after: 3 weeks
here is race between decaying the resource usage in containers, and updating
per-process usage; basically, the former may cause per-container usage
to get smaller than per-process usage.
Submitted by: Rudo Tomori
- Implement a function to ensure that all preempted threads have switched
back out at least once. Use this to make sure there are no stale
references to the old ktr_buf or the lock profiling buffers before
updating them.
Reviewed by: marius (sparc64 parts), attilio (earlier patch)
Sponsored by: EMC / Isilon Storage Division
pgrp2 was used only for debugging, but pgrp2 = pgfind(..) was present in compiled code even for kernels without INVARIANTS
Approved by: trasz (mentor)
MFC after: 1 week
designator to select a process which is waited for. The system call
optionally returns siginfo_t which would be otherwise provided to
SIGCHLD handler, as well as extended structure accounting for child
and cumulative grandchild resource usage.
Allow to get the current rusage information for non-exited processes
as well, similar to Solaris.
The explicit WEXITED flag is required to wait for exited processes,
allowing for more fine-grained control of the events the waiter is
interested in.
Fix the handling of siginfo for WNOWAIT option for all wait*(2)
family, by not removing the queued signal state.
PR: standards/170346
Submitted by: "Jukka A. Ukkonen" <jau@iki.fi>
MFC after: 1 month
- Do not try to steal load from other CPUs if there was no contest switches
on this CPU (i.e. it was idle all the time and woke up just for bus mastering
or TLB shutdown). If current CPU was idle, then it is quite unlikely that some
other CPU has load to steal. Under high I/O rate, when TLB shutdowns cause
numerous CPU wakeups, on 24-CPU system load stealing code may consume up to
25% of all CPU time without giving any benefits.
- Change code that implements spinning for load to restart spin in case of
context switch. Previous code periodically called cpu_idle() even under
high interrupt/context switch rate.
- Rise spinning threshold to 10KHz, where it gives at least some effect
that may worth consumed power.
Reviewed by: jeff@
Some hooks are added to clamp down maxusers and nmbclusters for
small address space systems.
VM_MAX_AUTOTUNE_MAXUSERS - the max maxusers that will be autotuned based on
physical memory.
VM_MAX_AUTOTUNE_NMBCLUSTERS - max nmbclusters based on physical memory.
These are set to the old values on i386 to preserve the clamping that was
being done to all arches.
Another macro VM_AUTOTUNE_NMBCLUSTERS is provided to allow an override
for the calculation on a MD basis. Currently no arch defines this.
Reviewed by: peter
MFC after: 2 weeks
to further reduce latency for threads in this queue. This should help
as threads transition from realtime to timeshare. The latency is
bound to a max of sched_slice until we have more than sched_slice / 6
threads runnable. Then the min slice is allotted to all threads and
latency becomes (nthreads - 1) * min_slice.
Discussed with: mav
give rwlock(9) the ability to crunch different type of structures, with
the only constraint that they have a lock cookie named rw_lock.
This name, then, becames reserved from the struct that wants to use
the rwlock(9) KPI and other locking primitives cannot reuse it for
their members.
Namely such structs are the current struct rwlock and the new struct
rwlock_padalign. The new structure will define an object which has the
same layout of a struct rwlock but will be allocated in areas aligned
to the cache line size and will be as big as a cache line.
For further details check comments on above mentioned revisions.
Reviewed by: jimharris, jeff
was still possible to open for write from the lower filesystem. There
is a symmetric situation where the binary could already has file
descriptors opened for write, but it can be executed from the nullfs
overlay.
Handle the issue by passing one v_writecount reference to the lower
vnode if nullfs vnode has non-zero v_writecount. Note that only one
write reference can be donated, since nullfs only keeps one use
reference on the lower vnode. Always use the lower vnode v_writecount
for the checks.
Introduce the VOP_GET_WRITECOUNT to read v_writecount, which is
currently always bypassed to the lower vnode, and VOP_ADD_WRITECOUNT
to manipulate the v_writecount value, which manages a single bypass
reference to the lower vnode. Caling the VOPs instead of directly
accessing v_writecount provide the fix described in the previous
paragraph.
Tested by: pho
MFC after: 3 weeks
cache line in order to avoid manual frobbing but using
struct mtx_padalign.
The sole exception being nvme and sxfge drivers, where the author
redefined CACHE_LINE_SIZE manually, so they need to be analyzed and
dealt with separately.
Reviwed by: jimharris, alc
sharing especially on the default CPU 0 callout_cpu structure.
This will be followed up by attilio@ with a conversion to the new struct
mtx_padalign but doing this manual conversion first gives an easy MFC
candidate since mtx_padalign is a more extensive system change.
Sponsored by: Intel
Reviewed by: jeff, attilio
MFC after: 1 week
only constraint that they have a lock cookie named mtx_lock.
This name, then, becames reserved from the struct that wants to use the
mtx(9) KPI and other locking primitives cannot reuse it for their
members.
Namely such structs are the current struct mtx and the new
struct mtx_padalign. The new structure will define an object which is
the same as the same layout of a struct mtx but will be allocated in
areas aligned to the cache line size and will be as big as a cache line.
This is supposed to give higher performance for highly contented mutexes
both spin or sleep (because of the adaptive spinning), where the cache
line contention results in too much traffic on the system bus.
The struct mtx_padalign can be used in a completely transparent way
with the mtx(9) KPI.
At the moment, a possibility to MFC the patch should be carefully
evaluated because this patch breaks the low level KPI
(not its representation though).
Discussed with: jhb
Reviewed by: jeff, andre
Reviewed by: mdf (earlier version)
Tested by: jimharris
executed. This means past the point where userret() is generally
executed.
Skip the td_pinned check if a callchain tracing is currently happening
and add a more robust check to pmc_capture_user_callchain() in order to
catch td_pinned leak past ast() in hwpmc case.
Reported and tested by: fabient
MFC after: 1 week
X-MFC: r240246
overwriting the return mbuf pointer with newly received data after
a loop. Instead append the new mbuf chain to the existing one.
Fix up sb_lastrecord when dequeuing mbuf's so that sbappend_stream()
doesn't get confused.
For the remainder copy case in the mbuf delivery part deduct the
copied length len instead of the whole mbuf length. Additionally
don't depend on 'n' being being available which isn't true in the
case of MSG_PEEK.
Fix the MSG_WAITALL case by comparing against sb_hiwat. Before
it was looping for every receive as sb_lowat normally is zero.
Add comment about issue with (MSG_WAITALL | MSG_PEEK) which isn't
properly handled.
Submitted by: trociny (except for the change in last paragraph)
mbuf's by doing proper testing with M_WRITABLE().
In m_collapse() replace an incomplete manual check for M_RDONLY
with the M_WRITABLE() macro that also tests for shared buffers
and other cases that make a particular mbuf immutable.
MFC after: 2 weeks
In the old TTY layer, SIGTTIN was correctly handled like this:
while (data should be read) {
send SIGTTIN if not foreground process group
read data
}
In the new TTY layer, however, this behaviour was changed, based on a
false interpretation of the standard:
send SIGTTIN if not foreground process group
while (data should be read) {
read data
}
Correct this by pushing tty_wait_background() into the ttydisc_read_*()
functions.
Reported by: koitsu
PR: kern/173010
MFC after: 2 weeks
A default install on large memory machines with multiple 10gigE interfaces
were not being given enough mbufs to do full bandwidth TCP or NFS traffic.
To keep the value somewhat reasonable, we scale back the number of
maxuers by 1/6 past the 384 point. This gives us enough mbufs for most
of our pretty basic 10gigE line-speed tests to complete.
This enables CPU searches (which read tdq_load) to operate independently
of any contention on the spinlock. Some scheduler-intensive workloads
running on an 8C single-socket SNB Xeon show considerable improvement with
this change (2-3% perf improvement, 5-6% decrease in CPU util).
Sponsored by: Intel
Reviewed by: jeff