This allows it to be better tracked as well as being able to leverage
UMA for more interesting/useful behaviour at a later date.
Sponsored by: Netflix, Inc.
the TTY. In such a case, ttydev_close() is called multiple times and
each time, t_revokecnt is incremented and cv_broadcast() is called for
both the t_outwait and t_inwait condition variables.
Let's say revoke(2) comes in first and gets to call tty_drain() from
ttydev_leave(). Let's say that the revoke comes from init(8) as the
result of running "shutdown -r now". Since shutdown prints various
messages to the console before announing that the machine will reboot
immediately, let's also say that the output queue is not empty and
that tty_drain() has something to do. Let's assume this all happens
on a 9600 baud serial console, so it takes a time to drain.
The shutdown command will exit(2) and as such will end up closing
stdout. Let's say this close will come in second, bump t_revokecnt
and call tty_wakeup(). This has tty_wait() return prematurely and
the next thing that will happen is that the thread doing revoke(2)
will flush the TTY. Since the drain wasn't complete, the flush will
effectively drop whatever is left in t_outq.
This change takes into account that tty_drain() will return ERESTART
due to the fact that t_revokecnt was bumped and in that case simply
call tty_drain() again. The thread in question is already performing
the close so it can safely finish draining the TTY before destroying
the TTY structure.
Now all messages from shutdown will be printed on the serial console.
Obtained from: Juniper Networks, Inc.
In case of 4K allocation quantum that means for allocations up to 128K.
With growth of memory fragmentation these lists may grow to quite a large
sizes (tenths and hundreds of thousands items). Having in one list items
of different sizes in worst case may require full linear list traversal,
that may be very expensive. Having lists for items of single size means
that unless user specify some alignment or border requirements (that are
very rare cases) first item found on the list should satisfy the request.
While running SPEC NFS benchmark on top of ZFS on 24-core machine with
84GB RAM this change reduces CPU time spent in vmem_xalloc() from 8%
and lock congestion spinning around it from 20% to invisible levels.
And that all is by the cost of just 26 more pointers per vmem instance.
If at some point our kernel will start to actively use KVA allocations
with odd sizes above 128K, something may need to be done to bigger lists
also.
the excess code in g_io_check(), bio_resid is also truncated by
g_io_deliver(). As result, bufdonebio() assigns truncated value to
the buffer b_resid field.
Use the residual bio_completed to calculate buffer b_resid from
b_bcount in bufdonebio(), instead of bio_resid, calculated from
bio_length in g_io_deliver().
The issue is seemingly caused by the code rearrange into g_io_check(),
which is not present in stable/10. The change still looks as the
useful change to have in 10 nevertheless.
Reported by: Stefan Hegnauer <stefan.hegnauer@gmx.ch>
Tested by: pho, Stefan Hegnauer <stefan.hegnauer@gmx.ch>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
defaults to PANIC_REBOOT_WAIT_TIME (a long-existing kernel config
setting). Use this now-variable value in place of the defined constant
to control how long the system waits after a panic before rebooting.
by SCHED_PRI_TICKS should be SCHED_PRI_RANGE - 1 so that the resulting
priority value (before nice adjustment) is between SCHED_PRI_MIN and
SCHED_PRI_MAX, inclusive.
Submitted by: kib
Reported by: pho
MFC after: 1 week
MACHINE_ARCH values whose binaries this kernel can run. This patch provides
a feature requested for implementing pkgng ABI identifiers in a robust
way.
The list is designed to indicate whether, say, an i386 package can be run on
the current system. If kern.supported_abis contains "i386", then the answer
is yes. Otherwise, the answer is no.
At the moment, this only supports MACHINE_ARCH and MACHINE_ARCH32. As we
gain support for more interesting combinations, this needs to become more
flexible, possibily through the sysent framework, along with the
hw.machine_arch emulation immediately preceding this code in kern_mib.c.
Reviewed by: imp
MFC after: 3 days
for extending and reusing it.
The sendfile_sync wrapper is mostly just a "mbuf transaction" wrapper,
used to indicate that the backing store for a group of mbufs has completed.
It's only being used by sendfile for now and it's only implementing a
sleep/wakeup rendezvous. However, there are other potential signaling
paths (kqueue) and other potential uses (socket zero-copy write) where the
same mechanism would also be useful.
So, with that in mind:
* extract the sendfile_sync code out into sf_sync_*() methods
* teach the sf_sync_alloc method about the current config flag -
it will eventually know about kqueue.
* move the sendfile_sync code out of do_sendfile() - the only thing
it now knows about is the sfs pointer. The guts of the sync
rendezvous (setup, rendezvous/wait, free) is now done in the
syscall wrapper.
* .. and teach the 32-bit compat sendfile call the same.
This should be a no-op. It's primarily preparation work for teaching
the sendfile_sync about kqueue notification.
Tested:
* Peter Holm's sendfile stress / regression scripts
Sponsored by: Netflix, Inc.
requires process descriptors to work and having PROCDESC in GENERIC
seems not enough, especially that we hope to have more and more consumers
in the base.
MFC after: 3 days
in one of the many layers of indirection and shims through stable/7
in jail_handle_ips(). When it was cleaned up and unified through
kern_jail() for 8.x, the byte order swap was lost.
This only matters for ancient binaries that call jail(2) themselves
internally.
This API has semantics similar to that of taskqueue_drain but acts on
all tasks that might be queued or running on a taskqueue.
A caller must ensure that no new tasks are being enqueued otherwise this
call would be totally meaningless. For example, if the tasks are
enqueued by an interrupt filter then its interrupt must be disabled.
MFC after: 10 days
given process.
Note that the correctness of the trampoline length returned for ABIs
which do not use shared page depends on the correctness of the struct
sysvec sv_szsigcodebase member, which will be fixed on as-need basis.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
The new macros are implemented in terms of SDT_PROBE_DEFINE and SDT_PROBE.
Probes defined in this way will appear under SDT provider named "sdt".
Parameter types are exposed via SDT_PROBE_ARGTYPE.
This is something that illumos does not have by default.
This kind of SDT probes is already present in ZFS code, so those probes
will now be available if KDTRACE_HOOKS options is enabled.
A potential future illumos compatibility enhancement is to encode a provider
name as a prefix in a probe name.
Reviewed by: markj
MFC after: 3 weeks
X-MFC after: r258622
In its stead use the Solaris / illumos approach of emulating '-' (dash)
in probe names with '__' (two consecutive underscores).
Reviewed by: markj
MFC after: 3 weeks
callable from the kernel.
Right now vn_sendfile() can't be called from anything other than
a syscall handler _and_ return the number of bytes queued.
This simply moves the copyout() to do_sendfile() so that any kernel
code can initiate vn_sendfile() outside of a syscall context.
Tested:
* tiny little sendfile program spitting things out a tcp socket
Sponsored by: Netflix, Inc.
option, unbreak the lock tracing release semantic by embedding
calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
version of the releasing functions for mutex, rwlock and sxlock.
Failing to do so skips the lockstat_probe_func invokation for
unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
kernel compiled without lock debugging options, potentially every
consumer must be compiled including opt_kdtrace.h.
Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
is linked there and it is only used as a compile-time stub [0].
[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested. As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while. Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].
Sponsored by: EMC / Isilon storage division
Discussed with: rstone
[0] Reported by: rstone
[1] Discussed with: philip
is chunked to pieces limited by integer io_hold_cnt tunable, while
vm_fault_quick_hold_pages() takes integer max_count as the upper bound.
Rearrange the checks to correctly handle overflowing address arithmetic.
Submitted by: bde
Tested by: pho
Discussed with: alc
MFC after: 1 week
Per POSIX, si_status should contain the value passed to exit() for
si_code==CLD_EXITED and the signal number for other si_code. This was
incorrect for CLD_EXITED and CLD_DUMPED.
This is still not fully POSIX-compliant (Austin group issue #594 says that
the full value passed to exit() shall be returned via si_status, not just
the low 8 bits) but is sufficient for a si_status-related test in libnih
(upstart, Debian/kFreeBSD).
PR: kern/184002
Reported by: Dmitrijs Ledkovs
Tested by: Dmitrijs Ledkovs
a very hard time to fully understand) with much more intuitive rights:
CAP_EVENT - when set on descriptor, the descriptor can be monitored
with syscalls like select(2), poll(2), kevent(2).
CAP_KQUEUE_EVENT - When set on a kqueue descriptor, the kevent(2)
syscall can be called on this kqueue to with the eventlist
argument set to non-NULL value; in other words the given
kqueue descriptor can be used to monitor other descriptors.
CAP_KQUEUE_CHANGE - When set on a kqueue descriptor, the kevent(2)
syscall can be called on this kqueue to with the changelist
argument set to non-NULL value; in other words it allows to
modify events monitored with the given kqueue descriptor.
Add alias CAP_KQUEUE, which allows for both CAP_KQUEUE_EVENT and
CAP_KQUEUE_CHANGE.
Add backward compatibility define CAP_POLL_EVENT which is equal to CAP_EVENT.
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
This fixes fallout from r256425.
Reported by: Pavel Timofeev <timp87@gmail com>
Tested by: Pavel Timofeev <timp87@gmail com>
Reviewed by: Roger Pau Monnц╘
MFC after: 3 days
shared vnode lock for VOP_PUTPAGES() as well. The only such
filesystem in the tree is ZFS, and it uses
vnode_pager_generic_putpages(), which performs the pageout with
VOP_WRITE().
Reviewed by: alc
Discussed with: avg
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
holding the vnode lock; vp->v_mount is checked first for NULL
equiality, and then dereferenced if not NULL. If vnode is reclaimed
meantime, second dereference would still give NULL. Change
VFS_PROLOGUE() to evaluate the mp once, convert MNTK_SHARED_WRITES and
MNTK_EXTENDED_SHARED tests into inline functions.
Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
words, every architecture is now auto-sizing the kmem arena. This revision
changes kmeminit() so that the definition of VM_KMEM_SIZE_SCALE becomes
mandatory and the definition of VM_KMEM_SIZE becomes optional.
Replace or eliminate all existing definitions of VM_KMEM_SIZE. With
auto-sizing enabled, VM_KMEM_SIZE effectively became an alternate spelling
for VM_KMEM_SIZE_MIN on most architectures. Use VM_KMEM_SIZE_MIN for
clarity.
Change kmeminit() so that the effect of defining VM_KMEM_SIZE is similar to
that of setting the tunable vm.kmem_size. Whereas the macros
VM_KMEM_SIZE_{MAX,MIN,SCALE} have had the same effect as the tunables
vm.kmem_size_{max,min,scale}, the effects of VM_KMEM_SIZE and vm.kmem_size
have been distinct. In particular, whereas VM_KMEM_SIZE was overridden by
VM_KMEM_SIZE_{MAX,MIN,SCALE} and vm.kmem_size_{max,min,scale}, vm.kmem_size
was not. Remedy this inconsistency. Now, VM_KMEM_SIZE can be used to set
the size of the kmem arena at compile-time without that value being
overridden by auto-sizing.
Update the nearby comments to reflect the kmem submap being replaced by the
kmem arena. Stop duplicating the auto-sizing formula in every machine-
dependent vmparam.h and place it in kmeminit() where auto-sizing takes
place.
Reviewed by: kib (an earlier version)
Sponsored by: EMC / Isilon Storage Division