a lookup or created via VOP_CREATE()) into a new vn_open_vnode() function
and use this function in fhopen() instead of duplicating code from
vn_open_cred() directly.
Tested by: pho
Reviewed by: kib
MFC after: 2 weeks
m_cat(), storing pointer to last mbuf in chain in local variable and
attaching new mbuf to the end of chain.
Submitter reports that CPU load dropped for > 10% on a web server
serving large files with this optimisation.
Submitted by: Sergey Budnevitch <sb nginx.com>
to SYSINIT routines if they can be resolved via symbol look up in DDB.
To avoid false positives, only honor a name if the symbol resolves
exactly to the pointer value (no offset).
MFC after: 1 week
perform copyin/copyout of the file data into the usermode
buffer. Typical filesystem hold vnode lock and some buffer locks over
the VOP_READ() and VOP_WRITE() operations, and since page fault
handler may need to recurse into VFS to get the page content, a
deadlock is possible.
The facility works by disabling page faults handling for the current
thread and attempting to execute i/o while allowing uiomove() to
access the usermode mapping of the i/o buffer. If all buffer pages are
resident, uiomove() is successfull and request is finished. If EFAULT
is returned from uiomove(), the pages backing i/o buffer are faulted
in and held, and the copyin/out is performed using uiomove_fromphys()
over the held pages for the second attempt of VOP call.
Since pages are hold in chunks to prevent large i/o requests from
starving free pages pool, and since vnode lock is only taken for
i/o over the current chunk, the vnode lock no longer protect atomicity
of the whole i/o request. Use newly added rangelocks to provide the
required atomicity of i/o regardind other i/o and truncations.
Filesystems need to explicitely opt-in into the scheme, by setting the
MNTK_NO_IOPF struct mount flag, and optionally by using
vn_io_fault_uiomove(9) helper which takes care of calling uiomove() or
converting uio into request for uiomove_fromphys().
Reviewed by: bf (comments), mdf, pjd (previous version)
Tested by: pho
Tested by: flo, Gustau P?rez <gperez entel upc edu> (previous version)
MFC after: 2 months
the i/o regions of the vnode data space. The implementation is quite
simple-minded, it uses the list of the lock requests, ordered by
arrival time. Each request may be for read or for write. The
implementation is fair FIFO.
MFC after: 2 month
implementation specific vs. the common architecture definition.
Bring PPC4XX defines (PSL, SPR, TLB). Note the new definitions under
BOOKE_PPC4XX are not used in the code yet.
This change set is not supposed to affect existing E500 support, it's just
another reorg step before bringing support for E500mc, E5500 and PPC465.
Obtained from: AppliedMicro, Freescale, Semihalf
This combination doesn't make sense, unit numbers should be hardwired
only in context of a known driver. The wildcard devices should have
wildcard unit numbers.
Reviewed by: jhb
MFC after: 2 weeks
this is a VNET-kernel or not. gcc used to put the static symbol into
the symbol table, clang does not. This fixes the 'netstat: no namelist'
error seen on clang+VNET systems.
'flags' field is added to the end of bpf_if structure. Currently the only
flag is BPFIF_FLAG_DYING which is set on bpf detach and checked by bpf_attachd()
Problem can be easily triggered on SMP stable/[89] by the following command (sort of):
'while true; do ifconfig vlan222 create vlan 222 vlandev em0 up ; tcpdump -pi vlan222 & ; ifconfig vlan222 destroy ; done'
Fix possible use-after-free when BPF detaches itself from interface, freeing bpf_bif memory,
while interface is still UP and there can be routes via this interface.
Freeing is now delayed till ifnet_departure_event is received via eventhandler(9) api.
Convert bpfd rwlock back to mutex due lack of performance gain (currently checking if packet
matches filter is done without holding bpfd lock and we have to acquire write lock if packet matches)
Approved by: kib(mentor)
MFC in: 4 weeks
Most part is merged from amd64.
- i386/acpica/acpi_wakecode.S
Replaced with amd64 code (from realmode to paging enabling code).
- i386/acpica/acpi_wakeup.c
Replaced with amd64 code (except for wakeup_pagetables stuff).
- i386/include/pcb.h
- i386/i386/genassym.c
Added PCB new members (CR0, CR2, CR4, DS, ED, FS, SS, GDT, IDT, LDT
and TR) needed for suspend/resume, not for context switch.
- i386/i386/swtch.s
Added suspendctx() and resumectx().
Note that savectx() was not changed and used for suspending (while
amd64 code uses it).
BSP and AP execute the same sequence, suspendctx(), acpi_wakecode()
and resumectx() for suspend/resume (in case of UP system also).
- i386/i386/apic_vector.s
Added cpususpend().
- i386/i386/mp_machdep.c
- i386/include/smp.h
Added cpususpend_handler().
- i386/include/apicvar.h
- kern/subr_smp.c
- sys/smp.h
Added IPI_SUSPEND and suspend_cpus().
- i386/i386/initcpu.c
- i386/i386/machdep.c
- i386/include/md_var.h
- pc98/pc98/machdep.c
Moved initializecpu() declarations to md_var.h.
MFC after: 3 days
Entries with zero inode number are considered placeholders by libc and
UFS. Fix remaining uses of VOP_READDIR in kernel: vop_stdvptocnp,
unionfs.
Sponsored by: Google Summer of Code 2011
compatible with the sched provider implemented by Solaris and its open-
source derivatives. Full documentation of the sched provider can be found
on Oracle's DTrace wiki pages.
Note that for compatibility with scripts originally written for Solaris,
serveral probes are defined that will never fire. These probes are defined
to fire when Solaris-specific features perform certain actions. As these
features are not present in FreeBSD, the probes can never fire.
Also, I have added a two probes that are not defined in Solaris, lend-pri
and load-change. These probes have been added to make it possible to
collect schedgraph data with DTrace.
Finally, a few probes are defined in Solaris to take a cpuinfo_t *
argument. As it was not immediately clear to me how to translate that to
FreeBSD, currently those probes are passed NULL in place of a cpuinfo_t *.
Sponsored by: Sandvine Incorporated
MFC after: 2 weeks
is running on other cpu, the CALLOUT_PENDING flag is temporarily
cleared. Then, callout_stop() on this, in fact active, callout fails
because CALLOUT_PENDING is not set, and callout_stop() returns 0.
Now, in sleepq_check_timeout(), the failed callout_stop() causes the
sleepq code to execute mi_switch() without even setting the wmesg,
since the switch-out is supposed to be transient. In fact, the thread
is put off the CPU for full timeout interval, instead of being put on
runq immediately. Until timeout fires, the process is unkillable for
obvious reasons.
Fix this by marking the migrating callouts with CALLOUT_DFRMIGRATION
flag. The flag is cleared by callout_stop_safe() when the function
detects a migration, besides returning the success. The softclock()
rechecks the flag for migrating callout and cancels its execution if
the flag was cleared meantime.
PR: misc/166340
Reported, debugging traces provided and tested by:
Christian Esken <christian.esken trivago com>
Reviewed by: avg, jhb
MFC after: 1 week
if the accounting log file is atomically replaced with a new file
(such as during log rotation).
- Simplify accounting log rotation a bit. There is no need to re-run
accton(8) after renaming the new log file to it's real name.
PR: kern/167321
Tested by: Jeremy Chadwick
to the process id. It follows the ptrace(2) interface and allows debugging
libraries to use thread ids directly, without slow and verbose conversion
of thread id into pid.
The PGET_NOTID flag is provided to allow a specific sysctl to disallow
this behaviour. All current callers of pget(9) have useful semantic to
operate on tid and do not need this flag.
Reviewed by: jhb, trocini
MFC after: 1 week
over just the active vnodes associated with a mount point to replace
MNT_VNODE_FOREACH_ALL in the vfs_msync, ffs_sync_lazy, and qsync
routines.
The vfs_msync routine is run every 30 seconds for every writably
mounted filesystem. It ensures that any files mmap'ed from the
filesystem with modified pages have those pages queued to be
written back to the file from which they are mapped.
The ffs_lazy_sync and qsync routines are run every 30 seconds for
every writably mounted UFS/FFS filesystem. The ffs_lazy_sync routine
ensures that any files that have been accessed in the previous
30 seconds have had their access times queued for updating in the
filesystem. The qsync routine ensures that any files with modified
quotas have those quotas queued to be written back to their
associated quota file.
In a system configured with 250,000 vnodes, less than 1000 are
typically active at any point in time. Prior to this change all
250,000 vnodes would be locked and inspected twice every minute
by the syncer. For UFS/FFS filesystems they would be locked and
inspected six times every minute (twice by each of these three
routines since each of these routines does its own pass over the
vnodes associated with a mount point). With this change the syncer
now locks and inspects only the tiny set of vnodes that are active.
Reviewed by: kib
Tested by: Peter Holm
MFC after: 2 weeks
a mount point. Active vnodes are those with a non-zero use or hold
count, e.g., those vnodes that are not on the free list. Note that
this list is in addition to the list of all the vnodes associated
with a mount point.
To avoid adding another set of linkage pointers to the vnode
structure, the active list uses the existing linkage pointers
used by the free list (previously named v_freelist, now renamed
v_actfreelist).
This update adds the MNT_VNODE_FOREACH_ACTIVE interface that loops
over just the active vnodes associated with a mount point (typically
less than 1% of the vnodes associated with the mount point).
Reviewed by: kib
Tested by: Peter Holm
MFC after: 2 weeks
used only as a helper function in that file. Replace sole call to
vbusy() with inline code in vholdl(). Replace sole calls to vfree()
and vdestroy() with inline code in vdropl().
The Clang compiler already inlines these functions, so they do not
show up in a kernel backtrace which is confusing. Also you cannot
set their frame in kgdb which means that it is impossible to view
their local variables. So, while the produced code is unchanged,
the debugging should be easier.
Discussed with: kib
MFC after: 2 weeks
The primary changes are that the user of the interface no longer
needs to manage the mount-mutex locking and that the vnode that
is returned has its mutex locked (thus avoiding the need to check
to see if its is DOOMED or other possible end of life senarios).
To minimize compatibility issues for third-party developers, the
old MNT_VNODE_FOREACH interface will remain available so that this
change can be MFC'ed to 9. Following the MFC to 9, MNT_VNODE_FOREACH
will be removed in head.
The reason for this update is to prepare for the addition of the
MNT_VNODE_FOREACH_ACTIVE interface that will loop over just the
active vnodes associated with a mount point (typically less than
1% of the vnodes associated with the mount point).
Reviewed by: kib
Tested by: Peter Holm
MFC after: 2 weeks
allow the owner to read and write ACL and file attributes when there
was no entry with subject matching the owner. In other words,
'getfacl meh' shouldn't fail for the owner if the ACL looks like this:
# file: meh
# owner: trasz
# group: wheel
user:root:------a-------:------:allow
Reported by: kientzle
like the one triggered by this:
# kldload geom_vinum
# pwait `pgrep -S gv_worker` &
# kldunload geom_vinum
or this:
GEOM_JOURNAL: Shutting down geom gjournal 3464572051.
panic: destroying non-empty racct: 1 allocated for resource 6
which were tracked by jh@ to be caused by checking p->p_flag,
while it wasn't initialised yet. Basically, during fork, the code
checked p_flag, concluded the process isn't marked as P_SYSTEM,
incremented the counter, and later on, when exiting, checked that
the process was marked as P_SYSTEM, and thus didn't decrement it.
Also, I believe there wasn't any good reason for checking P_SYSTEM
in the first place.
Tested by: jh
but GNU libc used it without checking its kernel version, e. g., Fedora 10.
- Move pipe(2) implementation for Linuxulator from MD files to MI file,
sys/compat/linux/linux_file.c. There is no MD code for this syscall at all.
- Correct an argument type for pipe() from l_ulong * to l_int *. Probably
this was the source of MI/MD confusion.
Reviewed by: emulation
backtrace for an arbitrary thread (rather than the calling thread).
A kdb_backtrace_thread() wrapper function uses the configured debugger
if possible, otherwise it falls back to using stack(9) if that is
available.
- Replace a direct call to db_trace_thread() in propagate_priority()
with a call to kdb_backtrace_thread() instead.
MFC after: 1 week
fail to load (the MOD_LOAD event fails) during a kldload(2), unload the
linker file and fail the kldload(2) with ENOEXEC.
Reported by: gcooper
MFC after: 1 week
in td_errno. Flag is supposed to be used by syscalls returning
EJUSTRETURN because errno was already placed into the usermode frame
by a call to set_syscall_retval(9). Both ktrace and dtrace get errno
value from td_errno if the flag is set.
Use the flag to fix sigsuspend(2) error return ktrace records.
Requested by: bde
MFC after: 1 week
static and declare its prototype in sys/vnode.h) so that it can be
called from process_deferred_inactive() (in ufs/ffs/ffs_snapshot.c)
instead of the body of vinactive() being cut and pasted into
process_deferred_inactive().
Reviewed by: kib
MFC after: 2 weeks
being attached. This is implemented by adding a new DS_ATTACHING state
while a device's DEVICE_ATTACH() method is being invoked. A driver is
required to not fail an attach of a busy device. The device's state will
be promoted to DS_BUSY rather than DS_ACTIVE() if the device was marked
busy during DEVICE_ATTACH().
Reviewed by: kib
MFC after: 1 week
outside the range of valid file descriptors
PR: kern/164970
Submitted by: Peter Jeremy <peterjeremy@acm.org>
Reviewed by: jilles
Approved by: cperciva
MFC after: 1 week
The SA_PROC signal property indicated whether each signal number is directed
at a specific thread or at the process in general. However, that depends on
how the signal was generated and not on the signal number. SA_PROC was not
used.
Interface locks and descriptor locks are converted from mutex(9) to rwlock(9).
This greately improves performance: in most common case we need to acquire 1
reader lock instead of 2 mutexes.
- Remove filter(descriptor) (reader) lock in bpf_mtap[2]
This was suggested by glebius@. We protect filter by requesting interface
writer lock on filter change.
- Cover struct bpf_if under BPF_INTERNAL define. This permits including bpf.h
without including rwlock stuff. However, this is is temporary solution,
struct bpf_if should be made opaque for any external caller.
Found by: Dmitrij Tejblum <tejblum@yandex-team.ru>
Sponsored by: Yandex LLC
Reviewed by: glebius (previous version)
Reviewed by: silence on -net@
Approved by: (mentor)
MFC after: 3 weeks
a pair of records similar to syscall entry and return that a user can
use to determine how long page faults take. The new ktrace records are
enabled via the 'p' trace type, and are enabled in the default set of
trace points.
Reviewed by: kib
MFC after: 2 weeks
application destroys semaphore after sem_wait returns. Just enter
kernel to wake up sleeping threads, only update _has_waiters if
it is safe. While here, check if the value exceed SEM_VALUE_MAX and
return EOVERFLOW if this is true.
a mutex after a thread has unlocked it, it event writes data to the mutex
memory to clear contention bit, there is a race that other threads
can lock it and unlock it, then destroy it, so it should not write
data to the mutex memory if there isn't any waiter.
The new operation UMTX_OP_MUTEX_WAKE2 try to fix the problem. It
requires thread library to clear the lock word entirely, then
call the WAKE2 operation to check if there is any waiter in kernel,
and try to wake up a thread, if necessary, the contention bit is set again
by the operation. This also mitgates the chance that other threads find
the contention bit and try to enter kernel to compete with each other
to wake up sleeping thread, this is unnecessary. With this change, the
mutex owner is no longer holding the mutex until it reaches a point
where kernel umtx queue is locked, it releases the mutex as soon as
possible.
Performance is improved when the mutex is contensted heavily. On Intel
i3-2310M, the runtime of a benchmark program is reduced from 26.87 seconds
to 2.39 seconds, it even is better than UMTX_OP_MUTEX_WAKE which is
deprecated now. http://people.freebsd.org/~davidxu/bench/mutex_perf.c
via procstat(1) and fstat(1):
- Change shm file descriptors to track the pathname they are associated
with and add a shm_path() method to copy the path out to a caller-supplied
buffer.
- Use the fo_stat() method of shared memory objects and shm_path() to
export the path, mode, and size of a shared memory object via
struct kinfo_file.
- Add a struct shmstat to the libprocstat(3) interface along with a
procstat_get_shm_info() to export the mode and size of a shared memory
object.
- Change procstat to always print out the path for a given object if it
is valid.
- Teach fstat about shared memory objects and to display their path,
mode, and size.
MFC after: 2 weeks
New kernel events can be added at various location for sampling or counting.
This will for example allow easy system profiling whatever the processor is
with known tools like pmcstat(8).
Simultaneous usage of software PMC and hardware PMC is possible, for example
looking at the lock acquire failure, page fault while sampling on
instructions.
Sponsored by: NETASQ
MFC after: 1 month
loaded and unloaded, also have sdt.ko register callbacks with kern_sdt.c
that will be called when a newly loaded KLD module adds more probes or
a module with probes is unloaded.
This fixes two issues: first, if a module with SDT probes was loaded after
sdt.ko was loaded, those new probes would not be available in DTrace.
Second, if a module with SDT probes was unloaded while sdt.ko was loaded,
the kernel would panic the next time DTrace had cause to try and do
anything with the no-longer-existent probes.
This makes it possible to create SDT probes in KLD modules, although there
are still two caveats: first, any SDT probes in a KLD module must be part
of a DTrace provider that is defined in that module. At present DTrace
only destroys probes when the provider is destroyed, so you can still
panic the system if a KLD module creates new probes in a provider from a
different module(including the kernel) and then unload the the first module.
Second, the system will panic if you unload a module containing SDT probes
while there is an active D script that has enabled those probes.
MFC after: 1 month
Function acquired reader lock if needed.
Assert check for reader or writer lock (RA_LOCKED / RA_UNLOCKED)
- While here, add knlist_init_mtx.9 to MLINKS and fix some style(9) issues
Reviewed by: glebius
Approved by: ae(mentor)
MFC after: 2 weeks
kernel.
When access restrictions are added to a page table entry, we flush the
corresponding virtual address mapping from the TLB. In contrast, when
access restrictions are removed from a page table entry, we do not
flush the virtual address mapping from the TLB. This is exactly as
recommended in AMD's documentation. In effect, when access
restrictions are removed from a page table entry, AMD's MMUs will
transparently refresh a stale TLB entry. In short, this saves us from
having to perform potentially costly TLB flushes. In contrast,
Intel's MMUs are allowed to generate a spurious page fault based upon
the stale TLB entry. Usually, such spurious page faults are handled
by vm_fault() without incident. However, when we are executing
no-fault sections of the kernel, we are not allowed to execute
vm_fault(). This change introduces special-case handling for spurious
page faults that occur in no-fault sections of the kernel.
In collaboration with: kib
Tested by: gibbs (an earlier version)
I would also like to acknowledge Hiroki Sato's assistance in
diagnosing this problem.
MFC after: 1 week
inserted after the priority token thus cleaning up the output.
- Remove the needless double internal do_add_char function.
- Resolve a possible deadlock if interrupts are
disabled and getnanotime is called
Reviewed by: bde kmacy, avg, sbruno (various versions)
Approved by: cperciva
MFC after: 2 weeks
hardclock() tick should be run on every active CPU, or on only one.
On my tests, avoiding extra interrupts because of this on 8-CPU Core i7
system with HZ=10000 saves about 2% of performance. At this moment option
implemented only for global timers, as reprogramming per-CPU timers is
too expensive now to be compensated by this benefit, especially since we
still have to regularly run hardclock() on at least one active CPU to
update system uptime. For global timer it is quite trivial: timer runs
always, but we just skip IPIs to other CPUs when possible.
Option is enabled by default now, keeping previous behavior, as periodic
hardclock() calls are still used at least to implement setitimer(2) with
ITIMER_VIRTUAL and ITIMER_PROF arguments. But since default schedulers don't
depend on it since r232917, we are much more free to experiment with it.
MFC after: 1 month
with HZ rate through the sched_tick() calls from hardclock().
Potentially it can be used to improve precision, but now it is just minus
one more reason to call hardclock() for every HZ tick on every active CPU.
SCHED_4BSD never used sched_tick(), but keep it in place for now, as at
least SCHED_FBFS existing in patches out of the tree depends on it.
MFC after: 1 month
Instead of using 25MHz equality threshold, look for the nearest value when
handling dev.cpu.0.freq sysctl and for exact match when it is expected.
ACPI may report extra level with frequency 1MHz above the nominal to
control Intel Turbo Boost operation. It is not a bug, but feature:
dev.cpu.0.freq_levels: 2934/106000 2933/95000 2800/82000 ...
In this case value 2933 means 2.93GHz, but 2934 means 3.2-3.6GHz.
I've found that my Core i7-870 based system has Intel Turbo Boost disabled
by default and without this change it was absolutely invisible and hard
to control.
MFC after: 2 weeks
- Pass number of events to the statclock() and profclock() functions
same as to hardclock() before to not call them many times in a loop.
- Rename them into statclock_cnt() and profclock_cnt().
- Turn statclock() and profclock() into compatibility wrappers,
still needed for arm.
- Rename hardclock_anycpu() into hardclock_cnt() for unification.
MFC after: 1 week
the cached name used for KTR_SCHED traces when a thread's name changes.
This way KTR_SCHED traces (and thus schedgraph) will notice when a thread's
name changes, most commonly via execve().
MFC after: 2 weeks
fifo_iseof() condition, allowing the v_fifoinfo to be reset and freed
by fifo_cleanup().
Precalculate EOF at the places were fo_wgen is changed, and cache the
state in a new pipe state flag PIPE_SAMEWGEN.
Reported and tested by: bf
Submitted by: gianni
MFC after: 1 week (a backport)
does not fit into registers, declare that we do not support this case
using CTASSERT(), and remove endianess-unsafe code to split return value
into td_retval.
While there, change the style of the sysctl debug.iosize_max_clamp
definition.
Requested by: bde
MFC after: 3 weeks
using the o32 ABI. This mostly follows nwhitehorn's lead in implementing
COMPAT_FREEBSD32 on powerpc64.
o) Add a new type to the freebsd32 compat layer, time32_t, which is time_t in the
32-bit ABI being used. Since the MIPS port is relatively-new, even the 32-bit
ABIs use a 64-bit time_t.
o) Because time{spec,val}32 has the same size and layout as time{spec,val} on MIPS
with 32-bit compatibility, then, disable some code which assumes otherwise
wrongly when built for MIPS. A more general macro to check in this case would
seem like a good idea eventually. If someone adds support for using n32
userland with n64 kernels on MIPS, then they will have to add a variety of
flags related to each piece of the ABI that can vary. That's probably the
right time to generalize further.
o) Add MIPS to the list of architectures which use PAD64_REQUIRED in the
freebsd32 compat code. Probably this should be generalized at some point.
Reviewed by: gonzo
significantly. Upon investigation this was caused by name cache
misses for lookups of "..". For name cache entries for non-".."
directories, the cache entry serves double duty. It maps both the
named directory plus ".." for the parent of the directory. As such,
two ctime values (one for each of the directory and its parent) need
to be saved in the name cache entry.
This patch adds an entry for ctime of the parent directory to the
name cache. It also adds an additional uma zone for large entries
with this time value, in order to minimize memory wastage.
As well, it fixes a couple of cases where the mtime of the parent
directory was being saved instead of ctime for positive name cache
entries. With this patch, Lookup RPC counts return to values similar
to pre-r230394 kernels.
Reported by: bde
Discussed with: kib
Reviewed by: jhb
MFC after: 2 weeks
long for specifying a boundary constraint.
- Change bus_dma tags to use bus_addr_t instead of bus_size_t for boundary
constraints.
These allow boundary constraints to be fully expressed for cases where
sizeof(bus_addr_t) != sizeof(bus_size_t). Specifically, it allows a
driver to properly specify a 4GB boundary in a PAE kernel.
Note that this cannot be safely MFC'd without a lot of compat shims due
to KBI changes, so I do not intend to merge it.
Reviewed by: scottl
snapshots on UFS filesystems running with journaled soft updates.
This is the first of several bugs that need to be fixed before
removing the restriction added in -r230250 to prevent the use
of snapshots on filesystems running with journaled soft updates.
The deadlock occurs when holding the snapshot lock (snaplk)
and then trying to flush an inode via ffs_update(). We become
blocked by another process trying to flush a different inode
contained in the same inode block that we need. It holds the
inode block for which we are waiting locked. When it tries to
write the inode block, it gets blocked waiting for the our
snaplk when it calls ffs_copyonwrite() to see if the inode
block needs to be copied in our snapshot.
The most obvious place that this deadlock arises is in the
ffs_copyonwrite() routine when it updates critical metadata
in a snapshot and tries to write it out before proceeding.
The fix here is to write the data and indirect block pointer
for the snapshot, but to skip the call to ffs_update() to
write the snapshot inode. To ensure that we will never have
to update a pointer in the inode itself, the ffs_snapshot()
routine that creates the snapshot has to ensure that all the
direct blocks are allocated as part of the creation of the
snapshot.
A less obvious place that this deadlock occurs is when we hold
the snaplk because we are deleting a snapshot. In the course of
doing the deletion, we need to allocate various soft update
dependency structures and allocate some journal space. If we
hit a resource limit while doing this we decrease the resources
in use by flushing out an existing dirty file to get it to give
up the soft dependency resources that it holds. The flush can
cause an ffs_update() to be done on the inode for the file that
we have selected to flush resulting in the same deadlock as
described above when the inode that we have chosen to flush
resides in the same inode block as the snapshot inode that we hold.
The fix is to defer cleaning up any time that the inode on which
we are operating is a snapshot.
Help and review by: Jeff Roberson
Tested by: Peter Holm
MFC (to 9 only) after: 2 weeks
operations for setting and accessing vnode's v_socket field.
The operations are necessary to implement proper unix socket handling
on layered file systems like nullfs(5).
This change fixes the long standing issue with nullfs(5) being in that
unix sockets did not work between lower and upper layers: if we bound
to a socket on the lower layer we could connect only to the lower
path; if we bound to the upper layer we could connect only to the
upper path. The new behavior is one can connect to both the lower and
the upper paths regardless what layer path one binds to.
PR: kern/51583, kern/159663
Suggested by: kib
Reviewed by: arch
MFC after: 2 weeks
following clang warning:
sys/kern/sys_pipe.c:1556:10: error: promoted type 'int' of K&R function parameter is not compatible with the parameter type 'mode_t'
(aka 'unsigned short') declared in a previous prototype [-Werror]
mode_t mode;
^
sys/kern/sys_pipe.c:155:19: note: previous declaration is here
static fo_chmod_t pipe_chmod;
^
not get syscall exit notification until the child performed exec of
exit. Swap the order of doing ptracestop() and waiting for P_PPWAIT
clearing, by postponing the wait into syscallret after ptracestop()
notification is done.
Reported, tested and reviewed by: Dmitry Mikulin <dmitrym juniper net>
MFC after: 2 weeks
Descriptions are specific to drivers and we don't change drivers on attached
devices. This fixes a few places where we were not clearing the description
when detaching a driver (e.g. with device_attach() failed). While here, fix
a few other nits:
- Remove spurious call to remove a device's driver from
devclass_driver_deleted(). device_detach() removes it already.
- Fix a typo.
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
Uftdi(4) examines (c_iflag & (IXON|IXOFF)) to control hw XON-XOFF support.
This is obviously no good, if changes to those bits are not communicated
down the stack.
allow.mount.zfs:
allow mounting the zfs filesystem inside a jail
This way the permssions for mounting all current VFCF_JAIL filesystems
inside a jail are controlled wia allow.mount.* jail parameters.
Update sysctl descriptions.
Update jail(8) and zfs(8) manpages.
TODO: document the connection of allow.mount.* and VFCF_JAIL for kernel
developers
MFC after: 10 days
socket protocol number. This is useful since the socket type can
be implemented by different protocols in the same protocol family,
e.g. SOCK_STREAM may be provided by both TCP and SCTP.
Submitted by: Jukka A. Ukkonen <jau iki fi>
PR: kern/162352
Discussed with: bz
Reviewed by: glebius
MFC after: 2 weeks
unp->unp_vnode pointer to detect if there is a vnode associated with
(binded to) this socket and does necessary cleanup if there is.
The issue is that after forced unmount this check may be too late as
the unp_vnode is reclaimed and the reference is stale.
To fix this provide a helper function that is called on a socket vnode
reclamation to do necessary cleanup.
Pointed by: kib
Reviewed by: kib
MFC after: 2 weeks