Move xpt_run_devq() call before request completion callback where it was
originally.
I am not sure why exactly have I moved it during one of many refactorings
during camlock project, but obviously it opens race window that may cause
use after free panics during SIM (in reported cases umass(4)) detach.
- Close a minor deadlock.
- Fix a possible memory use after free and leak situation associated
with USB device detach when using character device handles. This also
includes LibUSB. It turns out that "usb_close()" cannot always get a
reference to clean up its USB transfers and such, if called during the
kernel USB device detach.
- Separate I/O errors from reception of STALL PID.
- Implement better error recovery for Transaction Translators, TTs,
found in High Speed USB HUBs which translate from High Speed USB into
FULL or LOW speed USB. In some rare cases SPLIT transactions might get
lost, which might leave the TT in an unknown state. Whenever we detect
such an error try to issue either a clear TT buffer request, or if
that is not possible reset the whole TT.
Remove not applicable PI_SDTR_ABLE and PI_WIDE_16 hba_inquiry flags to
make CAM to not try negotiate unsupported settings and suppress warnings.
While there, enable command queuing on pass-through devices, announced
in hba_inquiry, but disabled. Even though queue size is very small, It
seems working well enough.
Several enhancements to the I/O APIC support in bhyve including:
- Move the I/O APIC device model from userspace into vmm.ko and add
ioctls to assert and deassert I/O APIC pins.
- Add HPET device emulation including a single timer block with 8 timers.
- Remove the 'vdev' abstraction.
Approved by: neel
Add the Raspberry Pi BSC (I2C compliant) controller driver.
Reviewed by: rpaulo
MFC r256961:
Enable the build of OFW I2C bus for FDT systems.
MFC r258045:
As all the IIC controllers on system uses the same 'iichb' prefix we cannot
rely only on checking the device unit to indentify the BSC unit we are
attaching to. Make use of the device base address to identify our BSC unit.
MFC r259127:
Bring the RPi I2C driver in line with ti_i2c. Make it treat any slave
address as a 7-bit address.
Approved by: adrian (mentor)
Rework NFS Duplicate Request Cache cleanup logic.
- Introduce additional hash to group requests by hash of sockref. This
allows to process TCP acknowledgements without looping though all the cache,
and as result allows to do it every time.
- Indroduce additional callbacks to notify application layer about sockets
disconnection. Without this last few requests processed just before socket
disconnection never processed their ACKs and stuck in cache for many hours.
- Implement transport-specific method for tracking reply acknowledgements.
New implementation does not cross multiple stack layers to get the data and
does not have race conditions that previously made some requests stuck
in cache. This could be done more efficiently at sockbuf layer, but that
would broke some KBIs, while I don't know other consumers for it aside NFS.
- Instead of traversing all DRC twice per request, run cleaning only once
per request, and except in some conditions traverse only single hash slot
at a time.
Together this limits NFS DRC growth only to situations of real connectivity
problems. If network is working well, and so all replies are acknowledged,
cache remains almost empty even after hours of heavy load. Without this
change on the same test cache was growing to many thousand requests even
with perfectly working local network.
As another result this reduces CPU time spent on the DRC handling during
SPEC NFS benchmark from about 10% to 0.5%.
Sponsored by: iXsystems, Inc.
Move most of NFS file handle affinity code out of the heavily congested
global RPC thread pool lock and protect it with own set of locks.
On synthetic benchmarks this improves peak NFS request rate by 40%.
Introduce xprt_inactive_self() -- variant for use when sure that port
is assigned to thread. For example, withing receive handlers. In that
case the function reduces to single assignment and can avoid locking.
Slightly simplify expiration logic introduced in r254337.
- Do not update the histogram for items we are any way deleting from cache.
- Do not update the histogram if nfsrc_tcphighwater is not set.
- Remove some extra math operations.
Fix RPC server threads file handle affinity to work better with ZFS.
Instead of taking 8 specific bytes of file handle to identify file during
RPC thread affitinity handling, use trivial hash of the full file handle.
ZFS's struct zfid_short does not have padding field after the length field,
as result, originally picked 8 bytes are loosing lower 16 bits of object ID,
causing many false matches and unneeded requests affinity to same thread.
This fix substantially improves NFS server latency and scalability in SPEC
NFS benchmark by more flexible use of multiple NFS threads.
Remove several linear list traversals per request from RPC server code.
Do not insert active ports into pool->sp_active list if they are success-
fully assigned to some thread. This makes that list include only ports that
really require attention, and so traversal can be reduced to simple taking
the first one.
Remove idle thread from pool->sp_idlethreads list when assigning some
work (port of requests) to it. That again makes possible to replace list
traversals with simple taking the first element.
Rework flow control for connection-oriented (TCP) RPC server.
When processing receive buffer, write the amount of data, expected
in present request record, into socket's so_rcv.sb_lowat to make stack
aware about our needs. When processing following upcalls, ignore them
until socket collect enough data to be read and processed in one turn.
This change reduces number of context switches and other operations
in RPC stack during large NFS writes (especially via non-Jumbo networks)
by order of magnitude.
After precessing current packet, take another look into the pending
buffer to find out whether the next packet had been already received.
If not, deactivate this port right there without making RPC code to
push this port to another thread just to find that there is nothing.
If the next packet is received partially, also deactivate the port, but
also update socket's so_rcv.sb_lowat to not be woken up prematurely.
This change additionally reduces number of context switches per NFS
request about in half.
Some minor tuning to rpc/svc.c:
- close cosmetic race in svc_exit();
- do not set wait timeout for idle threads if we have no use for wakeups;
- create new requested thread sooner, not only after some another thread
wakeup, that may happen later under constant load.
Fine tune filesystem block allocations under low free-space
conditions (-r254995) based on further operational experience.
Submitted by: Dmitry Sivachenko
Fix Tested by: Dmitry Sivachenko
r256543:
Add fasttrap for PowerPC. This is the last piece of the DTrace/ppc puzzle.
It's incomplete, it doesn't contain full instruction emulation, but it should be
sufficient for most cases.
r259245,r259421: (FBT)
FBT now does work fully on PowerPC.
Save r3 before using it for the trap check, else we end up saving the new r3,
containing the trap instruction encoding (0x7c810808), and restoring it back
with the frame on return. This caused it to panic on my ppc32 machine.
r259668,r259674:
Fix a typo in the FBT code.
r259394:
Rebase the PMC indices at 1, since PMC_SOFT is at 0.
r259395,r259699:
Add userland PMC backtracing, and use the PMC trapframe macros for kernel
backtraces.
ext2fs: fix inode flag conversion.
After r252890 we are naively attempting to pass through the
inode flags. This is technically incorrect as the ext2
inode flags don't match the UFS/system values used in
FreeBSD and a clean conversion is needed.
Some filtering was left in place so the change didn't cause
significant changes in FreeBSD but some of the garbage passed
is likely to be the cause for warning messages in linux.
Fix the issue by resetting the flags before conversion as was
done previously. This also means we will not pass the EXT4_*
inode flags into FreeBSD's inode.
PR: kern/185448
Take additional reference on SCSI probe periph to cover its freeze count.
Otherwise periph may be invalidated and freed before single-stepping freeze
is dropped, causing use after free panic.
MFV r258373:
4168 ztest assertion failure in dbuf_undirty
4169 verbatim import causes zdb to segfa
4170 zhack leaves pool in ACTIVE state
illumos/illumos-gate@7fdd916c47
Fix a braino with r259730: we cannot currently use CFLAGS.gcc or
CFLAGS.clang in sys/conf/Makefile.arm, since the main kernel build does
not use <bsd.sys.mk>. So revert that particular change for now.
Pointy hat to: me
Noticed by: zbb
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
When a da or ada device dissappears, outstanding IOs fail with
ENXIO, not EIO. The check for EIO was probably copied from Illumos,
where that is indeed the correct errno.
Without this change, pulling a busy drive from a zpool would usually
turn it into UNAVAIL, even though pulling an idle drive would turn
it into REMOVED. With this change, it is REMOVED every time.
Also, vdev_geom_io_intr shouldn't do zfs_post_remove, because that
results in devd getting two resource.fs.zfs.removed events. The
comment said that the event had to be sent directly instead of
through the async removal thread because "the DE engine is using
this information to discard prevoius I/O errors". However, the fact
that vdev_geom_io_intr was never actually sending the events until
now, and that vdev_geom_orphan never sent them at all, and that
vdev_geom_orphan usually gets called about 2 seconds after the
actual removal, means that FreeBSD's userland can cope with a late
event just fine.
Use an RLOCK here instead of an RWLOCK - matching all the other calls
to lla_lookup().
This drastically reduces the very high lock contention when doing parallel
TCP throughput tests (> 1024 sockets) with IPv6.
MFC r260187:
lla_lookup() does modification only when LLE_CREATE is specified.
Thus we can use IF_AFDATA_RLOCK() instead of IF_AFDATA_LOCK() when doing
lla_lookup() without LLE_CREATE flag.
MFC r260217:
Add IF_AFDATA_WLOCK_ASSERT() in case lla_lookup() is called with
LLE_CREATE flag.
Prevent users from deactivating the last component of a mirror.
MFC r259929:
Add an ability to stop gmirror and clear its metadata in one command.
This fixes the problem, when gmirror starts again just after stop.
The problem occurs when gmirror's component has geom label with equal size.
E.g. gpt and gptid have the same size as partition, diskid has the same
size as entire disk. When gmirror's geom has been destroyed, glabel
creates its providers and this initiate retaste.
Now "gmirror destroy" command is available. It destroys geom and also
erases gmirror's metadata.
PR: 184985
Add "resize" verb to gmirror(8) and such functionality to geom_mirror(4).
Now it is easy to expand the size of the mirror when all its components
are replaced. Also add g_resize method to geom_mirror class. It will write
updated metadata to new last sector, when parent provider is resized.
Split the last gcc-specific flags off into CFLAGS.gcc. This also
removes the need to use -Qunused-arguments for clang throughout the
tree.
MFC r260369:
Apply band-aid for 32-bit compat libs failures after r260334: put back
-Qunused-arguments for clang for now, until I can figure out a way to
make it unneeded in all scenarios. Sorry about the breakage.
Similar to r260020, only use -fms-extensions with gcc, for all other
modules which require this flag to compile. Use a GCC_MS_EXTENSIONS
variable, defined in kern.pre.mk, which can be used to easily supply the
flag (or not), depending on the compiler type.
MFC r260322:
In addition to r260102, also define GCC_MS_EXTENSIONS in bsd.sys.mk,
since kernel module builds do not use kern.pre.mk.
Add an OFW SPI compatible bus. Fix the spibus probe to return
BUS_PROBE_GENERIC and not BUS_PROBE_SPECIFIC (0) so the OFW SPI bus can
attach when enabled. Export the spibus devclass_t and driver_t
declarations.
Submitted by: ray
Approved by: adrian (mentor)
Implement automatic live resize support for GEOM MULTIPATH class.
In "manual" mode just automatically resize provider in any direction.
In "automatic" mode allow growth (with new metadata write); in case of
shrinking check if there is already valid metadata found at the new
location. This should allow easy transparent recovery if first resize
was done by mistake.
While there, unify metadata write code and fix minor memory leak.
Do not DELAY() for P-state transition unless we want to see the result.
Intel manual says: "If a transition is already in progress, transition to
a new value will subsequently take effect. Reads of IA32_PERF_CTL determine
the last targeted operating point." So seems it should be fine to just
trigger wanted transition and go. Linux does the same.
Update the description for pmap_remove_pages() to match the modern
times. Assert that the pmap passed to pmap_remove_pages() is only
active on current CPU.
Add a new sysctl / loader tunable kern.panic_reboot_wait_time which
defaults to PANIC_REBOOT_WAIT_TIME (a long-existing kernel config
setting). Use this now-variable value in place of the defined constant
to control how long the system waits after a panic before rebooting.
Bring back the old size of the kinfo_file structure to preserve ABI.
Keep only one uint64_t spare for further cap_rights_t expension.
Add a comment clarifying that if the size of this structure changes,
a new sysctl MIB has to be allocate for it and the old structure has
to be returned by the old sysctl MIB.
Requested by: re
Don't check for fd limits in fdgrowtable_exp.
Callers do that already and additional check races with process
decreasing limits and can result in not growing the table at all, which
is currently not handled.
locking support for CAM
r256826:
Fix several target mode SIMs to not blindly clear ccb_h.flags field of
ATIO CCBs. Not all CCB flags there belong to them.
r256836:
Remove hard limit on number of BIOs handled with one ATA TRIM request.
r256843:
Merge CAM locking changes from the projects/camlock branch to radically
reduce lock congestion and improve SMP scalability of the SCSI/ATA stack,
preparing the ground for the coming next GEOM direct dispatch support.
r256888:
Unconditionally acquire periph reference on CCB allocation failure.
r256895:
Fix memory and references leak due to unfreed path.
r256960:
Move CAM_UNQUEUED_INDEX setting to the last moment and under the periph lock.
This fixes race condition with cam_periph_ccbwait(), causing use-after-free.
r256975:
Minor (mostly cosmetical) addition to r256960.
r257054:
Some microoptimizations for da and ada drivers:
- Replace ordered_tag_count counter with single flag;
- From da remove outstanding_cmds counter, duplicating pending_ccbs list;
- From da_softc remove unused links field.
r257482:
Fix lock recursion, triggered by `smartctl -a /dev/adaX`.
r257501:
Make getenv_*() functions and respectively TUNABLE_*_FETCH() macros not
allocate memory and so not require sleepable environment. getenv() has
already used on-stack temporary storage, so just use it more rationally.
getenv_string() receives buffer as argument, so don't need another one.
r257914:
Some CAM locks polishing:
- Fix LOR and possible lock recursion when handling high-power commands.
Introduce new lock to protect left power quota and list of frozen devices.
- Correct locking around xpt periph creation.
- Remove seems never used XPT_FLAG_OPEN xpt periph flag.
Again, Netflix assisted with testing the merge, but all of the credit goes
to Alexander and iX Systems.
Submitted by: mav
Sponsored by: iX Systems
r256603:
Introduce new function devstat_end_transaction_bio_bt(), adding new argument
to specify present time. Use this function to move binuptime() out of lock,
substantially reducing lock congestion when slow timecounter is used.
r256606:
Move g_io_deliver() out of the lock, as required for direct dispatch.
Move g_destroy_bio() out too to reduce lock scope even more.
r256607:
Fix passing uninitialized bio_resid argument to g_trace().
r256610:
Add unmapped I/O support to GEOM RAID.
r256830:
Restore BIO_UNMAPPED and BIO_TRANSIENT_MAPPING in biodonne() when unmapping
temporary mapped buffer. That fixes double unmap if biodone() called twice
for the same BIO (but with different done methods).
r256880:
Merge GEOM direct dispatch changes from the projects/camlock branch.
When safety requirements are met, it allows to avoid passing I/O requests
to GEOM g_up/g_down thread, executing them directly in the caller context.
That allows to avoid CPU bottlenecks in g_up/g_down threads, plus avoid
several context switches per I/O.
r259247:
Fix bug introduced at r256607. We have to recalculate bp_resid here since
sizes of original and completed requests may differ due to end of media.
Testing of the stable/10 merge was done by Netflix, but all of the credit
goes to Alexander and iX Systems.
Submitted by: mav
Sponsored by: iX Systems
- Take BIO lock in biodone() only when there is no completion callback set
and so we should wake up thread waiting in biowait().
- Remove msleep() timeout from biowait(). It was added 11 years ago, when
there was no locks used, and it should not be needed any more.
Handle case when ACPI reports HPET device, but does not provide memory
resource for it. In such case take the address range from the HPET table.
This fixes hpet(4) driver attach on Asrock C2750D4I board.
Use relaxed (write-only) memory barriers when writing some of queue index
registers (for now on ISP2400+). We never read those registers back and
AFAIK their semantics does not require any immediate reaction on write.
Some more registers access optimizations:
- Process ATIO queue only if interrupt status tells so;
- Do not update queue out pointers after each processed command, do it
only once at the end of the loop.
Save one more register read per command by not reading rqstoutrp register
every time. The purpose of that register is unlikely output queue overflow
detection, so read it only when its last known (and probably stale now)
value signals overflow.
Optimize isp(4) to reduce CPU usage, especially in target mode:
- Remove two excessive and slow register reads from isp_intr(). Instead
of rereading value every time, assume that registers contain what we have
written there.
- Avoid sequential search through 4096 array elements when looking for
command tag. Use hash of lists to store active tags separately from free
ones and so greatly speedup the searches.
Don't even try to read vdev labels from devices smaller then SPA_MINDEVSIZE
(64MB). Even if we would find one somehow, ZFS kernel code rejects such
devices. It is funny to look on attempts to read 4 256K vdev labels from
1.44MB floppy, though it is not very practical and quite slow.
Reenable vfs.zfs.zio.use_uma for amd64, disabled at r209261.
On machines with seveal CPUs and enough RAM this can easily twice improve
ZFS performance or twice reduce CPU usage. It was disabled three years
ago due to memory and KVA exhaustion reports, but our VM subsystem got
improved a lot since that time, hopefully enough to make another try.
Introduce allocation cache to store LZ4 compression contexts without kicking
VM subsystem twice for every written record.
Tests on 24-core system show double reduction of CPU time spent on copying
single large well-compressed file.
This patch is not really needed on illumos (while not harm either) since
their memory allocator by default uses caching for all requests up to 128K.
Make UMA to not blindly force offpage slab header allocation for large
(> PAGE_SIZE) zones. If zone is not multiple to PAGE_SIZE, there may
be enough space for the header at the last page, so we may avoid extra
header memory allocation and hash table update/lookup.
ZFS creates bunch of odd-sized UMA zones (5120, 6144, 7168, 10240, 14336).
This change gives good use to at least some of otherwise lost memory there.
Don't count bucket allocation failures for UMA zones as their own failures.
There are good reasons for this to happen, such as recursion prevention, etc.
and they are not fatal since buckets are just an optimization mechanism.
Real bucket allocation failures are any way counted by the bucket zones
themselves, and we don't need double accounting there.
Implement mechanism to safely but slowly purge UMA per-CPU caches.
This is a last resort for very low memory condition in case other measures
to free memory were ineffective. Sequentially cycle through all CPUs and
extract per-CPU cache buckets into zone cache from where they can be freed.
Grow UMA zone bucket size also on lock congestion during item free.
Lock congestion is the same, whether it happens on alloc or free, so
handle it equally. Now that we have back pressure, there is no problem
to grow buckets a bit faster. Any way growth is much slower then in 9.x.
Add two new UMA bucket zones to store 3 and 9 items per bucket.
These new buckets make bucket size self-tuning more soft and precise.
Without them there are buckets for 1, 5, 13, 29, ... items. While at
bigger sizes difference about 2x is fine, at smallest ones it is 5x and
2.6x respectively. New buckets make that line look like 1, 3, 5, 9, 13,
29, reducing jumps between steps, making algorithm work softer, allocating
and freeing memory in better fitting chunks. Otherwise there is quite a
big gap between allocating 128K and 5x128K of RAM at once.
Implement soft pressure on UMA cache bucket sizes.
Every time system detects low memory condition decrease bucket sizes for
each zone by one item. As result, higher memory pressure will push to
smaller bucket sizes and so smaller per-CPU caches and so more efficient
memory use.
Before this change there was no force to oppose buckets growth as result
of practically inevitable zone lock conflicts, and after some run time
per-CPU caches could consume enough RAM to kill the system.
Create own free list for each of the first 32 possible allocation sizes.
In case of 4K allocation quantum that means for allocations up to 128K.
With growth of memory fragmentation these lists may grow to quite a large
sizes (tenths and hundreds of thousands items). Having in one list items
of different sizes in worst case may require full linear list traversal,
that may be very expensive. Having lists for items of single size means
that unless user specify some alignment or border requirements (that are
very rare cases) first item found on the list should satisfy the request.
While running SPEC NFS benchmark on top of ZFS on 24-core machine with
84GB RAM this change reduces CPU time spent in vmem_xalloc() from 8%
and lock congestion spinning around it from 20% to invisible levels.
And that all is by the cost of just 26 more pointers per vmem instance.
If at some point our kernel will start to actively use KVA allocations
with odd sizes above 128K, something may need to be done to bigger lists
also.
For sys/boot/i386 and sys/boot/pc98, separate flags to be passed
directly to the linker (LD_FLAGS) from flags passed indirectly, via the
compiler driver (LDFLAGS).
This is because several Makefiles under sys/boot/i386 and sys/boot/pc98
use ${LD} directly to link, and the normal LDFLAGS value should not be
used in these cases.
In sys/dev/scc, remove unused static function scc_setmreg(). While
here, invoke scc_getmreg() in two more places where it can be used.
Reviewed by: marcel
Fix bug introduced at r252226, when udata argument passed to bucket_alloc()
was used without making sure first that it was really passed for us.
On some of my systems this bug made user argument passed by ZFS code to
uma_zalloc_arg() unexpectedly block UMA per-CPU caches for those zones.
In sys/dev/mcd/mcd.c, mark the static const COPYRIGHT string as __used,
so it ends up in the object file, and no warnings are emitted about it
being actually unused.
For sys/dev/drm2/radeon, only use -fms-extensions with gcc. This flag
is only to stop gcc complaining about anonymous unions, which clang does
not do. For clang 3.4 however, -fms-extensions enables the Microsoft
__wchar_t type, which clashes with our own types.h.
MFC r260102:
Similar to r260020, only use -fms-extensions with gcc, for all other
modules which require this flag to compile. Use a GCC_MS_EXTENSIONS
variable, defined in kern.pre.mk, which can be used to easily supply the
flag (or not), depending on the compiler type.
Changes:
- Reinit uio_resid and flags before every call to soreceive().
- Set maximum acceptable size of packet to IP_MAXPACKET. As for now the
module doesn't support INET6.
- Properly handle MSG_TRUNC return from soreceive().
PR: 184601
Multi-queue NIC drivers and multi-port lagg tend to use the same lower
bits of the flowid as each other, resulting in a poor distribution of
packets among queues in certain cases. Work around this by adding a
set of sysctls for controlling a bit-shift on the flowid when doing
multi-port aggrigation in lagg and lacp. By default, lagg/lacp will
now use bits 16 and higher instead of 0 and higher.
Obtained from: Netflix