illumos/illumos-gate@bd9d3f9046bd9d3f9046https://www.illumos.org/issues/8661
The "zil-cw1" dtrace probe was previously removed in 8558, and the "zil-cw2"
probe should have been removed in that patch as well. Unfortunately, the "zil-
cw2" was not removed in 8558, so this bug is to track it's removal.
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>
MFC after: 1 week
Bump the FreeBSD version to force recompilation of external
kernel modules due to structure change.
PR: 222504
Submitted by: Greg V <greg@unrelenting.technology>
MFC after: 1 week
Sponsored by: Mellanox Technologies
The device index, partition index and reference counter are all positive
numbers. However, since our internal partition number may be negative
to indicate GPT table, the compare expression need to take care when comparing
pdinfo_t and partition data.
Prevent cam_iosched_iops_tick() from discarding 'unspent' ios unless
it's a new accounting interval.
Previously ios that weren't used between ticks were lost, as a result
the iops limiter could enforce a limit below the configured maximum.
Obtained from: ElectroBSD
Submitted by: Fabian Keil
PR: 221974
Previously the iops limiter would always allow at least
quanta ios per second as cam_iosched_iops_tick() never set
ios->l_value1 below 1.
Submitted by: Fabian Keil <fk@fabiankeil.de>
Obtained from: ElectroBSD
PR: 221974
In particular, support chaining an AES cipher with an HMAC for a request
including AAD. This permits submitting requests from userland to encrypt
objects like IPSec packets using these algorithms.
In the non-GCM case, the authentication crypto descriptor covers both the
AAD and the ciphertext. The GCM case remains unchanged. This matches
the requests created internally in IPSec. For the non-GCM case, the
COP_F_CIPHER_FIRST is also supported since the ordering matters.
Note that while this can be used to simulate IPSec requests from userland,
this ioctl cannot currently be used to perform TLS requests using AES-CBC
and MAC-before-encrypt.
Reviewed by: cem
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D11759
This requests that the cipher be performed before rather than after
the HMAC when both are specified for a single operation.
Reviewed by: cem
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D11757
Software crypto implementations don't care how the buffer is laid out,
but hardware implementations may assume that the AAD is always before
the plain/cipher text and that the hash/tag is immediately after the end
of the plain/cipher text.
In particular, this arrangement matches the layout of both IPSec packets
and TLS frames. Linux's crypto framework also assumes this layout for
AEAD requests.
Reviewed by: cem
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D11758
It doesn't appear to be safe to use gtask->gt_name.
Reported by: Mark Johnston, Jenkins
Reviewed by: sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12448
Move handling of these three pathconf() variables out of vop_stdpathconf()
and into devfs_pathconf() as TTY devices can only be devfs files. In
addition, only return settings for these three variables for devfs devices
whose device switch has the D_TTY flag set.
Discussed with: bde, kib
Sponsored by: Chelsio Communications
Check the return code of intr_setaffinity() and log any errors
it returns. When a qid is not located, log an error before returning
failure. Also, use __func__ rather than hardcoding the function name
Reviewed by: sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12436
Previously had the same short and long description as taskqueues.
This could cause problems with memguard(9) and vmstat -m which use
the short description as a unique identifier.
Reviewed by: sbruno
Approved by: sbruno (mentor)
MFC after: 3 days
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12438
- Use HWRM_FUNC_VF_CFG instead of HWRM_FUNC_CFG on VFs
- Fix NPAR/VF detection
- Clean up flag definitions
- Don't allow WoL on VFs
Although the bnxt driver doesn't support SR-IOV so can create VFs yet,
the PF could be running Linux or ESCi with a VF passed through to a
FreeBSD guest. This fixes the driver for that use case.
Submitted by: Siva Kallam <siva.kallam@@broadcom.com>
Reviewed by: shurd, sbruno
Approved by: sbruno (mentor)
Sponsored by: Broadcom Limited
Differential Revision: https://reviews.freebsd.org/D12410
accepts PQ_NONE as the specified queue and returns a Boolean indicating
whether the page's wire count transitioned to zero. Use these features
in dev/drm2.
Reviewed by: kib, markj
MFC after: 1 week
This ensures that the loader will not load the module if it's also built in to
the kernel.
PR: 220860
Submitted by: Eugene Grosbein <eugen@freebsd.org>
Reported by: Marie Helene Kvello-Aune <marieheleneka@gmail.com>
initialize nvp on every loop iteration and the code under 'fail'(!) label
detects success by checking of nvp != NULL.
Submitted by: pjd@
MFC after: 1 month
Sponsored by: Wheel Systems
to 'fail' on error it was treated as success, because nvp!=NULL. Fix this
by not handling success under 'fail' label and by using separate variable
for parent nvpair.
If we succeeded to allocate nvlist, but failed to allocated nvpair we
would leak nvls[ii] on return. Destroy it when we cannot allocate nvpair,
before we goto fail.
Submitted by: pjd@ and oshogbo@ (minor changes)
Found by: scan-build
MFC after: 1 month
Sponsored by: Wheel Systems
initially NULL, which is not possible. Change the loop to
'do {} while (array != NULL)' to satisfy scan-build and assert that
array really cannot be NULL just in case.
Submitted by: pjd@
Found by: scan-build
MFC after: 1 month
Sponsored by: Wheel Systems
Make scan-build happy by casting to 'void *' instead of 'void **'.
Submitted by: pjd@
MFC after: 1 month
Found by: scan-build and cppcheck
Sponsored by: Wheel Systems
Acquiring of IPFW_WLOCK is requried for cases when we are going to
change some data that can be accessed during processing of packets flow.
When we create new named object, there are not yet any rules, that
references it, thus holding IPFW_UH_WLOCK is enough to safely update
needed structures. When we destroy an object, we do this only when its
reference counter becomes zero. And it is safe to not acquire IPFW_WLOCK,
because noone references it. The another case is when we failed to finish
some action and thus we are doing rollback and destroying an object, in
this case it is still not referenced by rules and no need to acquire
IPFW_WLOCK.
This also fixes panic with INVARIANTS due to recursive IPFW_WLOCK acquiring.
MFC after: 1 week
Sponsored by: Yandex LLC
1/4 of the number of queues times queue entries is too limiting. It
works up to about 4k IOPS / 3.0GB/s for hardware that can do
4.4k/3.2GB/s with nvd. 3/4 works better, though it highlights issues
in the fairness of nda's choice of TRIM vs READ. That will be fixed
separately.
Previously ios->current was set to 0 until the first
cam_iosched_cl_maybe_steer() call.
PR: 221954
Obtained from: ElectroBSD
Submitted by: Fabian Keil
Differential Revision: https://reviews.freebsd.org/D12349
Previously callout_reset() was called with a "ticks" value that was
off by one. As a result cam_iosched_ticker() was called a bit too
frequently: On systems with hz=1000 a quanta value of 200 resulted in
~250 calls and a value of 100 in ~111 calls.
For the "queue_depth" and "bandwidth" limiters the difference doesn't
matter but the "iops" limiter depends on the scheduling to enforce the
correct maximum.
PR: 221956
Obtained from: ElectroBSD
Submitted by: Fabian Keil
Differential Revision: https://reviews.freebsd.org/D12350
Invalid values can result in devision-by-zero panics or other
undefined behaviour so lets not allow them.
PR: 221957
Obtained from: ElectroBSD
Submitted by: Fabian Keil
Differential Revision: https://reviews.freebsd.org/D12351
Use the write queue for BIO_ZONE commands so they can't get executed
ahead of writes that were sent after them. More generally, since they
introduce strong ordering into the list, they need to go to the write
queue (which is the only queue that BIO_ORDERED is honored for at the
moment). In fact, fix mismatch between queueing and dequeueing code by
changing this to queue all non-reads (and non-trims) to the write
queue.
As a side effect this prevents the kernel message:
kernel: Found bio_cmd = 0x9
which cam_iosched_next_bio() emits when finding commands
other than BIO_READ in the read queue.
PR: 221973
Obtained from: ElectroBSD
Submitted by: Fabian Keil
Differential Revision: https://reviews.freebsd.org/D12353
RXQ setup for netmap was broken because netmap_rxq_init was getting called
before IFDI_INIT - thus we ended up with ring tail pointer being reset to zero.
Reviewed by: sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12140
In ql_hw_send() return EINVAL when TSO framelength exceeds max
supported length by HW.(davidcs)
2. ql_os.c:
In qla_send() call bus_dmamap_unload before freeing mbuf or
recreating dmmamap.(davidcs)
In qla_fp_taskqueue() Add additional checks for IFF_DRV_RUNNING
Fix qla_clear_tx_buf() call bus_dmamap_sync() before freeing
mbuf.
Submitted by:David.Bachu@netapp.com
MFC after:5 days
illumos/illumos-gate@554675eee7554675eee7https://www.illumos.org/issues/8473
Scrubbing is supposed to detect and repair all errors in the pool. However,
it wrongly ignores active spare devices. The problem can easily be
reproduced in OpenZFS at git rev 0ef125d with these commands:
truncate -s 64m /tmp/a /tmp/b /tmp/c
sudo zpool create testpool mirror /tmp/a /tmp/b spare /tmp/c
sudo zpool replace testpool /tmp/a /tmp/c
/bin/dd if=/dev/zero bs=1024k count=63 oseek=1 conv=notrunc of=/tmp/c
sync
sudo zpool scrub testpool
zpool status testpool # Will show 0 errors, which is wrong
sudo zpool offline testpool /tmp/a
sudo zpool scrub testpool
zpool status testpool # Will show errors on /tmp/c,
# which should've already been fixed
FreeBSD head is partially affected: the first scrub will detect some errors, but the second scrub will detect more.
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
MFC after: 1 week
Sponsored by: Spectra Logic Corp
It is reported that the default value of 4KB results in a substantial
memory use overhead (at least, on some configurations). Using 1KB seems
to reduce the overhead significantly.
PR: 222377
Reported by: Sean Chittenden <sean@chittenden.org>
MFC after: 1 week
I overlooked the fact that that ZIO_IOCTL_PIPELINE does not include
ZIO_STAGE_VDEV_IO_DONE stage. We do allocate a struct bio for an ioctl
zio (a disk cache flush), but we never freed it.
This change splits bio handling into two groups, one for normal
read/write i/o that passes data around and, thus, needs the abd data
tranform; the other group is for "data-less" i/o such as trim and cache
flush.
PR: 222288
Reported by: Dan Nelson <dnelson@allantgroup.com>
Tested by: Borja Marcos <borjam@sarenet.es>
MFC after: 10 days
illumos/illumos-gate@2bcb5458542bcb545854https://www.illumos.org/issues/8602
When I landed the fix for 8558, I incorrectly added the "dp_early_sync_tasks"
field to the "dsl_pool" structure. This field is used in DelphixOS, but not in
illumos. It was incorrectly pulled into illumos, so this bug is to remove it
from the structure.
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>
MFC after: 1 week
indicating whether the page's wire count transitioned to zero. Use that
return value in zbuf_page_free() rather than checking the wire count.
MFC after: 1 week
Exploit r288122 to address a cosmetic issue. Since PV chunk pages don't
belong to a vm object, they can't be paged out. Since they can't be paged
out, they are never enqueued in a paging queue. Nonetheless, passing
PQ_INACTIVE to vm_page_unwire() creates the appearance that these pages
are being enqueued in the inactive queue. As of r288122, we can avoid
this false impression by passing PQ_NONE.
MFC after: 1 week
Make the NFSv4 pNFS client function nfsrpc_layoutget() a static, since it
is only used in sys/fs/nfsclient/nfs_clrpcops.c.
This prepares the code for future patches that add Flex File layout
support.
This patch adds a new function called nfsm_uiombuflist(), which is
similar to nfsm_uiombuf(), but doesn't not use the fields in
struct nfsrv_descript. This new function will be used by the pNFS client
for writing to mirrors using Flex Files layout.
The function is not yet called anywhere.
Also, get rid of #ifndef APPLE, which is ancient cruft left over from
the Mac OSX port of the NFSv4 client.
Simplify nfsrpc_layoutreturn() args. in preparation for the addition
of Flex File layout support, since File layout uses a 0 length field.
Flex Files does use a longer field, but that will be added in a
subsequent commit.
Care must be taken when updating the active LDT, since parallel
threads might try to load a segment descriptor which is currently
updated. Since the results are undefined, this cannot be ignored by
claiming to be an application race.
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D12413
If vrele() changes the hold count to zero, it needs to acquire the
vnode lock.
Sponsored by: The FreeBSD Foundation
Discussed with: avg
X-MFC with: r323578
One consequence of the patch is that msyncing unlinked file mappings
no longer reduces the amount of the dirty memory in the system, but I
do not think that there are users of msync(2) that utilize it for such
side-effect.
Reported and tested by: tjil
PR: 222356
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D12411
- Add size of an ethernet header to the value configured to NVS. This
does not seem to have any effects if MTU is 1500, but fix hypervisor
side's setting if MTU > 1500.
- Override the MTU setting according to the view from the hypervisor
side.
MFC after: 1 week
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D12352
Since in Azure SYN and SYN|ACK go through the synthetic parts while the
rest of the same TCP flow goes through the VF, apply VF's RSS settings
to synthetic parts to have a consistent hash value/type for the same TCP
flow.
MFC after: 1 week
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D12333
ino64 expanded nlink_t to 64 bits, but the on-disk format for UFS is still
limited to 16 bits. This is a nop currently but will matter if LINK_MAX is
increased in the future.
Reviewed by: kib
Sponsored by: Chelsio Communications
Suppose that userspace is executing with the non-standard segment
descriptors. Then, until exception or interrupt handler executed
SET_KERNEL_SEGS, kernel is still executing with user %ds, %es and %fs.
If an interrupt occurs in this window, the interrupt handler is
executed unsafely, relying on usability of the usermode registers. If
the interrupt results in the context switch on return, the
contamination of the kernel state spreads to the thread we switched
to. As result, kernel data accesses might fault or, if only the base
is changed, completely messed up.
More, if the user segment was allocated in LDT, another thread might
mark the descriptor as invalid before doreti code tried to reload
them. In this case kernel panics.
The issue exists for all exception entry points which use trap gate,
and thus do not automatically disable interrupts on entry, and for
lcall_handler.
Fix is two-fold: first, we need to disable interrupts for all kernel
entries, changing the IDT descriptor types from trap gate to interrupt
gate. Interrupts are re-enabled not earlier than the kernel segments
are loaded into the segment registers. Second, we only load the
segment registers from the trap frame when returning to usermode. For
the later, all interrupt return paths must happen through the doreti
common code.
There is no way to disable interrupts on call gate, which is the
supposed mode of servicing for lcall $7,$0 syscalls. Change the LDT
descriptor 0 into a code segment type and point it to the userspace
trampoline which redirects the syscall to int $0x80.
All the measures make the segment register handling similar to that of
amd64. We do not apply amd64 optimizations of not reloading segment
registers on return from the syscall.
Reported by: Maxime Villard <max@m00nbsd.net>
Tested by: pho (the non-lcall part)
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D12402
kern.features.mmcam will be present and equal to 1 if the kernel has been
compiled with option MMCCAM.
This will help sdio-related userland tools to fail-fast if running on the kernel
without MMCCAM enabled.
Approved by: imp (mentor)
Differential Revision: https://reviews.freebsd.org/D12386
1. Swap the order of device_get_ivars with device_get_devclass and devclass
name validation. This bug was introduced in r323692.
2. Error check device_get_children and free the returned list. This bug was
introduced in the original linsysfs commit.
Reported by: Oleg V. Nauman <oleg AT theweb.org.ua>, hselasky (1); hselasky (2)
Reviewed by: hselasky
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12407
Expose more information about PCI devices (and GPUs in particular) via
linsysfs to libdrm.
This allows unmodified modern 64-bit Linux libdrm to work, which allows
modern Linux Mesa to work. The submitter reports that he tested the change
with an Ubuntu 16.04 chroot + amdgpu from graphics/drm-next-kmod.
PR: 222375
Submitted by: Greg V <greg AT unrelenting.technology>
When it was added in r314636, AMD Thresholding was hardcoded to only
bank 4 (Northbridge) for some reason. However, even on family 10h the
MCAx_MISC register Valid/Present bits determine whether thresholding is
supported on that bank.
Expand thresholding support to monitor all monitorable banks. This
simplifies some of the logic and makes it more consistent with our Intel
CMCI support.
Reviewed by: markj (earlier version)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12321
The code in nfscl_doflayoutio() bogusly used FREAD instead of
NFSV4OPEN_ACCESSREAD. Since both happen to be defined as "1", this
worked and the patch doesn't result in a functional change.
Found by inspection during development of Flex File Layout support.
MFC after: 2 weeks
__builtin_frame_address with a non-zero argument is unsafe and rejected by
newer gcc. Since it doesn't seem to impact the stacktrace, don't bother
with gymnastics to unwind to a different frame for starting.
PR: kern/220118
MFC after: 2 weeks
All the Book-E world is no longer e500v{1,2}. e500mc the 64-bit derivatives do
not use the DOZE/NAP bits with MSR[WE], instead using the `wait' instruction to
wait for interrupts, and SoC plane controls (via CCSR) for power management.
MFC after: 1 week
Modify blst_leaf_alloc to find allocations that cross the boundary between
one leaf node and the next when those two leaves descend from the same
meta node.
Update the hint field for leaves so that it represents a bound on how
large an allocation can begin in that leaf, where it currently represents
a bound on how large an allocation can be found within the boundaries of
the leaf.
The first phase of blst_leaf_alloc currently shrinks sequences of
consecutive 1-bits in mask until each has been shrunken by count-1 bits,
so that any bits remaining show where an allocation can begin, or until
all the bits have disappeared, in which case the allocation fails. This
change amends that so that the high-order bit is copied, as if, when the
last block was free, it was followed by an endless stream of free
blocks. It also amends the early stopping condition, so that the shrinking
of 1-sequences stops early when there are none, or there is only one
unbounded one remaining.
The search for the first set bit is unchanged, and the code path
thereafter is mostly unchanged unless the first set bit is in a position
that makes some of those copied sign bits matter. In that case, we look
for a next leaf, and at what blocks it can provide, to see if a
cross-boundary allocation is possible.
The hint is updated on a successful allocation that clears the last bit,
but it not updated on a failed allocation that leaves the last bit
set. So, as long as the last block is free, the hint value for the leaf is
large. As long as the last block is free, and there's a next leaf, a large
allocation can begin here, perhaps. A stricter rule than this would mean
that allocations and frees in one leaf could require hint updates to the
preceding leaf, and this change seeks to leave the freeing code
unmodified.
Define BLIST_BMAP_MASK, and use it for bit masking in blst_leaf_free and
blist_leaf_fill, as well as in blst_leaf_alloc.
Correct a panic message in blst_leaf_free.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: markj (an earlier version)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D11819
The usbphy node for allwinner have two kind of resources, one for the
phy_ctrl and one per phy. Instead of blindy allocating resources, alloc
the phy_ctrl and pmu ones separately.
Also add a configuration struct for all different phy that hold the difference
between them (number of phys, unknow needed register write etc ...).
While here remove A83T code as upstream and FreeBSD dts don't have
nodes for USB.
This (plus 323640) re-enable OHCI on Pine64 on the bottom USB port.
The top USB port is routed to the OHCI0/EHCI0 which is by default in OTG mode.
While the phy code can handle the re-route to standard OHCI/EHCI we still need
a driver for musb to probe and configure it in host mode.
EHCI is still buggy on Pine64 (hang the board) so do not enable it for now.
Tested On: Bananapi (A20), BananapiM2 (A31S), OrangePi One (H3) Pine64 (A64)
r323392 introduce gpio_pin_get/gpio_pin_set for a10_gpio driver.
When called via gpio method they must aquire the device lock while
when they are called via gpio_pin_configure the lock is already aquire.
Introduce a10_gpio_pin_{s,g}et_locked and call them in pin_gpio_configure
instead.
Tested On: BananaPi (A20)
Reported by: Richard Puga richard@puga.net
This was really too big of a commit even if everything worked, but there
are multiple new issues introduced in the one huge commit, so it's not
worth keeping this until it's fixed.
I'll work on splitting this up into logical chunks and introduce them one
at a time over the next week or two.
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
To optimize the case of ping-ponging between two buffers, the DDP code
caches the last two buffers used keeping the pages wired and page pods
stored in the NIC's RAM. If a new aio_read() request uses one of the
same buffers, then the work of holding pages, etc. can be avoided.
However, the starting virtual address of an aio buffer was not saved,
only the page count, length, and initial page offset. Thus, an
aio_read() request could match a different buffer in the address
space. (Earlier during development vm_fault_hold_quick_pages() was
always called and the vm_page_t values were compared, but that was
eventually removed without being adequately replaced.) Fix by storing
the starting virtual address and comparing that (along with other
fields) to determine if a buffer can be reused.
MFC after: 3 days
Sponsored by: Chelsio Communications
Don't call cam_iosched_trim_done or cam_iosched_submit_trim for nda
since its hardware can handle almost an arbitrary number of TRIMs and
we don't have to be careful to only ever do one.
Sponsored by: Netflix
It's intended only for those situations where the periph driver
ones to limit the number of trims active to one and only one.
Also update comments on associated functions.
Sponsored by: Netflix
* Demote the level of several debug messages to CAM_DEBUG_TRACE
* Add detection for SDHC cards that can do 1.8V. No voltage switch sequence
is issued yet;
* Don't create a separate LUN for each SDIO function. We need just one to make
pass(4) attach;
* Remove obsolete mmc_sdio* files. SDIO functionality will be moved into the
separate device that will manage a new sdio(4) bus;
* Terminate probing if got no reply to CMD0;
* Make bcm2835 SDHCI host controller driver compile with 'option MMCCAM'.
Approved by: imp (mentor)
Differential Revision: https://reviews.freebsd.org/D12109
free queue mutex lock owning session, same as it was done for the
object termination in r323561.
Reported and tested by: mjg
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
In theory, all data access errors mean that a member is out of sync
at most. But they were treated as more serious errors to avoid the
situation where a flaky disk gets repeatedly disconnected, re-synchronized,
reconnected and then disconnected again.
ENXIO is a special error that means that the member disk disappeared,
so it should get the same handling as the GEOM orphaning event.
There is a better chance that when the disk is reconnected, it will be
a good member again.
When ENXIO happens on a read we use the exisiting G_MIRROR_BUMP_SYNCID
mechanism which means that the mirror's syncid is increased as soon
as there is a write to the mirror. That's because no data has got out
of sync yet, but the problematic memeber is disconnected, so the future
write will make it stale.
When ENXIO happens on a write we use a new G_MIRROR_BUMP_SYNCID_NOW
mechanism which means that we update the mirror metadata as soon as
possible because the problematic memeber is already behind.
Reviewed by: markj, imp
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D9463
The bad_session, sglist_error, and process_error sysctl nodes were
returning the value of the pad_error node instead of the appropriate
error counters.
Sponsored by: Chelsio Communications
When a newborn socket moves from incomplete queue to complete
one, we need to obtain the listening socket lock after the child,
which is a wrong order. The old code did that in potentially
endless loop of mtx_trylock(). The new one does only one attempt
of mtx_trylock(), and in case of failure references listening
socket, unlocks child and locks everything in right order. In
case if listening socket shuts down during that, just bail out.
Reported & tested by: Jason Eggleston <jeggleston llnw.com>
Reported & tested by: Jason Wolfe <jason llnw.com>
kernel. We can register callbacks to perform the required operation on the
saved registers before returning.
This is initially used to work around a bug in old versions of QEMU that
trigger such an exception when reading from an ID register when it should
load z zero value.
I expect this could be used with other exception types, e.g. to emulate
special register access from userland.
Sponsored by: DARPA, AFRL
An eventual devd(8) or other component should be able to scan buses and
automatically load drivers that match device ids described in this metadata.
Reviewed by: imp
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12364
The core note matches the format and layout of NT_ARM_VFP on Linux.
Debuggers use the AT_HWCAP flags to determine how many VFP registers
are actually used and their format.
Reviewed by: mmel (earlier version w/o gcore)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D12293
Future changes will use these functions to fetch and store VFP state for
threads other than curthread.
Reviewed by: andrew, stevek, Michal Meloun <meloun-miracle-cz>
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D12292
These flags match the meaning and value of flags in Linux, though
Linux has many more flags.
Reviewed by: stevek, Michal Meloun <meloun-miracle-cz> (earlier version)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D12291
A new 'u_long *sv_hwcap' field is added to 'struct sysentvec'. A
process ABI can set this field to point to a value holding a mask of
architecture-specific CPU feature flags. If an ABI does not wish to
supply AT_HWCAP to processes the field can be left as NULL.
The support code for AT_EHDRFLAGS was already present on all systems,
just the #define was not present. This is a step towards unifying the
AT_* constants across platforms.
Reviewed by: kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D12290
As long as mnt_ref is not zero there can be a consumer that might try
to access mnt_vnodecovered. For this reason the covered vnode must not
be freed until mnt_ref goes to zero.
So, move the release of the covered vnode to vfs_mount_destroy.
Reviewed by: kib
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D12329
The problem is that fdrop() requires syscall context, as it may
enter sleep in some cases. The reason to use it in the original
non-blocking sendfile implementation, was to avoid use of global
ACCEPT_LOCK() on every I/O completion. Now in head sorele() no
longer requires this lock.
16 bits is only wide enough for kegs with an item size of up to 64KB.
At that size or larger, slab headers are typically offpage because the
item size is a multiple of the page size, but there is no requirement
that this be the case.
We can widen the field without affecting the layout of struct uma_keg
since the removal of uk_slabsize in r315077 left an adjacent hole.
PR: 218911
MFC after: 2 weeks
object' page queue under the single mutex lock.
First, all pages on the queue are prepared for free by calls to
vm_page_free_prep(), and pages which should not be returned to the
physical allocator (e.g. wired or fictitious) are simply removed from
the queue. On the second pass, vm_page_free_phys_pglist() inserts all
pages from the queue without relocking the mutex.
The change improves the object termination, e.g. on the process exit
where large anonymous memory objects otherwise cause relocks the free
queue mutex for each page. More, if several such processes are
exiting or execing in parallel, the mutex was highly contended on
the address space demolition.
Diagnosed and tested by: mjg (previous version)
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
and insertion into the phys allocator free queues vm_page_free_phys().
Also provide a wrapper vm_page_free_phys_pglist() for batched free.
Reviewed by: alc, markj
Tested by: mjg (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
other kernel infrastructure changes.
Note that this doesn't affect the base cxgb(4) NIC driver for T3 at all.
MFC after: No MFC.
Sponsored by: Chelsio Communications
On some AMD FCH devices driven by intpm(4) (read: mine), the SMBus I/O port
range is split in two and the low range is only 0x10 wide. intpm(4) does
not access any registers above 0x0f, so there is no need for the wider
range.
Discussed with: avg
Sponsored by: Dell EMC Isilon
generate_fat.sh does the following:
- create an 800kb zero-filled file
- create an md device backed by this file
- format the device fat12
- mount the filesystem
- create the EFI ESP directory structure
- create the EFI boot file (BOOTx64 for amd64, BOOTaa64 for aarch64, etc)
- Adds a marker to the beginning of the file, and pad it to 384kb
- 384kb was chosen as it is less than half of 800kb, thus allowing
users to keep a backup of their older boot file in the small partition
- Unmount the filesystem
- Scan the image and find the offset where the marker was inserted
- The process requires root, to make image generation easier, images for
each architecture are pregenerated, compressed with xz, and checked
into svn.
The Makefile that generates boot1.efifat does the following:
- Ensure the compiled boot1.efi file is no larger than the generated image
- Decompress the template created by generate-fat.sh
- dd the contents of boot1.efi into boot1.efifat starting at the offset
where the marker is found. This allows any file less than the maximum
size to be written into the fat filesystem without having to mount it,
so no root privileges are required.
Later work by imp and myself makes bsdinstall create a 200mb fat16 instead
of using this process, but it is retained to make image generation easier.
Submitted by: Eric McCorkle (original version)
Reviewed by: emaste, tsoome, Eric McCorkle
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D9680
available, in i2c controller drivers that require interrupts for transfers.
This is the result of auditing all 22 existing drivers that attach iicbus.
These drivers were the only ones remaining that require interrupts and were
not using config_intrhook to defer attachment. That has led, over the
years, to various i2c slave device drivers needing to use config_intrhook
themselves rather than performing bus transactions in their probe() and
attach() methods, just in case they were attached too early.
in UNIX sockets.
o Check that socket is still connected in uipc_ready(). If not
we are responsible to free mbufs.
o In uipc_send() if socket appears to be disconnected, but we
are sending data with pending I/Os, don't free mbufs.
Reported by: Kevin Bowling <kbowling llnw.com>
Tested by: Kevin Bowling <kbowling llnw.com>
PR: 222259
Reported by: Mark Martinec <Mark.Martinec ijs.si>
MFC after: 3 days
in favor of just rendering the manpage instead of relying on pre-formatted
catpages. Note, this does not impede the ability to use existing catpages,
it just removes the utility to generate them.
Reviewed by: imp, allanjude
Approved by: emaste (mentor)
Differential Revision: https://reviews.freebsd.org/D12317
Kegs for internal zones always keep the slab header in the slab itself.
Therefore, when determining the allocation size, we need to take the
slab header size into account.
Reported and tested by: ae, rakuco
Reviewed by: avg
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D12342
The new IDs are taken from the hardware to which I have access
and from open datasheets.
Also, the hardware probing is moved to the device probe method.
Reviewed by: rpokala
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D11730
CAM_DEBUG_TRACE results in way too much debug output than needed now.
When debugging, it's always possible to turn on trace level using camcontrol.
Approved by: imp (mentor)
Differential Revision: https://reviews.freebsd.org/D12110
by Matt Macy as well as other changes which he has accepted via pull
request to his github repo at https://github.com/mattmacy/networking/
This should bring -CURRENT and the github repo into close enough sync to
allow small feature branches rather than a large chain of interdependant
patches being developed out of tree. The reset of the synchronization
should be able to be completed on github by splitting the remaining
changes that are not yet ready into short feature branches for later
review as smaller commits.
Here is a summary of changes included in this patch:
1) More checks when INVARIANTS are enabled for eariler problem
detection
2) Group Task Queue cleanups
- Fix use of duplicate shortdesc for gtaskqueue malloc type.
Some interfaces such as memguard(9) use the short description to
identify malloc types, so duplicates should be avoided.
3) Allow gtaskqueues to use ithreads in addition to taskqueues
- In some cases, this can improve performance
4) Better logging when taskqgroup_attach*() fails to set interrupt
affinity.
5) Do not start gtaskqueues until they're needed
6) Have mp_ring enqueue function enter the ABDICATED rather than BUSY
state. This moves the TX to the gtaskq and allows processing to
continue faster as well as make TX batching more likely.
7) Add an ift_txd_errata function to struct if_txrx. This allows
drivers to inspect/modify mbufs before transmission.
8) Add a new IFLIB_NEED_ZERO_CSUM for drivers to indicate they need
checksums zeroed for checksum offload to work. This avoids modifying
packet data in the TX path when possible.
9) Use ithreads for iflib I/O instead of taskqueues
10) Clean up ioctl and support async ioctl functions
11) Prefetch two cachlines from each mbuf instead of one up to 128B. We
often need to parse packet header info beyond 64B.
12) Fix potential memory corruption due to fence post error in
bit_nclear() usage.
13) Improved hang detection and handling
14) If the packet is smaller than MTU, disable the TSO flags.
This avoids extra packet parsing when not needed.
15) Move TCP header parsing inside the IS_TSO?() test.
This avoids extra packet parsing when not needed.
16) Pass chains of mbufs that are not consumed by lro to if_input()
rather call if_input() for each mbuf.
17) Re-arrange packet header loads to get as much work as possible done
before a cache stall.
18) Lock the context when calling IFDI_ATTACH_PRE()/IFDI_ATTACH_POST()/
IFDI_DETACH();
19) Attempt to distribute RX/TX tasks across cores more sensibly,
especially when RX and TX share an interrupt. RX will attempt to
take the first threads on a core, and TX will attempt to take
successive threads.
20) Allow iflib_softirq_alloc_generic() to request affinity to the same
cpus an interrupt has affinity with. This allows TX queues to
ensure they are serviced by the socket the device is on.
21) Add new iflib sysctls to net.iflib:
- timer_int - interval at which to run per-queue timers in ticks
- force_busdma
22) Add new per-device iflib sysctls to dev.X.Y.iflib
- rx_budget allows tuning the batch size on the RX path
- watchdog_events Count of watchdog events seen since load
23) Fix error where netmap_rxq_init() could get called before
IFDI_INIT()
24) e1000: Fixed version of r323008: post-cold sleep instead of DELAY
when waiting for firmware
- After interrupts are enabled, convert all waits to sleeps
- Eliminates e1000 software/firmware synchronization busy waits after
startup
25) e1000: Remove special case for budget=1 in em_txrx.c
- Premature optimization which may actually be incorrect with
multi-segment packets
26) e1000: Split out TX interrupt rather than share an interrupt for
RX and TX.
- Allows better performance by keeping RX and TX paths separate
27) e1000: Separate igb from em code where suitable
Much easier to understand separate functions and "if (is_igb)" than
previous tests like "if (reg_icr & (E1000_ICR_RXSEQ | E1000_ICR_LSC))"
#blamebruno
Reviewed by: sbruno
Approved by: sbruno (mentor)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D12235
Normally after receiving a packet, a vlan(4) interface sends the packet
back through its parent interface's rx routine so that it can be
processed as an untagged frame. It does this by using the parent's
ifp->if_input. This is incompatible with netmap(4), which replaces the
vlan(4) interface's if_input with a netmap(4) hook. Fix this by using
the vlan(4) interface's ifp instead of the parent's directly.
Reported by: Harry Schmalzbauer <freebsd@omnilan.de>
Reviewed by: rstone
Approved by: rstone (mentor)
MFC after: 3 days
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12191
The cam_iosched_ticker() can't be scheduled more than once per tick.
Some limiters depend on quanta matching the number of calls per second
to enforce the proper limits. Limit the quanta to no faster than 1 per
clock tick. This fixes some features when running in VMs where the
default HZ is 100.
PR: 221953
Obtained from: ElectroBSD
Differential Revision: https://reviews.freebsd.org/D12337
Submitted by: Fabian Keil
It's awkward to have spaces in CAM device serial numbers. That leads to
such things as device nodes named "/dev/diskid/MYSERIAL%20%20%201". Better
to replace the spaces with "0"s. This change only affects the default
serial numbers for users who don't provide their own.
Reviewed by: ken, mav
MFC after: Never
Relnotes: Yes
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D12263
Newer binutils supports extensions to the MIPS ABI for non-PIC code
that is used when compiling O32 binaries with clang 5 (but not used
for N64 oddly enough). These extensions require support for
R_MIPS_COPY relocations as well as a second PLT GOT using
R_MIPS_JUMP_SLOT relocations.
For R_MIPS_COPY, use the same approach as on other architectures where
fixups are deferred to the MD do_copy_relocations.
The additional PLT GOT for jump slots is located in a .got.plt section
which is identified by a DT_MIPS_PLTGOT dynamic entry. This GOT also
requires fixups for the first two GOT entries just as the normal GOT.
However, the entry point for this second GOT uses a different calling
convention. Rather than passing an offset into the GOT, it passes an
offset into the .rel.plt section. This requires a second entry point
(_rtld_pltbind_start) which calls the normal _rtld_bind() rather than
_mips_rtld_bind(). This also means providing a real version of
reloc_jmpslot() which is used by _rtld_bind().
In addition, add real implementions of reloc_plt() and
reloc_jmpslots() which walk .rel.plt handling R_MIPS_JUMP_SLOT
relocations.
Reviewed by: kib
Sponsored by: DARPA / AFRL
Differential Revision: https://reviews.freebsd.org/D12326
The zfsonlinux feature large_dnode is not yet supported by the loader.
Reviewed by: avg, allanjude
Differential Revision: https://reviews.freebsd.org/D12288
"probe" method of those drivers to mean we're on e TI SoC. Introduce a new
function, ti_soc_is_supported(), and use it to be sure we're really a TI
system.
PR: 222250
The uncovered vnode is possible because there is no guarantee that
its hold count would go to zero (and it would be inactivated and reclaimed)
immediately after a covering filesystem is unmounted.
So, such a vnode should be expected and it is possible to re-use it
without any trouble.
MFC after: 3 weeks
Sponsored by: Panzura
The only consumer of zfs_get_vfs, zfs_unmount_snap, does not need
the filesystem to be busy, it just need a reference that it can pass
to dounmount.
Also, previously the code was racy as it unbusied the filesystem
before taking a reference on it.
Now the code should be simpler and safer.
MFC after: 2 weeks
Sponsored by: Panzura
stop, read, and write methods. Some controllers don't implement these
individual operations and have only a transfer method. In that case, we
should return an indication that the device is present but doesn't support
the method, as opposed to the kobj default error ENXIO which makes it
look like the whole device is missing. Userland tools such as i2c(8) can
use the differing return values to switch between the two different i2c
IO mechanisms.
On AMD, the MCG_CAP feature bit is reserved -- not explicitly zero. Do not
use it to determine CMCI support.
Reviewed by: avg, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12320
This Makefile relies on Makefile.fat providing the correct value for
BOOT1_MAXSIZE and BOOT1_OFFSET. Since BOOT1_OFFSET had no default value
here the build would already fail if Makefile.fat did not provide
correct values.
Sponsored by: The FreeBSD Foundation
illumos/illumos-gate@37e84ab74e37e84ab74ehttps://www.illumos.org/issues/8569
C [C99] has peculiar rules for inline functions that are different from the
C++ rules. Unlike C++ where inline is "fire and forget", in C a programmer
must pay attention to the function's storage class / visibility. The main
problem is with the case where a compiler decides to not inline a call to the
function declared as inline.
Some relevant links:
- http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka15831.html
- http://www.drdobbs.com/the-new-c-inline-functions/184401540
The summary is that either the inline functions should be declared 'static
inline' or one of the compilation units (.c files) must provide a callable
externally visible function definition. In the former case, the compiler would
automatically create a local non-inlined function instance in every compilation
unit where it's needed. In the latter case the single external definition is
used to satisfy any non-inlined calls in all compilation units. As things
stand right now, we can get an undefined reference error under certain
combinations of compilers and compiler options. For example, this is what I
get on FreeBSD when compiling with clang 4.0.0 and -O1:
In function `abd_free': /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/abd.c:385:
undefined reference to `abd_is_linear'
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>
MFC after: 1 week
illumos/illumos-gate@216d7723a1216d7723a1https://www.illumos.org/issues/8558
On a system with more than 80K ZFS filesystems, we've seen cases where
lwp_create() will start to fail by returning EAGAIN. The problem being,
for each of those 80K ZFS filesystems, a taskq will be created for each
dataset as part of the ZIL for each dataset.
For each of these taskq's, a kernel thread will be created which results
in 24KB being allocated for each thread. With enough of these 24KB
allocations, we eventually exhaust the memory region set aside for these
allocations. Currently, segkpsize is set to a value of 2GB, which means
we can only support about 80K filesystems; 2GB / 24KB = ~80K.
The lwp_create() failure comes into play due to the fact that LWP
creation also allocates 24KB from this same region of memory. Thus, if
we've exhausted this region of memory due to the number of ZIL taskq's,
there won't be any memory avaible to allow the call to lwp_create() to
succeed.
FreeBSD note: I haven't created sysctl-s for the new ZIL clean
parameters. Let's add them if anyone requires to tune them.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>
MFC after: 3 weeks
This patch adds hwtype parameter which keeps information about hardware
revision of Marvell EHCI controller. It allows to replace multiple
calls to ofw_bus_is_compatible with comparing hwtype value during driver
initialization.
Submitted by: Patryk Duda <pdk@semihalf.com>
Suggested by: ian
Obtained from: Semihalf
Sponsored by: Semihalf
In advance of other changes to the fat template generation process, have
generate-fat.sh create all template files at the same time so that they
cannot get out of sync.
Also correct a longstanding but where BOOT1_OFFSET was overwritten on
each invocation. A previous version of this patch stored a per-arch
offset (e.g. BOOT1_arm64_OFFSET) but that was deemed unnecessary.
Instead just hardcode the known offset that applies to all archs (0x2d)
and fail if the offset happens to be different.
Ongiong work (using newfs_msdos in bsdinstall and adding msdosfs support
to makefs) will eventually allow us to do away with this fat template
hack altogether, but in the near term we have a few improvements that
will build on this.
Reviewed by: allanjude, imp, Eric McCorkle
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D10931
In the case of running newvers.sh on a git tree w/o git-svn-id notes we
previously piped the entire 'git log' to grep. Add --grep to the log
invocation to avoid processing log entries of no interest.
This saves about 2-3 seconds of newvers.sh run time on my SSD laptop.
Later changes will bring further speedups.
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
This prevents incorrect subversion revision detection when "git svn" is
not being used to get the sources but git is available. Previously old
subversion revisions included in commit messages were favoured over the
more recent and correct revisions in git notes.
For example cf1f355747 represents r315395 but was treated as r313908
which is referenced in the commit message. Commits following
r315395/cf1f35574722 but before another commit with a git-svn-id
reference in the commit message would be treated as r313908 as well.
Patch from PR updated to accommodate the initial four space indent in
`git log` ouptut.
PR: 221848
Submitted by: Fabian Keil
Obtained from: ElectroBSD
MFC after: 2 weeks
Prior to the change they were subject to extreme false sharing.
In particular this change shaves about 3 seconds real time of -j 80 buildkernel.
Reviewed by: alc, markj
Differential Revision: https://reviews.freebsd.org/D12281
Sometimes it is necessary to combine several gpio pins into an ad-hoc bus
and manipulate the pins as a group. In such cases manipulating the pins
individualy is not an option, because the value on the "bus" assumes
potentially-invalid intermediate values as each pin is changed in turn. Note
that the "bus" may be something as simple as a bi-color LED where changing
colors requires changing both gpio pins at once, or something as complex as
a bitbanged multiplexed address/data bus connected to a microcontroller.
In addition to the absolute requirement of simultaneously changing the
output values of driven pins, a desirable feature of these new methods is to
provide a higher-performance mechanism for reading and writing multiple
pins, especially from userland where pin-at-a-time access incurs a noticible
syscall time penalty.
These new interfaces are NOT intended to abstract away all the ugly details
of how gpio is implemented on any given platform. In fact, to use these
properly you absolutely must know something about how the gpio hardware is
organized. Typically there are "banks" of gpio pins controlled by registers
which group several pins together. A bank may be as small as 2 pins or as
big as "all the pins on the device, hundreds of them." In the latter case, a
driver might support this interface by allowing access to any 32 adjacent
pins within the overall collection. Or, more likely, any 32 adjacent pins
starting at any multiple of 32. Whatever the hardware restrictions may be,
you would need to understand them to use this interface.
In additional to defining the interfaces, two example implementations are
included here, for imx5/6, and allwinner. These represent the two primary
types of gpio hardware drivers. imx6 has multiple gpio devices, each
implementing a single bank of 32 pins. Allwinner implements a single large
gpio number space from 1-n pins, and the driver internally translates that
linear number space to a bank+pin scheme based on how the pins are grouped
into control registers. The allwinner implementation imposes the restriction
that the first_pin argument to the new functions must always be pin 0 of a
bank.
Differential Revision: https://reviews.freebsd.org/D11810
for analyzing the radix tree structures and reporting on the number, and
sizes, of maximal intervals of free blocks. The report includes the number
of maximal intervals, and also the number of them in each of several size
ranges, from small (size 1, or 3 to 4) to large (28657 to 46367) with size
boundaries defined by Fibonacci numbers. The report is written in the test
tool with the 's' command, or in a running kernel by sysctl.
The analysis of the radix tree frequently computes the position of the lone
bit set in a u_daddr_t, a computation that also appears in leaf allocation.
That computation has been moved into a function of its own, and optimized
for cases where an inlined machine instruction can replace the usual binary
search.
Submitted by: Doug Moore <dougm@rice.edu>
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D11906
it to a random value between 100 and 1123, rather than 0 as before.
Submitted by: Marie Helene Kvello-Aune <marieheleneka@gmail.com>
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D5336