This used to work by accident with ld.bfd even though always_keepalive
was marked as static. LLD honors static more correctly, so export this
variable properly (including moving it into the tcp_* namespace).
Reviewed by: bz, emaste
MFC after: 2 weeks
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D14129
Some ABIs have large gaps in syscall numbers. Allow gaps to be filled
as ranges of UNIMPL, with an entry like:
248-1023 AUE_NULL UNIMPL unimplemented
Reviewed by: jhb, gnn
Sponsored by: Turing Robotic Industries Inc.
Differential Revision: https://reviews.freebsd.org/D14122
checks to recognize own network devices when using mlx5ib. This patch fixes
an issues where mlx5ib fails to recognize mceX network devices for use with
RoCE.
MFC after: 1 week
Sponsored by: Mellanox Technologies
host to reprobe the bus by switching the USB pull up resistors off and
back on. In other words - when FreeBSD is configured as a USB device,
changing the sysctl will be immediately noticed by the machine it's
connected to.
Reviewed by: hselasky@
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
net80211/ieee80211_ageq.c was present twice in sys/conf/files so leave the
correctly sorted one. dev/wpi/if_wpi.c was present in sys/conf/files as well
as sys/conf/files.amd64 and sys/conf/files.i386 so prefer the sys/conf/files
entry.
Reviewed by: allanjude, rstone
Restore state 6. Many of the UNH tests end up exercising this
state, where we have a new neighbor cache entry and a new link-layer
entry is being created for it. The link-layer address is currently
unknown so the initial state of the "llentry" should remain initialized
to ND6_LLINFO_NOSTATE so that the ND code will send a solicitation.
Setting this to ND6_LLINFO_STALE implies that the link-level entry
is valid and can be used (but needs to be refreshed via the Neighbor
Unreachability state machine).
https://forums.freebsd.org/threads/64287/
Submitted by: Farrell Woods <Farrell_Woods@Dell.com>
Reviewed by: mjoras, dab, ae
MFC after: 1 week
Sponsored by: Dell EMC
Differential Revision: https://reviews.freebsd.org/D14059
When mbuf has M_FASTFWD_OURS flag, this means that a destination address
is our local, but we still need to pass scope zone violation check,
because protocol level expects that IPv6 link-local addresses have
embedded scope zone indexes. This should fix the problem, when ipfw is
used to forward packets to local address and source address of a packet
is IPv6 LLA.
Reported by: sbruno
MFC after: 3 weeks
When an interface has IFF_LOOPBACK flag in6_ifattach() tries to assing
IPv6 loopback address to this interface. It uses in6ifa_ifpwithaddr()
to check, that interface doesn't already have given address and then
uses in6_ifattach_loopback(). If in6_ifattach_loopback() fails, it just
exits and thus skips assignment of IPv6 LLA.
Fix this using in6ifa_ifwithaddr() function. If IPv6 loopback address is
already assigned in the system, do not call in6_ifattach_loopback().
PR: 138678
MFC after: 3 weeks
It turns out that under some circumstances we can get DSI or DSE before we set
LPCR and LPID so we should set it as early as possible.
Authored by: Patryk Duda <pdk@semihalf.com>
Submitted by: Wojciech Macek <wma@semihalf.com>
Obtained from: Semihalf
Sponsored by: IBM, QCM Technologies
On CHRP and PowerNV, use the interrupt server number in the cpuref and pcpu
hwref field instead of the device-tree phandle and make the CPU IDs reported
to the scheduler dense and with the BSP at 0.
Submitted by: Wojciech Macek <wma@semihalf.com>
Obtained from: Semihalf
Sponsored by: IBM, QCM Technologies
Differential revision: https://reviews.freebsd.org/D14011
Cleaning up AP startup routines. This is a mix of changes
required to make PowerNV running and to modify the code
to be more robust. Previously, some races were seen if more
than 90CPUs were online.
Authored by: Patryk Duda <pdk@semihalf.com>
Submitted by: Wojciech Macek <wma@semihalf.com>
Obtained from: Semihalf
Sponsored by: IBM, QCM Technologies
Differential revision: https://reviews.freebsd.org/D14026
used with hashed page tables on AIM and place it into a new, modular pmap
function called pmap_decode_kernel_ptr(). This function is the inverse
of pmap_map_user_ptr(). With POWER9 radix tables, which mapping to use
becomes more complex than just AIM/BOOKE and it is best to have it in
the same place as pmap_map_user_ptr().
Reviewed by: jhibbits
gone_in(majar, msg); If we're running in FreeBSD major, tell
the user this code may be deleted soon.
If we're running in FreeBSD major - 1,
the the user is deprecated and will
be gone in major.
Otherwise say nothing.
gone_in_dev(dev, major, msg) Just like gone_in, except use device_printf.
New tunable / sysctl debug.oboslete_panic: 0 - don't panic,
1 - panic in major or newer , 2 - panic in major - 1 or newer
default: 0
if NO_OBSOLETE_CODE is defined, then both of these turn into compile
time errors when building for major. Add options NO_OBSOLETE_CODE to
kernel build system.
This lets us tag code that's going away so users know it will be gone,
as well as automatically manage things.
Differential Review: https://reviews.freebsd.org/D13818
optimize away these loops. Change boolean to int to match what atomic
API supplies. Remove wmb() since the atomic_store_rel() on status.done
ensure the prior writes to status. It also fixes the fact that there
wasn't a rmb() before reading done. This should also be more efficient
since wmb() is fairly heavy weight.
Sponsored by: Netflix
Reviewed by: kib@, jim harris
Differential Revision: https://reviews.freebsd.org/D14053
There's no reason not to build modules for 64-bit QorIQ devices. This
config has evolved to be analogous to the AIM GENERIC64 kernel, so will grow
to match it in more ways as well.
Summary:
Rather than duplicating the checks for programmatic traps all over the code, put
it all in one function. This helps to remove some of the #ifdefs between AIM
and Book-E.
Reviewed By: nwhitehorn
Differential Revision: https://reviews.freebsd.org/D14082
- pmap_enter_object() can be used for mapping of executable pages, so it's
necessary to handle I-cache synchronization within it.
- Fix race in I-cache synchronization in pmap_enter(). The current code firstly
maps given page to target VA and then do I-cache sync on it. This causes
race, because this mapping become visible to other threads, before I-cache
is synced.
Do sync I-cache firstly (by using DMAP VA) and then map it to target VA.
- ARM64 ARM permits implementation of aliased (AIVIVT, VIPT) I-cache, but we
can use different that final VA for flushing it. So we should use full
I-cache flush on affected platforms. For now, and as temporary solution,
use full flush always.
Previously, MAKESYSPATH as well as '-m' directives in MAKEFLAGS would cause
any port rebuilt during the PORTS_MODULES stage to consume system makefiles
from $(SRCROOT)/share/mk instead of those installed under /usr/share/mk.
For kernel modules that need to build against an updated src tree this
makes sense; less so for <bsd.port.mk> or any userspace library or utility
the port may also happen to install.
Before 11.0, this probably didn't matter much in practice. But the addition
of src.libnames.mk under $(SRCROOT)/share/mk in 11.0 breaks any consumer of
bsd.prog.mk and DPADD/LDADD during PORTS_MODULES.
Address the build breakage by removing MAKESYSPATH and any occurrence of
'-m' from MAKEFLAGS in the environment created for the port build.
Instead set SYSDIR so that any kmod built by the port will still consume
conf/kmod.mk from the updated src tree, assuming it uses <bsd.kmod.mk>
Reviewed by: bdrewery
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D13053
Sanitize the values that will be assigned to ncookies so that we ensure
they are sane and we can handle them.
Let ncookies signed as it was before r328346. The valid range is such
that unsigned values are not required and we are not able to avoid at
least one cast anyways.
Hinted by: bde
Use PCID to avoid complete TLB shootdown when switching between user
and kernel mode with PTI enabled.
I use the model close to what I read about KAISER, user-mode PCID has
1:1 correspondence to the kernel-mode PCID, by setting bit 11 in PCID.
Full kernel-mode TLB shootdown is performed on context switches, since
KVA TLB invalidation only works in the current pmap. User-mode part of
TLB is flushed on the pmap activations as well.
Similarly, IPI TLB shootdowns must handle both kernel and user address
spaces for each address. Note that machines which implement PCID but
do not have INVPCID instructions, cause the usual complications in the
IPI handlers, due to the need to switch to the target PCID temporary.
This is racy, but because for PCID/no-INVPCID we disable the
interrupts in pmap_activate_sw(), IPI handler cannot see inconsistent
state of CPU PCID vs PCPU pmap/kcr3/ucr3 pointers.
On the other hand, on kernel/user switches, CR3_PCID_SAVE bit is set
and we do not clear TLB.
I can imagine alternative use of PCID, where there is only one PCID
allocated for the kernel pmap. Then, there is no need to shootdown
kernel TLB entries on context switch. But copyout(3) would need to
either use method similar to proc_rwmem() to access the userspace
data, or (in reverse) provide a temporal mapping for the kernel buffer
into user mode PCID and use trampoline for copy.
Reviewed by: markj (previous version)
Tested by: pho
Discussed with: alc (some aspects)
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
Differential revision: https://reviews.freebsd.org/D13985
When PTI is enabled, empty IDT slots point to rsvd_pti.
Reported by: Dexuan-BSD Cui <dexuan.bsd@gmail.com>
Sponsored by: The FreeBSD Foundation
MFC after: 5 days
Similarly as we already do for arm64, for mitigation is necessary to
flush branch predictor when we:
- do task switch
- receive prefetch abort on non-userspace address
The user can disable this mitigation by setting 'machdep.disable_bp_hardening'
sysctl variable, or it can check actual system status by reading
'machdep.spectre_v2_safe'
The situation is complicated by fact that:
- for Cortex-A8, the BPIALL instruction is effectively NOP until the IBE bit
in ACTLR is set.
- for Cortex-A15, the BPIALL is always NOP. The branch predictor can be
only flushed by doing ICIALLU with special bit (Enable invalidates of BTB)
set in ACTLR.
Since access to the ACTLR register is locked to secure monitor/firmware on
most boards, they will also need update of firmware / U-boot.
In worst case, when secure monitor is on-chip ROM (e.g. PandaBoard),
the board is unfixable.
MFC after: 2 weeks
Reviewed by: imp, emaste
Differential Revision: https://reviews.freebsd.org/D13931
- special fault handling for break-before-make mechanism should be also
applied for instruction translation faults, not only for data translation
faults.
- since arm64_address_translate_...() functions are not atomic,
use these with disabled interrupts.
Apply r328361 to duplicate copy of ccr_gcm_soft in ccp(4).
Properly honor the lack of the CRD_F_IV_PRESENT flag in the GCM software
fallback case for encryption requests.
Create a struct cryptop_data which contains state needed for a single
symmetric crypto operation and move that state out of the session. This
closes a race with the CRYPTO_F_DONE flag that can result in use after
free.
While here, remove the 'cse->error' member. It was just a copy of
'crp->crp_etype' and cryptodev_op() and cryptodev_aead() checked both
'crp->crp_etype' and 'cse->error'. Similarly, do not check for an
error from mtx_sleep() since it is not used with PCATCH or a timeout
so cannot fail with an error.
PR: 218597
Reviewed by: kib
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D13928
here. Return ENOMEM when we can't malloc a buffer for the DSM
TRIM. This should fix the WITNESS warnings similar to the following:
uma_zalloc_arg: zone "16" with the following non-sleepable locks held:
exclusive sleep mutex CAM device lock (CAM device lock) r = 0 (0xfffff800080c34d0) locked @ /usr/src/sys/cam/nvme/nvme_da.c:351
Reviewed by: scottl@
Sponsored by: Netflix
I suppose it should make this code NUMA-aware with recent NUMA drop-in,
trying to allocate shared memory buffers from domain closer to NT-bridge.
MFC after: 2 weeks
appeared on UFS/FFS filesystems. In some cases it was promptly followed
by a panic of "softdep_deallocate_dependencies: dangling deps". This fix
should eliminate both of these occurences.
Submitted by: Andreas Longwitz <longwitz at incore.de>
Reviewed by: kib
Tested by: Peter Holm (pho)
PR: 225423
MFC after: 1 week
the "power down" watchdog used by the ROM boot code is still active when the
regular watchdog is activated, turn off the power-down watchdog.
This adds support for the "fsl,ext-reset-output" FDT property. When
present, that property indicates that a chip reset is accomplished by
asserting the WDOG1_B external signal, which is supposed to trigger some
external component such as a PMIC to ready the hardware for reset (for
example, adjusting voltages from idle to full-power levels), and assert the
POR signal to SoC when ready. To guard against misconfiguation leading to a
non-rebootable system, the external reset signal is backstopped by code
that asserts a normal internal chip reset if nothing responds to the
external reset signal within one second.
in the LinuxKPI. This is done by calling finit() just before returning a magic
value of ENXIO in the "linux_dev_fdopen" function.
The Linux file structure should mimic the BSD file structure as much as
possible. This patch decouples the Linux file structure from the belonging
character device right after the "linux_dev_fdopen" function has returned.
This fixes an issue which allows a Linux file handle to exist after a
character device has been destroyed and removed from the directory index
of /dev. Only when the reference count of the BSD file handle reaches zero,
the Linux file handle is destroyed. This fixes use-after-free issues related
to accessing the Linux file structure after the character device has been
destroyed.
While at it add a missing NULL check for non-present file operation.
Calling a NULL pointer will result in a segmentation fault.
Reviewed by: kib @
MFC after: 1 week
Sponsored by: Mellanox Technologies
In a corner case we could fall into OOB error.
Authored by: Patryk Duda <pdk@semihalf.com>
Submitted by: Wojciech Macek <wma@semihalf.com>
Obtained from: Semihalf
Sponsored by: IBM, QCM Technologies
Specifically reading is done if ffs_sbget() and writing is done
in ffs_sbput(). These functions are exported to libufs via the
sbget() and sbput() functions which then used in the various
filesystem utilities. This work is in preparation for adding
subperblock check hashes.
No functional change intended.
Reviewed by: kib
virtual address sizes
Summary:
Some architectures use physical addresses larger than virtual. This is the
minimal changeset needed to get CAM/CTL to build on these targets. No
functional changes. More changes would likely be needed for this to be fully
functional on said platforms, but they can be made when needed.
Reviewed By: mav, chuck
Differential Revision: https://reviews.freebsd.org/D14041
than virtual
Summary:
Some architectures have physical/bus addresses that are much larger
than virtual addresses. This change just quiets a warning, as DMAP is not used
on those architectures, and on 64-bit platforms uintptr_t is the same size as
vm_paddr_t and void *.
Reviewed By: hselasky
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D14043
Mechanically replace uses of MALLOC/FREE with appropriate invocations of
malloc(9) / free(9) (a series of sed expressions). Something like:
* MALLOC(a, b, ... -> a = malloc(...
* FREE( -> free(
* free((caddr_t) -> free(
No functional change.
For now, punt on modifying contrib ipfilter code, leaving a definition of
the macro in its KMALLOC().
Reported by: jhb
Reviewed by: cy, imp, markj, rmacklem
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14035
leaks. We assume each source can be taken / dropped only once and
don't recurse. These are only enabled via DA_TRACK_REFS or
INVARIANTS. There appreas to be a reference leak under extreme load,
and these should help us colaberatively work it out. It also documents
better the reference / holding protocol better.
Reviewed by: ken@, scottl@
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D14040
device still wind up in xpt_done after the path has been
invalidated. Since we don't always need sim or devq, add some guard
rails to only fail if we have to use them.
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D14040
bottom of the file, where it is in most imx5/6 drivers. Switch from an RD2
macro using bus_space_read_2() to an inline function using bus_read_2();
likewise for WR2. Use RESOURCE_SPEC_END to end the resource_spec list.
Net effect should be no functional changes.
The header passed to these probes has some fields converted to host
order by tcp_fields_to_host(), so the tcpinfo_t translator doesn't do
what we want.
Submitted by: Hannes Mehnert
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D12647
Sometimes 32 bit and 64 bit ioctls are represented by the same number.
It causes unnecessary switch to 32 bit commpatible mode.
This patch prevents switching when we are dealing with 64 bit executable.
It fixes issue mentioned here
Authored by: Patryk Duda <pdk@semihalf.com>
Submitted by: Wojciech Macek <wma@semihalf.com>
Reviewed by: andrew, wma
Obtained from: Semihalf
Sponsored by: IBM, QCM Technologies
Differential revision: https://reviews.freebsd.org/D14023
Build with rtl8366rb has been broken due to incorrect retrieval of pointer
to device_t.
Reported by: lwhsu
Differential Revision: https://reviews.freebsd.org/D14044
This patch is cosmetic. It checks if allocation of ifnet structure failed.
It's better to have this check rather than assume positive scenario.
Submitted by: Dmitry Luhtionov <dmitryluhtionov@gmail.com>
Reported by: Dmitry Luhtionov <dmitryluhtionov@gmail.com>
Properly honor the lack of the CRD_F_IV_PRESENT flag in the GCM
software fallback case for encryption requests.
Submitted by: Harsh Jain @ Chelsio
Sponsored by: Chelsio Communications
In particular, this avoids edge cases where a generated IV might be
written into the output buffer even though the request is failed with
an error.
Sponsored by: Chelsio Communications
- Extend ccr_gcm_soft() to handle requests with a non-empty payload.
While here, switch to allocating the GMAC context instead of placing
it on the stack since it is over 1KB in size.
- Allow ccr_gcm() to return a special error value (EMSGSIZE) which
triggers a fallback to ccr_gcm_soft(). Move the existing empty
payload check into ccr_gcm() and change a few other cases
(e.g. large AAD) to fallback to software via EMSGSIZE as well.
- Add a new 'sw_fallback' stat to count the number of requests
processed via the software fallback.
Submitted by: Harsh Jain @ Chelsio (original version)
Sponsored by: Chelsio Communications
This works around an issue in the T6 that can result in DMA engine
stalls if an error occurs while processing a DSGL entry with a length
larger than 2KB.
Submitted by: Harsh Jain @ Chelsio
Sponsored by: Chelsio Communications
Most crypto requests will not trigger this condition, but a request
with a highly-fragmented data buffer (and a resulting "large" S/G
list) could trigger it.
Sponsored by: Chelsio Communications
The T6 can hang when processing certain AEAD requests if the request
sets a flag asking the crypto engine to discard the input IV and AAD
rather than copying them into the output buffer. The existing driver
always discards the IV and AAD as we do not need it. As a workaround,
allocate a single "dummy" buffer when the ccr driver attaches and
change all AEAD requests to write the IV and AAD to this scratch
buffer. The contents of the scratch buffer are never used (similar to
"bogus_page"), and it is ok for multiple in-flight requests to share
this dummy buffer.
Submitted by: Harsh Jain @ Chelsio (original version)
Sponsored by: Chelsio Communications
The T6 crypto engine's control messages only support a total AAD
length (including the prefixed IV) of 511 bytes. Reject requests with
large AAD rather than returning incorrect results.
Sponsored by: Chelsio Communications
Combined authentication-encryption and GCM requests already stored the
IV in the immediate explicitly. This extends this behavior to block
cipher requests to work around a firmware bug. While here, simplify
the AEAD and GCM handlers to not include always-true conditions.
Submitted by: Harsh Jain @ Chelsio
Sponsored by: Chelsio Communications
Fix a vulnerability in IPsec-IPv6-AH, that allows an attacker to remotely
crash the kernel with a single packet.
In this loop we need to increment 'ad' by two, because the length field
of the option header does not count the size of the option header itself.
If the length is zero, then 'count' is incremented by zero, and there's
an infinite loop. Beyond that, this code was written with the assumption
that since the IPv6 packet already went through the generic IPv6 option
parser, several fields are guaranteed to be valid; but this assumption
does not hold because of the missing '+2', and there's as a result a
triggerable buffer overflow (write zeros after the end of the mbuf,
potentially to the next mbuf in memory since it's a pool).
Add the missing '+2', this place will be reinforced in separate commits.
Reported by: Maxime Villard <maxv at NetBSD.org>
MFC after: 1 week
When allocating memory through malloc(9), we always expect the amount of
memory requested to be unsigned as a negative value would either stand for
an error or an overflow.
Unsign some values, found when considering the use of mallocarray(9), to
avoid unnecessary casting. Also consider that indexes should be of
at least the same size/type as the upper limit they pretend to index.
MFC after: 2 weeks
in the LinuxKPI. The old implementation assumed only one IDR layer was present.
Take additional IDR layers into account when computing the "id" value.
MFC after: 1 week
Found by: Karthik Palanichamy <karthikp@chelsio.com>
Tested by: Karthik Palanichamy <karthikp@chelsio.com>
Sponsored by: Mellanox Technologies
specified in the arg1 into ICMPv6 destination unreachable code according
to RFC7915.
Obtained from: Yandex LLC
MFC after: 2 weeks
Sponsored by: Yandex LLC
Added CTLFLAG_VNET to net.link.lagg.lacp.default_strict_mode which was missed
in r290450.
Reported by: julian@
MFC after: 1 week
Sponsored by: Multiplay
Fix a bug when the system has no CPU 0. When created, threads were implicitly assigned to CPU 0.
This had no practical effect since a real CPU was chosen immediately by the scheduler. However,
on systems without a CPU 0, sched_ule attempted to access the scheduler queue of the "old" CPU
when assigned the initial choice of the old one. This caused an attempt to use illegal memory
and a crash (or, more usually, a deadlock). Fix this by assigned new threads to the BSP
explicitly and add some asserts to see that this problem does not recur.
Authored by: Nathan Whitehorn <nwhitehorn@freebsd.org>
Submitted by: Wojciech Macek <wma@semihalf.com>
Obtained from: Semihalf
Differential revision: https://reviews.freebsd.org/D13932
the first mbuf of the reassembled datagram should have a pkthdr.
This was discovered with cxgbe(4) + IPSEC + ping with payload more than
interface MTU. cxgbe can generate !M_WRITEABLE mbufs and this results
in m_unshare being called on the reassembled datagram, and it complains:
panic: m_unshare: m0 0xfffff80020f82600, m 0xfffff8005d054100 has M_PKTHDR
PR: 224922
Reviewed by: ae@
MFC after: 1 week
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D14009
pf_unlink_state() releases a reference to the state without checking if
this is the last reference. It can't be, because pf_state_insert()
initialises it to two. KASSERT() that this is always the case.
CID: 1347140
The driver now ensures only one thread at a time is running in the API
functions (clock_gettime() and clock_settime()) by specifically requesting
ownership of the i2c bus without using IIC_RECURSIVE, then it does all IO
using IIC_RECURSIVE so that each individual IO operation doesn't try to
re-acquire the bus.
The other IO done by the driver happens at attach or intr_config_hooks time,
when there can't be multiple threads running with the same device instance.
So, the IIC_RECURSIVE flag can be safely ORed into the wait flags for all IO
done by the driver, because it's all either done in a single-threaded
environment, or protected within a block bounded by explict
iicbus_acquire_bus() and iicbus_release_bus() calls.
The driver now ensures only one thread at a time is running in the API
functions (clock_gettime() and clock_settime()) by specifically requesting
ownership of the i2c bus without using IIC_RECURSIVE, then it does all IO
using IIC_RECURSIVE so that each individual IO operation doesn't try to
re-acquire the bus.
The other IO done by the driver happens at attach or intr_config_hooks time,
when there can't be multiple threads running with the same device instance.
So, the IIC_RECURSIVE flag can be safely ORed into the wait flags for all IO
done by the driver, because it's all either done in a single-threaded
environment, or protected within a block bounded by explict
iicbus_acquire_bus() and iicbus_release_bus() calls.
The recursive ownership support added in r321584 was unconditionally in
effect all the time -- whenever a given i2c slave device instance tried to
lock the i2c bus for exclusive use when it already owned the bus, the call
returned immediately without waiting. However, many i2c slave drivers use
bus ownership to enforce that only a single thread at a time can be using
the slave device. The recursive locking changes broke this use case.
Now there is a new flag, IIC_RECURSIVE, which can be mixed in with the
other flags passed to iicbus_acquire_bus() to allow drivers to indicate
when recursive locking is desired. Using the flag implies that the driver
is managing concurrent access to the device by different threads in some way.
This immediately fixes all existing i2c slave drivers except for the two
i2c RTC drivers which use the recursive locking feature; those will be
fixed in a followup commit.
These files previously had a 3-clause license and 'THE REGENTS' text.
Switch to standard 2-clause text with kib's approval, and add the SPDX
tag.
Approved by: kib
MSI/MSI-x interrupts are edge-triggered. If an interrupt
arrives when IRQ line is masked, it will be lost and will
never recover. Perform MSI_EOI always after unmask to give
a chance for PHB/XICS to send an interrupt again if MSI/MSI-x
pending bit is set in MSI/MSI-x BAR space.
Submitted by: Wojciech Macek <wma@semihalf.org>
Obtained from: Semihalf
Sponsored by: IBM, QCM Technologies
Increment the route table generation count after modifying a
route. This signals back to TCP connections that they need to
update their L2 caches as the gateway for their route may have
changed. This is a heavier hammer than is needed, strictly
speaking, but route changes will be unlikely enough that the
performance effects of invalidating all connection route caches
should be negligible.
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13990
Reviewed by: karels
Add a new macro to clear both the L3 and L2 route caches, to
hopefully prevent future instances where only the L3 cache was
cleared when both should have been.
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13989
Reviewed by: karels
When the inpcb route cache is invalidated after a change to the
routing tables, we need to invalidate the LLE cache as well.
Previous to this change packets for the connection would continue
to use the old L2 information from the old L3 gateway, and the
packets for the connection would likely be blackholed.
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13988
Reviewed by: karels
Commits r326203 and r326978 broke 64-bit booke kernels by introducing a 1MB
zero-pad between the ELF header and the start of the kernel. This didn't
cause a build failure, but caused kernels to need to be loaded into memory
1MB lower, which could easily break scripts expecting previous behavior.
This change matches the similar change made to AIM in r327358.
In iflib, the device-specific init() function isn't supposed to edit
the struct ifnet driver flags. If it does, it'll cause an MPASS() assert
in iflib to fail.
PR: 225312
Reported by: bhughes@
Route messages are aligned to the host long type alignment, which
breaks 32bit.
Reported and tested by: lwhsu
Diagnosed by: Yuri Pankov <yuripv@icloud.com>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
These functions deal the same type of overflows we do with mallocarray(9).
Using our mallocarray will panic, which different from the previous
behavior (returning NULL), but neither behavior is more correct.
As a sidenote, drm_calloc_large() is not currently used at all.
Reviewed by: dumbbell
Differential Revision: https://reviews.freebsd.org/D13835
instead of frobbing the registers directly.
As a hack the bcm2835_pwm kmod presently ignores the 'status="disabled"'
in the RPI3 DTB, assuming that if you load the kld you probably
want the PWM to work.
illumos/illumos-gate@5cb8d943bchttps://www.illumos.org/issues/8835:
Sequential reads not aligned to block size are not detected by ZFS
prefetcher as sequential, killing prefetch and severely hurting
performance. It is caused by dmu_zfetch() in case of misaligned
sequential accesses being called with overlap of one block.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Allan Jude <allanjude@freebsd.org>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Alexander Motin <mav@FreeBSD.org>
illumos/illumos-gate@4ae5f5f06chttps://www.illumos.org/issues/8652:
Clang and GCC prefer to use unsigned ints to store enums. With Clang, that
causes tautological comparison warnings when comparing a zfs_prop_t or
zpool_prop_t variable to the macro ZPROP_INVAL. It's likely that error
handling code is being silently removed as a result.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Alan Somers <asomers@gmail.com>
illumos/illumos-gate@301fd1d6f2
Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Sean Eric Fagan <sef@ixsystems.com>
illumos/illumos-gate@01a059ee0chttps://www.illumos.org/issues/8856:
arc_cksum_is_equal() calls zio_push_transform() that requires abd_t*
(second arg), but a void* is passed.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Roman Strashkin <roman.strashkin@nexenta.com>
When allocating memory through malloc(9), we always expect the amount of
memory requested to be unsigned as a negative value would either stand for
an error or an overflow.
Unsign some values, found when considering the use of mallocarray(9), to
avoid unnecessary casting. Also consider that indexes should be of
at least the same size/type as the upper limit they pretend to index.
MFC after: 3 weeks
8930 zfs_zinactive: do not remove the node if the filesystem is readonly
illumos/illumos-gate@93c618e0f4https://www.illumos.org/issues/8930:
We normally remove an unlinked node when its last user goes away and the
node becomes inactive. However, we should not do that if the filesystem
is mounted read-only including the case where it has its readonly
property set. The node will remain on the unlinked queue, so it will
not be leaked.
One particular scenario is when we receive an incremental stream into a
mounted read-only filesystem and that stream contains an unlinked file
(still on the unlinked queue). If that file is opened before the
receive and some time later after the receive it becomes inactive we
would remove it and, thus, modify the read-only filesystem. As a
result, the filesystem would diverge from its source and further
incremental receives would not be possible (without forcing a rollback).
Another related scenario, that may or may not be possible depending on an
OS / VFS policy, is when an open file is unlinked, then the filesystem is
remounted read-only, and then the file is closed.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Andriy Gapon <avg@FreeBSD.org>
illumos/illumos-gate@94ddd0900ahttps://www.illumos.org/issues/8909:
There's a race condition that exists if `zil_free_lwb` races with either
`zil_commit_waiter_timeout` and/or `zil_lwb_flush_vdevs_done`.
Here's an example panic due to this bug:
> ::status
debugging crash dump vmcore.0 (64-bit) from ip-10-110-205-40
operating system: 5.11 dlpx-5.2.2.0_2017-12-04-17-28-32b6ba51fb (i86pc)
image uuid: 4af0edfb-e58e-6ed8-cafc-d3e9167c7513
panic message:
BAD TRAP: type=e (#pf Page fault) rp=ffffff0010555970 addr=60 occurred in mo
dule "zfs" due to a NULL pointer dereference
dump content: kernel pages only
> $c
zio_shrink+0x12()
zil_lwb_write_issue+0x30d(ffffff03dcd15cc0, ffffff03e0730e20)
zil_commit_waiter_timeout+0xa2(ffffff03dcd15cc0, ffffff03d97ffcf8)
zil_commit_waiter+0xf3(ffffff03dcd15cc0, ffffff03d97ffcf8)
zil_commit+0x80(ffffff03dcd15cc0, 9a9)
zfs_write+0xc34(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0)
fop_write+0x5b(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0)
write+0x250(42, fffffd7ff4832000, 2000)
sys_syscall+0x177()
If there's an outstanding lwb that's in `zil_commit_waiter_timeout`
waiting to timeout, waiting on it's waiter's CV, we must be sure not to
call `zil_free_lwb`. If we end up calling `zil_free_lwb`, then that LWB
may be freed and can result in a use-after-free situation where the
stale lwb pointer stored in the `zil_commit_waiter_t` structure of the
thread waiting on the waiter's CV is used.
A similar situation can occur if an lwb is issued to disk, and thus in
the `LWB_STATE_ISSUED` state, and `zil_free_lwb` is called while the
disk is servicing that lwb. In this situation, the lwb will be freed by
`zil_free_lwb`, which will result in a use-after-free situation when the
lwb's zio completes, and `zil_lwb_flush_vdevs_done` is called.
This race condition is prevented in `zil_close` by calling `zil_commit`
before `zil_free_lwb` is called, which will ensure all outstanding (i.e.
all lwb's in the `LWB_STATE_OPEN` and/or `LWB_STATE_ISSUED` states)
reach the `LWB_STATE_DONE` state before the lwb's are freed
(`zil_commit` will not return untill all the lwb's are
`LWB_STATE_DONE`).
Further, this race condition is prevented in `zil_sync` by only calling
`zil_free_lwb` for lwb's that do not have their `lwb_buf` pointer set.
All lwb's not in the `LWB_STATE_DONE` state will have a non-null value
for this pointer; the pointer is only cleared in
`zil_lwb_flush_vdevs_done`, at which point the lwb's state will be
changed to `LWB_STATE_DONE`.
This race is present in `zil_suspend`, leading to this bug.
At first glance, it would appear as though this would not be true
because `zil_suspend` will call `zil_commit`, just like `zil_close`, but
the problem is that `zil_suspend` will set the zilog's `zl_suspend`
field prior to calling `zil_commit`. Further, in `zil_commit`, if
`zl_suspend` is set, `zil_commit` will take a special branch of logic
and use `txg_wait_synced` instead of performing the normal `zil_commit`
logic.
This call to `txg_wait_synced` might be good enough for the data to
reach disk safely before it returns, but it does not ensure that all
outstanding lwb's reach the `LWB_STATE_DONE` state before it returns.
This is because, if there's an lwb "stuck" in
`zil_commit_waiter_timeout`, waiting for it's lwb to timeout, it will
maintain a non-null value for it's `lwb_buf` field and thus `zil_sync`
will not free that lwb. Thus, even though the lwb's data is already on
disk, the lwb will be left lingering, waiting on the CV, and will
eventually timeout and be issued to disk even though the write is
unnesseary.
So, after `zil_commit` is called from `zil_suspend`, we incorrectly
assume that there are not outstanding lwb's, and proceed to free all
lwb's found on the zilog's lwb list. As a result, we free the lwb that
will later be used `zil_commit_waiter_timeout`.
Reviewed by: John Kennedy <jwk404@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>
illumos/illumos-gate@cf07d3da99https://www.illumos.org/issues/8603:
To help make the ZIL's code more understandable, it was suggested that
the zilog_t's "zl_writer_lock" field should be renamed to "zl_issuer_lock".
Reviewed by: C Fraire <cfraire@me.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>
illumos/illumos-gate@a3b2868063https://www.illumos.org/issues/8677
We want to be able to run channel programs outside of synching context.
This would greatly improve performance of channel program that just gather
information, as we won't have to wait for synching context anymore.
This feature should introduce the following:
- A new command line flag in "zfs program" to specify our intention to
run in open context.
- A new flag/option within the channel program ioctl which selects the
context.
- Appropriate error handling whenever we try a channel program in
open-context that contains zfs.sync* expressions.
- Documentation for the new feature in the manual pages.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>
At least on GCC7 calling __alloc_size(x) twice is not equivalent to
calling using the attribute once with two arguments. The later is the
documented use in GCC documentation so add a new alloc_size(n, x)
alternative to cover for the few places where it is used: basically:
calloc(3), reallocarray(3) and mallocarray(9).
Submitted by: Mark Millard
MFC after: 3 days
Reference:
http://docs.freebsd.org/cgi/mid.cgi?F227842D-6BE2-4680-82E7-07906AF61CD7