Do this even for non-transparent mode VF. Better safe than sorry.
MFC after: 3 days
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D11981
- Update hn(4)'s stats properly for non-transparent mode VF.
- Allow BPF tapping to hn(4) for non-transparent mode VF.
- Don't setup mbuf hash, if 'options RSS' is set.
In Azure, when VF is activated, TCP SYN and SYN|ACK go through hn(4)
while the rest of segments and ACKs belonging to the same TCP 4-tuple
go through the VF. So don't setup mbuf hash, if a VF is activated
and 'options RSS' is not enabled. hn(4) and the VF may use neither
the same RSS hash key nor the same RSS hash function, so the hash
value for packets belonging to the same flow could be different!
- Disable LRO.
hn(4) will only receive broadcast packets, multicast packets, TCP
SYN and SYN|ACK (in Azure), LRO is useless for these packet types.
For non-transparent, we definitely _cannot_ enable LRO at all, since
the LRO flush will use hn(4) as the receiving interface; i.e.
hn_ifp->if_input(hn_ifp, m).
While I'm here, remove unapplied comment and minor style change.
MFC after: 3 days
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D11978
While, I'm here add comment about why updating VF's imcast stat is
not necessary.
MFC after: 3 days
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D11948
setting up the timer fails, because on some types of chips that's the
first attempt to access the device. If the chip is missing/non-responsive
then you'd get a driver that attached and didn't register the rtc, with
no clue about why. On other chip types there are inits that come before
timer setup, and they already print messages about errors.
- Add FDT probe code.
- Do i2c transfers with exclusive bus ownership.
- Use config_intrhook_oneshot() to defer chip setup because some i2c
busses can't do transfers without interrupts.
- Add a detach() routine.
- Add to module build.
This driver supports only basic timekeeping functionality. It completely
replaces the ds133x driver. It can also replace the ds1374 driver, but that
will take a few other changes in MIPS code and config, and will be committed
separately. It does NOT replace the existing ds1307 driver, which provides
access to some of the extended features on the 1307 chip, such as controlling
the square wave output signal. If both ds1307 and ds13rtc drivers are
present, the ds1307 driver will outbid and win control of the device.
This driver can be configured with FDT data, or by using hints on non-FDT
systems. In addition to the standard hints for i2c devices, it requires
a "chiptype" string of the form "dallas,ds13xx" where 'xx' is the chip id
(i.e., the same format as FDT compat strings).
and sc_irq_length to the softc to handle the base number of IRQs available,
make gicv3_get_nirqs return the number of available interrupt IDs, and
limit which CPUs we send interrupts to based on the numa domain.
The last point is only strictly needed on a dual socket ThunderX where we
are unable to send MSI/MSI-X interrupts between sockets.
Sponsored by: DARPA, AFRL
it automatically after it runs.
The config_intrhook mechanism allows a driver to stall the boot process
until device(s) required for booting are available, by not allowing system
inits to proceed until all intrhook functions have been unregistered.
Virtually all existing code simply unregisters from within the hook function
when it gets called.
This new function makes that common usage more convenient. Instead of
allocating and filling in a struct, passing it to a function that might (in
theory) fail, and checking the return code, now a driver can simply call
this cannot-fail routine, passing just the intrhook function and its arg.
Differential Revision: https://reviews.freebsd.org/D11963
the g_journal level needs to check whether it is holding a newer
copy of the block than that which exists on the disk. If so, it
needs to return its copy. If not, it should pass the request down
to the disk to fulfill. It currently considers six queues:
0) delayed queue,
1) unsent (current queue),
2) in-flight to the journal (flush queue),
3) active journal (active queue),
4) inactive journal (inactive queue), and
5) inflight to the disk (copy queue).
Checking on two of these queues is unnecessary:
0) The delayed requests should not be used for reads because they
have not yet been entered into the journal, so their value should
reflect the disk contents, not the future contents that are not
yet committed.
2) Because all the bio's in the flush queue are also found on the
active queue, there is no need to inspect the flush queue for
reads since they will be found when searching the active queue.
Submitted by: Dr. Andreas Longwitz <longwitz@incore.de>
Discussed with: kib
MFC after: 1 week
another parameter that identifies a starting point in the memory address
block. Radix is a power of two, blk is a multiple of radix, and the
starting point is in the range [blk, blk+radix), so that blk can always be
computed from the other two. This change drops the blk parameter from the
meta functions and computes it instead. (On amd64, for example, this
change reduces subr_blist.o's text size by 7%.)
It also makes the radix parameters unsigned to address concerns that the
calculation of '-radix' might overflow without the -fwrapv option. (See
https://reviews.freebsd.org/D11819.)
Submitted by: Doug Moore <dougm@rice.edu>
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D11964
to being called through the newbus DEVICE_SHUTDOWN() path. This ensures that
the NVME controller gets shut down before the device and bus disappear
and prevents data corruption on shutdown on at least Samsung EVO 960 SSDs.
PR: kern/211852
Reviewed by: imp
MFC after: 2 weeks
Previously, debug exceptions were only enabled on the boot CPU if
DDB was enabled in the dbg_monitor_init() function. APs also called
this function, but since mp_machdep.c doesn't include opt_ddb.h, the
APs ended up calling an empty stub defined in <machine/debug_monitor.h>
instead of the real function. Also, if DDB was not enabled in the kernel,
the boot CPU would not enable debug exceptions.
Fix this by adding a new dbg_init() function that always clears the OS
lock to enable debug exceptions which the boot CPU and the APs call.
This function also calls dbg_monitor_init() to enable hardware breakpoints
from DDB on all CPUs if DDB is enabled. Eventually base support for
hardware breakpoints/watchpoints will need to move out of the DDB-only
debug_monitor.c for use by userland debuggers.
Reviewed by: andrew
Differential Revision: https://reviews.freebsd.org/D12001
Only fetch the VFP state from the CPU if the thread whose registers are
being requested is the current thread. If a stopped thread's registers
are being fetched by a debugger, the saved state in the PCB is already
valid.
Reviewed by: andrew
MFC after: 1 week
generic driver with minimal feature support for a large number of chips.
More featureful per-chip drivers might exist (especially out-of-tree) and
those should win the bidding even if they use BUS_PROBE_DEFAULT.
the current state to determine whether to generate a link-state change
notification. This fixes a bug introduced in r321063 that caused the
driver to sometimes skip these notifications.
Reported by: Jason Eggleston @ LLNW
MFC after: 3 days
Sponsored by: Chelsio Communications
Added in r238189, ARM_MANY_BOARD adds support for multiple ARM boards in
a single kernel. Include it for LINT builds to avoid duplicate symbol
errors when linking with lld.
Sponsored by: The FreeBSD Foundation
The encrypted kernel dump code writes data in blocks of this size. A buffer
size of 4096 allows encrypted dumps to work with 4Kn drives.
Reviewed by: cem
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D11870
removes the only reference to atrtc_set() from outside of atrtc.c, so make
it static.
The xen timer driver registers as a realtime clock with 1us resolution. In
the past that resulted in only the xen timer's clock_settime() getting
called, so it would call atrtc_set() to set the hardware clock as well. As
of r32090, the clock_settime() method of all registered realtime clocks gets
called, so the xen driver no longer needs to chain-call the lower-resolution
driver.
Thanks to royger@ for talking me through the xen stuff, and for testing.
M_PMC is defined in sys/dev/hwpmc/hwpmc_mod.c, and the LINT kernel build
fails when linking with lld due to a duplicate symbol error.
Sponsored by: The FreeBSD Foundation
TCP connections (order of tens of thousands), with predominantly Transmits.
Choice to perform receive operations either in IThread or Taskqueue Thread.
Submitted by:Vaishali.Kulkarni@cavium.com
MFC after:5 days
cpumask it probably means we are unable to sent interrupts to CPUs outside
the map. As such only return the current CPU when it's within the mask
otherwise return the first valid CPU.
This is needed on ThunderX as, in a dual socket configuration, we are
unable to send MSI/MSI-X interrupts between sockets.
Reviewed by: mmel
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D11957
in the VM area structure in the LinuxKPI when doing mmap() and that
unsupported bits are masked away.
While at it fix some redundant use of parenthesing inside some related
macros.
Found by: KrishnamRaju ErapaRaju <Krishna2@chelsio.com>
MFC after: 1 week
Sponsored by: Mellanox Technologies
Such drivers attach to a vgapci bus rather than directly to a pci bus. For
the rest of the LinuxKPI to work correctly in this case, we override the
vgapci bus' ivars with those of the grandparent.
Reviewed by: hselasky
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D11932
We can remove some unnecessary object radix tree lookups by using the
object memq to iterate over pages in the specified range. This does not,
however, eliminate the lookup needed in vm_page_free_toq() to remove each
tree entry.
Reviewed by: alc, kib (previous revision)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D11945
When the mps(4) and mpr(4) drivers need to reinitialize the
firmware, they sometimes need to reallocate all of the memory
allocated by the driver. The reallocation happens whenever the IOC
Facts change. That should only happen after a firmware upgrade.
If the reinitialization happens as a result of a timed out command
sent to the card, the command that timed out and triggered the
reinit may have been freed if iocfacts_allocate() reallocated all
memory. If the caller attempts to access the command after that,
the kernel will panic because the caller will be dereferencing
freed memory.
The solution is to set a flag in the softc when we reallocate,
and avoid dereferencing the command strucure if we've reallocated.
The changes are largely the same in both drivers, since mpr(4) is a
derivative of mps(4).
o In iocfacts_allocate(), if the IOC Facts have changed and we
need to reallocate, set the REALLOCATED flag in the softc.
o Change wait_command() to take a struct mps_command ** instead of
a struct mps_command *. This allows us to NULL out the caller's
command pointer if we have to reinit the controller and the data
structures get reallocated. (The REALLOCATED flag will be set
in the softc if that has happened.)
o In every place that calls wait_command(), make sure we handle
the case where the command is NULL after the call.
o The mpr(4) driver has mpr_request_polled() which can also
reinitialize the card. Also check for reallocation there.
Reviewed by: scottl, slm
MFC after: 1 week
Sponsored by: Spectra Logic
New version is not compatible on supervisor mode with v1.9.1
(previous version).
Highlights:
o BBL (Berkeley Boot Loader) provides no initial page tables
anymore allowing us to choose VM, to build page tables manually
and enable MMU in S-mode.
o SBI interface changed.
o GENERIC kernel.
FDT is now chosen standard for RISC-V hardware description.
DTB is now provided by Spike (golden model simulator). This
allows us to introduce GENERIC kernel. However, description
for console and timer devices is not provided in DTB, so move
these devices temporary to nexus bus.
o Supervisor can't access userspace by default. Solution is to
set SUM (permit Supervisor User Memory access) bit in sstatus
register.
o Compressed extension is now turned on by default.
o External GCC 7.1 compiler used.
o _gp renamed to __global_pointer$
o Compiler -march= string is now in use allowing us to choose
required extensions (compressed, FPU, atomic, etc).
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D11800
This patch modifies function ofw_fdt_setprop (called by OF_setprop),
so that it can add property, when replacing is not possible.
Adding property is needed to fixup FDT's that have missing
properties.
Submitted by: Patryk Duda <pdk@semihalf.com>
Reviewed by: nwhitehorn, cognet (mentor)
Approved by: cognet (mentor)
Obtained from: Semihalf
Differential Revision: https://reviews.freebsd.org/D11879
after r319757.
1) Correct the return value from __wait_event_common() from 1 to 0 in
case the timeout is specified as MAX_SCHEDULE_TIMEOUT. In the other
case __ret is zero and will be substituted in the last part of the
macro with the appropriate value before return.
2) Make sure the "timeout" argument is casted to "int" before
evaluating negativity. Else the signedness of a "long" might be
checked instead of the signedness of an integer.
3) The wait_event() function should not have a return value.
Found by: KrishnamRaju ErapaRaju <Krishna2@chelsio.com>
MFC after: 1 week
Sponsored by: Mellanox Technologies
handles a timeout value of MAX_SCHEDULE_TIMEOUT which basically means there
is no timeout. This is a regression issue after r319757.
While at it change the type of returned variable from "long" to "int" to
match the actual return type.
MFC after: 1 week
Sponsored by: Mellanox Technologies
Introduce a new define to take int account the xAPIC ID limit, for
systems where x2APIC is not available/reliable.
Also change some of the usages of the APIC ID to use an unsigned int
(which is the correct storage type to deal with x2APIC IDs as found in
x2APIC MADT entries).
This allows booting FreeBSD on a box with 256 CPUs and APIC IDs up to
295:
FreeBSD/SMP: Multiprocessor System Detected: 256 CPUs
FreeBSD/SMP: 1 package(s) x 64 core(s) x 4 hardware threads
Package HW ID = 0
Core HW ID = 0
CPU0 (BSP): APIC ID: 0
CPU1 (AP/HT): APIC ID: 1
CPU2 (AP/HT): APIC ID: 2
CPU3 (AP/HT): APIC ID: 3
[...]
Core HW ID = 73
CPU252 (AP): APIC ID: 292
CPU253 (AP/HT): APIC ID: 293
CPU254 (AP/HT): APIC ID: 294
CPU255 (AP/HT): APIC ID: 295
Submitted by: kib (previous version)
Relnotes: yes
MFC after: 1 month
Reviewed by: kib
Differential revision: https://reviews.freebsd.org/D11913
So that MAX_APIC_ID can be bumped without wasting memory.
Note that the usage of MAX_APIC_ID in the SRAT parsing forces the
parser to allocate memory directly from the phys_avail physical memory
array, which is not the best approach probably, but I haven't found
any other way to allocate memory so early in boot. This memory is not
returned to the system afterwards, but at least it's sized according
to the maximum APIC ID found in the MADT table.
Sponsored by: Citrix Systems R&D
MFC after: 1 month
Reviewed by: kib
Differential revision: https://reviews.freebsd.org/D11912
Populate the lapics arrays and call cpu_add/lapic_create in the setup
phase instead. Also store the max APIC ID found in the newly
introduced max_apic_id global variable.
This is a requirement in order to make the static arrays currently
using MAX_LAPIC_ID dynamic.
Sponsored by: Citrix Systems R&D
MFC after: 1 month
Reviewed by: kib
Differential revision: https://reviews.freebsd.org/D11911
The amd64 build of boot2 was failing with gcc 6.3.0 due to being more
than 1 kB too large. It was apparently generating a .eh_frame section
which was not being removed by objcopy -S. The .eh_frame section seems
to be mandatory per the amd64 ABI, but boot2 is compiled for i386 (uses
-m32), and therefore should be optional in this context. Suppress
generation of .eh_frame with the -fno-asynchronous-unwind-tables flag to
gcc. This saves 1348 bytes (the limit is 7680 bytes).
Reviewed by: dim, imp
Approved by: markj (mentor)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D11928
key_msg2sp() is used for parsing data from setsockopt(IP[V6]_IPSEC_POLICY)
call. This socket option is usually used to configure IPsec bypass for
socket. Only privileged user can set this socket option.
The message syntax is described here
http://www.kame.net/newsletter/20021210/
and our libipsec is usually used to create the correct request.
Add additional checks:
* that sadb_x_ipsecrequest_len is not out of bounds of user supplied buffer
* that src/dst's sa_len is the same
* that 2*sa_len is not out of bounds of user supplied buffer
* that 2*sa_len fits into bounds of sadb_x_ipsecrequest
Reported by: Ilja van Sprundel
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D11796
reduces diff between amd64 and i386. Also, it fixes a regression introduced
in r322076, i.e., identify_hypervisor() failed to identify some hypervisors.
This function assumes cpu_feature2 is already initialized.
Reported by: dexuan
Tested by: dexuan
geom_bsd, geom_mbr and geom_sunlabel have been obsolete since Marcel
Moolenaar's geom_part was in FreeBSD 7. They haven't been in GENERIC
since FreeBSD 8. Add warning when used.
geom_vol_ffs has been obsolete since ufs support to geom_label was
committed in FreeBSD 5. It hasn't been in GENERIC since FreeBSD 5.
Add warning when used.
geom_fox has been obsolete since gmultipath was committed in FreeBSD 7.
(no warning added, since this is a very obscure class).
These will all be removed in FreeBSD 12.
MFC After: 3 days
Differential Revision: https://reviews.freebsd.org/D11935
Note: Classes will be removed after MFC
New flag 0x4 can be configured in net.enc.[in|out].ipsec_bpf_mask.
When it is set, if_enc(4) additionally captures a packet via BPF after
invoking pfil hook. This may be useful for debugging.
MFC after: 2 weeks
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D11804
is in VF or SRIOV mode typically in a virtual machine environment.
Submitted by: Sepherosa Ziehau <sephe@dragonflybsd.org>
MFC after: 3 days
Sponsored by: Mellanox Technologies
How network VF works with hn(4) on Hyper-V in transparent mode:
- Each network VF has a cooresponding hn(4).
- The network VF and the it's cooresponding hn(4) have the same hardware
address.
- Once the network VF is attached, the cooresponding hn(4) waits several
seconds to make sure that the network VF attach routing completes, then:
o Set the intersection of the network VF's if_capabilities and the
cooresponding hn(4)'s if_capabilities to the cooresponding hn(4)'s
if_capabilities. And adjust the cooresponding hn(4) if_capable and
if_hwassist accordingly. (*)
o Make sure that the cooresponding hn(4)'s TSO parameters meet the
constraints posed by both the network VF and the cooresponding hn(4).
(*)
o The network VF's if_input is overridden. The overriding if_input
changes the input packet's rcvif to the cooreponding hn(4). The
network layers are tricked into thinking that all packets are
neceived by the cooresponding hn(4).
o If the cooresponding hn(4) was brought up, bring up the network VF.
The transmission dispatched to the cooresponding hn(4) are
redispatched to the network VF.
o Bringing down the cooresponding hn(4) also brings down the network
VF.
o All IOCTLs issued to the cooresponding hn(4) are pass-through'ed to
the network VF; the cooresponding hn(4) changes its internal state
if necessary.
o The media status of the cooresponding hn(4) solely relies on the
network VF.
o If there are multicast filters on the cooresponding hn(4), allmulti
will be enabled on the network VF. (**)
- Once the network VF is detached. Undo all damages did to the
cooresponding hn(4) in the above item.
NOTE:
No operation should be issued directly to the network VF, if the
network VF transparent mode is enabled. The network VF transparent mode
can be enabled by setting tunable hw.hn.vf_transparent to 1. The network
VF transparent mode is _not_ enabled by default, as of this commit.
The benefit of the network VF transparent mode is that the network VF
attachment and detachment are transparent to all network layers; e.g. live
migration detaches and reattaches the network VF.
The major drawbacks of the network VF transparent mode:
- The netmap(4) support is lost, even if the VF supports it.
- ALTQ does not work, since if_start method cannot be properly supported.
(*)
These decisions were made so that things will not be messed up too much
during the transition period.
(**)
This does _not_ need to go through the fancy multicast filter management
stuffs like what vlan(4) has, at least currently:
- As of this write, multicast does not work in Azure.
- As of this write, multicast packets go through the cooresponding hn(4).
MFC after: 3 days
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D11803
unable to automatically find alternate superblocks. This checkin
places the information needed to find alternate superblocks to the
end of the area reserved for the boot block.
Filesystems created with a newfs of this vintage or later will
create the recovery information. If you have a filesystem created
prior to this change and wish to have a recovery block created for
your filesystem, you can do so by running fsck in forground mode
(i.e., do not use the -p or -y options). As it starts, fsck will
ask ``SAVE DATA TO FIND ALTERNATE SUPERBLOCKS'' to which you should
answer yes.
Discussed with: kib, imp
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D11589
vm_page_grab() on consecutive page indices. Besides simplifying the code
in the caller, vm_page_grab_pages() allows for batching optimizations.
For example, the current implementation replaces calls to vm_page_lookup()
on consecutive page indices by cheaper calls to vm_page_next().
Reviewed by: kib, markj
Tested by: pho (an earlier version)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D11926
Since the cache controller nodes fixup is added to the platform code,
this patch aligns it to the Linux device tree representation.
Submitted by: Patryk Duda <pdk@semihalf.com>
Reviewed by: cognet (mentor)
Approved by: cognet (mentor)
Obtained from: Semihalf
Differential Revision: https://reviews.freebsd.org/D11884
Updating PL310 sotfware context sc_io_coherent field in
platform_pl310_init() routine for Armada 38x helps to avoid
using 'arm,io-coherent' property, which is by default not present
in the device tree node in Linux.
This way another step for DT unification between two operating
systems is done. The improvemnt will also work after enabling
PLATFORM for Marvell ARMv7 SoCs.
Reviewed by: andrew, cognet (mentor)
Approved by: cognet (mentor)
Obtained from: Semihalf
Differential Revision: https://reviews.freebsd.org/D11883
Since the timers' base frequency setting is added to the platform code,
this patch removes clock-frequency properties from global
and twd timers, aligning both to the Linux device tree.
Submitted by: Patryk Duda <pdk@semihalf.com>
Reviewed by: cognet (mentor)
Approved by: cognet (mentor)
Obtained from: Semihalf
Differential Revision: https://reviews.freebsd.org/D11882
Instead of using 'clock-frequency' device tree property for global/twd
mpcore timers of Armada 38x SoCs, set it in platform_late_init stage
with arm_tmr_change_frequency() function.
Reviewed by: cognet (mentor)
Approved by: cognet (mentor)
Obtained from: Semihalf
Differential Revision: https://reviews.freebsd.org/D11881
Before this patch function ofw_bus_find_compatible was using
memory allocations in order to find compatible node and the property's
length. This way there was always a suited buffer for property,
however this approach had also disadvantages - ofw_bus_find_compatible
couldn't be used when malloc is not available, e.g. during fdt fixup stage.
In order to remove the usage limitation of ofw_bus_find_compatible(),
this patch modifies the function to use ofw_bus_node_is_compatible()
(instead of the one without _int suffix), which uses a fixed
buffer on stack instead of dynamic allocations.
Submitted by: Patryk Duda <pdk@semihalf.com>
Reviewed by: nwhitehorn, cognet (mentor)
Approved by: cognet (mentor)
Obtained from: Semihalf
Differential Revision: https://reviews.freebsd.org/D11880
Sometimes it's convenient to provide fixup to many boards
that use the same SoC family (eg. Marvell Armada 38x).
Instead of putting multiple entries in fdt_fixup_table,
use one entry which refers to all boards with given SoC.
Submitted by: Patryk Duda <pdk@semihalf.com>
Reviewed by: nwhitehorn, cognet (mentor)
Approved by: cognet (mentor)
Obtained from: Semihalf
Differential Revision: https://reviews.freebsd.org/D11878
Because fdt_get_ranges can process now multiple 'ranges' entries,
restoring the ranges from original Linux device trees is possible.
Submitted by: Patryk Duda <pdk@semihalf.com>
Reviewed by: cognet (mentor)
Approved by: cognet (mentor)
Obtained from: Semihalf
Differential Revision: https://reviews.freebsd.org/D11877
This patch makes possible to boot with up to 8 ranges in soc.
Dynamic allocation cannot be used, because ftd_get_ranges
function is called early, when malloc is not available.
Change is required for the alignment of Marvell Armada 38x
device trees present in sys/gnu/dts/arm - originally
the platform has 6 entries in simple-bus 'ranges'.
Submitted by: Patryk Duda <pdk@semihalf.com>
Reviewed by: manu, nwhitehorn, cognet (mentor)
Approved by: cognet (mentor)
Obtained from: Semihalf
Differential Revision: https://reviews.freebsd.org/D11876
transfers in their probe() or attach() routines, and that doesn't work
when the low-level controller requires interrupts to be functional.
The DS133x family of chips is nearly identical to the DS1307 and support
for them should be added to that driver, then the ds133x driver can be
deleted. The s35390a driver just needs a non-trivial workover. In both
cases that work will be done and committed separately.
This is an import of Alexander Bluhm's OpenBSD commit r1.60,
the first chunk had to be modified because on OpenBSD the
'cut' declaration is located elsewhere.
Upstream report by Jingmin Zhou:
https://marc.info/?l=openbsd-pf&m=150020133510896&w=2
OpenBSD commit message:
Use a 32 bit variable to detect integer overflow when searching for
an unused nat port. Prevents a possible endless loop if high port
is 65535 or low port is 0.
report and analysis Jingmin Zhou; OK sashan@ visa@
Quoted from: https://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/net/pf_lb.c
PR: 221201
Submitted by: Fabian Keil <fk@fabiankeil.de>
Obtained from: OpenBSD via ElectroBSD
MFC after: 1 week
libefivar expects opening /dev/efi to indicate if the we can make efi
runtime calls. With a null routine, it was always succeeding leading
efi_variables_supported() to return the wrong value. Only succeed if
we have an efi_runtime table. Also, while I'm hear, out of an
abundance of caution, add a likely redundant check to make sure
efi_systbl is not NULL before dereferencing it. I know it can't be
NULL if efi_cfgtbl is non-NULL, but the compiler doesn't.
LinuxKPI workqueue wrappers reported "successful" cancellation for works
already completed in normal way. This change brings reported status and
real cancellation fact into sync. This required for drm-next operation.
Reviewed by: hselasky (earlier version)
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D11904
If mly_user_command fails to allocate a command slot it jumps to an 'out'
label used for error handling. The error handling code checks for a data
buffer in 'mc->mc_data' to free before checking if 'mc' is NULL. Fix by
just returning directly if we fail to allocate a command and only using
the 'out' label for subsequent errors when there is actual cleanup to
perform.
PR: 217747
Reported by: PVS-Studio
Reviewed by: emaste
MFC after: 1 week
p1003_1b.aio_listio_max is now a tunable. Its value is reflected in the
sysctl of the same name, and the sysconf(3) variable _SC_AIO_LISTIO_MAX.
Its value will be bounded from below by the compile-time constant
AIO_LISTIO_MAX and from above by the compile-time constant
MAX_AIO_QUEUE_PER_PROC and the tunable vfs.aio.max_aio_queue.
Reviewed by: jhb, kib
MFC after: 3 weeks
Relnotes: yes
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D11601
Also improve the formatting of the corresponding KASSERT message.
Based on the submission by: Svyatoslav <razmyslov@viva64.com>
Found by: PVS-Studio
PR: 217741
Reviewed by: emaste
Sponsored by: The FreeBSD Foundation (kib)
MFC after: 1 week
input errors in the mlx5en(4) driver. This improves the sysadmin view of
physical port errors.
Submitted by: gallatin@
MFC after: 1 week
Sponsored by: Mellanox Technologies
The m_defrag() function can only defrag mbuf chains which have a valid
mbuf packet header. In r291699 when the mlx4en(4) driver was converted
into using BUSDMA(9), the call to m_defrag() was moved after the part
of the transmit routine which strips the header from the mbuf chain.
This effectivly disabled the mbuf defrag mechanism and such packets
simply got dropped.
This patch removes the stripping of mbufs from a chain and loads all
mbufs using busdma. If busdma finds there are no segments, unload
the DMA map and free the mbuf right away, because that means all
data in the mbuf has been inlined in the TX ring. Else proceed
as usual.
Add a per-ring rounter for the number of defrag attempts and
make sure the oversized_packets counter gets zeroed while at it.
The counters are per-ring to avoid excessive cache misses in the
TX path.
Submitted by: mjoras@
Differential Revision: https://reviews.freebsd.org/D11683
MFC after: 1 week
Sponsored by: Mellanox Technologies
illumos/illumos-gate@d28671a3b0d28671a3b0https://www.illumos.org/issues/8373
The code that writes ZIL blocks uses dmu_tx_assign(TXG_WAIT) to assign
a transaction to a transaction group. That seems to be logically
incorrect as writing of the ZIL block does not introduce any new dirty
data. Also, when there is a lot of dirty data, the call can introduce
significant delays into the ZIL commit path, thus affecting all
synchronous writes. Additionally, ARC throttling may affect the ZIL
writing.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>
MFC after: 2 weeks
illumos/illumos-gate@79c2b812ee79c2b812eehttps://www.illumos.org/issues/8491
The zpool checkpoint feature in DxOS added a new field in the uberblock.
The Multi-Modifier Protection Pull Request from ZoL adds two new fields in the
uberblock (Reference: https://github.com/zfsonlinux/zfs/pull/6279).
As these two changes come from two different sources and once upstreamed and
deployed will introduce an incompatibility with each other we want
to upstream a change that will reserve the padding for both of them so
integration goes smoothly and everyone gets both features.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Olaf Faaland <faaland1@llnl.gov>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>
MFC after: 3 weeks
illumos/illumos-gate@267ae6c3a8267ae6c3a8https://www.illumos.org/issues/7915
l2arc_evict() is strictly serialized with respect to
l2arc_write_buffers() and l2arc_write_done(). Normally, l2arc_evict()
and l2arc_write_buffers() are called from the same thread, so they can
not be concurrent. Also, l2arc_write_buffers() uses zio_wait() on the
parent zio of all cache zio-s. That ensures that l2arc_write_done()
is completed before l2arc_write_buffers() returns. Finally, if a
cache device is removed, then l2arc_evict() is called under SCL_ALL in
the exclusive mode. That ensures that it can not be concurrent with
the normal L2ARC accesses to the device (including writing and
evicting buffers). Given the above, some checks and actions in
l2arc_evict() do not make sense. For instance, it must never
encounter the write head header let alone remove it from the buffer
list.
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Author: Andriy Gapon <avg@FreeBSD.org>
MFC after: 2 weeks
illumos/illumos-gate@dcb6872c56dcb6872c56https://www.illumos.org/issues/8126
The sync thread is concurrently modifying dn_phys->dn_nlevels
while dbuf_dirty() is trying to assert something about it, without
holding the necessary lock. We need to move this assertion further down
in the function, after we have acquired the dn_struct_rwlock.
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
MFC after: 2 weeks
illumos/illumos-gate@9b195260e29b195260e2https://www.illumos.org/issues/8426
abd_copy_from_buf and abd_cmp_buf do not modify their void *buf arguments, so
qualify them with const.
abd_copy_from_buf_off and abd_cmp_buf_off already had that type for the
corresponding arguments.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>
MFC after: 2 weeks
illumos/illumos-gate@77b171372e77b171372ehttps://www.illumos.org/issues/7600
At present, the kernel side code seems to blindly rollback to whatever happens
to be the latest snapshot at the time when the rollback task is processed.
The expected target's name should be passed to the kernel driver and the sync
task should validate that the target exists and that it is the latest snapshot
indeed.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>
MFC after: 3 weeks
illumos/illumos-gate@42418f9e7342418f9e73https://www.illumos.org/issues/8377
The problem is that when dsl_bookmark_destroy_check() is executed from open
context (the pre-check), it fills in dbda_success based on the existence of the
bookmark.
But the bookmark (or containing filesystem as in this case) can be destroyed
before we get to syncing context. When we re-run dsl_bookmark_destroy_check()
in syncing
context, it will not add the deleted bookmark to dbda_success, intending for
dsl_bookmark_destroy_sync() to not process it. But because the bookmark is
still in dbda_success
from the open-context call, we do try to destroy it.
The fix is that dsl_bookmark_destroy_check() should not modify dbda_success
when called from open context.
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
MFC after: 2 weeks
illumos/illumos-gate@b7edcb9408b7edcb9408https://www.illumos.org/issues/8378
The problem is that zfs_get_data() supplies a stale zgd_bp to dmu_sync(), which
we then nopwrite against.
zfs_get_data() doesn't hold any DMU-related locks, so after it copies db_blkptr
to zgd_bp, dbuf_write_ready()
could change db_blkptr, and dbuf_write_done() could remove the dirty record.
dmu_sync() then sees the stale
BP and that the dbuf it not dirty, so it is eligible for nop-writing.
The fix is for dmu_sync() to copy db_blkptr to zgd_bp after acquiring the
db_mtx. We could still see a stale
db_blkptr, but if it is stale then the dirty record will still exist and thus
we won't attempt to nopwrite.
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
MFC after: 2 weeks
FreeBD note: the essence of this change was committed to FreeBSD in
r314274. This commit catches up with differences between what was
committed to FreeBSD and what was committed to OpenZFS, mainly more
logical variable names.
illumos/illumos-gate@16a7e5ac1116a7e5ac11https://www.illumos.org/issues/7910
It seems that the change in issue #6950 resurrected the problem that was
earlier fixed by the change in issue #5219.
Please also see the following FreeBSD bug report:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=216178
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>
MFC after: 2 weeks
This also involves adding a quirk table as TRIM is broken for some
Kingston eMMC devices, though. Compared to ERASE (declared "legacy"
in the eMMC specification v5.1), TRIM has the advantage of operating
on write sectors rather than on erase sectors, which typically are
of a much larger size. Thus, employing TRIM, we don't need to fiddle
with coalescing BIO_DELETE requests that are also of (write) sector
units into erase sectors, which might not even add up in all cases.
- For some SanDisk iNAND devices, the CMD38 argument, e. g. ERASE,
TRIM etc., has to be specified via EXT_CSD[113], which now is also
handled via a quirk.
- My initial understanding was that for eMMC partitions, the granularity
should be used as erase sector size, e. g. 128 KB for boot partitions.
However, rereading the relevant parts of the eMMC specification v5.1,
this isn't actually correct. So drop the code which used partition
granularities for delmaxsize and stripesize. For the most part, this
change is a NOP, though, because a) for ERASE, mmcsd_delete() used
the erase sector size unconditionally for all partitions anyway and
b) g_disk_limit() doesn't actually take the stripesize into account.
- Take some more advantage of mmcsd_errmsg() in mmcsd(4) for making
error codes human readable.
No need to set any fields in the cloned device. devfs uses symlinks,
so the adev entries returned won't be presented to the drivers. Since
we don't save copies, nothing else will see them. This code came from
the old compat code, and it appears to be obsolete or never needed.
Submitted by: kib@
Differential Review: https://reviews.freebsd.org/D11919
All ndaX and ndaXpY nodes will appear as nvdX and nvdXpY as well
(through symlinks in devfs via the normal disk aliasing mechanism in
GEOM).
Differential Revision: https://reviews.freebsd.org/D11873
Implement disk_add_alias to allow aliases to be added to disks. All
disk have a primary name (say "foo") can also have secondary names
(say "bar") such that all instances of "foo" also have a "bar"
alias. So if you have foo0, foo0p1, foo1, foo1s1 and foo1s1a nodes
created by the foo driver and gpart, device nodes bar0, bar0p1, bar1,
bar1s1 and bar1s1a will appear as symlinks back to the original nodes.
This generalizes to multiple aliases. However, since the unit number
follows the primary name, multiple device drivers can't create the
same aliases unless those drives coorinate the unit number space (eg
you couldn't add an alias 'disk' to both 'da' and 'ada' because it's
possible to have da0 and ada0, because 'disk0' is ambiguous).
Differential Revision: https://reviews.freebsd.org/D11873
When we're creating new providers for each of the partitions, add
aliases to the geom before we create the provider so when geom_dev
tastes the provider, the aliases are in place so the proper /dev
entries are created. So foo5p6 gets created as an alias for bar5p6
when foo is an alias for bar in the geom we're partitioning with
g_part. This also copies aliases from the container geom (eg disk) to
the label geom (the disk with GPT partitioning) so that aliases nest
properly.
Differential Revision: https://reviews.freebsd.org/D11873
Add an alias name list to geoms. Use them in geom_dev to create
aliases. Previously, geom_dev would create an device node for the name
of the geom. Now, additional nodes are created pointing back to the
primary node with make_dev_alias_p. Aliases must be in place on the
geom before any tasting occurs.
Differential Revision: https://reviews.freebsd.org/D11873
in the flush_queue:
1 2 3 4 5 6 7 8 9 10
and another 10 bio's go into the flush queue after only the first five
bio's are removed from the flush queue, the queue should look like:
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20,
but because of the bug we end up with
6 11 12 13 14 15 16 17 18 19 20 7 8 9 10.
So the sequence of the bio's is damaged in the flush queue (and
therefore in the journal on disk !). This error can be triggered by
ffs_snapshot() when a block is read with readblock() and gjournal finds
this block in the broken flush queue before it goes to the correct
active queue.
The fix is to place all new blocks at the end of the queue.
Submitted by: Dr. Andreas Longwitz <longwitz@incore.de>
Discussed with: kib
MFC after: 1 week
system having over 4GB RAM. That's due to:
1) the limit being u_int instead of u_long like vm.kmem_size (the limit is
half of vm.kmem_size by default for amd64);
2) sysctl handler g_journal_cache_limit_sysctl() using u_int instead of u_long.
The fix is to replace u_int with u_long for the kern.geom.journal.cache.limit
sysctl variable.
PR: 198500
Submitted by: Dr. Andreas Longwitz <longwitz@incore.de>
Reported by: Eugene Grosbein
Discussed with: kib
MFC after: 1 week
While there, switch to FreeBSD internal callout active status.
Reviewed by: markj, hselasky
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D11900
o Replace __riscv64 with (__riscv && __riscv_xlen == 64)
This is required to support new GCC 7.1 compiler.
This is compatible with current GCC 6.1 compiler.
RISC-V is extensible ISA and the idea here is to have built-in define
per each extension, so together with __riscv we will have some subset
of these as well (depending on -march string passed to compiler):
__riscv_compressed
__riscv_atomic
__riscv_mul
__riscv_div
__riscv_muldiv
__riscv_fdiv
__riscv_fsqrt
__riscv_float_abi_soft
__riscv_float_abi_single
__riscv_float_abi_double
__riscv_cmodel_medlow
__riscv_cmodel_medany
__riscv_cmodel_pic
__riscv_xlen
Reviewed by: ngie
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D11901
handle cases where they can only run on a single domain.
To allow all devices access to this set we need to move reading the domain
earlier in the boot as it was previously handled in the CPU driver, however
this is too late for the GICv3 ITS driver.
Sponsored by: DARPA, AFRL
but it was broken since r273800 (and r278522, its MFC to stable/10) because
identify_cpu() is called too late, i.e., after init_param1().
MFC after: 3 days
libefi/time.c is mix of different styles, this update does cleanup.
Also fix 0 versus NULL, and zero the tv structure for case we get error
from UEFI firmware.
Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D11861
This patch moves code necessary for the fmtdev functionality from
loader to libefi, allowing other applications to make use of it
Submitted by: Eric McCorkle
Differential Revision: https://reviews.freebsd.org/D11862
precision. These timers are already displayed in microseconds in the
sysctl MIB. Add variables to track these tunables while here.
MFC after: 3 days
Sponsored by: Chelsio Communications
multiple ITS devices, however we only want a single ITS device to be
configured on each CPU. To fix this only enable ITS when the node matches
the CPUs node.
Sponsored by: DARPA, AFRL
used to support the dual package ThunderX where we need to send MSI/MSI-X
interrupts to the same package as the device the interrupt came from.
Sponsored by: DARPA, AFRL
They are defined by XSI or newer SUS.
This is a follow-up to r318780.
Reported by: jbeich
Obtained from: DragonflyBSD commit e08b3836c962
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
This patch adds additional EFI utility functions to convert errno
values to EFI_STATUS errors, as well as EFI times to UNIX times.
Submitted by: Eric McCorkle
Differential Revision: https://reviews.freebsd.org/D11858
This patch moves some EFI ZFS functions from loader to libefi,
allowing them to be used by anything that links against libefi.
Submitted by: Eric McCorkle
Differential Revision: https://reviews.freebsd.org/D11855
This patch adds definitions and utility code for creating EFI drivers
using the EFI_DRIVER_BINDING_PROTOCOL.
Submitted by: Eric McCorkle
Differential Revision: https://reviews.freebsd.org/D11852
Introduce hw.nvme.use_nvd tunable. This tunable allows both nvd and
nda to be installed in the kernel, while allowing only one of them to
create devices. This is an all-or-nothing setting, and you can't
change it after boot-time. However, it will allow easier A/B testing.
Differential Revision: https://reviews.freebsd.org/D11825
Some C wrappers for x86 instructions do not touch global memory and only act
on their arguments; they can be marked __pure2, aka __const__. Without this
annotation, Clang 3.9.1 is not intelligent enough on its own to grok that
these functions are __const__.
Submitted by: Anton Rang <anton.rang AT isilon.com>
Sponsored by: Dell EMC Isilon
The previous limit of 24 was somewhat restrictive, and with this change
ceil(log2(sizeof(struct pfs_node))) is the same as before in both the ILP32
and LP64 models, so the malloc zone used for allocations of struct pfs_node
is the same as before.
Approved by: des
This was submitted by Rogiel Sulzbach (thank you!) but has a few last-minute
changes by me, mostly where the code interfaces to my still-utterly-deficient
imx6_ccm clocks implementation. So blame me for any mistakes.
Submitted by: Rogiel Sulzbach <rogiel@rogiel.com>
Differential Revision: https://reviews.freebsd.org/D11177
debug (cudbg) code, hooked up to the main driver via an ioctl.
The ioctl can be used to collect the chip's internal state in a
compressed dump file. These dumps can be decoded with the "view"
component of cudbg.
Obtained from: Chelsio Communications
MFC after: 2 months
Sponsored by: Chelsio Communications
My change had good intentions, but the implementation was incorrect:
- printf was returning the number of characters in the format string
plus the NUL, but failed in two regards implementation wise:
-- the pathological case, printf(""), wasn't being handled properly since
the pointer is always incremented, so the value returned would be
off-by-one.
-- printf(3) reports the number of characters printed post-conversion via
vfprintf, etc.
- putchar(3) should return the character printed or EOF, not the number
of characters output to the screen.
My goal in making the change (again) was to increase parity, but as bde
pointed out these are freestanding functions, so they don't have to
conform to libc/POSIX. I argued that the functions should be named
differently since the implementation is different enough to warrant it
and to allow boot2 code to be usable when linked against sys/boot and
libstand and other libraries in base. I have no interest in pushing
this change forward more though, as the original concern I had behind
the change with zfsboottest was resolved in r321849 and r321852. The
next person that updates the toolchain gets to deal with the
inconsistency if it's flagged by a newer compiler.
MFC after: 1 month
Reported by: ed, markj
This patch fixes an interopability issue between FreeBSD and non-FreeBSD
systems when the connection establishment is aborted. Refer to the
initial commit in Linux, drivers/infiniband/core/cm.c,
for a more detailed description.
Obtained from: Linux
MFC after: 3 days
Sponsored by: Mellanox Technologies
Code inspection reveals the busdma unload and free functions
do not write to the belonging dma tag and does not need to be
serialized. This allows mlx5_fwp_free() to be called from
software interrupt context.
MFC after: 3 days
Sponsored by: Mellanox Technologies