This driver seems to have a bug. The bug was carefully saved during
conversion. In the al_eth_mac_table_unicast_add() the argument 'addr',
which is the actual address is unused. So, the function is called as
many times as we have addresses, but with the exactly same argument
list. This doesn't make any sense, but was preserved.
net.link.tap.user_open has historically allowed non-root users to do devfs
cloning and open /dev/tap* nodes based on permissions. Loosen this up to
make it only allow users to do devfs cloning -- we no longer check it in
tunopen.
This allows tap devices to be created that can actually be opened by a user,
rather than swiftly restricting them to root because the magic sysctl has
not been set.
The sysctl has not yet been completely deprecated, because more thought is
needed for how to handle the devfs cloning case. There is not an easy
suitable replacement for the sysctl there, and more care needs to be placed
in determining whether that's OK or not.
PR: 200185
In order to ensure that changing the frag6 code does not change behaviour
or break code a set of test cases were implemented.
Like some other test cases these use Scapy to generate packets and possibly
wait for expected answers. In most cases we do check the global and
per interface (netstat) statistics output using the libxo output and grep
to validate fields and numbers. This is a bit hackish but we currently have
no better way to match a selected number of stats only (we have to ignore
some of the ND6 variables; otherwise we could use the entire list).
Test cases include atomic fragments, single fragments, multi-fragments,
and try to cover most error cases in the code currently.
In addition vnet teardown is tested to not panic.
A separate set (not in-tree currently) of probes were used in order to
make sure that the test cases actually test what they should.
The "sniffer" code was copied and adjusted from the netpfil version
as we sometimes will not get packets or have longer timeouts to deal with.
Sponsored by: Netflix
When shutting down a VNET we did not cleanup the fragmentation hashes.
This has multiple problems: (1) leak memory but also (2) leak on the
global counters, which might eventually lead to a problem on a system
starting and stopping a lot of vnets and dealing with a lot of IPv6
fragments that the counters/limits would be exhausted and processing
would no longer take place.
Unfortunately we do not have a useable variable to indicate when
per-VNET initialization of frag6 has happened (or when destroy happened)
so introduce a boolean to flag this. This is needed here as well as
it was in r353635 for ip_reass.c in order to avoid tripping over the
already destroyed locks if interfaces go away after the frag6 destroy.
While splitting things up convert the TRY_LOCK to a LOCK operation in
now frag6_drain_one(). The try-lock was derived from a manual hand-rolled
implementation and carried forward all the time. We no longer can afford
not to get the lock as that would mean we would continue to leak memory.
Assert that all the buckets are empty before destroying to lock to
ensure long-term stability of a clean shutdown.
Reported by: hselasky
Reviewed by: hselasky
MFC after: 3 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D22054
Add a read-only sysctl exporting the global number of fragments
(base system and all vnets). This is helpful to (a) know how many
fragments are currently being processed, (b) if there are possible
leaks, (c) if vnet teardown is not working correctly, and lastly
(d) it can be used as part of test-suits to ensure (a) to (c).
MFC after: 3 weeks
Sponsored by: Netflix
Notices appear both in picobsd(8) (near the top for easy notice) and are
also printed to stderr on every invocation of picobsd for visibility.
The tentative date for removal is October 31st, as no volunteers have
stepped forward at all from postings to -arch@ at least.
No objection from: -arch@
MFC after: 3 days
cdevpriv dtors will be called when the reference count on the associated
struct file drops to 0, while d_close can be unreliable for cleaning up
state at "last close" for a number of reasons. As far as tunclose/tundtor is
concerned the difference is minimal, so make the switch.
This allows us to avoid some dance in tunopen for dealing with the
possibility of dev->si_drv1 being NULL as it's set prior to the devfs node
being created in all cases.
There's still the possibility that the tun device hasn't been fully
initialized, since that's done after the devfs node was created. Alleviate
this by returning ENXIO if we're not to that point of tuncreate yet.
This work is what sparked r353128, full initialization of cloned devices
w/ specified make_dev_args.
This corrects an oversight from r351423.
Submitted by: Ján Sučan <sucanjan@gmail.com>
MFC after: Never
Differential Revision: https://reviews.freebsd.org/D22093
- In em_msix_link(), properly handle IGB-class devices after the iflib(4)
conversion again by only setting EM_MSIX_LINK for the EM-class 82574
and by re-arming link interrupts unconditionally, i. e. not only in
case of spurious interrupts. This fixes the interface link state change
detection for the IGB-class. [1]
- In em_if_update_admin_status(), only re-arm the link state change
interrupt for 82574 and also only if such a device uses MSI-X, i. e.
takes advantage of autoclearing. In case of INTx and MSI as well as
for LEM- and IGB-class devices, re-arming isn't appropriate here and
setting EM_MSIX_LINK isn't either.
While at it, consistently take advantage of the hw variable.
PR: 236724 [1]
Differential Revision: https://reviews.freebsd.org/D21924
MAS8 is hypervisor privileged, defining the logical partition (VM) to
operate on for TLB accesses. It's already guaranteed to be cleared when
booting bare metal (bootloader needs it zeroed to work), and we can't touch
it from a guest. Assume that if/when we eventually port bhyve to PowerPC
(and Book-E) the hypervisor module will take care of managing MAS8. This
saves several (tens) of clocks on each TLB miss.
MFC after: 2 weeks
in simple multifunction driver.
- follow interrupt changes in DT. Split old ICU driver to function oriented
parts and add drivers for newly defined parts (system error interrupts).
- Many drivers are children of simple multifunction driver. But after r349596
simple MF driver doesn't longer exports memory resources, and all children
must use syscon interface to access their registers. Adapt affected
drivers to this fact.
MFC after: 3 weeks
Thanks to cem@ for discussing the issue which resulted in this patch.
Reviewed by: cem@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D22089
When zfs probe did fail and no spa was created, but zfs_fmtdev() is called,
we will crash while dereferencing spa (NULL pointer dereference).
MFC after: 1 week
This should change nothing for kernel configurations at the standard
locations in the source tree. However, if KERNCONFDIR is used to
specify a custom location for a kernel configuration file (e.g., out of
tree), then both the custom location and the standard location, in this
order, will be used as include paths for config(8). This will allow the
kernel configuration to include files from both locations.
Reviewed by: bdrewery
MFC after: 16 days
Differential Revision: https://reviews.freebsd.org/D22057
The rationale is pretty much the same as in r353747.
There is no subsequent dependent store.
The store is to the regular (TSO) memory anyway.
MFC after: 23 days
After removing wmb(), vm_set_rendezvous_func() became super trivial, so
there was no point in keeping it.
The wmb (sfence on amd64, lock nop on i386) was not needed. This can be
explained from several points of view.
First, wmb() is used for store-store ordering (although, the primitive
is undocumented). There was no obvious subsequent store that needed the
barrier.
Second, x86 has a memory model with strong ordering including total
store order. An explicit store barrier may be needed only when working
with special memory (device, special caching mode) or using special
instructions (non-temporal stores). That was not the case for this
code.
Third, I believe that there is a misconception that sfence "flushes" the
store buffer in a sense that it speeds up the propagation of stores from
the store buffer to the global visibility. I think that such
propagation always happens as fast as possible. sfence only makes
subsequent stores wait for that propagation to complete. So, sfence is
only useful for ordering of stores and only in the situations described
above.
Reviewed by: jhb
MFC after: 23 days
Differential Revision: https://reviews.freebsd.org/D21978
This patch is part of an effort to make bhyve networking (in particular TCP)
faster. The key strategy to enhance TCP throughput is to let the whole packet
datapath work with TSO/LRO packets (up to 64KB each), so that the per-packet
overhead is amortized over a large number of bytes.
This capability is supported in the guest by means of the vtnet(4) driver,
which is able to handle TSO/LRO packets leveraging the virtio-net header
(see struct virtio_net_hdr and struct virtio_net_hdr_mrg_rxbuf).
A bhyve VM exchanges packets with the host through a network backend,
which can be vale(4) or if_tap(4).
While vale(4) supports TSO/LRO packets, if_tap(4) does not.
This patch extends if_tap(4) with the ability to understand the virtio-net
header, so that a tapX interface can process TSO/LRO packets.
A couple of ioctl commands have been added to configure and probe the
virtio-net header. Once the virtio-net header is set, the tapX interface
acquires all the IFCAP capabilities necessary for TSO/LRO.
Reviewed by: kevans
Differential Revision: https://reviews.freebsd.org/D21263
They're formatted into the device name like unit numbers, anyway; store the
number in mda_unit => si_drv0 like dev2unit() expects.
No functional change intended.
Sponsored by: Dell EMC Isilon
rename the WITH_LOADER_VERIEXEC_PASS_MANFIEST description to its correct
name. Also correct a bunch of spelling errors in that description.
MFC after: 3 days
In low memory conditions a significant number of pages may end up stuck
in the caches, and currently these caches cannot be reaped, leading to
spurious memory allocation failures and OOM kills. So:
- Take into account the fact that we may cache up to two full buckets
of pages per CPU, not just one.
- Increase the amount of RAM required per CPU to enable the caches.
This is a temporary measure until the page cache management policy is
improved.
PR: 241048
Reported and tested by: Kevin Oberman <rkoberman@gmail.com>
Reviewed by: alc, kib
Discussed with: jeff
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22040
The softdep lock names were unusually long and tended to stick out in
lock profiling reports. Abbreviate them and make them consistent with
our conventional style for lock names.
Reviewed by: mckusick
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22042