Commit Graph

4358 Commits

Author SHA1 Message Date
Chuck Tuffli
94c15665a5 Fix a typo in r349969
OUI_FRREBSD_NVME_HIGH should have been OUI_FREEBSD_NVME_HIGH

Caught by:	Gary Jennejohn
2019-07-14 03:49:48 +00:00
Chuck Tuffli
409a80e5a4 bhyve: Create EUI64 for NVMe namespaces
Accept an IEEE Extended Unique Identifier (EUI-64) from the command
line for each NVMe namespace. If one isn't provided, it will create one
based on the CRC16 of:
 - the FreeBSD IEEE OUI
 - PCI bus, device/slot, function values
 - Namespace ID

Reviewed by:	imp, araujo, jhb, rgrimes
Approved by:	imp (mentor), jhb (maintainer)
MFC after:	2 weeks
Differential Revision: https://reviews.freebsd.org/D19905
2019-07-13 12:48:28 +00:00
Mark Johnston
eeacb3b02f Merge the vm_page hold and wire mechanisms.
The hold_count and wire_count fields of struct vm_page are separate
reference counters with similar semantics.  The remaining essential
differences are that holds are not counted as a reference with respect
to LRU, and holds have an implicit free-on-last unhold semantic whereas
vm_page_unwire() callers must explicitly determine whether to free the
page once the last reference to the page is released.

This change removes the KPIs which directly manipulate hold_count.
Functions such as vm_fault_quick_hold_pages() now return wired pages
instead.  Since r328977 the overhead of maintaining LRU for wired pages
is lower, and in many cases vm_fault_quick_hold_pages() callers would
swap holds for wirings on the returned pages anyway, so with this change
we remove a number of page lock acquisitions.

No functional change is intended.  __FreeBSD_version is bumped.

Reviewed by:	alc, kib
Discussed with:	jeff
Discussed with:	jhb, np (cxgbe)
Tested by:	pho (previous version)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D19247
2019-07-08 19:46:20 +00:00
John Baldwin
66d0c056be Support IFCAP_NOMAP in vlan(4).
Enable IFCAP_NOMAP for a vlan interface if it is supported by the
underlying trunk device.

Reviewed by:	gallatin, hselasky, rrs
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20616
2019-06-29 00:51:38 +00:00
John Baldwin
82334850ea Add an external mbuf buffer type that holds multiple unmapped pages.
Unmapped mbufs allow sendfile to carry multiple pages of data in a
single mbuf, without mapping those pages.  It is a requirement for
Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web
serving workloads when used by sendfile, due to effectively
compressing socket buffers by an order of magnitude, and hence
reducing cache misses.

For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer
now points to a struct mbuf_ext_pgs structure instead of a data
buffer.  This structure contains an array of physical addresses (this
reduces cache misses compared to an earlier version that stored an
array of vm_page_t pointers).  It also stores additional fields needed
for in-kernel TLS such as the TLS header and trailer data that are
currently unused.  To more easily detect these mbufs, the M_NOMAP flag
is set in m_flags in addition to M_EXT.

Various functions like m_copydata() have been updated to safely access
packet contents (using uiomove_fromphys()), to make things like BPF
safe.

NIC drivers advertise support for unmapped mbufs on transmit via a new
IFCAP_NOMAP capability.  This capability can be toggled via the new
'nomap' and '-nomap' ifconfig(8) commands.  For NIC drivers that only
transmit packet contents via DMA and use bus_dma, adding the
capability to if_capabilities and if_capenable should be all that is
required.

If a NIC does not support unmapped mbufs, they are converted to a
chain of mapped mbufs (using sf_bufs to provide the mapping) in
ip_output or ip6_output.  If an unmapped mbuf requires software
checksums, it is also converted to a chain of mapped mbufs before
computing the checksum.

Submitted by:	gallatin (earlier version)
Reviewed by:	gallatin, hselasky, rrs
Discussed with:	ae, kp (firewalls)
Relnotes:	yes
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20616
2019-06-29 00:48:33 +00:00
Hans Petter Selasky
0dbdf04125 Need to wait for epoch callbacks to complete before detaching a
network interface.

This particularly manifests itself when an INP has multicast options
attached during a network interface detach. Then the IPv4 and IPv6
leave group call which results from freeing the multicast address, may
access a freed ifnet structure. These are the steps to reproduce:

service mdnsd onestart # installed from ports

ifconfig epair create
ifconfig epair0a 0/24 up
ifconfig epair0a destroy

Tested by:	pho @
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2019-06-28 10:49:04 +00:00
Marius Strobl
c2c5d1e787 o In iflib_txq_drain():
- Remove desc_used, which is only ever written to.
  - Remove a dead store to reclaimed.
  - Don't recycle avail.
  - Sort variables according to style(9).
  These changes will make a subsequent commit easier to read.
o In iflib_tx_credits_update(), don't bother checking whether the
  ift_txd_credits_update method pointer is NULL; _iflib_pre_assert()
  asserts upfront that this method has been assigned and functions
  like iflib_{fast_intr_rxtx,netmap_timer_adjust,txq_can_drain}()
  and _task_fn_tx() were already unconditionally relying on the
  method being callable.
2019-06-26 15:28:21 +00:00
Leandro Lupori
e2edff4167 [PowerPC64] Don't mark module data as static
Fixes panic when loading ipfw.ko and if_epair.ko built with modern compiler.

Similar to arm64 and riscv, when using a modern compiler (!gcc4.2), code
generated tries to access data in the wrong location, causing kernel panic
(data storage interrupt trap) when loading if_epair and ipfw.

Issue was reproduced with kernel/module compiled using gcc8 and clang8. It
affects both ELFv1 and ELFv2 ABI environments.

PR:		232387
Submitted by:	alfredo.junior_eldorado.org.br
Reported by:	Mark Millard
Reviewed by:	jhibbits
Differential Revision:	https://reviews.freebsd.org/D20461
2019-06-25 17:15:44 +00:00
Hans Petter Selasky
59854ecf55 Convert all IPv4 and IPv6 multicast memberships into using a STAILQ
instead of a linear array.

The multicast memberships for the inpcb structure are protected by a
non-sleepable lock, INP_WLOCK(), which needs to be dropped when
calling the underlying possibly sleeping if_ioctl() method. When using
a linear array to keep track of multicast memberships, the computed
memory location of the multicast filter may suddenly change, due to
concurrent insertion or removal of elements in the linear array. This
in turn leads to various invalid memory access issues and kernel
panics.

To avoid this problem, put all multicast memberships on a STAILQ based
list. Then the memory location of the IPv4 and IPv6 multicast filters
become fixed during their lifetime and use after free and memory leak
issues are easier to track, for example by: vmstat -m | grep multi

All list manipulation has been factored into inline functions
including some macros, to easily allow for a future hash-list
implementation, if needed.

This patch has been tested by pho@ .

Differential Revision: https://reviews.freebsd.org/D20080
Reviewed by:	markj @
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2019-06-25 11:54:41 +00:00
Marko Zec
188adcb7e4 V_ip6_forwarding and V_ipforwarding have been defined in ip6_var.h /
ip_var.h since at least 2008, so make use of those definitions here.

MFC after:	3 days
2019-06-19 08:49:24 +00:00
Marko Zec
6aee0bfa85 Evaluating htons() at compile time is more efficient than doing ntohs()
at runtime.  This change removes a dependency on a barrel shifter pass
before branch resolution, while reducing the instruction stream size
by 9 bytes on amd64.

MFC after:	3 days
2019-06-19 08:39:19 +00:00
Marius Strobl
d49e83eac3 - Replace unused and only ever written to members of public iflib(9)
structs with placeholders (in the latter case, IFLIB_MAX_TX_BYTES
  etc. are also only ever used for these write-only members if at all,
  so both these macros and members can just go). Using these spares
  may render it possible to merge certain iflib(9) fixes to stable/12.
  Otherwise, changes extending struct if_irq or struct if_shared_ctx
  in any way would break KBI as instances of these are allocated by
  the driver front-ends (by contrast, struct if_pkt_info as well as
  struct if_softc_ctx instances are provided by iflib(9) and, thus,
  may grow at least at the end without breaking KBI).
- Make the pvi_name in struct pci_vendor_info const char * as device
  identifiers in hardware lookup tables aren't to be expected to ever
  change at runtime.
- Similarly, make the pci_vendor_info_t of struct if_shared_ctx which
  is used to point to the struct pci_vendor_info arrays provided by
  the driver front-ends const.
- Remove the ETH_ADDR_LEN macro from iflib.h; this was duplicating
  ETHER_ADDR_LEN of <net/ethernet.h> with iflib(9) actually only
  consuming the latter macro.
- Make the name argument of iflib_io_tqg_attach(9) const, matching
  the taskqgroup_attach_cpu(9) this function wraps as well as e. g.
  iflib_config_gtask_init(9).
- Remove the orphaned iflib_qset_lock_get() prototype.
- Remove some extraneous empty lines.
2019-06-15 11:07:41 +00:00
Mark Johnston
ce7fb386d8 Restore the comment removed in r348745.
LAGG_RLOCK() enters an epoch section, so the comment wasn't stale.

Reported by:	jhb
MFC with:	r348745
2019-06-06 17:20:35 +00:00
Mark Johnston
9995dfd364 Conditionalize an in_epoch() call on INVARIANTS.
Its result is only used to determine whether to perform further
INVARIANTS-only checks.  Remove a stale comment while here.

Submitted by:	Sebastian Huber <sebastian.huber@embedded-brains.de>
MFC after:	1 week
2019-06-06 16:22:29 +00:00
Eric Joyner
668d6dbb4c iflib: provide probe wrapper for vendor drivers
From Jake:
Vendor drivers that exist out-of-tree generally should return
BUS_PROBE_VENDOR from their device probe functions. This helps ensure
that a vendor replacement driver will supersede the in-kernel driver for
a given device.

Currently, if a vendor wants to implement a driver based on iflib, it
will always report BUS_PROBE_DEFAULT.

Add a wrapper function, iflib_device_probe_vendor() which can be used in
place of iflib_device_probe(). This function will just return
BUS_PROBE_VENDOR whenever iflib_device_probe() would return
BUS_PROBE_DEFAULT.

While vendor drivers can already implement such a wrapper themselves,
providing it in the iflib.h header makes it easier for the vendor driver
to do the right thing.

Submitted by:	Jacob Keller <jacob.e.keller@intel.com>
Reviewed by:	erj@, gallatin@, marius@
MFC after:	1 week
Sponsored by:	Intel Corporation
Differential Revision:	https://reviews.freebsd.org/D20221
2019-05-29 22:24:10 +00:00
Kyle Evans
d8b985430c if_bridge(4): Complete bpf auditing of local traffic over the bridge
There were two remaining "gaps" in auditing local bridge traffic with
bpf(4):

Locally originated outbound traffic from a member interface is invisible to
the bridge's bpf(4) interface. Inbound traffic locally destined to a member
interface is invisible to the member's bpf(4) interface -- this traffic has
no chance after bridge_input to otherwise pass it over, and it wasn't
originally received on this interface.

I call these "gaps" because they don't affect conventional bridge setups.
Alas, being able to establish an audit trail of all locally destined traffic
for setups that can function like this is useful in some scenarios.

Reviewed by:	kp
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D19757
2019-05-29 01:08:30 +00:00
Andrey V. Elsukov
de25327313 Rework r348303 to reduce the time of holding global BPF lock.
It appeared that using NET_EPOCH_WAIT() while holding global BPF lock
can lead to another panic:

spin lock 0xfffff800183c9840 (turnstile lock) held by 0xfffff80018e2c5a0 (tid 100325) too long
panic: spin lock held too long
...
#0  sched_switch (td=0xfffff80018e2c5a0, newtd=0xfffff8000389e000, flags=<optimized out>) at /usr/src/sys/kern/sched_ule.c:2133
#1  0xffffffff80bf9912 in mi_switch (flags=256, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:439
#2  0xffffffff80c21db7 in sched_bind (td=<optimized out>, cpu=<optimized out>) at /usr/src/sys/kern/sched_ule.c:2704
#3  0xffffffff80c34c33 in epoch_block_handler_preempt (global=<optimized out>, cr=0xfffffe00005a1a00, arg=<optimized out>)
    at /usr/src/sys/kern/subr_epoch.c:394
#4  0xffffffff803c741b in epoch_block (global=<optimized out>, cr=<optimized out>, cb=<optimized out>, ct=<optimized out>)
    at /usr/src/sys/contrib/ck/src/ck_epoch.c:416
#5  ck_epoch_synchronize_wait (global=0xfffff8000380cd80, cb=<optimized out>, ct=<optimized out>) at /usr/src/sys/contrib/ck/src/ck_epoch.c:465
#6  0xffffffff80c3475e in epoch_wait_preempt (epoch=0xfffff8000380cd80) at /usr/src/sys/kern/subr_epoch.c:513
#7  0xffffffff80ce970b in bpf_detachd_locked (d=0xfffff801d309cc00, detached_ifp=<optimized out>) at /usr/src/sys/net/bpf.c:856
#8  0xffffffff80ced166 in bpf_detachd (d=<optimized out>) at /usr/src/sys/net/bpf.c:836
#9  bpf_dtor (data=0xfffff801d309cc00) at /usr/src/sys/net/bpf.c:914

To fix this add the check to the catchpacket() that BPF descriptor was
not detached just before we acquired BPFD_LOCK().

Reported by:	slavash
Tested by:	slavash
MFC after:	1 week
2019-05-28 11:45:00 +00:00
Andrey V. Elsukov
44a514745c Fix possible NULL pointer dereference.
bpf_mtap() can invoke catchpacket() for already detached descriptor.
And this can lead to NULL pointer dereference, since bd_bif pointer
was reset to NULL in bpf_detachd_locked(). To avoid this, use
NET_EPOCH_WAIT() when descriptor is removed from interface's descriptors
list. After the wait it is safe to modify descriptor's content.

Submitted by:	kib
Reported by:	slavash
MFC after:	1 week
2019-05-27 12:41:41 +00:00
John Baldwin
fb3bc59600 Restructure mbuf send tags to provide stronger guarantees.
- Perform ifp mismatch checks (to determine if a send tag is allocated
  for a different ifp than the one the packet is being output on), in
  ip_output() and ip6_output().  This avoids sending packets with send
  tags to ifnet drivers that don't support send tags.

  Since we are now checking for ifp mismatches before invoking
  if_output, we can now try to allocate a new tag before invoking
  if_output sending the original packet on the new tag if allocation
  succeeds.

  To avoid code duplication for the fragment and unfragmented cases,
  add ip_output_send() and ip6_output_send() as wrappers around
  if_output and nd6_output_ifp, respectively.  All of the logic for
  setting send tags and dealing with send tag-related errors is done
  in these wrapper functions.

  For pseudo interfaces that wrap other network interfaces (vlan and
  lagg), wrapper send tags are now allocated so that ip*_output see
  the wrapper ifp as the ifp in the send tag.  The if_transmit
  routines rewrite the send tags after performing an ifp mismatch
  check.  If an ifp mismatch is detected, the transmit routines fail
  with EAGAIN.

- To provide clearer life cycle management of send tags, especially
  in the presence of vlan and lagg wrapper tags, add a reference count
  to send tags managed via m_snd_tag_ref() and m_snd_tag_rele().
  Provide a helper function (m_snd_tag_init()) for use by drivers
  supporting send tags.  m_snd_tag_init() takes care of the if_ref
  on the ifp meaning that code alloating send tags via if_snd_tag_alloc
  no longer has to manage that manually.  Similarly, m_snd_tag_rele
  drops the refcount on the ifp after invoking if_snd_tag_free when
  the last reference to a send tag is dropped.

  This also closes use after free races if there are pending packets in
  driver tx rings after the socket is closed (e.g. from tcpdrop).

  In order for m_free to work reliably, add a new CSUM_SND_TAG flag in
  csum_flags to indicate 'snd_tag' is set (rather than 'rcvif').
  Drivers now also check this flag instead of checking snd_tag against
  NULL.  This avoids false positive matches when a forwarded packet
  has a non-NULL rcvif that was treated as a send tag.

- cxgbe was relying on snd_tag_free being called when the inp was
  detached so that it could kick the firmware to flush any pending
  work on the flow.  This is because the driver doesn't require ACK
  messages from the firmware for every request, but instead does a
  kind of manual interrupt coalescing by only setting a flag to
  request a completion on a subset of requests.  If all of the
  in-flight requests don't have the flag when the tag is detached from
  the inp, the flow might never return the credits.  The current
  snd_tag_free command issues a flush command to force the credits to
  return.  However, the credit return is what also frees the mbufs,
  and since those mbufs now hold references on the tag, this meant
  that snd_tag_free would never be called.

  To fix, explicitly drop the mbuf's reference on the snd tag when the
  mbuf is queued in the firmware work queue.  This means that once the
  inp's reference on the tag goes away and all in-flight mbufs have
  been queued to the firmware, tag's refcount will drop to zero and
  snd_tag_free will kick in and send the flush request.  Note that we
  need to avoid doing this in the middle of ethofld_tx(), so the
  driver grabs a temporary reference on the tag around that loop to
  defer the free to the end of the function in case it sends the last
  mbuf to the queue after the inp has dropped its reference on the
  tag.

- mlx5 preallocates send tags and was using the ifp pointer even when
  the send tag wasn't in use.  Explicitly use the ifp from other data
  structures instead.

- Sprinkle some assertions in various places to assert that received
  packets don't have a send tag, and that other places that overwrite
  rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.

Reviewed by:	gallatin, hselasky, rgrimes, ae
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20117
2019-05-24 22:30:40 +00:00
Alexander V. Chernikov
563ab4e400 Fix gateway setup for the interface routes.
Currently rinit1() and its IPv6 counterpart
  nd6_prefix_onlink_rtrequest() uses dummy null_sdl gateway address
  during route insertion and change it afterwards. This behaviour
  brings complications to the routing stack and the users of its
  upcoming notification system.

This change fixes both rinit1() and nd6_prefix_onlink_rtrequest()
  by filling in proper gateway in the beginning. It does not change any
  of the userland notifications as in both cases, they happen after
  the insertion and fixup process (rt_newaddrmsg_fib() and nd6_rtmsg()).

MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D20328
2019-05-22 21:20:15 +00:00
Conrad Meyer
e2e050c8ef Extract eventfilter declarations to sys/_eventfilter.h
This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h"
in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header
pollution substantially.

EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c
files into appropriate headers (e.g., sys/proc.h, powernv/opal.h).

As a side effect of reduced header pollution, many .c files and headers no
longer contain needed definitions.  The remainder of the patch addresses
adding appropriate includes to fix those files.

LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by
sys/mutex.h since r326106 (but silently protected by header pollution prior
to this change).

No functional change (intended).  Of course, any out of tree modules that
relied on header pollution for sys/eventhandler.h, sys/lock.h, or
sys/mutex.h inclusion need to be fixed.  __FreeBSD_version has been bumped.
2019-05-20 00:38:23 +00:00
Alexander V. Chernikov
2ad7ed6e4a Fix rt_ifa selection during loopback route insertion process.
Currently such routes are added with a link-level IFA, which is
  plain wrong. Only after the insertion they get fixed by the special
  link_rtrequest() ifa handler. This behaviour complicates routing code
  and makes ifa selection more complex.
Streamline this process by explicitly moving link_rtrequest() logic
  to the pre-insertion rt_getifa_fib() ifa selector. Avoid calling all
  this logic in the loopback route case by explicitly specifying
  proper rt_ifa inside the ifa_maintain_loopback_route().§

MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D20076
2019-05-19 21:49:56 +00:00
Kyle Evans
db226f0d8e tuntap: Defer clearing if_softc until after if_detach
r346670 added an sx to close a race between the ifioctl handler and
interface destruction. Unfortunately, it clears if_softc immediately after
the interface is closed, but before if_detach has been invoked.

Any time before detachment, an interface that's part of a bridge may still
receive traffic that's pushed through tunstart/tunstart_l2 and promptly
lead to a panic because if_softc is now NULL.

Fix it by deferring the clearing of if_softc until after the interface has
detached and thus been removed from the bridge. if_softc still gets cleared
in case another thread has already entered the ioctl handler before it's
replaced with ifdead_ioctl.

Reported by:	markj
MFC after:	3 days
2019-05-14 20:32:29 +00:00
Andrey V. Elsukov
82d7bf6b1b Avoid possible recursion on BPF_LOCK() in bpfwrite().
Release BPF_LOCK() before invoking if_output() and if_input().
Also enter epoch section before releasing lock, this should prevent
access to ifnet that may be freed on interface detach.

Reported by:	markj
2019-05-13 20:17:55 +00:00
Andrey V. Elsukov
af1f58df99 Do not leak memory used for binary filter. 2019-05-13 14:07:02 +00:00
Andrey V. Elsukov
699281b545 Rework locking in BPF code to remove rwlock from fast path.
On high packets rate the contention on rwlock in bpf_*tap*() functions
can lead to packets dropping. To avoid this, migrate this code to use
epoch(9) KPI and ConcurrencyKit's lists.

* all lists changed to use CK_LIST;
* reference counting added to bpf_if and bpf_d;
* now bpf_if references ifnet and releases this reference on destroy;
* each bpf_d descriptor references bpf_if when it is attached;
* new struct bpf_program_buffer introduced to keep BPF filter programs;
* bpf_program_buffer, bpf_d and bpf_if structures are freed by
  epoch_call();
* bpf_freelist and ifnet_departure event are no longer needed, thus
  both are removed;

Reviewed by:	melifaro
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D20224
2019-05-13 13:45:28 +00:00
Kyle Evans
81b3b91e6b tuntap: Improve style
No functional change.

tun_flags of the tuntap_driver was renamed to ident_flags to reflect the
fact that it's a subset of the tun_flags that identifies a tuntap device.
This maps more easily (visually) to the TUN_DRIVER_IDENT_MASK that masks off
the bits of tun_flags that are applicable to tuntap driver ident. This is a
purely cosmetic change.
2019-05-11 04:18:06 +00:00
Eric Joyner
afb7737237 iflib: use default ntxd and nrxd when user value is not power of 2
From Jake:
A user may set a sysctl to override the default number of Tx or Rx
descriptors. However, certain calculations in the iflib core expect the
number of descriptors to be a power of 2.

Update _iflib_assert to verify that all of the shared context parameters
for the number of descriptors are powers of 2.

Modify iflib_reset_qvalues to check that the provided isc_nrxd value is
a power of 2. If it's not, print a warning message and then use the
default value.

An alternative might be to try rounding the number down instead.
However, this creates problems in case the rounded down value is below
the minimum value that the driver would support.

Submitted by:	Jacob Keller <jacob.e.keller@intel.com>
Reviewed by:	marius@
MFC after:	1 week
Sponsored by:	Intel Corporation
Differential Revision:	https://reviews.freebsd.org/D19880
2019-05-10 00:41:42 +00:00
Kyle Evans
16760d8e28 tuntap: Don't down tap interfaces if LINK0 is set 2019-05-09 18:54:29 +00:00
Kyle Evans
a6fa049545 tuntap: Properly detach tap ifp 2019-05-09 14:06:24 +00:00
Marius Strobl
14e0010729 - Merge r338254 from cxgbe(4):
Use fcmpset instead of cmpset when appropriate.
- Revert r277226 of cxgbe(4), obsolete since r334320.
2019-05-09 11:34:46 +00:00
Gleb Smirnoff
6ca363eb7b Existense of PCB route caching doesn't allow us to use new fast route
lookup KPI in ip_output() like it is already used in ip_forward().
However, when there is no PCB provided we can use fast KPI, gaining
performance advantage.

Typical case when ip_output() is called without a PCB pointer is a
sendto(2) on a not connected UDP socket. In practice DNS servers do
this.

Reviewed by:	melifaro
Differential Revision:	https://reviews.freebsd.org/D19804
2019-05-08 23:39:24 +00:00
Marius Strobl
007b804fc7 Allow to build without INET and INET6 again after r347221.
Submitted by:	cam
2019-05-08 09:03:43 +00:00
Kyle Evans
251a32b5b2 tun/tap: merge and rename to tuntap
tun(4) and tap(4) share the same general management interface and have a lot
in common. Bugs exist in tap(4) that have been fixed in tun(4), and
vice-versa. Let's reduce the maintenance requirements by merging them
together and using flags to differentiate between the three interface types
(tun, tap, vmnet).

This fixes a couple of tap(4)/vmnet(4) issues right out of the gate:
- tap devices may no longer be destroyed while they're open [0]
- VIMAGE issues already addressed in tun by kp

[0] emaste had removed an easy-panic-button in r240938 due to devdrn
blocking. A naive glance over this leads me to believe that this isn't quite
complete -- destroy_devl will only block while executing d_* functions, but
doesn't block the device from being destroyed while a process has it open.
The latter is the intent of the condvar in tun, so this is "fixed" (for
certain definitions of the word -- it wasn't really broken in tap, it just
wasn't quite ideal).

ifconfig(8) also grew the ability to map an interface name to a kld, so
that `ifconfig {tun,tap}0` can continue to autoload the correct module, and
`ifconfig vmnet0 create` will now autoload the correct module. This is a
low overhead addition.

(MFC commentary)

This may get MFC'd if many bugs in tun(4)/tap(4) are discovered after this,
and how critical they are. Changes after this are likely easily MFC'd
without taking this merge, but the merge will be easier.

I have no plans to do this MFC as of now.

Reviewed by:	bcr (manpages), tuexen (testing, syzkaller/packetdrill)
Input also from:	melifaro
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D20044
2019-05-08 02:32:11 +00:00
Marius Strobl
3d10e9ed62 o Use iflib_fast_intr_rxtx() also for "legacy" interrupts, i. e. INTx and
MSI. Unlike as with iflib_fast_intr_ctx(), the former will also enqueue
  _task_fn_tx() in addition to _task_fn_rx() if appropriate, bringing TCP
  TX throughput of EM-class devices on par with the MSI-X case and, thus,
  close to wirespeed/pre-iflib(4) times again. [1]
  Note that independently of the interrupt type, the UDP performance with
  these MACs still is abysmal and nowhere near to where it was before the
  conversion of em(4) to iflib(4).
o In iflib_init_locked(), announce which free list failed to set up.
o In _task_fn_tx() when running netmap(4), issue ifdi_intr_enable instead
  of the ifdi_tx_queue_intr_enable method in case of a "legacy" interrupt
  as the latter is valid with MSI-X only.
o Instead of adding the missing - and apparently convoluted enough that a
  DBG_COUNTER_INC was put into a wrong spot in _task_fn_rx() - checks for
  ifdi_{r,t}x_queue_intr_enable being available in the MSI-X case also to
  iflib_fast_intr_rxtx(), factor these out to iflib_device_register() and
  make the checks fail gracefully rather than panic. This avoids invoking
  the checks at runtime over and over again in iflib_fast_intr_rxtx() and
  _task_fn_{r,t}x() - even if it's just in case of INVARIANTS - and makes
  these functions more readable.
o In iflib_rx_structures_setup(), only initialize LRO resources if device
  and driver have LRO capability in order to not waste memory. Also, free
  the LRO resources again if setting them up fails for one of the queues.
  However, don't bother invoking iflib_rx_sds_free() in that case because
  iflib_rx_structures_setup() doesn't call iflib_rxsd_alloc() either (and
  iflib_{device,pseudo}_register() will issue iflib_rx_sds_free() in case
  of failure via iflib_rx_structures_free(), but there definitely is some
  asymmetry left to be fixed, though).
o Similarly, free LRO resources again in iflib_rx_structures_free().
o In iflib_irq_set_affinity(), handle get_core_offset() errors gracefully
  instead of panicing (but only in case of INVARIANTS). This is a follow-
  up to r344132, as such driver bugs shouldn't be fatal.
o Likewise, handle unknown iflib_intr_type_t in iflib_irq_alloc_generic()
  gracefully, too.
o Bring yet more sanity to iflib_msix_init():
  - If the device doesn't provide enough MSI-X vectors or not all vectors
    can be allocate so the expected number of queues in addition to admin
    interrupts can't be supported, try MSI next (and then INTx) as proper
    MSI-X vector distribution can't be assured in such cases. In essence,
    this change brings r254008 forward to iflib(4). Also, this is the fix
    alluded to in the commit message of r343934.
  - If the MSI-X allocation has failed, don't prematurely announce MSI is
    going to be used as the latter in fact may not be available either.
  - When falling back to MSI, only release the MSI-X table resource again
    if it was allocated in iflib_msix_init(), i. e. isn't supplied by the
    driver, in the first place.
o In mp_ndesc_handler(), handle unknown type arguments gracefully, too.

PR:		235031 (likely) [1]
Reviewed by:	shurd
Differential Revision:	https://reviews.freebsd.org/D20175
2019-05-07 08:28:35 +00:00
Marius Strobl
1722eeac95 - Remove the unused ifc_link_irq and ifc_mtx_name members of struct iflib_ctx.
- Remove the only ever written to ift_db_mtx_name member of struct iflib_txq.
- Remove the unused or only ever written to ifr_size, ifr_cq_pidx, ifr_cq_gen
  and ifr_lro_enabled members of struct iflib_rxq.
- Consistently spell DMA, RX and TX uppercase in comments, messages etc.
  instead of mixing with some lowercase variants.
- Consistently use if_t instead of a mix of if_t and struct ifnet pointers.
- Bring the function comments of _iflib_fl_refill(), iflib_rx_sds_free() and
  iflib_fl_setup() in line with reality.
- Judging problem reports, people are wondering what on earth messages like:
  "TX(0) desc avail = 1024, pidx = 0"
  are trying to indicate. Thus, extend this string to be more like that of
  non-iflib(4) Ethernet MAC drivers, notifying about a watchdog timeout due
  to which the interface will be reset.
- Take advantage of the M_HAS_VLANTAG macro.
- Use false/true rather than FALSE/TRUE for variables of type bool.
- Use FALLTHROUGH as advocated by style(9).
2019-05-06 20:56:41 +00:00
Matt Macy
e2621d9657 Allow iflib drivers to pass a pointer to their own ifmedia structure.
Tested by: emaste@

Differential Revision:	https://reviews.freebsd.org/D19946
2019-05-03 20:05:31 +00:00
Andrew Gallatin
35961dce98 Select lacp egress ports based on NUMA domain
This change creates an array of port maps indexed by numa domain
for lacp port selection. If we have lacp interfaces in more than
one domain, then we select the egress port by indexing into the
numa port maps and picking a port on the appropriate numa domain.

This is behavior is controlled by the new ifconfig use_numa flag
and net.link.lagg.use_numa sysctl/tunable (both modeled after the
existing use_flowid), which default to enabled.

Reviewed by:	bz, hselasky, markj (and scottl, earlier version)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20060
2019-05-03 14:43:21 +00:00
Ed Maste
ce3da455e9 iflib: remove assertion that isc_capabilities is nonzero
It's atypical, but not invalid, for a driver to pass no capabilities.

Submitted by:	Gerald Aryeetey <aryeeteygerald_rogers.com>
Reviewed by:	shurd
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20142
2019-05-02 19:13:31 +00:00
Stephen Hurd
f154ece02e iflib: Better control over queue core assignment
By default, cores are now assigned to queues in a sequential
manner rather than all NICs starting at the first core. On a four-core
system with two NICs each using two queue pairs, the nic:queue -> core
mapping has changed from this:

0:0 -> 0, 0:1 -> 1
1:0 -> 0, 1:1 -> 1

To this:

0:0 -> 0, 0:1 -> 1
1:0 -> 2, 1:1 -> 3

Additionally, a device can now be configured to use separate cores for TX
and RX queues.

Two new tunables have been added, dev.X.Y.iflib.separate_txrx and
dev.X.Y.iflib.core_offset. If core_offset is set, the NIC is not part
of the auto-assigned sequence.

Reviewed by:	marius
MFC after:	2 weeks
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D20029
2019-04-25 21:24:56 +00:00
Kyle Evans
e3a883c386 tap(4): Correct driver name...
Reported by:	rgrimes
Pointy hat to:	kevans
MFC after:	3 days
X-MFC-With:	r346688
2019-04-25 18:26:34 +00:00
Kyle Evans
9ea63b2caa tap(4): Add a MODULE_VERSION
Otherwise tap(4) can be loaded by loader despite being compiled into the
kernel, causing a panic as things try to double-initialize.

PR:		220867
MFC after:	3 days
2019-04-25 18:22:22 +00:00
Kyle Evans
c83651445b tun(4): Don't allow open of open or dying devices
Previously, a pid check was used to prevent open of the tun(4); this works,
but may not make the most sense as we don't prevent the owner process from
opening the tun device multiple times.

The potential race described near tun_pid should not be an issue: if a
tun(4) is to be handed off, its fd has to have been sent via control message
or some other mechanism that duplicates the fd to the receiving process so
that it may set the pid. Otherwise, the pid gets cleared when the original
process closes it and you have no effective handoff mechanism.

Close up another potential issue with handing a tun(4) off by not clobbering
state if the closer isn't the controller anymore. If we want some state to
be cleared, we should do that a little more surgically.

Additionally, nothing prevents a dying tun(4) from being "reopened" in the
middle of tun_destroy as soon as the mutex is unlocked, quickly leading to a
bad time. Return EBUSY if we're marked for destruction, as well, and the
consumer will need to deal with it. The associated character device will be
destroyed in short order.

MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D20033
2019-04-25 13:46:12 +00:00
Kyle Evans
d91262603b tun/tap: close race between destroy/ioctl handler
It seems that there should be a better way to handle this, but this seems to
be the more common approach and it should likely get replaced in all of the
places it happens... Basically, thread 1 is in the process of destroying the
tun/tap while thread 2 is executing one of the ioctls that requires the
tun/tap mutex and the mutex is destroyed before the ioctl handler can
acquire it.

This is only one of the races described/found in PR 233955.

PR:		233955
Reviewed by:	ae
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D20027
2019-04-25 12:44:08 +00:00
Andrew Gallatin
6d49b41ee8 iflib: Add pfil hooks
As with mlx5en, the idea is to drop unwanted traffic as early
in receive as possible, before mbufs are allocated and anything
is passed up the stack.  This can save considerable CPU time
when a machine is under a flooding style DOS attack.

The major change here is to remove the unneeded abstraction where
callers of rxd_frag_to_sd() get back a pointer to the mbuf ring, and
are responsible for NULL'ing that mbuf themselves. Now this happens
directly in rxd_frag_to_sd(), and it returns an mbuf. This allows us
to use the decision (and potentially mbuf) returned by the pfil
hooks. The driver can now recycle mbufs to avoid re-allocation when
packets are dropped.

Reviewed by:	marius  (shurd and erj also provided feedback)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D19645
2019-04-24 13:32:04 +00:00
Andrey V. Elsukov
aee793eec9 Add GRE-in-UDP encapsulation support as defined in RFC8086.
This GRE-in-UDP encapsulation allows the UDP source port field to be
used as an entropy field for load-balancing of GRE traffic in transit
networks. Also most of multiqueue network cards are able distribute
incoming UDP datagrams to different NIC queues, while very little are
able do this for GRE packets.

When an administrator enables UDP encapsulation with command
`ifconfig gre0 udpencap`, the driver creates kernel socket, that binds
to tunnel source address and after udp_set_kernel_tunneling() starts
receiving of all UDP packets destined to 4754 port. Each kernel socket
maintains list of tunnels with different destination addresses. Thus
when several tunnels use the same source address, they all handled by
single socket.  The IP[V6]_BINDANY socket option is used to be able bind
socket to source address even if it is not yet available in the system.
This may happen on system boot, when gre(4) interface is created before
source address become available. The encapsulation and sending of packets
is done directly from gre(4) into ip[6]_output() without using sockets.

Reviewed by:	eugen
MFC after:	1 month
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D19921
2019-04-24 09:05:45 +00:00
Kyle Evans
e8de0c3bda tun(4): Defer clearing TUN_OPEN until much later
tun destruction will not continue until TUN_OPEN is cleared. There are brief
moments in tunclose where the mutex is dropped and we've already cleared
TUN_OPEN, so tun_destroy would be able to proceed while we're in the middle
of cleaning up the tun still. tun_destroy should be blocked until these
parts (address/route purges, mostly) are complete.

PR:		233955
MFC after:	2 weeks
2019-04-23 17:28:28 +00:00
Andrew Gallatin
7687707dd4 Track device's NUMA domain in ifnet & alloc ifnet from NUMA local memory
This commit adds new if_alloc_domain() and if_alloc_dev() methods to
allocate ifnets.  When called with a domain on a NUMA machine,
ifalloc_domain() will record the NUMA domain in the ifnet, and it will
allocate the ifnet struct from memory which is local to that NUMA
node.  Similarly, if_alloc_dev() is a wrapper for if_alloc_domain
which uses a driver supplied device_t to call ifalloc_domain() with
the appropriate domain.

Note that the new if_numa_domain field fits in an alignment pad in
struct ifnet, and so does not alter the size of the structure.

Reviewed by:	glebius, kib, markj
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D19930
2019-04-22 19:24:21 +00:00
Kyle Evans
1fd8c72c0a iflib: Use new ether_gen_addr, restricting addresses to that subset
Differential Revision:	https://reviews.freebsd.org/D19587
2019-04-17 17:19:54 +00:00
Kyle Evans
3c3aa8c170 net: adjust randomized address bits
Give devices that need a MAC a 16-bit allocation out of the FreeBSD
Foundation OUI range. Change the name ether_fakeaddr to ether_gen_addr now
that we're dealing real MAC addresses with a real OUI rather than random
locally-administered addresses.

Reviewed by:	bz, rgrimes
Differential Revision:	https://reviews.freebsd.org/D19587
2019-04-17 17:18:43 +00:00
Michael Tuexen
e6481fd4c4 When sending a routing message, don't allow the user to set the
RTF_RNH_LOCKED flag in rtm_flags, since this flag is used only
internally.

Reported by:		syzbot+65c676f5248a13753ea0@syzkaller.appspotmail.com
Reviewed by:		ae@
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D19898
2019-04-14 10:18:14 +00:00
Conrad Meyer
a8a16c7128 Replace read_random(9) with more appropriate arc4rand(9) KPIs
Reviewed by:	ae, delphij
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D19760
2019-04-04 01:02:50 +00:00
Mark Johnston
ca1163bd5f Do not perform DAD on stf(4) interfaces.
stf(4) interfaces are not multicast-capable so they can't perform DAD.
They also did not set IFF_DRV_RUNNING when an address was assigned, so
the logic in nd6_timer() would periodically flag such an address as
tentative, resulting in interface flapping.

Fix the problem by setting IFF_DRV_RUNNING when an address is assigned,
and do some related cleanup:
- In in6if_do_dad(), remove a redundant check for !UP || !RUNNING.
  There is only one caller in the tree, and it only looks at whether
  the return value is non-zero.
- Have in6if_do_dad() return false if the interface is not
  multicast-capable.
- Set ND6_IFF_NO_DAD when an address is assigned to an stf(4) interface
  and the interface goes UP as a result. Note that this is not
  sufficient to fix the problem because the new address is marked as
  tentative and DAD is started before in6_ifattach() is called.
  However, setting no_dad is formally correct.
- Change nd6_timer() to not flag addresses as tentative if no_dad is
  set.

This is based on a patch from Viktor Dukhovni.

Reported by:	Viktor Dukhovni <ietf-dane@dukhovni.org>
Reviewed by:	ae
MFC after:	3 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D19751
2019-03-30 18:00:44 +00:00
John Baldwin
841613dcdc Use a dedicated malloc type for lagg(4)'s structures.
Reviewed by:	gallatin
MFC after:	1 month
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D19719
2019-03-28 21:00:54 +00:00
Eric Joyner
225eae1bb7 iflib: return ENETDOWN when the network device is down
From Jake:
iflib_if_transmit returns ENOBUFS when the device is down, or when the
link isn't active.

This was changed in r308792 from return (0), so that the function
correctly reports an error that it was unable to transmit.

However, using ENOBUFS can cause some network applications to produce
the following or similar errors:

"ping: sendto: No buffer space available"

This is a bit confusing as the real cause of the issue is that the
network device is down.

Replace the ENOBUFS return with ENETDOWN to indicate more clearly that
the reason for the failure to send is due to the network device is
offline.

This will cause the error message to be reported as

"ping: sendto: Network is down"

Submitted by:	Jacob Keller <jacob.e.keller@intel.com>
Reviewed by:	shurd@, sbruno@, bz@
MFC after:	1 week
Sponsored by:	Intel Corporation
Differential Revision:	https://reviews.freebsd.org/D19652
2019-03-28 20:46:45 +00:00
Eric Joyner
aac9c817af iflib: hold the CTX lock in iflib_pseudo_register
From Jake:
The iflib_device_register function takes the CTX lock before calling
IFDI_ATTACH_PRE, and releases it upon finishing the registration.

Mirror this process in iflib_pseudo_register, so that we always hold the
CTX lock during the attach process when registering a pseudo interface
or a regular interface.

This was caught by code inspection while attempting to analyze where the
CTX lock was held.

Submitted by:	Jacob Keller <jacob.e.keller@intel.com>
Reviewed by:	shurd@, erj@
MFC after:	1 week
Sponsored by:	Intel Corporation
Differential Revision:	https://reviews.freebsd.org/D19604
2019-03-28 20:43:47 +00:00
John Baldwin
2f59b04af1 Remove nested epochs from lagg(4).
lagg_bcast_start appeared to have a bug in that was using the last
lagg port structure after exiting the epoch that was keeping that
structure alive.  However, upon further inspection, the epoch was
already entered by the caller (lagg_transmit), so the epoch enter/exit
in lagg_bcast_start was actually unnecessary.

This commit generally removes uses of the net epoch via LAGG_RLOCK to
protect the list of ports when the list of ports was already protected
by an existing LAGG_RLOCK in a caller, or the LAGG_XLOCK.

It also adds a missing epoch enter/exit in lagg_snd_tag_alloc while
accessing the lagg port structures.  An ifp is still accessed via an
unsafe reference after the epoch is exited, but that is true in the
current code and will be fixed in a future change.

Reviewed by:	gallatin
MFC after:	1 month
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D19718
2019-03-28 20:25:36 +00:00
Kyle Evans
93c9d31918 if_bridge(4): ensure all traffic passing over the bridge is accounted for
Consider a bridge0 with em0 and em1 members. Traffic rx'd by em0 and
transmitted by bridge0 through em1 gets accounted for in IPACKETS/IBYTES
and bridge0 bpf -- assuming it's not unicast traffic destined for em1.
Unicast traffic destined for em1 traffic is not accounted for by any
mechanism, and isn't pushed through bridge0's bpf machinery as any other
packets that pass over the bridge do.

Fix this and simplify GRAB_OUR_PACKETS by bailing out early if it was rx'd
by the interface that it was addressed for. Everything else there is
relevant for any traffic that came in from one member that's being directed
at another member of the bridge.

Reviewed by:	kp
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D19614
2019-03-28 03:31:51 +00:00
Eric Joyner
10a1e981d4 iflib: mark isc_driver_version as constant
From Jake:
The iflib core never modifies the isc_driver_version string. Allow
drivers to safely assign pointers to constant buffers by marking this
parameter const.

Submitted by:	Jacob Keller <jacob.e.keller@intel.com>
Reviewed by:	erj@, gallatin@, jhb@
MFC after:	1 week
Sponsored by:	Intel Corporation
Differential Revision:	https://reviews.freebsd.org/D19577
2019-03-19 23:44:26 +00:00
Eric Joyner
1b9d93948a iflib: expose the Rx mbuf buffer size to drivers
From Jake:
iflib_fl_setup calculates a suitable buffer size for the Rx mbufs based
on the isc_max_frame_size value that drivers setup. This calculation is
repeated by drivers when programming their hardware with the size of
each Rx buffer.

This can lead to a mismatch where the iflib mbuf size is different from
the expected size of the buffer as programmed by the hardware. This can
lead to unexpected results.

If iflib ever wants to support mbuf sizes larger than one page, every
driver must be updated to account for the new possible buffer sizes.

Fix this by calculating the mbuf size prior to calling IFDI_INIT, and
adding the iflib_get_rx_mbuf_sz function which will expose this value to
drivers, so that they do not repeat the same calculation.

Submitted by:	Jacob Keller <jacob.e.keller@intel.com>
Reviewed by:	shurd@, erj@
MFC after:	1 week
Sponsored by:	Intel Corporation
Differential Revision:	https://reviews.freebsd.org/D19489
2019-03-19 17:59:56 +00:00
Eric Joyner
3e8d1bae5f iflib: prevent possible infinite loop in iflib_encap
From Jake:
iflib_encap calls bus_dmamap_load_mbuf_sg. Upon it returning EFBIG, an
m_collapse and an m_defrag are attempted to shrink the mbuf cluster to
fit within the DMA segment limitations.

However, if we call m_defrag, and then bus_dmamap_load_mbuf_sg returns
EFBIG on the now defragmented mbuf, we will continuously re-call
bus_dmamap_load_mbuf_sg over and over.

This happens because m_head isn't NULL, and remap is >1, so we don't try
to m_collapse or m_defrag again. The only way we exit the loop is if
m_head is NULL. However, m_head can't be modified by the call to
bus_dmamap_load_mbuf_sg, because we don't pass it as a double pointer.

I believe this will be an incredibly rare occurrence, because it is
unlikely that bus_dmamap_load_mbuf_sg will actually fail on the second
defragment with an EFBIG error. However, it still seems like
a possibility that we should account for.

Fix the exit check to ensure that if remap is >1, we will also exit,
even if m_head is not NULL.

Submitted by:	Jacob Keller <jacob.e.keller@intel.com>
Reviewed by:	shurd@, gallatin@
MFC after:	1 week
Sponsored by:	Intel Corporation
Differential Revision:	https://reviews.freebsd.org/D19468
2019-03-19 17:49:03 +00:00
Andrey V. Elsukov
c5be49da01 Convert allocation of bpf_if in bpfattach2 from M_NOWAIT to M_WAITOK
and remove possible panic condition.

It is already allowed to sleep in bpfattach[2], since BPF_LOCK was
converted to SX lock in r332388. Also move KASSERT() to the top of
function and make full initialization before bpf_if will be linked
to BPF's list of interfaces.

MFC after:	2 weeks
2019-03-19 10:29:32 +00:00
Vincenzo Maffione
d12354a56c netmap: add support for multiple host rings
Some applications forward from/to host rings most or all the
traffic received or sent on a physical interface. In this
cases it is desirable to have more than a pair of RX/TX host
rings, and use multiple threads to speed up forwarding.
This change adds support for multiple host rings. On registering
a netmap port, the user can specify the number of desired receive
and transmit host rings in the nr_host_tx_rings and nr_host_rx_rings
fields of the nmreq_register structure.

MFC after:	2 weeks
2019-03-18 12:22:23 +00:00
Kyle Evans
4920f9a348 if_bridge(4): Drop pointless rtflush
At this point, all routes should've already been dropped by removing all
members from the bridge. This condition is in-fact KASSERT'd in the line
immediately above where this nop flush was added.
2019-03-15 17:19:36 +00:00
Kyle Evans
6e6b93fe1d Revert r345192: Too many trees in play for bridge(4) bits
An accidental appendage was committed that has not undergone review yet.
2019-03-15 17:18:19 +00:00
Kyle Evans
4b4b284d95 if_bridge(4): Drop pointless rtflush
At this point, all routes should've already been dropped by removing all
members from the bridge. This condition is in-fact KASSERT'd in the line
immediately above where this nop flush was added.
2019-03-15 17:13:05 +00:00
Kristof Provost
43d3127ca7 bridge: Fix STP-related panic
After r345180 we need to have the appropriate vnet context set to delete an
rtnode in bridge_rtnode_destroy().
That's usually the case, but not when it's called by the STP code (through
bstp_notify_rtage()).

We have to set the vnet context in bridge_rtable_expire() just as we do in the
other STP callback bridge_state_change().

Reviewed by:	kevans
2019-03-15 15:52:36 +00:00
Kyle Evans
a87407ff85 if_bridge(4): Fix module teardown
bridge_rtnode_zone still has outstanding allocations at the time of
destruction in the current model because all of the interface teardown
happens in a VNET_SYSUNINIT, -after- the MOD_UNLOAD has already been
processed.  The SYSUNINIT triggers destruction of the interfaces, which then
attempts to free the memory from the zone that's already been destroyed, and
we hit a panic.

Solve this by virtualizing the uma_zone we allocate the rtnodes from to fix
the ordering. bridge_rtable_fini should also take care to flush any
remaining routes that weren't taken care of when dynamic routes were flushed
in bridge_stop.

Reviewed by:	kp
Differential Revision:	https://reviews.freebsd.org/D19578
2019-03-15 13:19:52 +00:00
Kristof Provost
d6747eafa9 bridge: Fix panic if the STP root is removed
If the spanning tree root interface is removed from the bridge we panic
on the next 'ifconfig'.
While the STP code is notified whenever a bridge member interface is
removed from the bridge it does not clear the bs_root_port. This means
bs_root_port can still point at an bridge_iflist which has been free()d.
The next access to it will panic.

Explicitly check if the interface we're removing in bstp_destroy() is
the root, and if so re-assign the roles, which clears bs_root_port.

Reviewed by:	philip
MFC after:	2 weeks
2019-03-15 11:21:20 +00:00
Kristof Provost
5904868691 pf :Use counter(9) in pf tables.
The counters of pf tables are updated outside the rule lock. That means state
updates might overwrite each other. Furthermore allocation and
freeing of counters happens outside the lock as well.

Use counter(9) for the counters, and always allocate the counter table
element, so that the race condition cannot happen any more.

PR:		230619
Submitted by:	Kajetan Staszkiewicz <vegeta@tuxpowered.net>
Reviewed by:	glebius
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D19558
2019-03-15 11:08:44 +00:00
Kyle Evans
521b05ea52 ether_fakeaddr: Use 'b' 's' 'd' for the prefix
This has the advantage of being obvious to sniff out the designated prefix
by eye and it has all the right bits set. Comment stolen from ffec.

I've removed bryanv@'s pending question of using the FreeBSD OUI range --
no one has followed up on this with a definitive action, and there's no
particular reason to shoot for it and the administrative overhead that comes
with deciding exactly how to use it.
2019-03-14 19:48:43 +00:00
Kyle Evans
6b7e0c1cca ether: centralize fake hwaddr generation
We currently have two places with identical fake hwaddr generation --
if_vxlan and if_bridge. Lift it into if_ethersubr for reuse in other
interfaces that may also need a fake addr.

Reviewed by:	bryanv, kp, philip
Differential Revision:	https://reviews.freebsd.org/D19573
2019-03-14 17:18:00 +00:00
Gleb Smirnoff
c93410229c Most Ethernet drivers that potentially can run a pfil(9) hook with
PFIL_MEMPTR flag are intentionally providing a memory address that
isn't aligned to pointer alignment. This is done to align an IPv4
or IPv6 header that is expected to follow Ethernet header.

When we return PFIL_REALLOCED we store a pointer to allocated mbuf
at this address. With this change the KPI changes to store the pointer
at aligned address, which usually yields in +2 bytes.

Provide two inlines:

pfil_packet_align() to get aligned pfil_packet_t for a misaligned one
pfil_mem2mbuf() to read out mbuf pointer from misaligned pfil_packet_t

Provide function pfil_realloc(), not used yet, that would convert a
memory pfil_packet_t to an mbuf one.

Reported by:	hps
Reviewed by:	hps, gallatin
2019-03-10 17:20:09 +00:00
Gleb Smirnoff
b9fdb4b3a3 Properly handle a case when a first filter returns PFIL_REALLOCED, then
second one returns PFIL_PASS.
2019-03-10 17:08:05 +00:00
Bjoern A. Zeeb
b25d74e06c Improve ARP logging.
r344504 added an extra ARP_LOG() call in case of an if_output() failure.
It turns out IPv4 can be noisy. In order to not spam the console by default:
(a) add a counter for these events so people can keep better track of how
    often it happens, and
(b) add a sysctl to select the default ARP_LOG log level and set it to
    INFO avoiding the one (the new) DEBUG level by default.

Claim a spare (1st one after 10 years since the stats were added) in order
to not break netstat from FreeBSD 12->13 updates in the future.

Reviewed by:		karels
Differential Revision:	https://reviews.freebsd.org/D19490
2019-03-09 01:12:59 +00:00
Bjoern A. Zeeb
21231a7aa6 Update for IETF draft-ietf-6man-ipv6only-flag.
All changes are hidden behind the EXPERIMENTAL option and are not compiled
in by default.

Add ND6_IFF_IPV6_ONLY_MANUAL to be able to set the interface into no-IPv4-mode
manually without router advertisement options.  This will allow developers to
test software for the appropriate behaviour even on dual-stack networks or
IPv6-Only networks without the option being set in RA messages.
Update ifconfig to allow setting and displaying the flag.

Update the checks for the filters to check for either the automatic or the manual
flag to be set.  Add REVARP to the list of filtered IPv4-related protocols and add
an input filter similar to the output filter.

Add a check, when receiving the IPv6-Only RA flag to see if the receiving
interface has any IPv4 configured.  If it does, ignore the IPv6-Only flag.

Add a per-VNET global sysctl, which is on by default, to not process the automatic
RA IPv6-Only flag.  This way an administrator (if this is compiled in) has control
over the behaviour in case the node still relies on IPv4.
2019-03-06 23:31:42 +00:00
Eric Joyner
bc408c7d61 Remove references to CONTIGMALLOC_WORKS in iflib and em
From Jake:
"The iflib_fl_setup() function tries to pick various buffer sizes based
on the max_frame_size value defined by the parent driver. However, this
code was wrapped under CONTIGMALLOC_WORKS, which was never actually
defined anywhere.

This same code pattern was used in if_em.c, likely trying to match
what iflib uses.

Since CONTIGMALLOC_WORKS is not defined, remove this dead code from
iflib_fl_setup and if_em.c

Given that various iflib drivers appear to be using a similar
calculation, it might be worth making this buffer size a value that the
driver can peek at in the future."

Submitted by:	Jacob Keller <jacob.e.keller@intel.com>
Reviewed by:	shurd@
MFC after:	1 week
Sponsored by:	Intel Corporation
Differential Revision:	https://reviews.freebsd.org/D19199
2019-03-05 19:12:51 +00:00
Kristof Provost
5ea5849a7b tun: VIMAGE fix for if_tun cloner
The if_tun cloner is not virtualised, but if_clone_attach() does use a
virtualised list of cloners.
The result is that we can't find the if_tun cloner when we try to remove
a renamed tun interface. Virtualise the cloner, and move the final
cleanup into a sysuninit so that we're sure this happens after all of
the vnet_sysuninits

Note that we need unit numbers to be system-unique (rather than unique
per vnet, as is done by if_clone_simple()). The unit number is used to
create the corresponding /dev/tunX device node, and this node must match
with the interface.
Switch to if_clone_advanced() so that we have control over the unit
numbers.

Reproduction scenario:
	jail -c -n foo persist vnet
	jexec test ifconfig tun create
	jexec test ifconfig tun0 name wg0
	jexec test ifconfig wg0 destroy

PR:		235704
Reviewed by:	bz, hrs, hselasky
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D19248
2019-03-05 13:21:07 +00:00
Alexander Motin
c3c93809f6 bridge: Fix spurious warnings about capabilities
Mask off the bits we don't care about when checking that capabilities
of the member interfaces have been disabled as intended.

Submitted by:	Ryan Moeller <ryan@ixsystems.com>
Reviewed by:	kristof, mav
MFC after:	1 week
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D18924
2019-03-04 22:01:09 +00:00
Stephen Hurd
ca62461bc6 iflib: Improve return values of interrupt handlers.
iflib was returning FILTER_HANDLED, in cases where FILTER_STRAY was more
correct. This potentially caused issues with shared legacy interrupts.

Driver filters returning FILTER_STRAY are now properly handled.

Submitted by:	Augustin Cavalier <waddlesplash@gmail.com>
Reviewed by:	marius, gallatin
Obtained from:	Haiku (a84bb9, 4947d1)
MFC after:	1 week
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D19201
2019-02-15 18:51:43 +00:00
Randall Stewart
fa91f84502 This commit adds the missing release mechanism for the
ratelimiting code. The two modules (lagg and vlan) did have
allocation routines, and even though they are indirect (and
vector down to the underlying interfaces) they both need to
have a free routine (that also vectors down to the actual interface).

Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D19032
2019-02-13 14:57:59 +00:00
Marius Strobl
a6611c938b Fix the build with ALTQ after r344060. 2019-02-12 22:33:17 +00:00
Marius Strobl
f855ec814d Make taskqgroup_attach{,_cpu}(9) work across architectures
So far, intr_{g,s}etaffinity(9) take a single int for identifying
a device interrupt. This approach doesn't work on all architectures
supported, as a single int isn't sufficient to globally specify a
device interrupt. In particular, with multiple interrupt controllers
in one system as found on e. g. arm and arm64 machines, an interrupt
number as returned by rman_get_start(9) may be only unique relative
to the bus and, thus, interrupt controller, a certain device hangs
off from.
In turn, this makes taskqgroup_attach{,_cpu}(9) and - internal to
the gtaskqueue implementation - taskqgroup_attach_deferred{,_cpu}()
not work across architectures. Yet in turn, iflib(4) as gtaskqueue
consumer so far doesn't fit architectures where interrupt numbers
aren't globally unique.
However, at least for intr_setaffinity(..., CPU_WHICH_IRQ, ...) as
employed by the gtaskqueue implementation to bind an interrupt to a
particular CPU, using bus_bind_intr(9) instead is equivalent from
a functional point of view, with bus_bind_intr(9) taking the device
and interrupt resource arguments required for uniquely specifying a
device interrupt.
Thus, change the gtaskqueue implementation to employ bus_bind_intr(9)
instead and intr_{g,s}etaffinity(9) to take the device and interrupt
resource arguments required respectively. This change also moves
struct grouptask from <sys/_task.h> to <sys/gtaskqueue.h> and wraps
struct gtask along with the gtask_fn_t typedef into #ifdef _KERNEL
as userland likes to include <sys/_task.h> or indirectly drags it
in - for better or worse also with _KERNEL defined -, which with
device_t and struct resource dependencies otherwise is no longer
as easily possible now.
The userland inclusion problem probably can be improved a bit by
introducing a _WANT_TASK (as well as a _WANT_MOUNT) akin to the
existing _WANT_PRISON etc., which is orthogonal to this change,
though, and likely needs an exp-run.

While at it:
- Change the gt_cpu member in the grouptask structure to be of type
  int as used elswhere for specifying CPUs (an int16_t may be too
  narrow sooner or later),
- move the gtaskqueue_enqueue_fn typedef from <sys/gtaskqueue.h> to
  the gtaskqueue implementation as it's only used and needed there,
- change the GTASK_INIT macro to use "gtask" rather than "task" as
  argument given that it actually operates on a struct gtask rather
  than a struct task, and
- let subr_gtaskqueue.c consistently use __func__ to print functions
  names.

Reported by:	mmel
Reviewed by:	mmel
Differential Revision:	https://reviews.freebsd.org/D19139
2019-02-12 21:23:59 +00:00
Marius Strobl
95dcf343b7 Further correct and optimize the bus_dma(9) usage of iflib(4):
o Correct the obvious bugs in the netmap(4) parts:
  - No longer check for the existence of DMA maps as bus_dma(9)
    is used unconditionally in iflib(4) since r341095.
  - Supply the correct DMA tag and map pairs to bus_dma(9)
    functions (see also the commit message of r343753).
  - In iflib_netmap_timer_adjust(), add synchronization of the
    TX descriptors before calling the ift_txd_credits_update
    method as the latter evaluates the TX descriptors possibly
    updated by the MAC.
  - In _task_fn_tx(), wrap the netmap(4)-specific bits in
    #ifdef DEV_NETMAP just as done in _task_fn_admin() and
    _task_fn_rx() respectively.
o In iflib_fast_intr_rxtx(), synchronize the TX rather than
  the RX descriptors before calling the ift_txd_credits_update
  method (see also above).
o There's no need to synchronize an RX buffer that is going to
  be recycled in iflib_rxd_pkt_get(), yet; it's sufficient to
  do that as late as passing RX buffers to the MAC via the
  ift_rxd_refill method. Hence, combine that synchronization
  with the synchronization of new buffers into a common spot
  in _iflib_fl_refill().
o There's no need to synchronize the RX descriptors of a free
  list in preparation of the MAC updating their statuses with
  every invocation of rxd_frag_to_sd(); it's enough to do this
  once before handing control over to the MAC, i. e. before
  calling ift_rxd_flush method in _iflib_fl_refill(), which
  already performs the necessary synchronization.
o Given that the ift_rxd_available method evaluates the RX
  descriptors which possibly have been altered by the MAC,
  synchronize as appropriate beforehand. Most notably this
  is now done in iflib_rxd_avail(), which in turn means that
  we don't need to issue the same synchronization yet again
  before calling the ift_rxd_pkt_get method in iflib_rxeof().
o In iflib_txd_db_check(), synchronize the TX descriptors
  before handing them over to the MAC for transmission via
  the ift_txd_flush method.
o In iflib_encap(), move the TX buffer synchronization after
  the invocation of the ift_txd_encap() method. If the MAC
  driver fails to encapsulate the packet and we retry with
  a defragmented mbuf chain or finally fail, the cycles for
  TX buffer synchronization have been wasted. Synchronizing
  afterwards matches what non-iflib(4) drivers typically do
  and is sufficient as the MAC will not actually start with
  the transmission before - in this case - the ift_txd_flush
  method is called.
  Moreover, for the latter reason the synchronization of the
  TX descriptors in iflib_encap() can go as it's enough to
  synchronize them before passing control over to the MAC by
  issuing the ift_txd_flush() method (see above).
o In iflib_txq_can_drain(), only synchronize TX descriptors
  if the ift_txd_credits_update method accessing these is
  actually called.

Differential Revision:	https://reviews.freebsd.org/D19081
2019-02-12 21:08:44 +00:00
Patrick Kelsey
8f2ac65690 Reduce the time it takes the kernel to install a new PF config containing a large number of queues
In general, the time savings come from separating the active and
inactive queues lists into separate interface and non-interface queue
lists, and changing the rule and queue tag management from list-based
to hash-bashed.

In HFSC, a linear scan of the class table during each queue destroy
was also eliminated.

There are now two new tunables to control the hash size used for each
tag set (default for each is 128):

net.pf.queue_tag_hashsize
net.pf.rule_tag_hashsize

Reviewed by:	kp
MFC after:	1 week
Sponsored by:	RG Nets
Differential Revision:	https://reviews.freebsd.org/D19131
2019-02-11 05:17:31 +00:00
Marius Strobl
bfce461ee9 o As illustrated by e. g. figure 7-14 of the Intel 82599 10 GbE
controller datasheet revision 3.3, in the context of Ethernet
  MACs the control data describing the packet buffers typically
  are named "descriptors". Each of these descriptors references
  one buffer, multiple of which a packet can be composed of.
  By contrast, in comments, messages and the names of structure
  members, iflib(4) refers to DMA resources employed for RX and
  TX buffers (rather than control data) as "desc(riptors)".
  This odd naming convention of iflib(4) made reviewing r343085
  and identifying wrong and missing bus_dmamap_sync(9) calls in
  particular way harder than it already is. This convention may
  also explain why the netmap(4) part of iflib(4) pairs the DMA
  tags for control data with DMA maps of buffers and vice versa
  in calls to bus_dma(9) functions.
  Therefore, change iflib(4) to refer to buf(fers) when buffers
  and not the usual understanding of descriptors is meant. This
  change does not include corrections to the DMA resources used
  in the netmap(4) parts. However, it revises error messages to
  state which kind of allocation/creation failed. Specifically,
  the "Unable to allocate tx_buffer (map) memory" copy & pasted
  inappropriately on several occasions was replaced with proper
  messages.
o Enhance some other error messages to indicate which half - RX
  or TX - they apply to instead of using identical text in both
  cases and generally canonicalize them.
o Correct the descriptions of iflib_{r,t}xsd_alloc() to reflect
  reality; current code doesn't use {r,t}x_buffer structures.
o In iflib_queues_alloc():
  - Remove redundant BUS_DMA_NOWAIT of iflib_dma_alloc() calls,
  - change the M_WAITOK from malloc(9) calls into M_NOWAIT. The
    return values are already checked, deferred DMA allocations
    not being an option at this point, BUS_DMA_NOWAIT has to be
    used anyway and prior malloc(9) calls in this function also
    specify M_NOWAIT.

Reviewed by:	shurd
Differential Revision:	https://reviews.freebsd.org/D19067
2019-02-04 20:46:57 +00:00
Gleb Smirnoff
3ca1c423aa Teach pfil_ioctl() about VIMAGE.
Submitted by:	gallatin
2019-02-03 08:28:02 +00:00
Vincenzo Maffione
5faab77822 netmap: upgrade sync-kloop support
Add SYNC_KLOOP_MODE option, and add support for direct mode, where application
executes the TXSYNC and RXSYNC in the context of the ioeventfd wake up callback.

MFC after:	5 days
2019-02-02 22:39:29 +00:00
Gleb Smirnoff
b252313f0b New pfil(9) KPI together with newborn pfil API and control utility.
The KPI have been reviewed and cleansed of features that were planned
back 20 years ago and never implemented.  The pfil(9) internals have
been made opaque to protocols with only returned types and function
declarations exposed. The KPI is made more strict, but at the same time
more extensible, as kernel uses same command structures that userland
ioctl uses.

In nutshell [KA]PI is about declaring filtering points, declaring
filters and linking and unlinking them together.

New [KA]PI makes it possible to reconfigure pfil(9) configuration:
change order of hooks, rehook filter from one filtering point to a
different one, disconnect a hook on output leaving it on input only,
prepend/append a filter to existing list of filters.

Now it possible for a single packet filter to provide multiple rulesets
that may be linked to different points. Think of per-interface ACLs in
Cisco or Juniper. None of existing packet filters yet support that,
however limited usage is already possible, e.g. default ruleset can
be moved to single interface, as soon as interface would pride their
filtering points.

Another future feature is possiblity to create pfil heads, that provide
not an mbuf pointer but just a memory pointer with length. That would
allow filtering at very early stages of a packet lifecycle, e.g. when
packet has just been received by a NIC and no mbuf was yet allocated.

Differential Revision:	https://reviews.freebsd.org/D18951
2019-01-31 23:01:03 +00:00
John Baldwin
829c56fc08 Don't set IFCAP_TXRTLMT during lagg_clone_create().
lagg_capabilities() will set the capability once interfaces supporting
the feature are added to the lagg.  Setting it on a lagg without any
interfaces is pointless as the if_snd_tag_alloc call will always fail
in that case.

Reviewed by:	hselasky, gallatin
MFC after:	2 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D19040
2019-01-31 21:35:37 +00:00
Gleb Smirnoff
f712b16127 Revert r316461: Remove "IPFW static rules" rmlock, and use pfil's global lock.
The pfil(9) system is about to be converted to epoch(9) synchronization, so
we need [temporarily] go back with ipfw internal locking.

Discussed with:	ae
2019-01-31 21:04:50 +00:00
Marius Strobl
b97de13ae0 - Stop iflib(4) from leaking MSI messages on detachment by calling
bus_teardown_intr(9) before pci_release_msi(9).
- Ensure that iflib(4) and associated drivers pass correct RIDs to
  bus_release_resource(9) by obtaining the RIDs via rman_get_rid(9)
  on the corresponding resources instead of using the RIDs initially
  passed to bus_alloc_resource_any(9) as the latter function may
  change those RIDs. Solely em(4) for the ioport resource (but not
  others) and bnxt(4) were using the correct RIDs by caching the ones
  returned by bus_alloc_resource_any(9).
- Change the logic of iflib_msix_init() around to only map the MSI-X
  BAR if MSI-X is actually supported, i. e. pci_msix_count(9) returns
  > 0. Otherwise the "Unable to map MSIX table " message triggers for
  devices that simply don't support MSI-X and the user may think that
  something is wrong while in fact everything works as expected.
- Put some (mostly redundant) debug messages emitted by iflib(4)
  and em(4) during attachment under bootverbose. The non-verbose
  output of em(4) seen during attachment now is close to the one
  prior to the conversion to iflib(4).
- Replace various variants of spelling "MSI-X" (several in messages)
  with "MSI-X" as used in the PCI specifications.
- Remove some trailing whitespace from messages emitted by iflib(4)
  and change them to consistently start with uppercase.
- Remove some obsolete comments about releasing interrupts from
  drivers and correct a few others.

Reviewed by:	erj, Jacob Keller, shurd
Differential Revision:	https://reviews.freebsd.org/D18980
2019-01-30 13:21:26 +00:00
Marius Strobl
3db348b54a - In _iflib_fl_refill(), don't mark an RX buffer as available in the
corresponding bitmap before adding an mbuf has actually succeeded.
  Previously, m_gethdr(M_NOWAIT, ...) failing caused a "hole" in the
  RX ring but not in its bitmap. One implication of such a hole was
  that in a subsequent call to _iflib_fl_refill() with the RX buffer
  accounting still indicating another reclaimable buffer, bit_ffc(3)
  nevertheless returned -1 in frag_idx which in turn caused havoc
  when used as an index. Thus, additionally assert that frag_idx is
  0 or greater.
  Another possible consequence of a hole in the RX ring was a NULL-
  dereference when trying to use the unallocated mbuf, for example
  in iflib_rxd_pkt_get().

  While at it, make the variable declarations in _iflib_fl_refill()
  conform to style(9) and remove redundant checks already performed
  by bit_ffc{,_at}(3).

- In iflib_queues_alloc(), don't pass redundant M_ZERO to bit_alloc(3).

Reported and tested by: pho
2019-01-26 21:35:51 +00:00
Andrew Gallatin
77102fd6a2 Fix an iflib driver unload panic introduced in r343085
The new loop to sync and unload descriptors was indexed
by "i", rather than "j".   The panic was caused by "i"
being advanced rather than "j", and eventually becoming
out of bounds.

Reviewed by:	kib
MFC after:	3 days
Sponsored by:	Netflix
2019-01-25 15:02:18 +00:00
Vincenzo Maffione
f79ba6d75b netmap: improvements to the netmap kloop (CSB mode)
Changelist:
    - Add the proper memory barriers in the kloop ring processing
      functions.
    - Fix memory barriers usage in the user helpers (nm_sync_kloop_appl_write,
      nm_sync_kloop_appl_read).
    - Fix nm_kr_txempty() helper to look at rhead rather than rcur. This
      is important since the kloop can read a value of rcur which is ahead
      of the value of rhead (see explanation in nm_sync_kloop_appl_write)
    - Remove obsolete ptnetmap_guest_write_kring_csb() and
      ptnet_guest_read_kring_csb(), and update if_ptnet(4) to use those.
    - Prepare in advance the arguments for netmap_sync_kloop_[tr]x_ring(),
      to make the kloop faster.
    - Provide kernel and user implementation for nm_ldld_barrier() and
      nm_ldst_barrier()

MFC after:	2 weeks
2019-01-23 14:51:36 +00:00
Brooks Davis
bc6f170e30 Rework CASE_IOC_IFGROUPREQ() to require a case before the macro.
This is more compatible with formatting tools and looks more normal.

Reported by:	jhb (on a different review)
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D18442
2019-01-22 17:39:26 +00:00
Patrick Kelsey
8f82136aec onvert vmx(4) to being an iflib driver.
Also, expose IFLIB_MAX_RX_SEGS to iflib drivers and add
iflib_dma_alloc_align() to the iflib API.

Performance is generally better with the tunable/sysctl
dev.vmx.<index>.iflib.tx_abdicate=1.

Reviewed by:	shurd
MFC after:	1 week
Relnotes:	yes
Sponsored by:	RG Nets
Differential Revision:	https://reviews.freebsd.org/D18761
2019-01-22 01:11:17 +00:00
Patrick Kelsey
7f3eb9dab3 Fix various resource leaks that can occur in the error paths of
iflib_device_register() and iflib_pseudo_register().

Reviewed by:	shurd
MFC after:	1 week
Sponsored by:	RG Nets
Differential Revision:	https://reviews.freebsd.org/D18760
2019-01-22 00:56:44 +00:00
Konstantin Belousov
8a04b53dce Improve iflib busdma(9) KPI use.
- Specify BUS_DMA_NOWAIT for bus_dmamap_load() on rx refill, since
  callbacks are not supposed to be used.
- Match tso/non-tso tags to corresponding tx map operations.  Create
  separate tso maps for tx descriptors.  In particular, do not use
  non-tso tag to load, unload, or destroy a map created with tso tag.
- Add missed bus_dmamap_sync() calls.
  Submitted by: marius.

Reported and tested by:	pho
Reviewed by:	marius
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-01-16 05:44:14 +00:00
Gleb Smirnoff
cc31f9821c Remove recursive NET_EPOCH_ENTER() from sysctl_ifmalist(), missed in r342872. 2019-01-11 00:45:22 +00:00
Gleb Smirnoff
7b7f772fa0 Bring the comment up to date. 2019-01-10 00:37:14 +00:00
Mark Johnston
72755d285f Stop setting if_linkmib in vlan(4) ifnets.
There are several reasons:
- The structure being exported via IFDATA_LINKSPECIFIC doesn't appear
  to be a standard MIB.
- The structure being exported is private to the kernel and always
  has been.
- No other drivers in common use set the if_linkmib field.
- Because IFDATA_LINKSPECIFIC can be used to overwrite the linkmib
  structure, a privileged user could use it to corrupt internal
  vlan(4) state. [1]

PR:		219472
Reported by:	CTurt <ecturt@gmail.com> [1]
Reviewed by:	kp (previous version)
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18779
2019-01-09 16:47:16 +00:00
Gleb Smirnoff
a68cc38879 Mechanical cleanup of epoch(9) usage in network stack.
- Remove macros that covertly create epoch_tracker on thread stack. Such
  macros a quite unsafe, e.g. will produce a buggy code if same macro is
  used in embedded scopes. Explicitly declare epoch_tracker always.

- Unmask interface list IFNET_RLOCK_NOSLEEP(), interface address list
  IF_ADDR_RLOCK() and interface AF specific data IF_AFDATA_RLOCK() read
  locking macros to what they actually are - the net_epoch.
  Keeping them as is is very misleading. They all are named FOO_RLOCK(),
  while they no longer have lock semantics. Now they allow recursion and
  what's more important they now no longer guarantee protection against
  their companion WLOCK macros.
  Note: INP_HASH_RLOCK() has same problems, but not touched by this commit.

This is non functional mechanical change. The only functionally changed
functions are ni6_addrs() and ni6_store_addrs(), where we no longer enter
epoch recursively.

Discussed with:	jtl, gallatin
2019-01-09 01:11:19 +00:00
Gleb Smirnoff
21ee61a6cf Remove part of comment that doesn't match reality. 2019-01-09 00:38:16 +00:00
Stephen Hurd
cd28ea929a Use iflib_if_init_locked() during resume instead of iflib_init_locked().
iflib_init_locked() assumes that iflib_stop() has been called, however,
it is not called for suspend.  iflib_if_init_locked() calls stop then init,
so fixes the problem.

This was causing errors after a resume from suspend.

PR:		224059
Reported by:	zeising
MFC after:	1 week
Sponsored by:	Limelight Networks
2019-01-07 23:46:54 +00:00
Matt Macy
5881181d8c mp_ring: avoid items offset difference between iflib and mp_ring
on architectures without 64-bit atomics

Reported by:	Augustin Cavalier <waddlesplash@gmail.com>
2019-01-03 23:06:05 +00:00
Konstantin Belousov
85f3b801e9 Fix typo, use boolean operator instead of bit-wise.
Reviewed by:	marius, shurd
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
2019-01-03 01:01:03 +00:00
Mateusz Guzik
cc426dd319 Remove unused argument to priv_check_cred.
Patch mostly generated with cocinnelle:

@@
expression E1,E2;
@@

- priv_check_cred(E1,E2,0)
+ priv_check_cred(E1,E2)

Sponsored by:	The FreeBSD Foundation
2018-12-11 19:32:16 +00:00
Stephen Hurd
7124b5ba04 Fix !tx_abdicate error from r336560
r336560 was supposed to restore pre-r323954 behaviour when tx_abdicate is
not set (the default case). However, it appears that rather than the drainage
check being made conditional on tx_abdicate being set, it was duplicated
so it occured twice if tx_abdicate was set and once if it was not.

Now when !tx_abdicate, drainage is only checked if the doorbell isn't
pending.

Reported by:    lev
MFC after:      1 week
Sponsored by:   Limelight Networks
2018-12-11 17:46:01 +00:00
Vincenzo Maffione
1d9ec3ee50 netmap.h: include stdatomic.h
The stdatomic.h header exports atomic_thread_fence(), that
can be used to implement the nm_stst_barrier() macro needed
by netmap.

MFC after:	3 days
2018-12-05 15:38:52 +00:00
Vincenzo Maffione
b6e66be22b netmap: align codebase to the current upstream (760279cfb2730a585)
Changelist:
  - Replace netmap passthrough host support with a more general
    mechanism to call TXSYNC/RXSYNC from an in-kernel event-loop.
    No kernel threads are used to use this feature: the application
    is required to spawn a thread (or a process) and issue a
    SYNC_KLOOP_START (NIOCCTRL) command in the thread body. The
    kernel loop is executed by the ioctl implementation, which returns
    to userspace only when a different thread calls SYNC_KLOOP_STOP
    or the netmap file descriptor is closed.
  - Update the if_ptnet driver to cope with the new data structures,
    and prune all the obsolete ptnetmap code.
  - Add support for "null" netmap ports, useful to allocate netmap_if,
    netmap_ring and netmap buffers to be used by specialized applications
    (e.g. hypervisors). TXSYNC/RXSYNC on these ports have no effect.
  - Various fixes and code refactoring.

Sponsored by:	Sunny Valley Networks
Differential Revision:	https://reviews.freebsd.org/D18015
2018-12-05 11:57:16 +00:00
Eric van Gyzen
9dae3b521e altq: manual cleanup after r341507
Remove a file that became practically empty.
Fix indentation.

Like r341507, I do not plan to MFC, but anyone else can.
2018-12-04 23:53:42 +00:00
Eric van Gyzen
325fab802e altq: remove ALTQ3_COMPAT code
This code has apparently never compiled on FreeBSD since its
introduction in 2004 (r130365).  It has certainly not compiled
since 2006, when r164033 added #elsif [sic] preprocessor directives.
The code was left in the tree to reduce the diff from upstream (KAME).
Since that upstream is no longer relevant, remove the long-dead code.

This commit is the direct result of:

    unifdef -m -UALTQ3_COMPAT sys/net/altq/*

A later commit will do some manual cleanup.

I do not plan to MFC this.  If that would help you, go for it.
2018-12-04 23:46:43 +00:00
Andrey V. Elsukov
851073551d Adapt the fix in r341008 to correctly work with EBR.
IFNET_RLOCK_NOSLEEP() is epoch_enter_preempt() in FreeBSD 12+. Holding
it in sysctl_rtsock() doesn't protect us from ifnet unlinking, because
unlinking occurs with IFNET_WLOCK(), that is rw_wlock+sx_xlock, and it
doesn check that concurrent code is running in epoch section. But while
we are in epoch section, we should be able to do access to ifnet's
fields, even it was unlinked. Thus do not change if_addr and if_hw_addr
fields in ifnet_detach_internal() to NULL, since rtsock code can do
access to these fields and this is allowed while it is running in epoch
section.

This should fix the race, when ifnet_detach_internal() unlinks ifnet
after we checked it for IFF_DYING in sysctl_dumpentry.

Move free(ifp->if_hw_addr) into ifnet_free_internal(). Also remove the
NULL check for ifp->if_description, since free(9) can correctly handle
NULL pointer.

MFC after:	1 week
2018-11-30 10:36:14 +00:00
Andrew Gallatin
fbec776de0 Use busdma unconditionally in iflib
- Remove the complex mechanism to choose between using busdma
and raw pmap_kextract at runtime.   The reduced complexity makes
the code easier to read and maintain.

- Fix a bug in the small packet receive path where clusters were
repeatedly mapped but never unmapped. We now store the cluster's
bus address and avoid re-mapping the cluster each time a small
packet is received.

This patch fixes bugs I've seen where ixl(4) will not even
respond to ping without seeing DMAR faults.

I see a small improvement (14%) on packet forwarding tests using
a Haswell based Xeon E5-2697 v3.  Olivier sees a small
regression (-3% to -6%) with lower end hardware.

Reviewed by:	mmacy
Not objected to by:	sbruno
MFC after:	8 weeks
Sponsored by:	Netflix, Inc
Differential Revision:		https://reviews.freebsd.org/D17901
2018-11-27 20:01:05 +00:00
Andrey V. Elsukov
a716ad4a35 Fix possible panic during ifnet detach in rtsock.
The panic can happen, when some application does dump of routing table
using sysctl interface. To prevent this, set IFF_DYING flag in
if_detach_internal() function, when ifnet under lock is removed from
the chain. In sysctl_rtsock() take IFNET_RLOCK_NOSLEEP() to prevent
ifnet detach during routes enumeration. In case, if some interface was
detached in the time before we take the lock, add the check, that ifnet
is not DYING. This prevents access to memory that could be freed after
ifnet is unlinked.

PR:		227720, 230498, 233306
Reviewed by:	bz, eugen
MFC after:	1 week
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D18338
2018-11-27 09:04:06 +00:00
Mark Johnston
d25f8522be Plug routing sysctl leaks.
Various structures exported by sysctl_rtsock() contain padding fields
which were not being zeroed.

Reported by:	Thomas Barabosch, Fraunhofer FKIE
Reviewed by:	ae
MFC after:	3 days
Security:	kernel memory disclosure
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18333
2018-11-26 13:42:18 +00:00
Oleg Bulyzhin
cac302483e Unbreak kernel build with VLAN_ARRAY defined.
MFC after:	1 week
2018-11-21 13:34:21 +00:00
Andrey V. Elsukov
ad43bf348b Allow configuration of several ipsec interfaces with the same tunnel
endpoints.

This can be used to configure several IPsec tunnels between two hosts
with different security associations.

Obtained from:	Yandex LLC
MFC after:	2 weeks
Sponsored by:	Yandex LLC
2018-11-16 14:21:57 +00:00
Mike Karels
456e896d6d Fix flags collision causing inability to enable CBQ in ALTQ
The CBQ BORROW flag conflicts with the RMCF_CODEL flag; the
two sets of definitions actually define the same things. The symptom
is that a kernel with CBQ support and not CODEL fails to load a QoS
policy with the obscure error "pfctl: DIOCADDALTQ: Cannot allocate memory."
If ALTQ_DEBUG is enabled, the error becomes a little clearer:
"rmc_newclass: CODEL not configured for CBQ!" is printed by the kernel.
There really shouldn't be two sets of macros that have to be defined
consistently, but the include structure isn't right for exporting
CBQ flags to altq_rmclass.h. Re-align the definitions, and add
CTASSERTs in the kernel to ensure that the definitions are consistent.

PR:		215716
Reviewed by:	pkelsey
MFC after:	2 weeks
Sponsored by:	Forcepoint LLC
Differential Revision:	https://reviews.freebsd.org/D17758
2018-11-16 03:42:29 +00:00
Stephen Hurd
0efb1a464f Clear RX completion queue state veriables in iflib_stop()
iflib_stop() was not resetting the rxq completion queue state variables.
This meant that for any driver that has receive completion queues, after a
reinit, iflib would start asking what's available on the rx side starting at
whatever the completion queue index was prior to the stop, instead of at 0.

Submitted by:	pkelsey
Reported by:	pkelsey
MFC after:	3 days
Sponsored by:	Limelight Networks
2018-11-14 20:36:18 +00:00
Stephen Hurd
8d4ceb9cc4 Prevent POLA violation with TSO/CSUM offload
Ensure that any time CSUM_IP_TSO or CSUM_IP6_TSO is set that the corresponding
CSUM_IP6?_TCP / CSUM_IP flags are also set.

Rather than requireing drivers to bake-in an understanding that TSO implies
checksum offloads, make it explicit.

This change requires us to move the IFLIB_NEED_ZERO_CSUM implementation to
ensure it's zeroed for TSO.

Reported by:	Jacob Keller <jacob.e.keller@intel.com>
MFC after:	1 week
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D17801
2018-11-14 15:23:39 +00:00
Stephen Hurd
4d261ce278 Fix leaks caused by ifc_nhwtxqs never being initialized
r333502 removed initialization of ifc_nhwtxqs, and it's not clear
there's a need to copy it into the struct iflib_ctx at all. Use
ctx->ifc_sctx->isc_ntxqs instead.

Further, iflib_stop() did not clear the last ring in the case where
isc_nfl != isc_nrxqs (such as when IFLIB_HAS_RXCQ is set). Use
ctx->ifc_sctx->isc_nrxqs here instead of isc_nfl.

Reported by:	pkelsey
Reviewed by:	pkelsey
MFC after:	3 days
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D17979
2018-11-14 15:16:45 +00:00
Gleb Smirnoff
b79aa45e0e For compatibility KPI functions like if_addr_rlock() that used to have
mutexes but now are converted to epoch(9) use thread-private epoch_tracker.
Embedding tracker into ifnet(9) or ifnet derived structures creates a non
reentrable function, that will fail miserably if called simultaneously from
two different contexts.
A thread private tracker will provide a single tracker that would allow to
call these functions safely. It doesn't allow nested call, but this is not
expected from compatibility KPIs.

Reviewed by:	markj
2018-11-13 22:58:38 +00:00
Stephen Hurd
a42546df88 Fix rxcsum issue introduced in r338838
r338838 attempted to fix issues with rxcsum and rxcsum6.
However, the rxcsum bits were set as though if_setcapenablebit() was
being called, not if_togglecapenable() which is in use. As a result,
it was not possible to disable rxcsum when rxcsum6 was supported.

PR:		233004
Reported by:	lev
Reviewed by:	lev
MFC after:	3 days
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D17881
2018-11-07 19:31:48 +00:00
Kristof Provost
fbbf436d56 pfsync: Handle syncdev going away
If the syncdev is removed we no longer need to clean up the multicast
entry we've got set up for that device.

Pass the ifnet detach event through pf to pfsync, and remove our
multicast handle, and mark us as no longer having a syncdev.

Note that this callback is always installed, even if the pfsync
interface is disabled (and thus it's not a per-vnet callback pointer).

MFC after:	2 weeks
Sponsored by:	Orange Business Services
Differential Revision:	https://reviews.freebsd.org/D17502
2018-11-02 16:57:23 +00:00
Kristof Provost
25c6ab1b78 Notify that the ifnet will go away, even on vnet shutdown
pf subscribes to ifnet_departure_event events, so it can clean up the
ifg_pf_kif and if_pf_kif pointers in the ifnet.
During vnet shutdown interfaces could go away without sending the event,
so pf ends up cleaning these up as part of its shutdown sequence, which
happens after the ifnet has already been freed.

Send the ifnet_departure_event during vnet shutdown, allowing pf to
clean up correctly.

MFC after:	2 weeks
Sponsored by:	Orange Business Services
Differential Revision:	https://reviews.freebsd.org/D17500
2018-11-02 16:50:17 +00:00
Kristof Provost
5f6cf24e2d pfsync: Make pfsync callbacks per-vnet
The callbacks are installed and removed depending on the state of the
pfsync device, which is per-vnet. The callbacks must also be per-vnet.

MFC after:	2 weeks
Sponsored by:	Orange Business Services
Differential Revision:	https://reviews.freebsd.org/D17499
2018-11-02 16:47:07 +00:00
Kristof Provost
790194cd47 pf: Limit the fragment entry queue length to 64 per bucket.
So we have a global limit of 1024 fragments, but it is fine grained to
the region of the packet.  Smaller packets may have less fragments.
This costs another 16 bytes of memory per reassembly and devides the
worst case for searching by 8.

Obtained from:	OpenBSD
Differential Revision:	https://reviews.freebsd.org/D17734
2018-11-02 15:32:04 +00:00
Kristof Provost
fd2ea405e6 pf: Split the fragment reassembly queue into smaller parts
Remember 16 entry points based on the fragment offset.  Instead of
a worst case of 8196 list traversals we now check a maximum of 512
list entries or 16 array elements.

Obtained from:	OpenBSD
Differential Revision:	https://reviews.freebsd.org/D17733
2018-11-02 15:26:51 +00:00
Bjoern A. Zeeb
c955c6cd08 With more excessive use of modules, more kernel parts working with
VIMAGE, and feature richness and global state increasing the 8k of
vnet module space are no longer sufficient for people and loading
multiple modules, e.g., pf(4) and ipl(4) or ipsec(4) will fail on
the second module.

Increase the module space to 8 * PAGE_SIZE which should be enough
to hold multiple firewalls, ipsec, multicast (as in the old days was
a problem), epair, carp, and any kind of other vnet enabled modules.

Sadly this is a global byte array part of the vnet_set, so we cannot
dynamically change its size;  otherwise a TUNABLE would have been
a better solution.

PR:			228854
Reported by:		Ernie Luzar, Marek Zarychta
Discussed with:		rgrimes on current
MFC after:		3 days
2018-10-30 20:45:15 +00:00
Bjoern A. Zeeb
201100c58b Initial implementation of draft-ietf-6man-ipv6only-flag.
This change defines the RA "6" (IPv6-Only) flag which routers
may advertise, kernel logic to check if all routers on a link
have the flag set and accordingly update a per-interface flag.

If all routers agree that it is an IPv6-only link, ether_output_frame(),
based on the interface flag, will filter out all ETHERTYPE_IP/ARP
frames, drop them, and return EAFNOSUPPORT to upper layers.

The change also updates ndp to show the "6" flag, ifconfig to
display the IPV6_ONLY nd6 flag if set, and rtadvd to allow
announcing the flag.

Further changes to tcpdump (contrib code) are availble and will
be upstreamed.

Tested the code (slightly earlier version) with 2 FreeBSD
IPv6 routers, a FreeBSD laptop on ethernet as well as wifi,
and with Win10 and OSX clients (which did not fall over with
the "6" flag set but not understood).

We may also want to (a) implement and RX filter, and (b) over
time enahnce user space to, say, stop dhclient from running
when the interface flag is set.  Also we might want to start
IPv6 before IPv4 in the future.

All the code is hidden under the EXPERIMENTAL option and not
compiled by default as the draft is a work-in-progress and
we cannot rely on the fact that IANA will assign the bits
as requested by the draft and hence they may change.

Dear 6man, you have running code.

Discussed with:	Bob Hinden, Brian E Carpenter
2018-10-30 20:08:48 +00:00
Marcelo Araujo
fbd8c33022 Allow changing lagg(4) MTU.
Previously, changing the MTU would require destroying the lagg and
creating a new one. Now it is allowed to change the MTU of
the lagg interface and the MTU of the ports will be set to match.

If any port cannot set the new MTU, all ports are reverted to the original
MTU of the lagg. Additionally, when adding ports, the MTU of a port will be
automatically set to the MTU of the lagg. As always, the MTU of the lagg is
initially determined by the MTU of the first port added. If adding an
interface as a port for some reason fails, that interface is reverted to its
original MTU.

Submitted by:	Ryan Moeller <ryan@freqlabs.com>
Reviewed by:	mav
Relnotes:	Yes
Sponsored by:	iXsystems Inc.
Differential Revision:	https://reviews.freebsd.org/D17576
2018-10-30 09:53:57 +00:00
Eugene Grosbein
232485a17e Prevent stf(4) from panicing due to unprotected access to INADDR_HASH.
PR:			220078
MFC after:		1 month
Differential Revision:	https://reviews.freebsd.org/D12457
Tested-by:		Cassiano Peixoto and others
2018-10-27 04:45:28 +00:00
Eric Joyner
46fa0c2552 Revert r339634.
That commit is causing kernel panics in em(4), so this will be reverted
until those are fixed.

Reported by:	ae@, pho@, et al
Sponsored by:	Intel Corporation
2018-10-23 17:06:36 +00:00
Andrey V. Elsukov
8796e291f8 Add the check that current VNET is ready and access to srchash is allowed.
This change is similar to r339646. The callback that checks for appearing
and disappearing of tunnel ingress address can be called during VNET
teardown. To prevent access to already freed memory, add check to the
callback and epoch_wait() call to be sure that callback has finished its
work.

MFC after:	20 days
2018-10-23 13:11:45 +00:00
Andrey V. Elsukov
221022e190 Add the check that current VNET is ready and access to srchash is
allowed.

ipsec_srcaddr() callback can be called during VNET teardown, since
ingress address checking subsystem isn't VNET specific. And thus
callback can make access to already freed memory. To prevent this,
use V_ipsec_idhtbl pointer as indicator of VNET readiness. And make
epoch_wait() after resetting it to NULL in vnet_ipsec_uninit() to
be sure that ipsec_srcaddr() is finished its work.

Reported by:	kp
MFC after:	20 days
2018-10-23 13:03:03 +00:00
Andrey V. Elsukov
e6b383b2f5 Remove softc from idhash when interface is destroyed.
MFC after:	20 days
2018-10-23 12:50:28 +00:00
Vincenzo Maffione
2a7db7a63d netmap: align codebase to the current upstream (sha 8374e1a7e6941)
Changelist:
    - Move large parts of VALE code to a new file and header netmap_bdg.[ch].
      This is useful to reuse the code within upcoming projects.
    - Improvements and bug fixes to pipes and monitors.
    - Introduce nm_os_onattach(), nm_os_onenter() and nm_os_onexit() to
      handle differences between FreeBSD and Linux.
    - Introduce some new helper functions to handle more host rings and fake
      rings (netmap_all_rings(), netmap_real_rings(), ...)
    - Added new sysctl to enable/disable hw checksum in emulated netmap mode.
    - nm_inject: add support for NS_MOREFRAG

Approved by:	gnn (mentor)
Differential Revision:	https://reviews.freebsd.org/D17364
2018-10-23 08:55:16 +00:00
Eric Joyner
940f62d616 iflib: drain enqueued tasks before detaching from taskqgroup
The taskqgroup_detach function does not check if task is already enqueued when
detaching it. This may lead to kernel panic if enqueued task starts after
context state lock is destroyed. Ensure that the already enqueued admin tasks
are executed before detaching them.

The issue was discovered during validation of D16429. Unloading of if_ixlv
followed by immediate removal of VFs with iovctl -D may lead to panic on
NODEBUG kernel.

As well, check if iflib is in detach before enqueueing new admin or iov
tasks, to prevent new tasks from executing while the taskqgroup tasks
are being drained.

Submitted by:	Krzysztof Galazka <krzysztof.galazka@intel.com>
Reviewed by:	shurd@, erj@
Sponsored by:	Intel Corporation
Differential Revision:	https://reviews.freebsd.org/D17404
2018-10-23 04:37:29 +00:00
Hans Petter Selasky
88d57e9eac Resolve deadlock between epoch(9) and various network interface
SX-locks, during if_purgeaddrs(), by not allowing to hold the epoch
read lock over typical network IOCTL code paths. This is a regression
issue after r334305.

Reviewed by:		ae (network)
Differential revision:	https://reviews.freebsd.org/D17647
MFC after:		1 week
Sponsored by:		Mellanox Technologies
2018-10-22 13:25:26 +00:00
Andrey V. Elsukov
cc958ed28c Follow the fix in r339532 (by glebius):
Fix exiting an epoch(9) we never entered. May happen only with MAC.

MFC after:	1 month
2018-10-21 18:30:27 +00:00
Andrey V. Elsukov
2c87fdf059 Rework if_ipsec(4) to use epoch(9) instead of rmlock.
* use CK_LIST and FNV hash to keep chains of softc;
* read access to softc is protected by epoch();
* write access is protected by ipsec_ioctl_sx. Changing of softc fields
  is allowed only when softc is unlinked from CK_LIST chains.
* linking/unlinking of softc is allowed only when ipsec_ioctl_sx is
  exclusive locked.
* the plain LIST of all softc is replaced by hash table that uses ingress
  address of tunnels as a key.
* added support for appearing/disappearing of ingress address handling.
  Now it is allowed configure non-local ingress IP address, and thus the
  problem with if_ipsec(4) configuration that happens on boot, when
  ingress address is not yet configured, is solved.

MFC after:	1 month
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D17190
2018-10-21 18:24:20 +00:00
Andrey V. Elsukov
df49ca9fd3 Add handling for appearing/disappearing of ingress addresses to if_me(4).
* register handler for ingress address appearing/disappearing;
* add new srcaddr hash table for fast softc lookup by srcaddr;
* when srcaddr disappears, clear IFF_DRV_RUNNING flag from interface,
  and set it otherwise;

MFC after:	1 month
Sponsored by:	Yandex LLC
2018-10-21 18:18:37 +00:00
Andrey V. Elsukov
19873f4780 Add handling for appearing/disappearing of ingress addresses to if_gre(4).
* register handler for ingress address appearing/disappearing;
* add new srcaddr hash table for fast softc lookup by srcaddr;
* when srcaddr disappears, clear IFF_DRV_RUNNING flag from interface,
  and set it otherwise;

MFC after:	1 month
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D17214
2018-10-21 18:13:45 +00:00
Andrey V. Elsukov
009d82ee0f Add handling for appearing/disappearing of ingress addresses to if_gif(4).
* register handler for ingress address appearing/disappearing;
* add new srcaddr hash table for fast softc lookup by srcaddr;
* when srcaddr disappears, clear IFF_DRV_RUNNING flag from interface,
  and set it otherwise;
* remove the note about ingress address from BUGS section.

MFC after:	1 month
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D17134
2018-10-21 18:06:15 +00:00
Andrey V. Elsukov
8251c68d5c Add KPI that can be used by tunneling interfaces to handle IP addresses
appearing and disappearing on the host system.

Such handling is need, because tunneling interfaces must use addresses,
that are configured on the host as ingress addresses for tunnels.
Otherwise the system can send spoofed packets with source address, that
belongs to foreign host.

The KPI uses ifaddr_event_ext event to implement addresses tracking.
Tunneling interfaces register event handlers and then they are
notified by the kernel, when an address disappears or appears.

ifaddr_event_compat() handler from if.c replaced by srcaddr_change_event()
in the ip_encap.c

MFC after:	1 month
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D17134
2018-10-21 17:55:26 +00:00
Kristof Provost
5191a3aea6 vlan: Fix panic with lagg and vlan
vlan_lladdr_fn() is called from taskqueue, which means there's no vnet context
set. We can end up trying to send ARP messages (through the iflladdr_event
event), which requires a vnet context.

PR:		227654
MFC after:	3 days
2018-10-21 16:51:35 +00:00
Andrey V. Elsukov
64d63b1e03 Add ifaddr_event_ext event. It is similar to ifaddr_event, but the
handler receives the type of event IFADDR_EVENT_ADD/IFADDR_EVENT_DEL,
and the pointer to ifaddr. Also ifaddr_event now is implemented using
ifaddr_event_ext handler.

MFC after:	3 weeks
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D17100
2018-10-21 15:02:06 +00:00
Gleb Smirnoff
0a27163f8d Fix exiting an epoch(9) we never entered. May happen only with MAC. 2018-10-21 12:39:00 +00:00
Hans Petter Selasky
c32a9d66fb Fix deadlock when destroying VLANs.
Synchronizing the epoch before freeing the multicast addresses while holding
the VLAN_XLOCK() might lead to a deadlock. Use deferred freeing of the VLAN
multicast addresses to resolve deadlock. Backtrace:

Thread1:
epoch_block_handler_preempt()
ck_epoch_synchronize_wait()
epoch_wait_preempt()
vlan_setmulti()
vlan_ioctl()
in6m_release_task()
gtaskqueue_run_locked()
gtaskqueue_thread_loop()
fork_exit()
fork_trampoline()

Thread2:
sleepq_switch()
sleepq_wait()
_sx_xlock_hard()
_sx_xlock()
in6_leavegroup()
in6_purgeaddr()
if_purgeaddrs()
if_detach_internal()
if_detach()
vlan_clone_destroy()
if_clone_destroyif()
if_clone_destroy()
ifioctl()
kern_ioctl()
sys_ioctl()
amd64_syscall()
fast_syscall_common()
syscall()

Differential revision:	https://reviews.freebsd.org/D17496
Reviewed by:		slavash, mmacy
Approved by:		re (kib)
Sponsored by:		Mellanox Technologies
2018-10-15 10:29:29 +00:00
Eric Joyner
77c1fcec91 ixl/iavf(4): Change ixlv to iavf and update it to use iflib(9)
Finishes the conversion of the 40Gb Intel Ethernet drivers to iflib(9) for
FreeBSD 12.0, and fixes numerous bugs in both ixl(4) and iavf(4).

This commit also re-adds the VF driver to GENERIC since it now compiles and
functions.

The VF driver name was changed from ixlv(4) to iavf(4) because the VF driver is
now intended to be used with future products, not just with Fortville/Fort Park
VFs.

A man page update that documents these drivers is forthcoming in a separate
commit.

Reviewed by:    sbruno@, kbowling@
Tested by:      jeffrey.e.pieper@intel.com
Approved by:	re (gjb@)
Relnotes:       yes
Sponsored by:   Intel Corporation
Differential Revision: https://reviews.freebsd.org/D16429
2018-10-12 22:40:54 +00:00
Jonathan T. Looney
13c6ba6d94 There are three places where we return from a function which entered an
epoch section without exiting that epoch section. This is bad for two
reasons: the epoch section won't exit, and we will leave the epoch tracker
from the stack on the epoch list.

Fix the epoch leak by making sure we exit epoch sections before returning.

Reviewed by:	ae, gallatin, mmacy
Approved by:	re (gjb, kib)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D17450
2018-10-09 13:26:06 +00:00
Michael Tuexen
580e30a33e Use strlcpy() instead of strncpy().
Approved by:            re (kib@)
CID:			1395980, 1395981
X-MFC with:		r339012
MFC after:              1 week
2018-10-03 07:35:16 +00:00
Michael Tuexen
1687b1ab24 For changing the MTU on tun/tap devices, it should not matter whether it
is done via using ifconfig, which uses a SIOCSIFMTU ioctl() command, or
doing it using a TUNSIFINFO/TAPSIFINFO ioctl() command.
Without this patch, for IPv6 the new MTU is not used when creating routes.
Especially, when initiating TCP connections after increasing the MTU,
the old MTU is still used to compute the MSS.
Thanks to ae@ and bz@ for helping to improve the patch.

Reviewed by:		ae@, bz@
Approved by:		re (kib@)
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D17180
2018-09-29 13:01:23 +00:00
Warner Losh
329e817fcc Reapply, with minor tweaks, r338025, from the original commit:
Remove unused and easy to misuse PNP macro parameter

Inspired by r338025, just remove the element size parameter to the
MODULE_PNP_INFO macro entirely.  The 'table' parameter is now required to
have correct pointer (or array) type.  Since all invocations of the macro
already had this property and the emitted PNP data continues to include the
element size, there is no functional change.

Mostly done with the coccinelle 'spatch' tool:

  $ cat modpnpsize0.cocci
    @normaltables@
    identifier b,c;
    expression a,d,e;
    declarer MODULE_PNP_INFO;
    @@
     MODULE_PNP_INFO(a,b,c,d,
    -sizeof(d[0]),
     e);

    @singletons@
    identifier b,c,d;
    expression a;
    declarer MODULE_PNP_INFO;
    @@
     MODULE_PNP_INFO(a,b,c,&d,
    -sizeof(d),
     1);

  $ rg -l MODULE_PNP_INFO -- sys | \
    xargs spatch --in-place --sp-file modpnpsize0.cocci

(Note that coccinelle invokes diff(1) via a PATH search and expects diff to
tolerate the -B flag, which BSD diff does not.  So I had to link gdiff into
PATH as diff to use spatch.)

Tinderbox'd (-DMAKE_JUST_KERNELS).
Approved by: re (glen)
2018-09-26 17:12:14 +00:00
Matt Macy
b08d611de8 fix vlan locking to permit sx acquisition in ioctl calls
- update vlan(9) to handle changes earlier this year in multicast locking

Tested by: np@, darkfiberu at gmail.com

PR:	230510
Reviewed by:	mjoras@, shurd@, sbruno@
Approved by:	re (gjb@)
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D16808
2018-09-21 01:37:08 +00:00
Stephen Hurd
0c919c2370 Fix capabilities handling for iflib drivers
Various capabilities were not being handled correctly in the
SIOCSIFCAP handler. Specifically:

IFCAP_RXCSUM and IFCAP_RXCSUM_IPV6 could be set even if not supported

It was impossible to disable IFCAP_RXCSUM and/or IFCAP_RXCSUM_IPV6 via
ifconfig since it does ioctl() per command-line flag rather than combine
them into a single call.

IFCAP_VLAN_HWCSUM could not be modified via the ioctl()

Setting any combination of the three IFCAP_WOL flags would set only
IFCAP_WOL_MCAST | IFCAP_WOL_MAGIC. For example, setting only
IFCAP_WOL_UCAST would result in both IFCAP_WOL_MCAST and IFCAP_WOL_MAGIC
being enabled, but IFCAP_WOL_UCAST would not be enabled.

Because if_vlancap() was called before if_togglecapenable(), vlan flags
were sometimes not applied correctly.

Interfaces were being unnecessarily stopped and restarted for WoL

PR:		231151
Submitted by:	Kaho Toshikazu <kaho@elam.kais.kyoto-u.ac.jp>
Reported by:	Shirkdog <mshirk@daemon-security.com>
Reviewed by:	galladin
Approved by:	re (gjb)
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D17158
2018-09-20 19:35:35 +00:00
Andrey V. Elsukov
c6851ad04c Restore outbound packets capturing for if_gre(4). It was missed in r335048.
Also clear M_MCAST and M_BCAST flags for encapsulated datagram, since it
will have new IP header.

Approved by:	re (kib)
2018-09-17 10:10:14 +00:00
Ruslan Bukin
86c5937532 Don't mark module data as static on RISC-V.
Similar to arm64, riscv compiler uses PC-relative loads/stores,
and with static data compiler does not emit relocations.
In result, kernel module linker has nothing to fix and data accessed
from the wrong location.

Approved by:	re (gjb)
Sponsored by:	DARPA, AFRL
2018-09-12 08:05:33 +00:00
Stephen Hurd
64e6fc1379 Clean up iflib sysctls
Remove sysctls:
txq_drain_encapfail - now a duplicate of encap_txd_encap_fail
intr_link - was never incremented
intr_msix - was never incremented
rx_zero_len - was never incremented

The following were not incremented in all code-paths that apply:
m_pullups, mbuf_defrag, rxd_flush, tx_encap, rx_intr_enables, tx_frees,
encap_txd_encap_fail.

Fixes:
Replace the broken collapse_pkthdr() implementation with an MPASS().
fl_refills and fl_refills_large were not incremented when using netmap.

Reviewed by:	gallatin
Approved by:	re (marius)
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D16733
2018-09-06 18:51:52 +00:00
Bjoern A. Zeeb
d4672c61d3 Rather than duplicating the functionality of a macro after r322866
use the already existing one.  No functional changes.

Reviewed by:	karels, ae
Approved by:	re (rgrimes)
Differential Revision:	https://reviews.freebsd.org/D17004
2018-09-03 22:10:49 +00:00
Stephen Hurd
bc0e855bd9 Fix compile error due to missing parenthesis in r338372
Approved by:	re (gjb)
2018-08-29 16:21:34 +00:00
Stephen Hurd
a520f8b6fe Fix potential data corruption in iflib
The MP ring may have txq pointers enqueued.  Previously, these were
passed to m_free() when IFC_QFLUSH was set.  This patch checks for
the value and doesn't call m_free().

Reviewed by:	gallatin
Approved by:	re (gjb)
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D16882
2018-08-29 15:55:25 +00:00
Mark Murray
19fa89e938 Remove the Yarrow PRNG algorithm option in accordance with due notice
given in random(4).

This includes updating of the relevant man pages, and no-longer-used
harvesting parameters.

Ensure that the pseudo-unit-test still does something useful, now also
with the "other" algorithm instead of Yarrow.

PR:		230870
Reviewed by:	cem
Approved by:	so(delphij,gtetlow)
Approved by:	re(marius)
Differential Revision:	https://reviews.freebsd.org/D16898
2018-08-26 12:51:46 +00:00
Navdeep Parhar
47c39432b1 Unbreak VLANs after r337943.
ether_set_pcp should not be called from ether_output_frame for VLAN
interfaces -- the vid + pcp will be inserted during vlan_transmit in
that case. r337943 sets the VLAN's ifnet's if_pcp to a proper PCP value
and this led to double encapsulation (once with vid 0 and second time
with vid+pcp).

PR: 230794
Reviewed by:	kib@
Approved by:	re@ (gjb@)
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D16887
2018-08-24 21:48:13 +00:00
Patrick Kelsey
249cc75fd1 Extended pf(4) ioctl interface and pfctl(8) to allow bandwidths of
2^32 bps or greater to be used.  Prior to this, bandwidth parameters
would simply wrap at the 2^32 boundary.  The computations in the HFSC
scheduler and token bucket regulator have been modified to operate
correctly up to at least 100 Gbps.  No other algorithms have been
examined or modified for correct operation above 2^32 bps (some may
have existing computation resolution or overflow issues at rates below
that threshold).  pfctl(8) will now limit non-HFSC bandwidth
parameters to 2^32 - 1 before passing them to the kernel.

The extensions to the pf(4) ioctl interface have been made in a
backwards-compatible way by versioning affected data structures,
supporting all versions in the kernel, and implementing macros that
will cause existing code that consumes that interface to use version 0
without source modifications.  If version 0 consumers of the interface
are used against a new kernel that has had bandwidth parameters of
2^32 or greater configured by updated tools, such bandwidth parameters
will be reported as 2^32 - 1 bps by those old consumers.

All in-tree consumers of the pf(4) interface have been updated.  To
update out-of-tree consumers to the latest version of the interface,
define PFIOC_USE_LATEST ahead of any includes and use the code of
pfctl(8) as a guide for the ioctls of interest.

PR:	211730
Reviewed by:	jmallett, kp, loos
MFC after:	2 weeks
Relnotes:	yes
Sponsored by:	RG Nets
Differential Revision:	https://reviews.freebsd.org/D16782
2018-08-22 19:38:48 +00:00
Eric Joyner
fe2bf351fe if_media: Add new 2.5G/5G/25G/40G/50G/100G/200G/400G media types
Upcoming Ethernet hardware will support new media types that aren't in the kernel
yet, so they are added here. These mostly include new 25G/50G/100G media types;
and this commit introduces new 200G/400G speeds and media.

Reviewed by:	hselasky@, jhb@
MFC after:	1 week
Sponsored by:	Intel Corporation
Differential Revision:	https://reviews.freebsd.org/D16731
2018-08-22 18:19:56 +00:00
Matt Macy
77ad07b6a3 fix copy/paste error when clearing ifma flag
CID: 1395119
Reported by:	vangyzen
2018-08-21 22:59:22 +00:00
Conrad Meyer
b8e771e97a Back out r338035 until Warner is finished churning GSoC PNP patches
I was not aware Warner was making or planning to make forward progress in
this area and have since been informed of that.

It's easy to apply/reapply when churn dies down.
2018-08-19 00:46:22 +00:00
Conrad Meyer
faa319436f Remove unused and easy to misuse PNP macro parameter
Inspired by r338025, just remove the element size parameter to the
MODULE_PNP_INFO macro entirely.  The 'table' parameter is now required to
have correct pointer (or array) type.  Since all invocations of the macro
already had this property and the emitted PNP data continues to include the
element size, there is no functional change.

Mostly done with the coccinelle 'spatch' tool:

  $ cat modpnpsize0.cocci
    @normaltables@
    identifier b,c;
    expression a,d,e;
    declarer MODULE_PNP_INFO;
    @@
     MODULE_PNP_INFO(a,b,c,d,
    -sizeof(d[0]),
     e);

    @singletons@
    identifier b,c,d;
    expression a;
    declarer MODULE_PNP_INFO;
    @@
     MODULE_PNP_INFO(a,b,c,&d,
    -sizeof(d),
     1);

  $ rg -l MODULE_PNP_INFO -- sys | \
    xargs spatch --in-place --sp-file modpnpsize0.cocci

(Note that coccinelle invokes diff(1) via a PATH search and expects diff to
tolerate the -B flag, which BSD diff does not.  So I had to link gdiff into
PATH as diff to use spatch.)

Tinderbox'd (-DMAKE_JUST_KERNELS).
2018-08-19 00:22:21 +00:00
Navdeep Parhar
32a52e9e39 if_vlan(4): A VLAN always has a PCP and its ifnet's if_pcp should be set
to the PCP value in use instead of IFNET_PCP_NONE.

MFC after:	1 week
Sponsored by:	Chelsio Communications
2018-08-17 01:03:23 +00:00
Navdeep Parhar
32d2623ae2 Add the ability to look up the 3b PCP of a VLAN interface. Use it in
toe_l2_resolve to fill up the complete vtag and not just the vid.

Reviewed by:	kib@
MFC after:	1 week
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D16752
2018-08-16 23:46:38 +00:00
Matt Macy
f9be038601 Fix in6_multi double free
This is actually several different bugs:
- The code is not designed to handle inpcb deletion after interface deletion
  - add reference for inpcb membership
- The multicast address has to be removed from interface lists when the refcount
  goes to zero OR when the interface goes away
  - decouple list disconnect from refcount (v6 only for now)
- ifmultiaddr can exist past being on interface lists
  - add flag for tracking whether or not it's enqueued
- deferring freeing moptions makes the incpb cleanup code simpler but opens the
  door wider still to races
  - call inp_gcmoptions synchronously after dropping the the inpcb lock

Fundamentally multicast needs a rewrite - but keep applying band-aids for now.

Tested by: kp
Reported by: novel, kp, lwhsu
2018-08-15 20:23:08 +00:00
Andrew Gallatin
5ccac9f972 lagg: allow lacp to manage the link state
Lacp needs to manage the link state itself. Unlike other
lagg protocols, the ability of lacp to pass traffic
depends not only on the lagg members having link, but also
on the lacp protocol converging to a distributing state with the
link partner.

If we prematurely mark the link as up, then we will send a
gratuitous arp (via arp_handle_ifllchange()) before the lacp
interface is capable of passing traffic. When this happens,
the gratuitous arp is lost, and our link partner may cache
a stale mac address (eg, when the base mac address for the
lagg bundle changes, due to a BIOS change re-ordering NIC
unit numbers)

Reviewed by: jtl, hselasky
Sponsored by: Netflix
2018-08-13 14:13:25 +00:00
Kristof Provost
91e0f2d200 pf: Increase default hash table size
Now that we (by default) limit the number of states to 100.000 it makse sense
to also adjust the default size of the hash table.

Based on the benchmarking results in
https://github.com/ocochard/netbenches/blob/master/Atom_C2758_8Cores-Chelsio_T540-CR/pf-states_hashsize/results/fbsd12-head.r332390/README.md
128K entries offers a good compromise between performance and memory use.

Users may still overrule this setting with the net.pf.states_hashsize and
net.pf.source_nodes_hashsize loader(8) tunables.
2018-08-05 13:54:37 +00:00
Patrick Kelsey
8f410865b8 Mark the send queue ready so ALTQ is available. 2018-08-04 01:45:17 +00:00
Andrew Turner
b6ea4c5a2a As with DPCPU_DEFINE_STATIC make VNET_DEFINE_STATIC non-static on arm64 in
modules. It also fails in the same way, we are unable to relocate static
variables as the compiler uses PC-relative loads with nothing for the
kernel linker to relocate.

Sponsored by:	DARPA, AFRL
2018-07-30 15:05:07 +00:00
Andrew Turner
cd2106eaea Ensure the DPCPU and VNET module spaces are aligned to hold a pointer.
Previously they may have been aligned to a char, leading to misaligned
DPCPU and VNET variables.

Sponsored by:	DARPA, AFRL
2018-07-30 14:25:17 +00:00
Andrew Turner
bc61d94997 As with DPCPU_DEFINE make it a compile error to use static with VNET_DEFINE.
There is the VNET_DEFINE_STATIC macro for that.
2018-07-30 12:44:44 +00:00
Patrick Kelsey
b8ca475604 ALTQ support for iflib.
Reviewed by:	jmallett, mmacy
Differential Revision:	https://reviews.freebsd.org/D16433
2018-07-25 22:46:36 +00:00
Marius Strobl
c9a49a4fd8 Since r336611, n is only used for INET in iflib_parse_header().
Reported by:	rpokala
2018-07-24 23:40:27 +00:00
Andrew Turner
5f901c92a8 Use the new VNET_DEFINE_STATIC macro when we are defining static VNET
variables.

Reviewed by:	bz
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D16147
2018-07-24 16:35:52 +00:00
Andrew Turner
fceba23f93 As with DPCPU create VNET_DEFINE_STATIC for when a variable needs to be
declaired static. This will allow us to change the definition on arm64
as it has the same issues described in r336349.

Reviewed by:	bz
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D16147
2018-07-24 16:31:16 +00:00
Eugene Grosbein
804771f553 epair(4): make sure we do not duplicate MAC addresses
in case of reused if_index.

PR:		229957
Tested by:	O. Hartmann <ohartmann@walstatt.org>
Approved by:	avg (mentor)
2018-07-23 07:11:58 +00:00
Marius Strobl
7474544bac Use the maximum of isc_tx_{nsegments,tso_segments_max} for MAX_TX_DESC.
Since r336313, TSO support for LEM-class devices is removed again as it
was before the conversion of {l,}em(4) to iflib(4) in r311849 and as a
result, isc_tx_tso_segments_max is 0 for LEM-class devices now. Thus,
inappropriate watermarks were used for this class.

This is really only a band-aid, though, because so far iflib(9) doesn't
fully take into account that DMA engines can support different maxima
of segments for transfers of TSO and non-TSO packets. For example, the
DESC_RECLAIMABLE macro is based on isc_tx_nsegments while MAX_TX_DESC
used isc_tx_tso_segments_max only. For most in-tree consumers that
doesn't make a difference as the maxima are the same for both kinds of
transfers (that is, apart from the fact that TSO may require up to 2
sentinel descriptors but also not with every MAC supported). However,
isc_tx_nsegments is 8 but isc_tx_tso_segments_max is 85 by default
with ixl(4).
2018-07-22 17:51:11 +00:00
Marius Strobl
8b8d90931d - Given that the controlling expression of the receive loop in iflib_rxeof()
tests for avail > 0, avail can never be 0 within that loop. Thus, move
  decrementing avail and budget_left into the loop and before the code which
  checks for additional descriptors having become available in case all the
  previous ones have been processed but there still is budget left so the
  latter code works as expected. [1]
- In iflib_{busdma_load_mbuf_sg,parse_header}(), remove dead stores to m
  and n respectively. [2, 3]
- In collapse_pkthdr(), ensure that m_next isn't NULL before dereferencing
  it. [4]
- Remove a duplicate assignment of segs in iflib_encap().

Reported by:	Coverity
CID:		1356027 [1], 1356047 [2], 1368205 [3], 1356028 [4]
2018-07-22 17:45:44 +00:00
Stephen Hurd
fe51d4cdfe Add knob to control tx ring abdication.
r323954 changed the mp ring behaviour when 64-bit atomics were
available to abdicate the TX ring rather than having one become a
consumer thereby running to completion on TX. The consumer of the mp
ring was then triggered in the tx task rather than blocking the TX call.
While this significantly lowered the number of RX drops in small-packet
forwarding, it also negatively impacts TX performance.

With this change, the default behaviour is reverted, causing one TX ring
to become a consumer during the enqueue call. A new sysctl,
dev.X.Y.iflib.tx_abdicate is added to control this behaviour.

Reviewed by:	gallatin
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D16302
2018-07-20 17:45:26 +00:00
Stephen Hurd
dd7fbcf193 Improve netmap TX handling when TX IRQs are not used/supported
Use the timer to poll for TX completions when there are
outstanding TX slots. Track when the last driver timer was called
to prevent overcalling it. Also clean up some kring vs NIC ring
usage.

Reviewed by:	marius, Johannes Lundberg <johalun0@gmail.com>
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D16300
2018-07-20 17:24:45 +00:00
Andrey V. Elsukov
acf673edf0 Move invoking of callout_stop(&lle->lle_timer) into llentry_free().
This deduplicates the code a bit, and also implicitly adds missing
callout_stop() to in[6]_lltable_delete_entry() functions.

PR:		209682, 225927
Submitted by:	hselasky (previous version)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D4605
2018-07-17 11:33:23 +00:00
Marius Strobl
7f87c0406d Assorted TSO fixes for em(4)/iflib(9) and dead code removal:
- Ever since the workaround for the silicon bug of TSO4 causing MAC hangs
  was committed in r295133, CSUM_TSO always got disabled unconditionally
  by em(4) on the first invocation of em_init_locked(). However, even with
  that problem fixed, it turned out that for at least e. g. 82579 not all
  necessary TSO workarounds are in place, still causing MAC hangs even at
  Gigabit speed. Thus, for stable/11, TSO usage was deliberately disabled
  in r323292 (r323293 for stable/10) for the EM-class by default, allowing
  users to turn it on if it happens to work with their particular EM MAC
  in a Gigabit-only environment.
  In head, the TSO workaround for speeds other than Gigabit was lost with
  the conversion to iflib(9) in r311849 (possibly along with another one
  or two TSO workarounds). Yet at the same time, for EM-class MACs TSO4
  got enabled by default again, causing device hangs. Therefore, change the
  default for this hardware class back to have TSO4 off, allowing users
  to turn it on manually if it happens to work in their environment as
  we do in stable/{10,11}. An alternative would be to add a whitelist of
  EM-class devices where TSO4 actually is reliable with the workarounds in
  place, but given that the advantage of TSO at Gigabit speed is rather
  limited - especially with the overhead of these workarounds -, that's
  really not worth it. [1]
  This change includes the addition of an isc_capabilities to struct
  if_softc_ctx so iflib(9) can also handle interface capabilities that
  shouldn't be enabled by default which is used to handle the default-off
  capabilities of e1000 as suggested by shurd@ and moving their handling
  from em_setup_interface() to em_if_attach_pre() accordingly.
- Although 82543 support TSO4 in theory, the former lem(4) didn't have
  support for TSO4, presumably because TSO4 is even more broken in the
  LEM-class of MACs than the later EM ones. Still, TSO4 for LEM-class
  devices was enabled as part of the conversion to iflib(9) in r311849,
  causing device hangs. So revert back to the pre-r311849 behavior of
  not supporting TSO4 for LEM-class at all, which includes not creating
  a TSO DMA tag in iflib(9) for devices not having IFCAP_TSO4 set. [2]
- In fact, the FreeBSD TCP stack can handle a TSO size of IP_MAXPACKET
  (65535) rather than FREEBSD_TSO_SIZE_MAX (65518). However, the TSO
  DMA must have a maxsize of the maximum TSO size plus the size of a
  VLAN header for software VLAN tagging. The iflib(9) converted em(4),
  thus, first correctly sets scctx->isc_tx_tso_size_max to EM_TSO_SIZE
  in em_if_attach_pre(), but later on overrides it with IP_MAXPACKET
  in em_setup_interface() (apparently, left-over from pre-iflib(9)
  times). So remove the later and correct iflib(9) to correctly cap
  the maximum TSO size reported to the stack at IP_MAXPACKET. While at
  it, let iflib(9) use if_sethwtsomax*().
  This change includes the addition of isc_tso_max{seg,}size DMA engine
  constraints for the TSO DMA tag to struct if_shared_ctx and letting
  iflib_txsd_alloc() automatically adjust the maxsize of that tag in case
  IFCAP_VLAN_MTU is supported as requested by shurd@.
- Move the if_setifheaderlen(9) call for adjusting the maximum Ethernet
  header length from {ixgbe,ixl,ixlv,ixv,em}_setup_interface() to iflib(9)
  so adjustment is automatically done in case IFCAP_VLAN_MTU is supported.
  As a consequence, this adjustment now is also done in case of bnxt(4)
  which missed it previously.
- Move the reduction of the maximum TSO segment count reported to the
  stack by the number of m_pullup(9) calls (which in the worst case,
  can add another mbuf and, thus, the requirement for another DMA
  segment each) in the transmit path for performance reasons from
  em_setup_interface() to iflib_txsd_alloc() as these pull-ups are now
  done in iflib_parse_header() rather than in the no longer existing
  em_xmit(). Moreover, this optimization applies to all drivers using
  iflib(9) and not just em(4); all in-tree iflib(9) consumers still
  have enough room to handle full size TSO packets. Also, reduce the
  adjustment to the maximum number of m_pullup(9)'s now performed in
  iflib_parse_header().
- Prior to the conversion of em(4)/igb(4)/lem(4) and ixl(4) to iflib(9)
  in r311849 and r335338 respectively, these drivers didn't enable
  IFCAP_VLAN_HWFILTER by default due to VLAN events not being passed
  through by lagg(4). With iflib(9), IFCAP_VLAN_HWFILTER was turned on
  by default but also lagg(4) was fixed in that regard in r203548. So
  just remove the now redundant and defunct IFCAP_VLAN_HWFILTER handling
  in {em,ixl,ixlv}_setup_interface().
- Nuke other redundant IFCAP_* setting in {em,ixl,ixlv}_setup_interface()
  which is (more completely) already done in {em,ixl,ixlv}_if_attach_pre()
  now.
- Remove some redundant/dead setting of scctx->isc_tx_csum_flags in
  em_if_attach_pre().
- Remove some IFCAP_* duplicated either directly or indirectly (e. g.
  via IFCAP_HWCSUM) in {EM,IGB,IXL}_CAPS.
- Don't bother to fiddle with IFCAP_HWSTATS in ixgbe(4)/ixgbev(4) as
  iflib(9) adds that capability unconditionally.
- Remove some unused macros from em(4).
- Bump __FreeBSD_version as some of the above changes require the modules
  of drivers using iflib(9) to be recompiled.

Okayed by:	sbruno@ at 201806 DevSummit Transport Working Group [1]
Reviewed by:	sbruno (earlier version), erj
PR:	219428 (part of; comment #10) [1], 220997 (part of; comment #3) [2]
Differential Revision:	https://reviews.freebsd.org/D15720
2018-07-15 19:04:23 +00:00
Kristof Provost
b895839daa pf: Fix typo in r336221
Reported by:	olivier@
2018-07-12 18:07:28 +00:00
Kristof Provost
f4cdb8aedd pf: Increate default state table size
The typical system now has a lot more memory than when pf was new, and is also
expected to handle more connections. Increase the default size of the state
table.
Note that users can overrule this using 'set limit states' in pf.conf.

From OpenBSD:
    The year is 2018.
    Mercury, Bowie, Cash, Motorola and DEC all left us.
    Just pf still has a default state table limit of 10000.
    Had! Now it's a tiny little bit more, 100k.
    lead guitar: me
    ok chorus: phessler theo claudio benno
    background school girl laughing: bob

Obtained from:	OpenBSD
2018-07-12 16:35:35 +00:00
Andrey V. Elsukov
98a8fdf6da Deduplicate the code.
Add generic function if_tunnel_check_nesting() that does check for
allowed nesting level for tunneling interfaces and also does loop
detection. Use it in gif(4), gre(4) and me(4) interfaces.

Differential Revision:	https://reviews.freebsd.org/D16162
2018-07-09 11:03:28 +00:00
Sean Bruno
2da1967762 struct ifmediareq *ifmrp is only used in the COMPAT_FREEBSD32 parts of
ifioctl().  Move it inside the proper #ifdef.  This was throwing a valid
"Assigned but unused" warning with gcc.

Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D16063
2018-07-07 13:35:06 +00:00
Will Andrews
cc535c95ca Revert r335833.
Several third-parties use at least some of these ioctls.  While it would be
better for regression testing if they were used in base (or at least in the
test suite), it's currently not worth the trouble to push through removal.

Submitted by:	antoine, markj
2018-07-04 03:36:46 +00:00
Matt Macy
6573d7580b epoch(9): allow preemptible epochs to compose
- Add tracker argument to preemptible epochs
- Inline epoch read path in kernel and tied modules
- Change in_epoch to take an epoch as argument
- Simplify tfb_tcp_do_segment to not take a ti_locked argument,
  there's no longer any benefit to dropping the pcbinfo lock
  and trying to do so just adds an error prone branchfest to
  these functions
- Remove cases of same function recursion on the epoch as
  recursing is no longer free.
- Remove the the TAILQ_ENTRY and epoch_section from struct
  thread as the tracker field is now stack or heap allocated
  as appropriate.

Tested by: pho and Limelight Networks
Reviewed by: kbowling at llnw dot com
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16066
2018-07-04 02:47:16 +00:00
Will Andrews
c1887e9f09 pf: remove unused ioctls.
Several ioctls are unused in pf, in the sense that no base utility
references them.  Additionally, a cursory review of pf-based ports
indicates they're not used elsewhere either.  Some of them have been
unused since the original import.  As far as I can tell, they're also
unused in OpenBSD.  Finally, removing this code removes the need for
future pf work to take them into account.

Reviewed by:		kp
Differential Revision:	https://reviews.freebsd.org/D16076
2018-07-01 01:16:03 +00:00
Andrey V. Elsukov
6e081509db Add NULL pointer check.
encap_lookup_t method can be invoked by IP encap subsytem even if none
of gif/gre/me interfaces are exist. Hash tables are allocated on demand,
when first interface is created. So, make NULL pointer check before
doing access to hash table.

PR:		229378
2018-06-28 11:39:27 +00:00
Andrey V. Elsukov
ca3cd72b17 Move BPFIF_* macro definitions into .c file, where struct bpf_if is
declared.

They are only used in this file and there is no need to export them via
bpfdesc.h.
2018-06-19 10:34:45 +00:00
Eric Joyner
dfae03b5b5 iflib: Style fixes
MFC after:	1 week
2018-06-18 17:27:43 +00:00
Marius Strobl
e4defe55a8 Assorted fixes to MSI-X/MSI/INTx setup in iflib(9):
- In iflib_msix_init(), VMMs with broken MSI-X activation are trying
  to be worked around by manually enabling PCIM_MSIXCTRL_MSIX_ENABLE
  before calling pci_alloc_msix(9). Apart from constituting a layering
  violation, this has the problem of leaving PCIM_MSIXCTRL_MSIX_ENABLE
  enabled when falling back to MSI or INTx when e. g. MSI-X is black-
  listed and initially also when disabled via hw.pci.enable_msix. The
  later in turn was incorrectly worked around in r325166.
  Since r310806, pci(4) itself has code to deal with broken MSI-X
  handling of VMMs, so all of these workarounds in iflib(9) can go,
  fixing non-working interrupts when falling back to MSI/INTx. In
  any case, possibly further adjustments to broken MSI-X activation
  of VMMs like enabling r310806 by default in VM environments need to
  be placed into pci(4), not iflib(9). [1]
- Also remove the pci_enable_busmaster(9) call from iflib_msix_init(),
  which is already more properly invoked from iflib_device_attach().
- When falling back to MSI/INTx, release the MSI-X BAR resource again.
- When falling back to INTx, ensure scctx->isc_vectors is set to 1 and
  not to something higher from a device with more than one MSI message
  supported.
- Make the nearby ring_state(s) stuff (static) const.

Discussed with:	jhb at BSDCan 2018 [1]
Reviewed by:	imp, jhb
Differential Revision:	https://reviews.freebsd.org/D15729
2018-06-17 20:33:02 +00:00
Andrey V. Elsukov
ae11b3829c Fix typo.
Reported by:	rpokala
2018-06-16 19:21:09 +00:00
Andrey V. Elsukov
20efcfc602 Switch RIB and RADIX_NODE_HEAD lock from rwlock(9) to rmlock(9).
Using of rwlock with multiqueue NICs for IP forwarding on high pps
produces high lock contention and inefficient. Rmlock fits better for
such workloads.

Reviewed by:	melifaro, olivier
Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D15789
2018-06-16 08:26:23 +00:00
Andrey V. Elsukov
9597ff83b8 Add missing BPF_MTAP2() for outbound packets. 2018-06-14 15:04:30 +00:00
Andrey V. Elsukov
2addcba7d5 Convert if_me(4) driver to use encap_lookup_t method and be lockless on
data path.
2018-06-14 14:53:24 +00:00
Jonathan T. Looney
0766f278d8 Make UMA and malloc(9) return non-executable memory in most cases.
Most kernel memory that is allocated after boot does not need to be
executable.  There are a few exceptions.  For example, kernel modules
do need executable memory, but they don't use UMA or malloc(9).  The
BPF JIT compiler also needs executable memory and did use malloc(9)
until r317072.

(Note that a side effect of r316767 was that the "small allocation"
path in UMA on amd64 already returned non-executable memory.  This
meant that some calls to malloc(9) or the UMA zone(9) allocator could
return executable memory, while others could return non-executable
memory.  This change makes the behavior consistent.)

This change makes malloc(9) return non-executable memory unless the new
M_EXEC flag is specified.  After this change, the UMA zone(9) allocator
will always return non-executable memory, and a KASSERT will catch
attempts to use the M_EXEC flag to allocate executable memory using
uma_zalloc() or its variants.

Allocations that do need executable memory have various choices.  They
may use the M_EXEC flag to malloc(9), or they may use a different VM
interfact to obtain executable pages.

Now that malloc(9) again allows executable allocations, this change also
reverts most of r317072.

PR:		228927
Reviewed by:	alc, kib, markj, jhb (previous version)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D15691
2018-06-13 17:04:41 +00:00
Andrey V. Elsukov
a5185adeb6 Rework if_gre(4) to use encap_lookup_t method to speedup lookup
of needed interface when many gre interfaces are present.

Remove rmlock from gre_softc, use epoch(9) and CK_LIST instead.
Move more AF-related code into AF-related locations. Use hash table to
speedup lookup of needed softc.
2018-06-13 11:11:33 +00:00
Jonathan T. Looney
16a227c7c9 Fix a memory leak for the BIOCSETWF ioctl on kernels with the BPF_JITTER
option.

The BPF code was creating a compiled filter in the common filter-creation
path.  However, BPF only uses compiled filters in the read direction.
When creating a write filter, the common filter-creation code was
creating an unneeded write filter and leaking the memory used for that.

MFC after:	2 weeks
Sponsored by:	Netflix
2018-06-11 23:32:06 +00:00
Andrey V. Elsukov
44bcc06816 Explicitly change the link state when we assingn an address.
Since we are setting IFF_UP flag on SIOCSIFADDR, it is possible, that
after this link state information still not initialized properly.
This leads to problems with routing, since now interface has
IFCAP_LINKSTATE capability and a route is considered as working only
when interface's link state is in LINK_STATE_UP (see RT_LINK_IS_UP()
macro).

Reported by:	Marek Zarychta
MFC after:	3 days
2018-06-09 09:57:14 +00:00
Stephen Hurd
3ab4a96085 Remove tx task spinning added in r333686
This caused issues with PASTE.  Just remove the reschedule since the DELAY()
should be enough for use cases such as pkt-gen which were failing before the
change.

Reported by:	Michio Honda
Sponsored by:	Limelight Networks
2018-06-08 21:49:19 +00:00
Mateusz Guzik
b8af2820f6 uma: fix up r334824
Turns out there is code which ends up passing M_ZERO to counters.
Since counters zero unconditionally on their own, just ignore drop the
flag in that place.
2018-06-08 05:40:36 +00:00
Matt Macy
58378a8971 rtentry_zinit: don't blindly pass through M_ZERO to counter alloc 2018-06-08 05:17:06 +00:00
Eric Joyner
a06424ddd3 iflib: Record TCP checksum info in iflib when TCP checksum is requested
ixl(4) (when it switches over to using iflib) devices need the TCP header
length in order to do TCP checksum offload.

Reviewed by:	gallatin@, shurd@
MFC after:	1 week
Sponsored by:	Intel Corporation
Differential Revision:	https://reviews.freebsd.org/D15558
2018-06-07 13:03:07 +00:00
Andrey V. Elsukov
b941bc1d6e Rework if_gif(4) to use new encap_lookup_t method to speedup lookup
of needed interface when many gif interfaces are present.

Remove rmlock from gif_softc, use epoch(9) and CK_LIST instead.
Move more AF-related code into AF-related locations.
Use hash table to speedup lookup of needed softc. Interfaces
with GIF_IGNORE_SOURCE flag are stored in plain CK_LIST.
Sysctl net.link.gif.parallel_tunnels is removed. The removal was planed
16 years ago, and actually it could work only for outbound direction.
Each protocol, that can be handled by if_gif(4) interface is registered
by separate encap handler, this helps avoid invoking the handler
for unrelated protocols (GRE, PIM, etc.).

This change allows dramatically improve performance when many gif(4)
interfaces are used.

Sponsored by:	Yandex LLC
2018-06-05 21:24:59 +00:00
Andrey V. Elsukov
6d8fdfa9d5 Rework IP encapsulation handling code.
Currently it has several disadvantages:
- it uses single mutex to protect internal structures. It is used by
  data- and control- path, thus there are no parallelism at all.
- it uses single list to keep encap handlers for both INET and INET6
  families.
- struct encaptab keeps unneeded information (src, dst, masks, protosw),
  that isn't used by code in the source tree.
- matches are prioritized and when many tunneling interfaces are
  registered, encapcheck handler of each interface is invoked for each
  packet. The search takes O(n) for n interfaces. All this work is done
  with exclusive lock held.

What this patch includes:
- the datapath is converted to be lockless using epoch(9) KPI.
- struct encaptab now linked using CK_LIST.
- all unused fields removed from struct encaptab. Several new fields
  addedr: min_length is the minimum packet length, that encapsulation
  handler expects to see; exact_match is maximum number of bits, that
  can return an encapsulation handler, when it wants to consume a packet.
- IPv6 and IPv4 handlers are stored in separate lists;
- added new "encap_lookup_t" method, that will be used later. It is
  targeted to speedup lookup of needed interface, when gif(4)/gre(4) have
  many interfaces.
- the need to use protosw structure is eliminated. The only pr_input
  method was used from this structure, so I don't see the need to keep
  using it.
- encap_input_t method changed to avoid using mbuf tags to store softc
  pointer. Now it is passed directly trough encap_input_t method.
  encap_getarg() funtions is removed.
- all sockaddr structures and code that uses them removed. We don't have
  any code in the tree that uses them. All consumers use encap_attach_func()
  method, that relies on invoking of encapcheck() to determine the needed
  handler.
- introduced struct encap_config, it contains parameters of encap handler
  that is going to be registered by encap_attach() function.
- encap handlers are stored in lists ordered by exact_match value, thus
  handlers that need more bits to match will be checked first, and if
  encapcheck method returns exact_match value, the search will be stopped.
- all current consumers changed to use new KPI.

Reviewed by:	mmacy
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D15617
2018-06-05 20:51:01 +00:00
Matt Macy
a6bc59f203 Reduce overhead of entropy collection
- move harvest mask check inline
- move harvest mask to frequently_read out of actively
  modified cache line
- disable ether_input collection and describe its limitations
  in NOTES

Typically entropy collection in ether_input was stirring zero
in to the entropy pool while at the same time greatly reducing
max pps. This indicates that perhaps we should more closely
scrutinize how much entropy we're getting from a given source
as well as what our actual entropy collection needs are for
seeding Yarrow.

Reviewed by: cem, gallatin, delphij
Approved by: secteam
Differential Revision: https://reviews.freebsd.org/D15526
2018-05-31 21:53:07 +00:00
Hans Petter Selasky
5fd1ea0810 Re-apply r190640.
- Restore local change to include <net/bpf.h> inside pcap.h.
This fixes ports build problems.
- Update local copy of dlt.h with new DLT types.
- Revert no longer needed <net/bpf.h> includes which were added
as part of r334277.

Suggested by:	antoine@, delphij@, np@
MFC after:	3 weeks
Sponsored by:	Mellanox Technologies
2018-05-31 09:11:21 +00:00
Matt Macy
91d6c9b93e if_setlladdr: don't call ioctl in epoch context
PR: 228612
Reported by: markj
2018-05-30 21:46:10 +00:00
Kristof Provost
d25c25dc52 pf: Add missing include statement
rmlocks require <sys/lock.h> as well as <sys/rmlock.h>.
Unbreak mips build.
2018-05-30 12:40:37 +00:00
Kristof Provost
455969d305 pf: Replace rwlock on PF_RULES_LOCK with rmlock
Given that PF_RULES_LOCK is a mostly read lock, replace the rwlock with rmlock.
This change improves packet processing rate in high pps environments.
Benchmarking by olivier@ shows a 65% improvement in pps.

While here, also eliminate all appearances of "sys/rwlock.h" includes since it
is not used anymore.

Submitted by:	farrokhi@
Differential Revision:	https://reviews.freebsd.org/D15502
2018-05-30 07:11:33 +00:00
Stephen Hurd
3e0e6330b5 iflib: mark irq allocation name parameter as constant
The *name parameter passed to iflib_irq_alloc_generic and
iflib_softirq_alloc_generic is never modified. Many places in code pass
string literals and thus should not be modified.

Mark the *name parameter as a const char * instead, so that we enforce
that the name is not modified before passing to bus_describe_intr()

Submitted by:	Jacob Keller <jacob.e.keller@intel.com>
Reviewed by:	kmacy
Sponsored by:	Intel Corporation
Differential Revision:	https://reviews.freebsd.org/D15343
2018-05-29 21:56:39 +00:00
Matt Macy
6c3c319414 iflib: hold context lock across detach for drivers that need it 2018-05-29 18:03:43 +00:00
Matt Macy
134804c89a rt_getifa_fib: don't use ifa but info->rti_ifa
Reported by:	kp
2018-05-29 07:14:57 +00:00
Matt Macy
1ebec5faf4 route: fix missed ref adds
- ensure that we bump the ifa ref whenever we add a reference
 - defer freeing epoch protected references until after the if_purgaddrs
   loop
2018-05-29 00:53:53 +00:00
Eric Joyner
1d7ef1867a iflib: Add new shared flag: IFLIB_ADMIN_ALWAYS_RUN
ixl(4)'s nvmupdate utility expects the nvmupdate process to run
while the interface is down; these nvm update commands use the
admin queue, so the admin queue needs to be able to generate
interrupts and be processed while the interface is down.

So add a flag that ixl(4) sets that lets the entire admin task
run even when the interface is marked down/IFF_DRV_RUNNING isn't set.

With this change, nvmupdate should function like it did pre-iflib.

Reviewed by:	gallatin@, sbruno@
MFC after:	1 week
Sponsored by:	Intel Corporation
Differential Revision:	https://reviews.freebsd.org/D15575
2018-05-26 00:46:08 +00:00
Matt Macy
9379029a92 rtrequest1_fib: we need to always bump the ifaddr refcount when we take a reference from
an rtentry. r334118 introduced a case when this was not done.

While we're here make the intent more obvious by moving the refcount
bump down to when we know we'll actually need it.

Reported by:	markj
2018-05-25 19:48:26 +00:00
Matt Macy
0f8d79d977 CK: update consumers to use CK macros across the board
r334189 changed the fields to have names distinct from those in queue.h
in order to expose the oversights as compile time errors
2018-05-24 23:21:23 +00:00
Matt Macy
5328b11c95 if_delgroups: add missed unlock introduced by r334118 2018-05-24 17:54:08 +00:00
Matt Macy
4f6c66cc9c UDP: further performance improvements on tx
Cumulative throughput while running 64
  netperf -H $DUT -t UDP_STREAM -- -m 1
on a 2x8x2 SKL went from 1.1Mpps to 2.5Mpps

Single stream throughput increases from 910kpps to 1.18Mpps

Baseline:
https://people.freebsd.org/~mmacy/2018.05.11/udpsender2.svg

- Protect read access to global ifnet list with epoch
https://people.freebsd.org/~mmacy/2018.05.11/udpsender3.svg

- Protect short lived ifaddr references with epoch
https://people.freebsd.org/~mmacy/2018.05.11/udpsender4.svg

- Convert if_afdata read lock path to epoch
https://people.freebsd.org/~mmacy/2018.05.11/udpsender5.svg

A fix for the inpcbhash contention is pending sufficient time
on a canary at LLNW.

Reviewed by:	gallatin
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D15409
2018-05-23 21:02:14 +00:00
Luca Pizzamiglio
11d416666d Improve MAC address uniqueness on if_epair(4).
As reported in PR184149, it can happen that epair devices can have the same
MAC address.
This solution is based on a 32-bit hash, obtained combining the if_index of
the a interface and the hostid.
If the hostid is zero, a random number is used.

PR:		184149
Reviewed by:	wollman, eugen
Approved by:	cognet
Differential Revision:	https://reviews.freebsd.org/D15329
2018-05-23 13:10:57 +00:00
Mark Johnston
db5a36bddf Simplify lagg_input().
No functional change intended.

MFC after:	2 weeks
2018-05-22 15:35:38 +00:00
Matt Macy
fd04260d3f ck: simplify interface with libkvm consumers by defining ck_queue types
as their queue.h equivalents if !_KERNEL
2018-05-21 01:53:23 +00:00
Matt Macy
f6cb0dea4c net: fix uninitialized variable warning 2018-05-19 19:00:04 +00:00
Matt Macy
e335651e1e mp_ring: fix i386
Even though 64-bit atomics are supported on i386 there are panics
indicating that the code does not work correctly there. Switch
to mutex based variant (and fix that while we're here).

Reported by:	pho, kib
2018-05-19 16:44:12 +00:00
Matt Macy
46d0f824be net: fix set but not used 2018-05-19 05:27:49 +00:00
Matt Macy
d7c5a620e2 ifnet: Replace if_addr_lock rwlock with epoch + mutex
Run on LLNW canaries and tested by pho@

gallatin:
Using a 14-core, 28-HTT single socket E5-2697 v3 with a 40GbE MLX5
based ConnectX 4-LX NIC, I see an almost 12% improvement in received
packet rate, and a larger improvement in bytes delivered all the way
to userspace.

When the host receiving 64 streams of netperf -H $DUT -t UDP_STREAM -- -m 1,
I see, using nstat -I mce0 1 before the patch:

InMpps OMpps  InGbs  OGbs err TCP Est %CPU syscalls csw     irq GBfree
4.98   0.00   4.42   0.00 4235592     33   83.80 4720653 2149771   1235 247.32
4.73   0.00   4.20   0.00 4025260     33   82.99 4724900 2139833   1204 247.32
4.72   0.00   4.20   0.00 4035252     33   82.14 4719162 2132023   1264 247.32
4.71   0.00   4.21   0.00 4073206     33   83.68 4744973 2123317   1347 247.32
4.72   0.00   4.21   0.00 4061118     33   80.82 4713615 2188091   1490 247.32
4.72   0.00   4.21   0.00 4051675     33   85.29 4727399 2109011   1205 247.32
4.73   0.00   4.21   0.00 4039056     33   84.65 4724735 2102603   1053 247.32

After the patch

InMpps OMpps  InGbs  OGbs err TCP Est %CPU syscalls csw     irq GBfree
5.43   0.00   4.20   0.00 3313143     33   84.96 5434214 1900162   2656 245.51
5.43   0.00   4.20   0.00 3308527     33   85.24 5439695 1809382   2521 245.51
5.42   0.00   4.19   0.00 3316778     33   87.54 5416028 1805835   2256 245.51
5.42   0.00   4.19   0.00 3317673     33   90.44 5426044 1763056   2332 245.51
5.42   0.00   4.19   0.00 3314839     33   88.11 5435732 1792218   2499 245.52
5.44   0.00   4.19   0.00 3293228     33   91.84 5426301 1668597   2121 245.52

Similarly, netperf reports 230Mb/s before the patch, and 270Mb/s after the patch

Reviewed by:	gallatin
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D15366
2018-05-18 20:13:34 +00:00
Matt Macy
f2d19f98c1 epoch(9): allocate net epochs earlier in boot 2018-05-18 18:48:00 +00:00
Matt Macy
d71e30de40 epoch: move epoch variables to read mostly section 2018-05-18 17:58:15 +00:00
Ed Maste
891cf3ed44 Use NULL for SYSINIT's last arg, which is a pointer type
Sponsored by:	The FreeBSD Foundation
2018-05-18 17:58:09 +00:00
Matt Macy
70398c2f86 epoch(9): Make epochs non-preemptible by default
There are risks associated with waiting on a preemptible epoch section.
Change the name to make them not be the default and document the issue
under CAVEATS.

Reported by:	markj
2018-05-18 17:29:43 +00:00
Matt Macy
5e68a3dfe3 epoch: add non-preemptible "critical" variant
adds:
- epoch_enter_critical() - can be called inside a different epoch,
  starts a section that will acquire any MTX_DEF mutexes or do
  anything that might sleep.
- epoch_exit_critical() - corresponding exit call
- epoch_wait_critical() - wait variant that is guaranteed that any
  threads in a section are running.
- epoch_global_critical - an epoch_wait_critical safe epoch instance

Requested by:   markj
Approved by:	sbruno
2018-05-18 01:52:51 +00:00
Matt Macy
2aa6f526de Fix !netmap build post r333686
Approved by:	sbruno
2018-05-16 22:25:47 +00:00
Stephen Hurd
5ee36c68e1 Work around lack of TX IRQs in iflib for netmap
When poll() is called via netmap, txsync is initially called,
and if there are no available buffers to reclaim, it waits for the driver
to notify of new buffers. Since the TX IRQ is generally not used in iflib
drivers, this ends up causing a timeout.

Work around this by having the reclaim DELAY(1) if it's initially unable
to reclaim anything, then schedule the tx task, which will spin by
continuously rescheduling itself until some buffers are reclaimed. In
general, the delay is enough to allow some buffers to be reclaimed, so
spinning is minimized.

Reported by:	Johannes Lundberg <johalun0@gmail.com>
Reviewed by:	sbruno
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D15455
2018-05-16 21:03:22 +00:00
Stephen Hurd
99031b8f7d Replace rmlock with epoch in lagg
Use the new epoch based reclamation API. Now the hot paths will not
block at all, and the sx lock is used for the softc data.  This fixes LORs
reported where the rwlock was obtained when the sxlock was held.

Submitted by:	mmacy
Reported by:	Harry Schmalzbauer <freebsd@omnilan.de>
Reviewed by:	sbruno
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D15355
2018-05-14 20:06:49 +00:00
Matt Macy
09f6ff4f1a iflib(9): Add support for cloning pseudo interfaces
Part 3 of many ...
The VPC framework relies heavily on cloning pseudo interfaces
(vmnics, vpc switch, vcpswitch port, hostif, vxlan if, etc).

This pulls in that piece. Some ancillary changes get pulled
in as a side effect.

Reviewed by:	shurd@
Approved by:	sbruno@
Sponsored by:	Joyent, Inc.
Differential Revision:	https://reviews.freebsd.org/D15347
2018-05-11 20:08:28 +00:00
Andrey V. Elsukov
e287c474be Apply the change from r272770 to if_ipsec(4) interface.
It is guaranteed that if_ipsec(4) interface is used only for tunnel
mode IPsec, i.e. decrypted and decapsultaed packet has its own IP header.
Thus we can consider it as new packet and clear the protocols flags.
This allows ICMP/ICMPv6 properly handle errors that may cause this packet.

PR:		228108
MFC after:	1 week
2018-05-11 16:50:25 +00:00
Matt Macy
5c30b378f0 Allow different bridge types to coexist
if_bridge has a lot of limitations that make it scale poorly to higher data
rates. In my projects/VPC branch I leverage the bridge interface between
layers for my high speed soft switch as well as for purposes of stacking
in general.

Reviewed by:	sbruno@
Approved by:	sbruno@
Differential Revision:	https://reviews.freebsd.org/D15344
2018-05-11 05:00:40 +00:00
Dag-Erling Smørgrav
20f8d7bc7e Slight cleanup of interface event logging.
Make if_printf() use vlog() instead of vprintf().  This means it can no
longer return the number of characters printed, as it used to, but every
single call to if_printf() in the entire kernel ignores the return value
anyway; just return 0 so we don't have to change the prototype.

Consistently use if_printf() throughout sys/net/if.c, instead of a
mixture of if_printf() and log().

In ifa_maintain_loopback_route(), don't needlessly log an error if we
either failed to add a route because it already existed or failed to
remove one because it did not.  We still return an error code, though.

MFC after:	1 week
2018-05-11 00:19:49 +00:00
Matt Macy
7bf272a612 Allocate epoch for networking at startup
Additionally add CK to include paths for modules

Approved by:	sbruno@
2018-05-10 19:13:00 +00:00