Commit Graph

4375 Commits

Author SHA1 Message Date
Gleb Smirnoff
0921628ddc Introduce flag IFF_NEEDSEPOCH that marks Ethernet interfaces that
supposedly may call into ether_input() without network epoch.

They all need to be reviewed before 13.0-RELEASE.  Some may need
be fixed.  The flag is not planned to be used in the kernel for
a long time.
2020-01-23 01:41:09 +00:00
Gleb Smirnoff
af614b8e04 tap(4) calls ether_input() in context of write(2). Enter network
epoch here.

The tun(4) side doesn't need this, as netisr code will take care.
2020-01-23 01:38:51 +00:00
Gleb Smirnoff
0b8df657a4 Enter network epoch in iflib rxeof task.
In upcoming changes ether_input() is going to be changed not
to enter the network epoch.  It is going to be responsibility
of network interrupt.  In case of iflib - its taskqueue.
2020-01-23 01:27:58 +00:00
Gleb Smirnoff
6ed3e18711 Mark swi_net() as INTR_TYPE_NET and stop entering epoch there. 2020-01-23 01:25:32 +00:00
Alexander Motin
84becee1ac Update route MTUs for bridge, lagg and vlan interfaces.
Those interfaces may implicitly change their MTU on addition of parent
interface in addition to normal SIOCSIFMTU ioctl path, where the route
MTUs are updated normally.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2020-01-22 20:36:45 +00:00
Alexander V. Chernikov
34a5582c47 Bring back redirect route expiration.
Redirect (and temporal) route expiration was broken a while ago.
This change brings route expiration back, with unified IPv4/IPv6 handling code.

It introduces net.inet.icmp.redirtimeout sysctl, allowing to set
 an expiration time for redirected routes. It defaults to 10 minutes,
 analogues with net.inet6.icmp6.redirtimeout.

Implementation uses separate file, route_temporal.c, as route.c is already
 bloated with tons of different functions.
Internally, expiration is implemented as an per-rnh callout scheduled when
 route with non-zero rt_expire time is added or rt_expire is changed.
 It does not add any overhead when no temporal routes are present.

Callout traverses entire routing tree under wlock, scheduling expired routes
 for deletion and calculating the next time it needs to be run. The rationale
 for such implemention is the following: typically workloads requiring large
 amount of routes have redirects turned off already, while the systems with
 small amount of routes will not inhibit large overhead during tree traversal.

This changes also fixes netstat -rn display of route expiration time, which
 has been broken since the conversion from kread() to sysctl.

Reviewed by:	bz
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D23075
2020-01-22 13:53:18 +00:00
Alexander V. Chernikov
16c2f24169 Document requirements for the 'struct route' variations.
MFC after:	2 weeks
2020-01-21 12:00:34 +00:00
Eugene Grosbein
2888eb4091 ifa_maintain_loopback_route: adjust debugging output
Correction after r333476:

- write this as LOG_DEBUG again instead of LOG_INFO;
- get back function name into the message;
- error may be ESRCH if an address is removed in process (by carp f.e.),
not only ENOENT;
- expression complexity grows, so try making it more readable.

MFC after:	1 week
2020-01-18 04:48:05 +00:00
Gleb Smirnoff
66c6c556b6 Change argument order of epoch_call() to more natural, first function,
then its argument.

Reviewed by:	imp, cem, jhb
2020-01-17 06:10:24 +00:00
Gleb Smirnoff
9758a507e9 gif_transmit() must always be called in the network epoch. 2020-01-15 06:18:32 +00:00
Gleb Smirnoff
2a4bd982d0 Introduce NET_EPOCH_CALL() macro and use it everywhere where we free
data based on the network epoch.   The macro reverses the argument
order of epoch_call(9) - first function, then its argument. NFC
2020-01-15 06:05:20 +00:00
Gleb Smirnoff
97168be809 Mechanically substitute assertion of in_epoch(net_epoch_preempt) to
NET_EPOCH_ASSERT(). NFC
2020-01-15 05:45:27 +00:00
Gleb Smirnoff
3264dcadc9 - Move global network epoch definition to epoch.h, as more different
subsystems tend to need to know about it, and including if_var.h is
  huge header pollution for them.  Polluting possible non-network
  users with single symbol seems much lesser evil.
- Remove non-preemptible network epoch.  Not used yet, and unlikely
  to get used in close future.
2020-01-15 03:34:21 +00:00
Vincenzo Maffione
2ec213aba4 netmap: disable passthrough with no hypervisor support
The netmap passthrough subsystem requires proper support in the
hypervisor. In particular, two PCI device ids (from the Red Hat
PCI vendor id 0x1b36) need to be assigned to the two netmap
virtual devices. We then disable these devices until the ids have
not been assigned, in order to avoid conflicts with other
virtual devices emulated by upstream QEMU.

PR:	241774
MFC after:	3 days
2020-01-13 21:47:23 +00:00
Alexander V. Chernikov
ead85fe415 Add fibnum, family and vnet pointer to each rib head.
Having metadata such as fibnum or vnet in the struct rib_head
 is handy as it eases building functionality in the routing space.
This change is required to properly bring back route redirect support.

Reviewed by:	bz
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D23047
2020-01-09 17:21:00 +00:00
Mark Johnston
c23df8eafa lagg: Further cleanup of the rr_limit option.
Add an option flag so that arbitrary updates to a lagg's configuration
do not clear sc_stride.  Preseve compatibility for old ifconfig
binaries.  Update ifconfig to use the new flag and improve the casting
used when parsing the option parameter.

Modify the RR transmit function to avoid locklessly reading sc_stride
twice.  Ensure that sc_stride is always 1 or greater.

Reviewed by:	hselasky
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D23092
2020-01-09 14:58:41 +00:00
Kyle Evans
c7bab2a7ca if_vmove: return proper error status
if_vmove can fail if it lost a race and the vnet's already been moved. The
callers (and their callers) can generally cope with this, but right now
success is assumed. Plumb out the ENOENT from if_detach_internal if it
happens so that the error's properly reported to userland.

Reviewed by:	bz, kp
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D22780
2020-01-09 03:52:50 +00:00
Alexander V. Chernikov
e02d3fe70c Fix rtsock route message generation for interface addresses.
Reviewed by:	olivier
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D22974
2020-01-07 21:16:30 +00:00
Eric Joyner
f6afed726b iflib: Prevent watchdog from resetting idle queues
While changing link state in iflib_link_state_change(), queues are
marked as IFLIB_QUEUE_IDLE to disable watchdog. Currently, iflib_timer()
watchdog does not check for previous queue status before marking it as
IFLIB_QUEUE_HUNG.

This patch adds check of queue status before marking it as hung.

Signed-off-by: Piotr Pietruszewski <piotr.pietruszewski@intel.com>

PR:		239240
Submitted by:	Piotr Pietruszewski <piotr.pietruszewski@intel.com>
Reported by:	ultima@
Reviewed by:	gallatin@, erj@
MFC after:	3 days
Sponsored by:	Intel Corporation
Differential Revision:	https://reviews.freebsd.org/D21712
2020-01-02 23:35:06 +00:00
Alexander V. Chernikov
5fcb2832e3 Plug loopback idaddr refcount leak.
Reviewed by:	markj
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D22980
2020-01-02 09:08:45 +00:00
Gleb Smirnoff
8d5c56dab1 In r343631 error code for a packet blocked by a firewall was
changed from EACCES to EPERM.  This change was not intentional,
so fix that.  Return EACCESS if a firewall forbids sending.

Noticed by:	ae
2020-01-01 17:31:43 +00:00
Alexander V. Chernikov
d930203192 Fix NOINET6 build broken by r356236.
MFC after:	2 weeks
2019-12-31 17:57:12 +00:00
Alexander V. Chernikov
c83dda362e Split gigantic rtsock route_output() into smaller functions.
Amount of changes to the original code has been intentionally minimised
to ease diffing.
The changes are mostly mechanical, with the following exceptions:

* lltable handler is now called directly based of RTF_LLINFO flag presense.
* "report" logic for updating rtm in RTM_GET/RTM_DELETE has been simplified,
  fixing several potential use-after-free cases in rt_addrinfo.
* llable asserts has been replaced with error-returning, preventing kernel
  crashes when lltable gw af family is invalid (root required).

MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D22864
2019-12-31 17:26:53 +00:00
Mark Johnston
6b5d8e30f1 Plug some ifaddr refcount leaks.
- Only take an ifaddr ref in in rt_exportinfo() if the caller explicitly
  requests it.  Take care to release it in this case.
- Don't unconditionally take a ref in rtrequest1_fib().  rt_getifa_fib()
  will acquire a reference, in which case we would previously acquire
  two references.
- Stop taking a reference in rtinit1() before calling rtrequest1_fib().
  rtrequest1_fib() will acquire a reference for the RTM_ADD case.

PR:		242746
Reviewed by:	melifaro (previous version)
Tested by:	ghuckriede@blackberry.com
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22912
2019-12-27 01:12:54 +00:00
Mark Johnston
c104c2990d lagg: Clean up handling of the rr_limit option.
- Don't allow an unprivileged user to set the stride. [1]
- Only set the stride under the softc lock.
- Rename the internal fields to accurately reflect their use.  Keep
  ro_bkt to avoid changing the user API.
- Simplify the implementation.  The port index is just sc_seq / stride.
- Document rr_limit in ifconfig.8.

Reported by:	Ilja Van Sprundel <ivansprundel@ioactive.com> [1]
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22857
2019-12-22 21:56:47 +00:00
Cy Schubert
57e22627f9 MFV r353141 (by phillip):
Update libpcap from 1.9.0 to 1.9.1.

MFC after:	2 weeks
2019-12-21 21:01:03 +00:00
Mark Johnston
3f197b134c Deduplicate code between if_delgroup() and if_delgroups().
Fix some style in if_addgroup().  No functional change intended.

Reviewed by:	hselasky
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22892
2019-12-20 20:15:34 +00:00
Mark Johnston
718ef55ec7 Fix a memory leak in if_delgroups() introduced in r334118.
PR:		242712
Submitted by:	ghuckriede@blackberry.com
MFC after:	3 days
2019-12-20 17:21:57 +00:00
Gleb Smirnoff
185c3d2b93 Convert routing statistics to VNET_PCPUSTAT.
Submitted by:	ocochard
Reviewed by:	melifaro, glebius
Differential Revision:	https://reviews.freebsd.org/D22834
2019-12-17 02:02:26 +00:00
John Baldwin
65d2f9c12b Use a void * argument to callout handlers instead of timeout_t casts.
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D22684
2019-12-05 18:47:29 +00:00
Bjoern A. Zeeb
3232273f42 Allow kernel to compile without BPF.
r297816 added some bpf magic for VIMAGE unconditionally which no longer
allows kernels to compile without bpf (but with other networking).
Add the missing ifdef checks and allow a kernel to compile without bpf
again.

PR:		242136
Reported by:	dave mischler.com
MFC after:	2 weeks
2019-11-24 23:21:47 +00:00
Conrad Meyer
7993a104a1 Add explicit SI_SUB_EPOCH
Add explicit SI_SUB_EPOCH, after SI_SUB_TASKQ and before SI_SUB_SMP
(EARLY_AP_STARTUP).  Rename existing "SI_SUB_TASKQ + 1" to SI_SUB_EPOCH.

epoch(9) consumers cannot epoch_alloc() before SI_SUB_EPOCH:SI_ORDER_SECOND,
but likely should allocate before SI_SUB_SMP.  Prior to this change,
consumers (well, epoch itself, and net/if.c) just open-coded the
SI_SUB_TASKQ + 1 order to match epoch.c, but this was fragile.

Reviewed by:	mmacy
Differential Revision:	https://reviews.freebsd.org/D22503
2019-11-22 23:23:40 +00:00
Vincenzo Maffione
a56e6334d1 netmap: check if we already ran mmap before we attempt it
Submitted by:	neel@neelc.org
Reviewed by:	vmaffione
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D22390
2019-11-19 21:29:49 +00:00
Bjoern A. Zeeb
d9a61c960c if_llatbl: change htable_unlink_entry() to early exist if no work to do
Adjust the logic in htable_unlink_entry() to the one in
htable_link_entry() saving a block indent and making it more clear
in which case we do not do any work.

No functional change.

MFC after:	3 weeks
Sponsored by:	Netflix
2019-11-15 23:12:19 +00:00
Bjoern A. Zeeb
9744f55686 if_llatbl: cleanup
Remove function prototypes which are not needed (no use before function
definition for these file static functions).

MFC after:	3 weeks
Sponsored by:	Netflix
2019-11-15 11:00:03 +00:00
Gleb Smirnoff
9352fab6ab In if_siocaddmulti() enter VNET.
Reported & tested by:	garga
2019-11-13 16:28:53 +00:00
Bjoern A. Zeeb
8b5f9bb755 lltabl: remove dead code
Remove the long (8? years ago) #if 0 marked function lltable_drain() and
while here also remove the unused function llentry_alloc() which has call
paths tools keep finding and are never used.

Sponsored by:	Netflix
2019-11-13 11:21:02 +00:00
Gleb Smirnoff
8a9a28c455 sysctl_rtsock() has all necessary locking and doesn't need Giant to run.
While here add description.
2019-11-07 19:06:18 +00:00
Andrey V. Elsukov
a961401ee0 Enqueue lladdr_task to update link level address of vlan, when its parent
interface has changed.

During vlan reconfiguration without destroying interface, it is possible,
that parent interface will be changed. This usually means, that link
layer address of vlan will be different. Therefore we need to update all
associated with vlan's addresses permanent llentries - NDP for IPv6
addresses, and ARP for IPv4 addresses. This is done via lladdr_task
execution. To avoid extra work, before execution do the check, that L2
address is different.

No objection from:	#network
Obtained from:	Yandex LLC
MFC after:	1 week
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D22243
2019-11-07 15:00:37 +00:00
Eric Joyner
74954211d6 net: add ETHER_IS_ZERO macro similar to ETHER_IS_BROADCAST
Some places in network code may need to verify that an ethernet address
is not the 'zero' address. Provide a standard macro ETHER_IS_ZERO for
this purpose, similar to the ETHER_IS_BROADCAST macro already available.

This patch also removes previous ETHER_IS_ZERO definitions in several
USB ethernet drivers, in favor of this centrally-located macro.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>

Submitted by:	Jacob Keller <jacob.e.keller@intel.com>
Reviewed by:	erj@
Sponsored by:	Intel Corporation
Differential Revision:	https://reviews.freebsd.org/D21240
2019-11-05 00:12:21 +00:00
Eric Joyner
db8e8f1ede iflib: properly release memory allocated for DMA
DMA memory allocations using the bus_dma.h interface are not properly
released in all cases for both Tx and Rx. This causes ~448 bytes of
M_DEVBUF allocations to be leaked.

First, the DMA maps for Rx are not properly destroyed. A slight attempt
is made in iflib_fl_bufs_free to destroy the maps if we're detaching.
However, this function may not be reliably called during detach. Indeed,
there is a comment "asking" if this should be moved out.

Fix this by moving the bus_dmamap_destroy call into iflib_rx_sds_free,
where we already sync and unload the DMA.

Second, the DMA tag associated with the ifr_ifdi descriptor DMA is not
released properly anywhere. Add a call to iflib_dma_free in
iflib_rx_structures_free.

Third, use of NULL as a canary value on the map pointer returned by
bus_dmamap_create is not valid. On some platforms, notably x86, this
value may be NULL. In this case, we fail to properly release the related
resources.

Remove the NULL checks on map values in both iflib_fl_bufs_free and
iflib_txsd_destroy.

With all of these fixes applied, the leaks to M_DEVBUF are squelched,
and iflib drivers now seem to properly cleanup when detaching.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>

Submitted by:	Jacob Keller <jacob.e.keller@intel.com>
Reviewed by:	erj@, gallatin@
MFC after:	1 week
Sponsored by:	Intel Corporation
Differential Revision:	https://reviews.freebsd.org/D22203
2019-11-04 23:06:57 +00:00
Vincenzo Maffione
9dd7582c12 netmap: fix build issue in netmap_user.h
The issue was a comparison of integers of different signs
on 32 bit architectures.

Reported by:	jenkins
MFC after:	1 week
2019-10-31 22:16:20 +00:00
Vincenzo Maffione
c7c7805531 add valectl to the system commands
The valectl(4) program is used to manage vale(4) switches.
Add it to the system commands so that it can be used right away.
This program was previously called vale-ctl, and stored in
tools/tools/netmap

Reviewed by:	hrs, bcr, lwhsu, kevans
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D22146
2019-10-31 21:01:34 +00:00
Eric Joyner
244e7cffa5 iflib: cleanup memory leaks on driver detach
From Jake:
The iflib stack failed to release all of the memory allocated under
M_IFLIB during device detach.

Specifically, the ifmp_ring, the ift_ifdi Tx DMA info, and the ifr_ifdi Rx
DMA info were not being released.

Release this memory so that iflib won't leak memory when a device
detaches.

Since we're freeing the ift_ifdi pointer during iflib_txq_destroy we
need to call this only after iflib_dma_free in iflib_tx_structures_free.

Additionally, also ensure that we destroy the callout mutex associated
with each Tx queue when we free it.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>

Submitted by:	Jacob Keller <jacob.e.keller@intel.com>
Reviewed by:	erj@, gallatin@
MFC after:	1 week
Sponsored by:	Intel Corporation
Differential Revision:	https://reviews.freebsd.org/D22157
2019-10-30 20:45:12 +00:00
Gleb Smirnoff
0839aa5c04 There is a long standing problem with multicast programming for NICs
and IPv6.  With IPv6 we may call if_addmulti() in context of processing
of an incoming packet.  Usually this is interrupt context.  While most
of the NIC drivers are able to reprogram multicast filters without
sleeping, some of them can't.  An example is e1000 family of drivers.
With iflib conversion the problem was somewhat hidden.  Iflib processes
packets in private taskqueue, so going to sleep doesn't trigger an
assertion.  However, the sleep would block operation of the driver and
following incoming packets would fill the ring and eventually would
start being dropped.  Enabling epoch for the full time of a packet
processing again started to trigger assertions for e1000.

Fix this problem once and for all using a general taskqueue to call
if_ioctl() method in all cases when if_addmulti() is called in a
non sleeping context.  Note that nobody cares about returned value.

Reviewed by:	hselasky, kib
Differential Revision:	  https://reviews.freebsd.org/D22154
2019-10-29 17:36:06 +00:00
Eric Joyner
1558015e3e iflib: call ether_ifdetach and netmap_detach before stop
From Jake:
Calling ether_ifdetach after iflib_stop leads to a potential race where
a stale ifp pointer can remain in the route entry list for IPv6 traffic.
This will potentially cause a page fault or other system instability if
the ifp pointer is accessed.

Move both iflib_netmap_detach and ether_ifdetach to be called prior to
iflib_stop. This avoids the race above, and helps ensure that other ifp
references are removed before stopping the interface.

Submitted by:	Jacob Keller <jacob.e.keller@intel.com>
Reviewed by:	erj@, gallatin@, jhb@
MFC after:	3 days
Sponsored by:	Intel Corporation
Differential Revision:	https://reviews.freebsd.org/D22071
2019-10-23 23:20:49 +00:00
Conrad Meyer
65366903c3 Prevent a panic when a driver provides bogus debugnet parameters
This is just a bandaid; we should fix the driver(s) too.  Introduced in
r353685.

PR:		241403
X-MFC-With:	r353685
Reported by:	np and others
2019-10-23 16:48:22 +00:00
Kyle Evans
f7810883d4 tuntap(4): Fix NOINET build after r353741
Shuffle headers around to more appropriate #ifdef OPTION blocks (INET vs.
INET6) -- double checked LINT-{NOINET,NOINET6,NOIP}, all seem good.

Reported by:	cem
2019-10-23 02:15:15 +00:00
Kyle Evans
200abb43c0 tuntap(4): properly declare if_tun and if_tap modules
Simply adding MODULE_VERSION does not do the trick, because the modules
haven't been declared. This should actually fix modfind/kldstat, which
r351229 aimed and failed to do.

This should make vm-bhyve do the right thing again when using the ports
version, rather than the latest version not in ports.

MFC after:	3 days
2019-10-22 00:18:16 +00:00
Gleb Smirnoff
19e09f447f Remove obsoleted KPIs that were used to access interface address lists. 2019-10-21 18:17:03 +00:00
Kyle Evans
3d5013337a tuntap(4): restrict scope of net.link.tap.user_open slightly
net.link.tap.user_open has historically allowed non-root users to do devfs
cloning and open /dev/tap* nodes based on permissions. Loosen this up to
make it only allow users to do devfs cloning -- we no longer check it in
tunopen.

This allows tap devices to be created that can actually be opened by a user,
rather than swiftly restricting them to root because the magic sysctl has
not been set.

The sysctl has not yet been completely deprecated, because more thought is
needed for how to handle the devfs cloning case. There is not an easy
suitable replacement for the sysctl there, and more care needs to be placed
in determining whether that's OK or not.

PR:		200185
2019-10-21 14:38:11 +00:00
Kyle Evans
6025077704 tuntap(4): use cdevpriv w/ dtor for last close instead of d_close
cdevpriv dtors will be called when the reference count on the associated
struct file drops to 0, while d_close can be unreliable for cleaning up
state at "last close" for a number of reasons. As far as tunclose/tundtor is
concerned the difference is minimal, so make the switch.
2019-10-20 22:55:47 +00:00
Kyle Evans
6869d530c7 tuntap(4): Use make_dev_s to avoid si_drv1 race
This allows us to avoid some dance in tunopen for dealing with the
possibility of dev->si_drv1 being NULL as it's set prior to the devfs node
being created in all cases.

There's still the possibility that the tun device hasn't been fully
initialized, since that's done after the devfs node was created. Alleviate
this by returning ENXIO if we're not to that point of tuncreate yet.

This work is what sparked r353128, full initialization of cloned devices
w/ specified make_dev_args.
2019-10-20 22:39:40 +00:00
Kyle Evans
486c0b2269 tuntap(4): break out after setting TUN_DSTADDR
This is now the only flag we set in this loop, terminate early.
2019-10-20 21:06:25 +00:00
Kyle Evans
6041d76e0c tuntap(4): Drop TUN_IASET
This flag appears to have been effectively unused since introduction to
if_tun(4) -- drop it now.
2019-10-20 21:03:48 +00:00
Michael Tuexen
4ad8cb6813 Fix compile issues when building a kernel without the VIMAGE option.
Thanks to cem@ for discussing the issue which resulted in this patch.

Reviewed by:		cem@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D22089
2019-10-19 20:48:53 +00:00
Conrad Meyer
0634308df2 Fix debugnet(4) link/build fallout on some configurations
Introduced in r353685 (sys/conf/files), r353694 (debugnet.c db_printf).

Submitted by:	kevans
Reported by:	cy
X-MFC-With:	r353685, r353694
2019-10-18 22:03:36 +00:00
Vincenzo Maffione
f8bc74e2f4 tap: add support for virtio-net offloads
This patch is part of an effort to make bhyve networking (in particular TCP)
faster. The key strategy to enhance TCP throughput is to let the whole packet
datapath work with TSO/LRO packets (up to 64KB each), so that the per-packet
overhead is amortized over a large number of bytes.
This capability is supported in the guest by means of the vtnet(4) driver,
which is able to handle TSO/LRO packets leveraging the virtio-net header
(see struct virtio_net_hdr and struct virtio_net_hdr_mrg_rxbuf).
A bhyve VM exchanges packets with the host through a network backend,
which can be vale(4) or if_tap(4).
While vale(4) supports TSO/LRO packets, if_tap(4) does not.
This patch extends if_tap(4) with the ability to understand the virtio-net
header, so that a tapX interface can process TSO/LRO packets.
A couple of ioctl commands have been added to configure and probe the
virtio-net header. Once the virtio-net header is set, the tapX interface
acquires all the IFCAP capabilities necessary for TSO/LRO.

Reviewed by:	kevans
Differential Revision:	https://reviews.freebsd.org/D21263
2019-10-18 21:53:27 +00:00
Gleb Smirnoff
fda454099b Make rt_getifa_fib() static. 2019-10-18 15:20:24 +00:00
Conrad Meyer
dda17b3672 Implement NetGDB(4)
NetGDB(4) is a component of a system using a panic-time network stack to
remotely debug crashed FreeBSD kernels over the network, instead of
traditional serial interfaces.

There are three pieces in the complete NetGDB system.

First, a dedicated proxy server must be running to accept connections from
both NetGDB and gdb(1), and pass bidirectional traffic between the two
protocols.

Second, the NetGDB client is activated much like ordinary 'gdb' and
similarly to 'netdump' in ddb(4) after a panic.  Like other debugnet(4)
clients (netdump(4)), the network interface on the route to the proxy server
must be online and support debugnet(4).

Finally, the remote (k)gdb(1) uses 'target remote <proxy>:<port>' (like any
other TCP remote) to connect to the proxy server.

The NetGDB v1 protocol speaks the literal GDB remote serial protocol, and
uses a 1:1 relationship between GDB packets and sequences of debugnet
packets (fragmented by MTU).  There is no encryption utilized to keep
debugging sessions private, so this is only appropriate for local
segments or trusted networks.

Submitted by:	John Reimer <john.reimer AT emc.com> (earlier version)
Discussed some with:	emaste, markj
Relnotes:	sure
Differential Revision:	https://reviews.freebsd.org/D21568
2019-10-17 21:33:01 +00:00
Conrad Meyer
e9c6962599 debugnet(4): Add optional full-duplex mode
It remains unattached to any client protocol.  Netdump is unaffected
(remaining half-duplex).  The intended consumer is NetGDB.

Submitted by:	John Reimer <john.reimer AT emc.com> (earlier version)
Discussed with:	markj
Differential Revision:	https://reviews.freebsd.org/D21541
2019-10-17 20:25:15 +00:00
Gleb Smirnoff
b2807792f1 Revert two parts of r353292 that enter epoch when processing vlan capabilities.
It could be that entering epoch isn't necessary here, but better take a
conservative approach.

Submitted by:	kp
2019-10-17 20:18:07 +00:00
Conrad Meyer
fde2cf65ce debugnet(4): Infer non-server connection parameters
Loosen requirements for connecting to debugnet-type servers.  Only require a
destination address; the rest can theoretically be inferred from the routing
table.

Relax corresponding constraints in netdump(4) and move ifp validation to
debugnet connection time.

Submitted by:	John Reimer <john.reimer AT emc.com> (earlier version)
Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D21482
2019-10-17 20:10:32 +00:00
Conrad Meyer
8270d35eca Add ddb(4) 'netdump' command to netdump a core without preconfiguration
Add a 'X -s <server> -c <client> [-g <gateway>] -i <interface>' subroutine
to the generic debugnet code.  The imagined use is both netdump, shown here,
and NetGDB (vaporware).  It uses the ddb(4) lexer, with some new extensions,
to parse out IPv4 addresses.

'Netdump' uses the generic debugnet routine to load a configuration and
start a dump, without any netdump configuration prior to panic.

Loosely derived from work by:	John Reimer <john.reimer AT emc.com>
Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D21460
2019-10-17 19:49:20 +00:00
Conrad Meyer
6d567ec2da debugnet: Respond to broadcast ARP requests
The in-tree netdump code has always ignored non-directed ARP requests, and
that seems to work most of the time for netdump.

In my work and testing on NetGDB, it seems like sometimes the remote FreeBSD
conversant (the non-panic system) will send broadcast-destination ARP
requests to the debugnet kernel; without this change, those are dropped and
the remote will see EHOSTDOWN "Host is down" errors from the userspace
interface of the network stack.

Discussed with:	markj
2019-10-17 17:48:32 +00:00
Conrad Meyer
d39756c142 debugnet(4): Check hardware-validated UDP checksums
Similar to INET checksums, lazily validate UDP checksums when the driver has
already performed the check for us.  Like debugnet(4) INET checksums,
validation in software is left as future work.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D21745
2019-10-17 17:19:16 +00:00
Conrad Meyer
7790c8c199 Split out a more generic debugnet(4) from netdump(4)
Debugnet is a simplistic and specialized panic- or debug-time reliable
datagram transport.  It can drive a single connection at a time and is
currently unidirectional (debug/panic machine transmit to remote server
only).

It is mostly a verbatim code lift from netdump(4).  Netdump(4) remains
the only consumer (until the rest of this patch series lands).

The INET-specific logic has been extracted somewhat more thoroughly than
previously in netdump(4), into debugnet_inet.c.  UDP-layer logic and up, as
much as possible as is protocol-independent, remains in debugnet.c.  The
separation is not perfect and future improvement is welcome.  Supporting
INET6 is a long-term goal.

Much of the diff is "gratuitous" renaming from 'netdump_' or 'nd_' to
'debugnet_' or 'dn_' -- sorry.  I thought keeping the netdump name on the
generic module would be more confusing than the refactoring.

The only functional change here is the mbuf allocation / tracking.  Instead
of initiating solely on netdump-configured interface(s) at dumpon(8)
configuration time, we watch for any debugnet-enabled NIC for link
activation and query it for mbuf parameters at that time.  If they exceed
the existing high-water mark allocation, we re-allocate and track the new
high-water mark.  Otherwise, we leave the pre-panic mbuf allocation alone.
In a future patch in this series, this will allow initiating netdump from
panic ddb(4) without pre-panic configuration.

No other functional change intended.

Reviewed by:	markj (earlier version)
Some discussion with:	emaste, jhb
Objection from:	marius
Differential Revision:	https://reviews.freebsd.org/D21421
2019-10-17 16:23:03 +00:00
Philip Paeps
579b70db89 ether: add older ethertype definitions for QinQ
Older network equipment used the ethertypes 0x9100, 0x9200, and 0x9300 for
outer VLANs, before standardisation introduced 0x88a8.

Submitted by:	 Lutz Donnerhacke <lutz_donnerhacke.de>
Differential Revision:	https://reviews.freebsd.org/D21846
2019-10-17 00:34:53 +00:00
Gleb Smirnoff
b46d70fd88 do_link_state_change() is executed in taskqueue context and in
general is allowed to sleep.  Don't enter the epoch for the
whole duration.  If some event handlers need the epoch, they
should handle that theirselves.

Discussed with:	hselasky
2019-10-16 16:32:58 +00:00
Hans Petter Selasky
270b83b9d1 The two functions ifnet_byindex() and ifnet_byindex_locked() are exactly the
same after the network stack was epochified. Merge the two into one function
and cleanup all uses of ifnet_byindex_locked().

While at it:
- Add branch prediction macros.
- Make sure the ifnet pointer is only deferred once,
  also when code optimisation is disabled.

Sponsored by:	Mellanox Technologies
2019-10-15 12:08:09 +00:00
Hans Petter Selasky
93cfeb0ed9 Exclude the network link eventhandler from epochification after r353292.
This fixes the following assert when "options RATELIMIT" is used:
panic()
malloc()
sysctl_add_oid()
tcp_rl_ifnet_link()
do_link_state_change()
taskqueue_run_locked()

Sponsored by:	Mellanox Technologies
2019-10-15 11:20:16 +00:00
Gleb Smirnoff
416a1d1e70 if_delmulti() is never called without ifp argument, assert this instead
of doing a useless search through interfaces.
2019-10-14 21:18:37 +00:00
Michael Tuexen
69104ebe0f Add missing include which breaks builds without VIMAGE.
The bug was introduced by me in r353480.

Reported by:		Michael Butler
MFC after:		3 days
2019-10-13 19:58:37 +00:00
Michael Tuexen
d6e23cf0cf Use an event handler to notify the SCTP about IP address changes
instead of calling an SCTP specific function from the IP code.
This is a requirement of supporting SCTP as a kernel loadable module.
This patch was developed by markj@, I tweaked a bit the SCTP related
code.

Submitted by:		markj@
MFC after:		3 days
2019-10-13 18:17:08 +00:00
Gleb Smirnoff
6dcec895d9 vlan_config() isn't always called in epoch context.
Reported by:	kp
2019-10-13 15:15:09 +00:00
Gleb Smirnoff
45c1d51c39 Don't use if_maddr_rlock() in sppp(4), use epoch(9) directly instead. 2019-10-10 23:54:37 +00:00
Gleb Smirnoff
73c96bbeac Don't use if_maddr_rlock() in tuntap(4), use epoch(9) directly instead. 2019-10-10 23:51:14 +00:00
Gleb Smirnoff
4b24e5b1ef Interface output method must be executed in network epoch, so if_addr_rlock()
isn't needed here.
2019-10-10 23:50:32 +00:00
Gleb Smirnoff
fb3fc771f6 Add two extra functions that basically give count of addresses
on interface.  Such function could been implemented on top of
the if_foreach_llm?addr(), but several drivers need counting,
so avoid copy-n-paste inside the drivers.
2019-10-10 23:44:56 +00:00
Gleb Smirnoff
826857c833 Provide new KPI for network drivers to access lists of interface
addresses.  The KPI doesn't reveal neither how addresses are stored,
how the access to them is synchronized, neither reveal struct ifaddr
and struct ifmaddr.

Reviewed by:	gallatin, erj, hselasky, philip, stevek
Differential Revision:	https://reviews.freebsd.org/D21943
2019-10-10 23:42:55 +00:00
Gleb Smirnoff
caeeeaa7c5 ifnet_byindex_ref() requires network epoch. 2019-10-09 16:21:50 +00:00
Gleb Smirnoff
1e80e4f26c Remove epoch assertion from if_setlladdr(). Originally this function was
protected by IF_ADDR_LOCK(), which was a mutex, so that two simultaneous
if_setlladdr() can't execute. Later it was switched to IF_ADDR_RLOCK(),
likely by a mistake. Later it was switched to NET_EPOCH_ENTER(). Then I
incorrectly added NET_EPOCH_ASSERT() here.

In reality ifp->if_addr never goes away and never changes its length. So,
doing bcopy() in it is always "safe", meaning it won't dereference a wrong
pointer or write into someone's else memory. Of course doing two bcopy() in
parallel would result in a mess of two addresses, but net epoch doesn't
protect against that, neither IF_ADDR_RLOCK() did.

So for now, just remove the assertion and leave for later a proper fix.

Reported by:	markj
2019-10-08 17:55:45 +00:00
Gleb Smirnoff
e9dc46cc30 In DIAGNOSTIC block of if_delmulti_ifma_flags() enter the network epoch.
This quickly plugs the regression from r353292. The locking of multicast
definitely needs a broader review today...

Reported by:	pho, dhw
2019-10-08 16:45:56 +00:00
Hans Petter Selasky
a362cf527e Fix regression issue after r353274:
Make sure the vnet_shutdown field is not set until after all
VNET_SYSUNINIT()'s in the SI_SUB_VNET_DONE subsystem have been
executed. Especially the vnet_if_return() functions requires that
if_move() is still operational.

Reported by:	lwhsu@
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2019-10-08 11:06:24 +00:00
Gleb Smirnoff
b8a6e03fac Widen NET_EPOCH coverage.
When epoch(9) was introduced to network stack, it was basically
dropped in place of existing locking, which was mutexes and
rwlocks. For the sake of performance mutex covered areas were
as small as possible, so became epoch covered areas.

However, epoch doesn't introduce any contention, it just delays
memory reclaim. So, there is no point to minimise epoch covered
areas in sense of performance. Meanwhile entering/exiting epoch
also has non-zero CPU usage, so doing this less often is a win.

Not the least is also code maintainability. In the new paradigm
we can assume that at any stage of processing a packet, we are
inside network epoch. This makes coding both input and output
path way easier.

On output path we already enter epoch quite early - in the
ip_output(), in the ip6_output().

This patch does the same for the input path. All ISR processing,
network related callouts, other ways of packet injection to the
network stack shall be performed in net_epoch. Any leaf function
that walks network configuration now asserts epoch.

Tricky part is configuration code paths - ioctls, sysctls. They
also call into leaf functions, so some need to be changed.

This patch would introduce more epoch recursions (see EPOCH_TRACE)
than we had before. They will be cleaned up separately, as several
of them aren't trivial. Note, that unlike a lock recursion the
epoch recursion is safe and just wastes a bit of resources.

Reviewed by:	gallatin, hselasky, cy, adrian, kristof
Differential Revision:	https://reviews.freebsd.org/D19111
2019-10-07 22:40:05 +00:00
Hans Petter Selasky
4715738b12 Compile time assert a valid subsystem for all VNET init and uninit functions.
Using VNET init and uninit functions outside the given range has undefined
behaviour.

MFC after:	1 week
Sponsored by:	Mellanox Technologies
2019-10-07 14:24:59 +00:00
Hans Petter Selasky
204e2f30d9 Factor out VNET shutdown check into an own vnet structure field.
Remove the now obsolete vnet_state field. This greatly simplifies the
detection of VNET shutdown and avoids code duplication.

Discussed with:	bz@
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2019-10-07 14:15:41 +00:00
Kyle Evans
291287667c tuntap(4): loosen up tunclose restrictions
Realistically, this cannot work. We don't allow the tun to be opened twice,
so it must be done via fd passing, fork, dup, some mechanism like these.
Applications demonstrably do not enforce strict ordering when they're
handing off tun devices, so the parent closing before the child will easily
leave the tun/tap device in a bad state where it can't be destroyed and a
confused user because they did nothing wrong.

Concede that we can't leave the tun/tap device in this kind of state because
of software not playing the TUNSIFPID game, but it is still good to find and
fix this kind of thing to keep ifconfig(8) up-to-date and help ensure good
discipline in tun handling.

MFC after:	 3 days
2019-10-04 13:43:07 +00:00
Kyle Evans
59997c3c46 if_tuntap: create /dev aliases when a tuntap device gets renamed
Currently, if you do:

$ ifconfig tun0 create
$ ifconfig tun0 name wg0
$ ls -l /dev | egrep 'wg|tun'

You will see tun0, but no wg0. In fact, it's slightly more annoying to make
the association between the new name and the old name in order to open the
device (if it hadn't been opened during the rename).

Register an eventhandler for ifnet_arrival_events and catch interface
renames. We can determine if the ifnet is a tun easily enough from the
if_dname, which matches the cevsw.d_name from the associated tuntap_driver.

Some locking dance is required because renames don't require the device to
be opened, so it could go away in the middle of handling the ioctl, but as
soon as we've verified this isn't the case we can attempt to busy the tun
and either bail out if the tun device is dying, or we can proceed with the
rename.

We only create these aliases on a best-effort basis. Renaming a tun device
to "usbctl", which doesn't exist as an ifnet but does as a /dev, is clearly
not that disastrous, but we can't and won't create a /dev for that.
2019-10-03 17:54:00 +00:00
Kyle Evans
c4cad1549e if_tuntap: add a busy/unbusy mechanism, replace destroy OPEN check
A future commit will create device aliases when a tuntap device is renamed
so that it's still easily found in /dev after the rename.  Said mechanism
will want to keep the tun alive long enough to either realize that it's
about to go away or complete the alias creation, even if the alias is about
to get destroyed.

While we're introducing it, using it to prevent open devices from going away
makes plenty of sense and keeps the logic on waking up tun_destroy clean, so
we don't have multiple places trying to cv_broadcast unless it's still in
use elsewhere.
2019-10-03 17:46:27 +00:00
Mark Johnston
4166913371 Add IFLIB_SINGLE_IRQ_RX_ONLY.
As of r347221 the iflib legacy interrupt mode setup assumes that drivers
perform both receive and transmit processing from the interrupt handler.
This assumption is invalid in the vmxnet3 driver, so introduce the
IFLIB_SINGLE_IRQ_RX_ONLY flag to make iflib avoid tx processing in the
interrupt handler.

PR:		239118
Reported and tested by:	Juraj Lutter <otis@sk.freebsd.org>
Obtained from:	marius
Reviewed by:	gallatin
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D21831
2019-09-30 15:59:07 +00:00
Andrew Gallatin
6554362c66 kTLS support for TLS 1.3
TLS 1.3 requires a few changes because 1.3 pretends to be 1.2
with a record type of application data. The "real" record type is
then included at the end of the user-supplied plaintext
data. This required adding a field to the mbuf_ext_pgs struct to
save the record type, and passing the real record type to the
sw_encrypt() ktls backend functions.

Reviewed by:	jhb, hselasky
Sponsored by:	Netflix
Differential Revision:	D21801
2019-09-27 19:17:40 +00:00
Gleb Smirnoff
bf7700e44f style(9): remove extraneous empty lines 2019-09-25 20:46:09 +00:00
Gleb Smirnoff
dd902d015a Add debugging facility EPOCH_TRACE that checks that epochs entered are
properly nested and warns about recursive entrances.  Unlike with locks,
there is nothing fundamentally wrong with such use, the intent of tracer
is to help to review complex epoch-protected code paths, and we mean the
network stack here.

Reviewed by:	hselasky
Sponsored by:	Netflix
Pull Request:	https://reviews.freebsd.org/D21610
2019-09-25 18:26:31 +00:00
Eric Joyner
53b5b9b049 iflib: Remove redundant VLAN events deregistration
From Piotr:
r351152 introduced iflib_deregister() function calling
EVENTHANDLER_DEREGISTER() to unregister VLAN events. This patch removes
duplicate of EVENTHANDLER_DEREGISTER() calls placed in
iflib_device_deregister() as this function is now calling
iflib_deregister(). This is to avoid deregistering same event twice.

This patch also adds check in iflib_vlan_register() to prevent
registering VLAN while being in detach.

Patch co-authored by Krzysztof Galazka <krzysztof.galazka@intel.com>,
erj <erj@FreeBSD.org> and Jacob Keller <jacob.e.keller@intel.com>.

Signed-off-by: Piotr Pietruszewski <piotr.pietruszewski@intel.com>

Submitted by:	Piotr Pietruszewski <piotr.pietruszewski@intel.com>
Reviewed by:	gallatin@, erj@
MFC after:	3 days
Sponsored by:	Intel Corporation
Differential Revision:	https://reviews.freebsd.org/D21711
2019-09-24 17:03:31 +00:00
Konstantin Belousov
247cf5664e Add SIOCGIFDOWNREASON.
The ioctl(2) is intended to provide more details about the cause of
the down for the link.

Eventually we might define a comprehensive list of codes for the
situations.  But interface also allows the driver to provide free-form
null-terminated ASCII string to provide arbitrary non-formalized
information.  Sample implementation exists for mlx5(4), where the
string is fetched from firmware controlling the port.

Reviewed by:	hselasky, rrs
Sponsored by:	Mellanox Technologies
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D21527
2019-09-17 18:49:13 +00:00
Kyle Evans
40b1c921bd SIOCSIFNAME: Do nothing if we're not actually changing
Instead of throwing EEXIST, just succeed if the name isn't actually
changing. We don't need to trigger departure or any of that because there's
no change from consumers' perspective.

PR:		240539
Reviewed by:	brooks
MFC after:	5 days
Differential Revision:	https://reviews.freebsd.org/D21618
2019-09-12 15:36:48 +00:00
Hans Petter Selasky
180fecd5b6 Callout drain does not have to be followed by a callout stop call.
Fix bogus code.

MFC after:	1 week
Sponsored by:	Mellanox Technologies
2019-09-10 14:33:07 +00:00
Li-Wen Hsu
4835262b68 Fix build for the platforms where db_expr_t is not long
Sponsored by:	The FreeBSD Foundation
2019-09-10 08:51:11 +00:00
Conrad Meyer
0b12ab8111 Appease Clang false-positive Werrors in r352112
Reported by:	bcran
2019-09-10 01:56:47 +00:00
Conrad Meyer
8b6acd2b51 ddb(4): Add 'show route <dest>' and 'show routetable [<af>]'
These commands show the route resolved for a specified destination, or
print out the entire routing table for a given address family (or all
families, if none is explicitly provided).

Discussed with:	emaste
Differential Revision:	https://reviews.freebsd.org/D21510
2019-09-09 22:54:27 +00:00
Mark Johnston
fee2a2fa39 Change synchonization rules for vm_page reference counting.
There are several mechanisms by which a vm_page reference is held,
preventing the page from being freed back to the page allocator.  In
particular, holding the page's object lock is sufficient to prevent the
page from being freed; holding the busy lock or a wiring is sufficent as
well.  These references are protected by the page lock, which must
therefore be acquired for many per-page operations.  This results in
false sharing since the page locks are external to the vm_page
structures themselves and each lock protects multiple structures.

Transition to using an atomically updated per-page reference counter.
The object's reference is counted using a flag bit in the counter.  A
second flag bit is used to atomically block new references via
pmap_extract_and_hold() while removing managed mappings of a page.
Thus, the reference count of a page is guaranteed not to increase if the
page is unbusied, unmapped, and the object's write lock is held.  As
a consequence of this, the page lock no longer protects a page's
identity; operations which move pages between objects are now
synchronized solely by the objects' locks.

The vm_page_wire() and vm_page_unwire() KPIs are changed.  The former
requires that either the object lock or the busy lock is held.  The
latter no longer has a return value and may free the page if it releases
the last reference to that page.  vm_page_unwire_noq() behaves the same
as before; the caller is responsible for checking its return value and
freeing or enqueuing the page as appropriate.  vm_page_wire_mapped() is
introduced for use in pmap_extract_and_hold().  It fails if the page is
concurrently being unmapped, typically triggering a fallback to the
fault handler.  vm_page_wire() no longer requires the page lock and
vm_page_unwire() now internally acquires the page lock when releasing
the last wiring of a page (since the page lock still protects a page's
queue state).  In particular, synchronization details are no longer
leaked into the caller.

The change excises the page lock from several frequently executed code
paths.  In particular, vm_object_terminate() no longer bounces between
page locks as it releases an object's pages, and direct I/O and
sendfile(SF_NOCACHE) completions no longer require the page lock.  In
these latter cases we now get linear scalability in the common scenario
where different threads are operating on different files.

__FreeBSD_version is bumped.  The DRM ports have been updated to
accomodate the KPI changes.

Reviewed by:	jeff (earlier version)
Tested by:	gallatin (earlier version), pho
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20486
2019-09-09 21:32:42 +00:00
Vincenzo Maffione
253b2ec199 netmap: import changes from upstream (SHA 137f537eae513)
- Rework option processing.
 - Use larger integers for memory size values in the
   memory management code.

MFC after:	2 weeks
2019-09-01 14:47:41 +00:00
Matt Joras
16cf6bdbb6 Wrap a vlan's parent's if_output in a separate function.
When a vlan interface is created, its if_output is set directly to the
parent interface's if_output. This is fine in the normal case but has an
unfortunate consequence if you end up with a certain combination of vlan
and lagg interfaces.

Consider you have a lagg interface with a single laggport member. When
an interface is added to a lagg its if_output is set to
lagg_port_output, which blackholes traffic from the normal networking
stack but not certain frames from BPF (pseudo_AF_HDRCMPLT). If you now
create a vlan with the laggport member (not the lagg interface) as its
parent, its if_output is set to lagg_port_output as well. While this is
confusing conceptually and likely represents a misconfigured system, it
is not itself a problem. The problem arises when you then remove the
lagg interface. Doing this resets the if_output of the laggport member
back to its original state, but the vlan's if_output is left pointing to
lagg_port_output. This gives rise to the possibility that the system
will panic when e.g. bpf is used to send any frames on the vlan
interface.

Fix this by creating a new function, vlan_output, which simply wraps the
parent's current if_output. That way when the parent's if_output is
restored there is no stale usage of lagg_port_output.

Reviewed by:	rstone
Differential Revision:	D21209
2019-08-30 20:19:43 +00:00
John Baldwin
b2e60773c6 Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets.  KTLS only supports
offload of TLS for transmitted data.  Key negotation must still be
performed in userland.  Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option.  All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.

Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type.  Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.

At present, rekeying is not supported though the in-kernel framework
should support rekeying.

KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer.  Each TLS frame is described by a single
ext_pgs mbuf.  The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.

KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.

Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame().  ktls_enqueue() is then
called to schedule TLS frames for encryption.  In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed.  For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().

A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue().  Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.

(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)

KTLS supports pluggable software encryption backends.  Internally,
Netflix uses proprietary pure-software backends.  This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames.  As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.

Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready().  At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.

ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation.  In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session.  TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted.  The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface.  If so, the packet is tagged
with the TLS send tag and sent to the interface.  The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation.  If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped.  In addition, a task is scheduled to refresh the TLS send
tag for the TLS session.  If a new TLS send tag cannot be allocated,
the connection is dropped.  If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag.  (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another.  As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)

ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8).  ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.

Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option.  They can also use this socket
option to toggle between software and ifnet TLS modes.

In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax.  However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.

Various sysctls and counters are available under the kern.ipc.tls
sysctl node.  The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default).  The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.

KTLS is enabled via the KERN_TLS kernel option.

This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.

Reviewed by:	gallatin, hselasky, rrs
Obtained from:	Netflix
Sponsored by:	Netflix, Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D21277
2019-08-27 00:01:56 +00:00
Kyle Evans
5c4eed8601 tuntap: belatedly add MODULE_VERSION for if_tun and if_tap
When tun/tap were merged, appropriate MODULE_VERSION should have been added
for things like modfind(2) to continue to do the right thing with the old
names.

Reported by:	jhb
2019-08-19 19:01:59 +00:00
Vincenzo Maffione
b5b83671ea if_tuntap: minor improvements
Rewrite a loop to avoid duplicating the exit condition.
Simplify mask processing in tunpoll().
Fix minor typos.

Reviewed by:	kevans, markj
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D21302
2019-08-19 17:23:22 +00:00
Eric Joyner
f4aa9b67eb net: Update SFF-8024 definitions and strings with values from rev 4.6
This will let ifconfig -v's SFF eeprom read functionality recognize more
module types.

Signed-off-by: Eric Joyner <erj@freebsd.org>

Reviewed by:	gallatin@
MFC after:	1 week
Sponsored by:	Intel Corporation
Differential Revision:	https://reviews.freebsd.org/D21041
2019-08-17 00:10:56 +00:00
Eric Joyner
566144142e iflib: add iflib_deregister to help cleanup on exit
Commit message by Jake:
The iflib_register function exists to allocate and setup some common
structures used by both iflib_device_register and iflib_pseudo_register.

There is no associated cleanup function used to undo the steps taken in
this function.

Both iflib_device_deregister and iflib_pseudo_deregister have some of
the necessary steps scattered in their flow. However, most of the
necessary cleanup is not done during the error path of
iflib_device_register and iflib_pseudo_register.

Some examples of missed cleanup include:

the ifp pointer is not free'd during error cleanup
the STATE and CTX locks are not destroyed during error cleanup
the vlan event handlers are not removed during error cleanup
media added to the ifmedia structure is not removed
the kobject reference is never deleted
Additionally, when initializing the kobject class reference counter is
increased even though kobj_init already increases it. This results in
the class never being free'd again because the reference count would
never hit zero even after all driver instances are unloaded.

To aid in proper cleanup, implement an iflib_deregister function that
goes through the reverse steps taken by iflib_register.

Call this function during the error cleanup for iflib_device_register
and iflib_pseudo_register. Additionally call the function in the
iflib_device_deregister and iflib_pseudo_deregister functions near the
end of their flow. This helps reduce code duplication and ensures that
proper steps are taken to cleanup allocations and references in both the
regular and error cleanup flows.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>

Submitted by:	Jacob Keller <jacob.e.keller@intel.com>
Reviewed by:	shurd@, erj@
MFC after:	3 days
Sponsored by:	Intel Corporation
Differential Revision:	https://reviews.freebsd.org/D21005
2019-08-16 23:33:44 +00:00
George V. Neville-Neil
d8dc4e350f Properly validte arguments for route deletion
Reported by: Liang Zhuo brightiup.zhuo@gmail.com
MFC after:	1 week
2019-08-03 14:42:07 +00:00
Eric Joyner
197c679824 iflib: Prevent kernel panic caused by loading driver with a specific interrupt configuration
If a device has only 1 MSI-X interrupt available and does not support either
MSI or legacy interrupts, iflib_device_register() will fail, leak memory and
MSI resources, and the driver will not load. Worse, if another iflib-using
driver tries to unload afterwards, a kernel panic will occur because the
previous failed iflib driver loead did not properly call "taskqgroup_detach()"
during it's cleanup.

This patch is band-aid for this situation -- don't try allocating MSI or legacy
interrupts if a single MSI-X interrupt was allocated, but fail to load instead.
As well, during the cleanup, properly call taskqgroup_detach() on the admin
task to prevent panics when other iflib drivers unload.

This whole interrupt allocation process actually needs re-doing to properly
support devices with only a single MSI-X interrupt, devices that only support
MSI-X, non-PCI devices, and multiple non-MSIX interrupts, as well.

Signed-off-by: Eric Joyner <erj@freebsd.org>

Reviewed by:	marius@
MFC after:	1 week
Sponsored by:	Intel Corporation
Differential Revision:	https://reviews.freebsd.org/D20747
2019-08-01 17:37:25 +00:00
Eric Joyner
6a3f243b04 iflib: remove kobject class reference increment
Commit message from Jake:
In iflib_register, the context is initialized as a kobject using the
device driver's "driver" kobject class. As part of this, the function
mistakenly increments the ref counter.

The ref counter is incremented twice, once in the code directly, and
once again by kobj_class_compile. However, there is no associated
decrement in the detach path. Because of this, the ref counter will
never go back down to zero, and thus the kobject method table will never
be released.

Remove this unnecessary reference count increment.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>

Submitted by:	Jacob Keller <jacob.e.keller@intel.com>
Reviewed by:	jhb@, erj@
MFC after:	3 days
Sponsored by:	Intel Corporation
Differential Revision:	https://reviews.freebsd.org/D21125
2019-08-01 17:28:36 +00:00
Randall Stewart
20abea6663 This adds the third step in getting BBR into the tree. BBR and
an updated rack depend on having access to the new
ratelimit api in this commit.

Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D20953
2019-08-01 14:17:31 +00:00
Ed Maste
1082be6554 ppp: correct echo-req magic number on big endian archs
The magic number is a 32-bit quantity; use uint32_t to match hton's
return type and avoid sending zeros (upper 32 bits) on big-endian
architectures.

PR:		184141
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2019-08-01 13:42:58 +00:00
Kyle Evans
0dbac71f19 if_tuntap(4): Add TUNGIFNAME
This effectively just moves TAPGIFNAME into common ioctl territory.

MFC after:	3 days
2019-07-25 22:23:34 +00:00
Eric Joyner
7f3f6aad3e iflib: fix dangling device softc pointer
Commit text by Jake:
If a driver's IFDI_ATTACH_PRE function fails, the iflib_device_register
function will free the ctx pointer. However, it does not reset the
device softc pointer to NULL.

This will result in memory corruption as a future access to the now
invalid pointer will corrupt memory that is later allocated on top of
the same memory location.

The iflib_device_deregister function correctly resets the softc pointer
by using device_set_softc().

This clears up the invalid dangling pointer and prevents memory
corruption that could lead to a panic or undefined behavior if the
device's driver failed to attach.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>

Submitted by:	Jacob Keller <jacob.e.keller@intel.com>
Reviewed by:	erj@, gallatin@
MFC after:	1 week
Sponsored by:	Intel Corporation
Differential Revision:	https://reviews.freebsd.org/D21003
2019-07-24 21:43:41 +00:00
Kirill Ponomarev
b7592822d5 Allow set MTU more than 1500 bytes.
Submitted by:	Alexandr Fedorov <aleksandr.fedorov_itglobal_dot_com>
Approved by:	jhb, rgrimes
Sponsored by:	ITGlobal.com
Differential Revision:	https://reviews.freebsd.org/D19422
2019-07-24 16:10:20 +00:00
Chuck Tuffli
94c15665a5 Fix a typo in r349969
OUI_FRREBSD_NVME_HIGH should have been OUI_FREEBSD_NVME_HIGH

Caught by:	Gary Jennejohn
2019-07-14 03:49:48 +00:00
Chuck Tuffli
409a80e5a4 bhyve: Create EUI64 for NVMe namespaces
Accept an IEEE Extended Unique Identifier (EUI-64) from the command
line for each NVMe namespace. If one isn't provided, it will create one
based on the CRC16 of:
 - the FreeBSD IEEE OUI
 - PCI bus, device/slot, function values
 - Namespace ID

Reviewed by:	imp, araujo, jhb, rgrimes
Approved by:	imp (mentor), jhb (maintainer)
MFC after:	2 weeks
Differential Revision: https://reviews.freebsd.org/D19905
2019-07-13 12:48:28 +00:00
Mark Johnston
eeacb3b02f Merge the vm_page hold and wire mechanisms.
The hold_count and wire_count fields of struct vm_page are separate
reference counters with similar semantics.  The remaining essential
differences are that holds are not counted as a reference with respect
to LRU, and holds have an implicit free-on-last unhold semantic whereas
vm_page_unwire() callers must explicitly determine whether to free the
page once the last reference to the page is released.

This change removes the KPIs which directly manipulate hold_count.
Functions such as vm_fault_quick_hold_pages() now return wired pages
instead.  Since r328977 the overhead of maintaining LRU for wired pages
is lower, and in many cases vm_fault_quick_hold_pages() callers would
swap holds for wirings on the returned pages anyway, so with this change
we remove a number of page lock acquisitions.

No functional change is intended.  __FreeBSD_version is bumped.

Reviewed by:	alc, kib
Discussed with:	jeff
Discussed with:	jhb, np (cxgbe)
Tested by:	pho (previous version)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D19247
2019-07-08 19:46:20 +00:00
John Baldwin
66d0c056be Support IFCAP_NOMAP in vlan(4).
Enable IFCAP_NOMAP for a vlan interface if it is supported by the
underlying trunk device.

Reviewed by:	gallatin, hselasky, rrs
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20616
2019-06-29 00:51:38 +00:00
John Baldwin
82334850ea Add an external mbuf buffer type that holds multiple unmapped pages.
Unmapped mbufs allow sendfile to carry multiple pages of data in a
single mbuf, without mapping those pages.  It is a requirement for
Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web
serving workloads when used by sendfile, due to effectively
compressing socket buffers by an order of magnitude, and hence
reducing cache misses.

For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer
now points to a struct mbuf_ext_pgs structure instead of a data
buffer.  This structure contains an array of physical addresses (this
reduces cache misses compared to an earlier version that stored an
array of vm_page_t pointers).  It also stores additional fields needed
for in-kernel TLS such as the TLS header and trailer data that are
currently unused.  To more easily detect these mbufs, the M_NOMAP flag
is set in m_flags in addition to M_EXT.

Various functions like m_copydata() have been updated to safely access
packet contents (using uiomove_fromphys()), to make things like BPF
safe.

NIC drivers advertise support for unmapped mbufs on transmit via a new
IFCAP_NOMAP capability.  This capability can be toggled via the new
'nomap' and '-nomap' ifconfig(8) commands.  For NIC drivers that only
transmit packet contents via DMA and use bus_dma, adding the
capability to if_capabilities and if_capenable should be all that is
required.

If a NIC does not support unmapped mbufs, they are converted to a
chain of mapped mbufs (using sf_bufs to provide the mapping) in
ip_output or ip6_output.  If an unmapped mbuf requires software
checksums, it is also converted to a chain of mapped mbufs before
computing the checksum.

Submitted by:	gallatin (earlier version)
Reviewed by:	gallatin, hselasky, rrs
Discussed with:	ae, kp (firewalls)
Relnotes:	yes
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20616
2019-06-29 00:48:33 +00:00
Hans Petter Selasky
0dbdf04125 Need to wait for epoch callbacks to complete before detaching a
network interface.

This particularly manifests itself when an INP has multicast options
attached during a network interface detach. Then the IPv4 and IPv6
leave group call which results from freeing the multicast address, may
access a freed ifnet structure. These are the steps to reproduce:

service mdnsd onestart # installed from ports

ifconfig epair create
ifconfig epair0a 0/24 up
ifconfig epair0a destroy

Tested by:	pho @
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2019-06-28 10:49:04 +00:00
Marius Strobl
c2c5d1e787 o In iflib_txq_drain():
- Remove desc_used, which is only ever written to.
  - Remove a dead store to reclaimed.
  - Don't recycle avail.
  - Sort variables according to style(9).
  These changes will make a subsequent commit easier to read.
o In iflib_tx_credits_update(), don't bother checking whether the
  ift_txd_credits_update method pointer is NULL; _iflib_pre_assert()
  asserts upfront that this method has been assigned and functions
  like iflib_{fast_intr_rxtx,netmap_timer_adjust,txq_can_drain}()
  and _task_fn_tx() were already unconditionally relying on the
  method being callable.
2019-06-26 15:28:21 +00:00
Leandro Lupori
e2edff4167 [PowerPC64] Don't mark module data as static
Fixes panic when loading ipfw.ko and if_epair.ko built with modern compiler.

Similar to arm64 and riscv, when using a modern compiler (!gcc4.2), code
generated tries to access data in the wrong location, causing kernel panic
(data storage interrupt trap) when loading if_epair and ipfw.

Issue was reproduced with kernel/module compiled using gcc8 and clang8. It
affects both ELFv1 and ELFv2 ABI environments.

PR:		232387
Submitted by:	alfredo.junior_eldorado.org.br
Reported by:	Mark Millard
Reviewed by:	jhibbits
Differential Revision:	https://reviews.freebsd.org/D20461
2019-06-25 17:15:44 +00:00
Hans Petter Selasky
59854ecf55 Convert all IPv4 and IPv6 multicast memberships into using a STAILQ
instead of a linear array.

The multicast memberships for the inpcb structure are protected by a
non-sleepable lock, INP_WLOCK(), which needs to be dropped when
calling the underlying possibly sleeping if_ioctl() method. When using
a linear array to keep track of multicast memberships, the computed
memory location of the multicast filter may suddenly change, due to
concurrent insertion or removal of elements in the linear array. This
in turn leads to various invalid memory access issues and kernel
panics.

To avoid this problem, put all multicast memberships on a STAILQ based
list. Then the memory location of the IPv4 and IPv6 multicast filters
become fixed during their lifetime and use after free and memory leak
issues are easier to track, for example by: vmstat -m | grep multi

All list manipulation has been factored into inline functions
including some macros, to easily allow for a future hash-list
implementation, if needed.

This patch has been tested by pho@ .

Differential Revision: https://reviews.freebsd.org/D20080
Reviewed by:	markj @
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2019-06-25 11:54:41 +00:00
Marko Zec
188adcb7e4 V_ip6_forwarding and V_ipforwarding have been defined in ip6_var.h /
ip_var.h since at least 2008, so make use of those definitions here.

MFC after:	3 days
2019-06-19 08:49:24 +00:00
Marko Zec
6aee0bfa85 Evaluating htons() at compile time is more efficient than doing ntohs()
at runtime.  This change removes a dependency on a barrel shifter pass
before branch resolution, while reducing the instruction stream size
by 9 bytes on amd64.

MFC after:	3 days
2019-06-19 08:39:19 +00:00
Marius Strobl
d49e83eac3 - Replace unused and only ever written to members of public iflib(9)
structs with placeholders (in the latter case, IFLIB_MAX_TX_BYTES
  etc. are also only ever used for these write-only members if at all,
  so both these macros and members can just go). Using these spares
  may render it possible to merge certain iflib(9) fixes to stable/12.
  Otherwise, changes extending struct if_irq or struct if_shared_ctx
  in any way would break KBI as instances of these are allocated by
  the driver front-ends (by contrast, struct if_pkt_info as well as
  struct if_softc_ctx instances are provided by iflib(9) and, thus,
  may grow at least at the end without breaking KBI).
- Make the pvi_name in struct pci_vendor_info const char * as device
  identifiers in hardware lookup tables aren't to be expected to ever
  change at runtime.
- Similarly, make the pci_vendor_info_t of struct if_shared_ctx which
  is used to point to the struct pci_vendor_info arrays provided by
  the driver front-ends const.
- Remove the ETH_ADDR_LEN macro from iflib.h; this was duplicating
  ETHER_ADDR_LEN of <net/ethernet.h> with iflib(9) actually only
  consuming the latter macro.
- Make the name argument of iflib_io_tqg_attach(9) const, matching
  the taskqgroup_attach_cpu(9) this function wraps as well as e. g.
  iflib_config_gtask_init(9).
- Remove the orphaned iflib_qset_lock_get() prototype.
- Remove some extraneous empty lines.
2019-06-15 11:07:41 +00:00
Mark Johnston
ce7fb386d8 Restore the comment removed in r348745.
LAGG_RLOCK() enters an epoch section, so the comment wasn't stale.

Reported by:	jhb
MFC with:	r348745
2019-06-06 17:20:35 +00:00
Mark Johnston
9995dfd364 Conditionalize an in_epoch() call on INVARIANTS.
Its result is only used to determine whether to perform further
INVARIANTS-only checks.  Remove a stale comment while here.

Submitted by:	Sebastian Huber <sebastian.huber@embedded-brains.de>
MFC after:	1 week
2019-06-06 16:22:29 +00:00
Eric Joyner
668d6dbb4c iflib: provide probe wrapper for vendor drivers
From Jake:
Vendor drivers that exist out-of-tree generally should return
BUS_PROBE_VENDOR from their device probe functions. This helps ensure
that a vendor replacement driver will supersede the in-kernel driver for
a given device.

Currently, if a vendor wants to implement a driver based on iflib, it
will always report BUS_PROBE_DEFAULT.

Add a wrapper function, iflib_device_probe_vendor() which can be used in
place of iflib_device_probe(). This function will just return
BUS_PROBE_VENDOR whenever iflib_device_probe() would return
BUS_PROBE_DEFAULT.

While vendor drivers can already implement such a wrapper themselves,
providing it in the iflib.h header makes it easier for the vendor driver
to do the right thing.

Submitted by:	Jacob Keller <jacob.e.keller@intel.com>
Reviewed by:	erj@, gallatin@, marius@
MFC after:	1 week
Sponsored by:	Intel Corporation
Differential Revision:	https://reviews.freebsd.org/D20221
2019-05-29 22:24:10 +00:00
Kyle Evans
d8b985430c if_bridge(4): Complete bpf auditing of local traffic over the bridge
There were two remaining "gaps" in auditing local bridge traffic with
bpf(4):

Locally originated outbound traffic from a member interface is invisible to
the bridge's bpf(4) interface. Inbound traffic locally destined to a member
interface is invisible to the member's bpf(4) interface -- this traffic has
no chance after bridge_input to otherwise pass it over, and it wasn't
originally received on this interface.

I call these "gaps" because they don't affect conventional bridge setups.
Alas, being able to establish an audit trail of all locally destined traffic
for setups that can function like this is useful in some scenarios.

Reviewed by:	kp
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D19757
2019-05-29 01:08:30 +00:00
Andrey V. Elsukov
de25327313 Rework r348303 to reduce the time of holding global BPF lock.
It appeared that using NET_EPOCH_WAIT() while holding global BPF lock
can lead to another panic:

spin lock 0xfffff800183c9840 (turnstile lock) held by 0xfffff80018e2c5a0 (tid 100325) too long
panic: spin lock held too long
...
#0  sched_switch (td=0xfffff80018e2c5a0, newtd=0xfffff8000389e000, flags=<optimized out>) at /usr/src/sys/kern/sched_ule.c:2133
#1  0xffffffff80bf9912 in mi_switch (flags=256, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:439
#2  0xffffffff80c21db7 in sched_bind (td=<optimized out>, cpu=<optimized out>) at /usr/src/sys/kern/sched_ule.c:2704
#3  0xffffffff80c34c33 in epoch_block_handler_preempt (global=<optimized out>, cr=0xfffffe00005a1a00, arg=<optimized out>)
    at /usr/src/sys/kern/subr_epoch.c:394
#4  0xffffffff803c741b in epoch_block (global=<optimized out>, cr=<optimized out>, cb=<optimized out>, ct=<optimized out>)
    at /usr/src/sys/contrib/ck/src/ck_epoch.c:416
#5  ck_epoch_synchronize_wait (global=0xfffff8000380cd80, cb=<optimized out>, ct=<optimized out>) at /usr/src/sys/contrib/ck/src/ck_epoch.c:465
#6  0xffffffff80c3475e in epoch_wait_preempt (epoch=0xfffff8000380cd80) at /usr/src/sys/kern/subr_epoch.c:513
#7  0xffffffff80ce970b in bpf_detachd_locked (d=0xfffff801d309cc00, detached_ifp=<optimized out>) at /usr/src/sys/net/bpf.c:856
#8  0xffffffff80ced166 in bpf_detachd (d=<optimized out>) at /usr/src/sys/net/bpf.c:836
#9  bpf_dtor (data=0xfffff801d309cc00) at /usr/src/sys/net/bpf.c:914

To fix this add the check to the catchpacket() that BPF descriptor was
not detached just before we acquired BPFD_LOCK().

Reported by:	slavash
Tested by:	slavash
MFC after:	1 week
2019-05-28 11:45:00 +00:00
Andrey V. Elsukov
44a514745c Fix possible NULL pointer dereference.
bpf_mtap() can invoke catchpacket() for already detached descriptor.
And this can lead to NULL pointer dereference, since bd_bif pointer
was reset to NULL in bpf_detachd_locked(). To avoid this, use
NET_EPOCH_WAIT() when descriptor is removed from interface's descriptors
list. After the wait it is safe to modify descriptor's content.

Submitted by:	kib
Reported by:	slavash
MFC after:	1 week
2019-05-27 12:41:41 +00:00
John Baldwin
fb3bc59600 Restructure mbuf send tags to provide stronger guarantees.
- Perform ifp mismatch checks (to determine if a send tag is allocated
  for a different ifp than the one the packet is being output on), in
  ip_output() and ip6_output().  This avoids sending packets with send
  tags to ifnet drivers that don't support send tags.

  Since we are now checking for ifp mismatches before invoking
  if_output, we can now try to allocate a new tag before invoking
  if_output sending the original packet on the new tag if allocation
  succeeds.

  To avoid code duplication for the fragment and unfragmented cases,
  add ip_output_send() and ip6_output_send() as wrappers around
  if_output and nd6_output_ifp, respectively.  All of the logic for
  setting send tags and dealing with send tag-related errors is done
  in these wrapper functions.

  For pseudo interfaces that wrap other network interfaces (vlan and
  lagg), wrapper send tags are now allocated so that ip*_output see
  the wrapper ifp as the ifp in the send tag.  The if_transmit
  routines rewrite the send tags after performing an ifp mismatch
  check.  If an ifp mismatch is detected, the transmit routines fail
  with EAGAIN.

- To provide clearer life cycle management of send tags, especially
  in the presence of vlan and lagg wrapper tags, add a reference count
  to send tags managed via m_snd_tag_ref() and m_snd_tag_rele().
  Provide a helper function (m_snd_tag_init()) for use by drivers
  supporting send tags.  m_snd_tag_init() takes care of the if_ref
  on the ifp meaning that code alloating send tags via if_snd_tag_alloc
  no longer has to manage that manually.  Similarly, m_snd_tag_rele
  drops the refcount on the ifp after invoking if_snd_tag_free when
  the last reference to a send tag is dropped.

  This also closes use after free races if there are pending packets in
  driver tx rings after the socket is closed (e.g. from tcpdrop).

  In order for m_free to work reliably, add a new CSUM_SND_TAG flag in
  csum_flags to indicate 'snd_tag' is set (rather than 'rcvif').
  Drivers now also check this flag instead of checking snd_tag against
  NULL.  This avoids false positive matches when a forwarded packet
  has a non-NULL rcvif that was treated as a send tag.

- cxgbe was relying on snd_tag_free being called when the inp was
  detached so that it could kick the firmware to flush any pending
  work on the flow.  This is because the driver doesn't require ACK
  messages from the firmware for every request, but instead does a
  kind of manual interrupt coalescing by only setting a flag to
  request a completion on a subset of requests.  If all of the
  in-flight requests don't have the flag when the tag is detached from
  the inp, the flow might never return the credits.  The current
  snd_tag_free command issues a flush command to force the credits to
  return.  However, the credit return is what also frees the mbufs,
  and since those mbufs now hold references on the tag, this meant
  that snd_tag_free would never be called.

  To fix, explicitly drop the mbuf's reference on the snd tag when the
  mbuf is queued in the firmware work queue.  This means that once the
  inp's reference on the tag goes away and all in-flight mbufs have
  been queued to the firmware, tag's refcount will drop to zero and
  snd_tag_free will kick in and send the flush request.  Note that we
  need to avoid doing this in the middle of ethofld_tx(), so the
  driver grabs a temporary reference on the tag around that loop to
  defer the free to the end of the function in case it sends the last
  mbuf to the queue after the inp has dropped its reference on the
  tag.

- mlx5 preallocates send tags and was using the ifp pointer even when
  the send tag wasn't in use.  Explicitly use the ifp from other data
  structures instead.

- Sprinkle some assertions in various places to assert that received
  packets don't have a send tag, and that other places that overwrite
  rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.

Reviewed by:	gallatin, hselasky, rgrimes, ae
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20117
2019-05-24 22:30:40 +00:00
Alexander V. Chernikov
563ab4e400 Fix gateway setup for the interface routes.
Currently rinit1() and its IPv6 counterpart
  nd6_prefix_onlink_rtrequest() uses dummy null_sdl gateway address
  during route insertion and change it afterwards. This behaviour
  brings complications to the routing stack and the users of its
  upcoming notification system.

This change fixes both rinit1() and nd6_prefix_onlink_rtrequest()
  by filling in proper gateway in the beginning. It does not change any
  of the userland notifications as in both cases, they happen after
  the insertion and fixup process (rt_newaddrmsg_fib() and nd6_rtmsg()).

MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D20328
2019-05-22 21:20:15 +00:00
Conrad Meyer
e2e050c8ef Extract eventfilter declarations to sys/_eventfilter.h
This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h"
in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header
pollution substantially.

EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c
files into appropriate headers (e.g., sys/proc.h, powernv/opal.h).

As a side effect of reduced header pollution, many .c files and headers no
longer contain needed definitions.  The remainder of the patch addresses
adding appropriate includes to fix those files.

LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by
sys/mutex.h since r326106 (but silently protected by header pollution prior
to this change).

No functional change (intended).  Of course, any out of tree modules that
relied on header pollution for sys/eventhandler.h, sys/lock.h, or
sys/mutex.h inclusion need to be fixed.  __FreeBSD_version has been bumped.
2019-05-20 00:38:23 +00:00
Alexander V. Chernikov
2ad7ed6e4a Fix rt_ifa selection during loopback route insertion process.
Currently such routes are added with a link-level IFA, which is
  plain wrong. Only after the insertion they get fixed by the special
  link_rtrequest() ifa handler. This behaviour complicates routing code
  and makes ifa selection more complex.
Streamline this process by explicitly moving link_rtrequest() logic
  to the pre-insertion rt_getifa_fib() ifa selector. Avoid calling all
  this logic in the loopback route case by explicitly specifying
  proper rt_ifa inside the ifa_maintain_loopback_route().§

MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D20076
2019-05-19 21:49:56 +00:00
Kyle Evans
db226f0d8e tuntap: Defer clearing if_softc until after if_detach
r346670 added an sx to close a race between the ifioctl handler and
interface destruction. Unfortunately, it clears if_softc immediately after
the interface is closed, but before if_detach has been invoked.

Any time before detachment, an interface that's part of a bridge may still
receive traffic that's pushed through tunstart/tunstart_l2 and promptly
lead to a panic because if_softc is now NULL.

Fix it by deferring the clearing of if_softc until after the interface has
detached and thus been removed from the bridge. if_softc still gets cleared
in case another thread has already entered the ioctl handler before it's
replaced with ifdead_ioctl.

Reported by:	markj
MFC after:	3 days
2019-05-14 20:32:29 +00:00
Andrey V. Elsukov
82d7bf6b1b Avoid possible recursion on BPF_LOCK() in bpfwrite().
Release BPF_LOCK() before invoking if_output() and if_input().
Also enter epoch section before releasing lock, this should prevent
access to ifnet that may be freed on interface detach.

Reported by:	markj
2019-05-13 20:17:55 +00:00
Andrey V. Elsukov
af1f58df99 Do not leak memory used for binary filter. 2019-05-13 14:07:02 +00:00
Andrey V. Elsukov
699281b545 Rework locking in BPF code to remove rwlock from fast path.
On high packets rate the contention on rwlock in bpf_*tap*() functions
can lead to packets dropping. To avoid this, migrate this code to use
epoch(9) KPI and ConcurrencyKit's lists.

* all lists changed to use CK_LIST;
* reference counting added to bpf_if and bpf_d;
* now bpf_if references ifnet and releases this reference on destroy;
* each bpf_d descriptor references bpf_if when it is attached;
* new struct bpf_program_buffer introduced to keep BPF filter programs;
* bpf_program_buffer, bpf_d and bpf_if structures are freed by
  epoch_call();
* bpf_freelist and ifnet_departure event are no longer needed, thus
  both are removed;

Reviewed by:	melifaro
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D20224
2019-05-13 13:45:28 +00:00
Kyle Evans
81b3b91e6b tuntap: Improve style
No functional change.

tun_flags of the tuntap_driver was renamed to ident_flags to reflect the
fact that it's a subset of the tun_flags that identifies a tuntap device.
This maps more easily (visually) to the TUN_DRIVER_IDENT_MASK that masks off
the bits of tun_flags that are applicable to tuntap driver ident. This is a
purely cosmetic change.
2019-05-11 04:18:06 +00:00
Eric Joyner
afb7737237 iflib: use default ntxd and nrxd when user value is not power of 2
From Jake:
A user may set a sysctl to override the default number of Tx or Rx
descriptors. However, certain calculations in the iflib core expect the
number of descriptors to be a power of 2.

Update _iflib_assert to verify that all of the shared context parameters
for the number of descriptors are powers of 2.

Modify iflib_reset_qvalues to check that the provided isc_nrxd value is
a power of 2. If it's not, print a warning message and then use the
default value.

An alternative might be to try rounding the number down instead.
However, this creates problems in case the rounded down value is below
the minimum value that the driver would support.

Submitted by:	Jacob Keller <jacob.e.keller@intel.com>
Reviewed by:	marius@
MFC after:	1 week
Sponsored by:	Intel Corporation
Differential Revision:	https://reviews.freebsd.org/D19880
2019-05-10 00:41:42 +00:00
Kyle Evans
16760d8e28 tuntap: Don't down tap interfaces if LINK0 is set 2019-05-09 18:54:29 +00:00
Kyle Evans
a6fa049545 tuntap: Properly detach tap ifp 2019-05-09 14:06:24 +00:00
Marius Strobl
14e0010729 - Merge r338254 from cxgbe(4):
Use fcmpset instead of cmpset when appropriate.
- Revert r277226 of cxgbe(4), obsolete since r334320.
2019-05-09 11:34:46 +00:00
Gleb Smirnoff
6ca363eb7b Existense of PCB route caching doesn't allow us to use new fast route
lookup KPI in ip_output() like it is already used in ip_forward().
However, when there is no PCB provided we can use fast KPI, gaining
performance advantage.

Typical case when ip_output() is called without a PCB pointer is a
sendto(2) on a not connected UDP socket. In practice DNS servers do
this.

Reviewed by:	melifaro
Differential Revision:	https://reviews.freebsd.org/D19804
2019-05-08 23:39:24 +00:00
Marius Strobl
007b804fc7 Allow to build without INET and INET6 again after r347221.
Submitted by:	cam
2019-05-08 09:03:43 +00:00