4077 Commits

Author SHA1 Message Date
Marius Strobl
a6611c938b Fix the build with ALTQ after r344060. 2019-02-12 22:33:17 +00:00
Marius Strobl
f855ec814d Make taskqgroup_attach{,_cpu}(9) work across architectures
So far, intr_{g,s}etaffinity(9) take a single int for identifying
a device interrupt. This approach doesn't work on all architectures
supported, as a single int isn't sufficient to globally specify a
device interrupt. In particular, with multiple interrupt controllers
in one system as found on e. g. arm and arm64 machines, an interrupt
number as returned by rman_get_start(9) may be only unique relative
to the bus and, thus, interrupt controller, a certain device hangs
off from.
In turn, this makes taskqgroup_attach{,_cpu}(9) and - internal to
the gtaskqueue implementation - taskqgroup_attach_deferred{,_cpu}()
not work across architectures. Yet in turn, iflib(4) as gtaskqueue
consumer so far doesn't fit architectures where interrupt numbers
aren't globally unique.
However, at least for intr_setaffinity(..., CPU_WHICH_IRQ, ...) as
employed by the gtaskqueue implementation to bind an interrupt to a
particular CPU, using bus_bind_intr(9) instead is equivalent from
a functional point of view, with bus_bind_intr(9) taking the device
and interrupt resource arguments required for uniquely specifying a
device interrupt.
Thus, change the gtaskqueue implementation to employ bus_bind_intr(9)
instead and intr_{g,s}etaffinity(9) to take the device and interrupt
resource arguments required respectively. This change also moves
struct grouptask from <sys/_task.h> to <sys/gtaskqueue.h> and wraps
struct gtask along with the gtask_fn_t typedef into #ifdef _KERNEL
as userland likes to include <sys/_task.h> or indirectly drags it
in - for better or worse also with _KERNEL defined -, which with
device_t and struct resource dependencies otherwise is no longer
as easily possible now.
The userland inclusion problem probably can be improved a bit by
introducing a _WANT_TASK (as well as a _WANT_MOUNT) akin to the
existing _WANT_PRISON etc., which is orthogonal to this change,
though, and likely needs an exp-run.

While at it:
- Change the gt_cpu member in the grouptask structure to be of type
  int as used elswhere for specifying CPUs (an int16_t may be too
  narrow sooner or later),
- move the gtaskqueue_enqueue_fn typedef from <sys/gtaskqueue.h> to
  the gtaskqueue implementation as it's only used and needed there,
- change the GTASK_INIT macro to use "gtask" rather than "task" as
  argument given that it actually operates on a struct gtask rather
  than a struct task, and
- let subr_gtaskqueue.c consistently use __func__ to print functions
  names.

Reported by:	mmel
Reviewed by:	mmel
Differential Revision:	https://reviews.freebsd.org/D19139
2019-02-12 21:23:59 +00:00
Marius Strobl
95dcf343b7 Further correct and optimize the bus_dma(9) usage of iflib(4):
o Correct the obvious bugs in the netmap(4) parts:
  - No longer check for the existence of DMA maps as bus_dma(9)
    is used unconditionally in iflib(4) since r341095.
  - Supply the correct DMA tag and map pairs to bus_dma(9)
    functions (see also the commit message of r343753).
  - In iflib_netmap_timer_adjust(), add synchronization of the
    TX descriptors before calling the ift_txd_credits_update
    method as the latter evaluates the TX descriptors possibly
    updated by the MAC.
  - In _task_fn_tx(), wrap the netmap(4)-specific bits in
    #ifdef DEV_NETMAP just as done in _task_fn_admin() and
    _task_fn_rx() respectively.
o In iflib_fast_intr_rxtx(), synchronize the TX rather than
  the RX descriptors before calling the ift_txd_credits_update
  method (see also above).
o There's no need to synchronize an RX buffer that is going to
  be recycled in iflib_rxd_pkt_get(), yet; it's sufficient to
  do that as late as passing RX buffers to the MAC via the
  ift_rxd_refill method. Hence, combine that synchronization
  with the synchronization of new buffers into a common spot
  in _iflib_fl_refill().
o There's no need to synchronize the RX descriptors of a free
  list in preparation of the MAC updating their statuses with
  every invocation of rxd_frag_to_sd(); it's enough to do this
  once before handing control over to the MAC, i. e. before
  calling ift_rxd_flush method in _iflib_fl_refill(), which
  already performs the necessary synchronization.
o Given that the ift_rxd_available method evaluates the RX
  descriptors which possibly have been altered by the MAC,
  synchronize as appropriate beforehand. Most notably this
  is now done in iflib_rxd_avail(), which in turn means that
  we don't need to issue the same synchronization yet again
  before calling the ift_rxd_pkt_get method in iflib_rxeof().
o In iflib_txd_db_check(), synchronize the TX descriptors
  before handing them over to the MAC for transmission via
  the ift_txd_flush method.
o In iflib_encap(), move the TX buffer synchronization after
  the invocation of the ift_txd_encap() method. If the MAC
  driver fails to encapsulate the packet and we retry with
  a defragmented mbuf chain or finally fail, the cycles for
  TX buffer synchronization have been wasted. Synchronizing
  afterwards matches what non-iflib(4) drivers typically do
  and is sufficient as the MAC will not actually start with
  the transmission before - in this case - the ift_txd_flush
  method is called.
  Moreover, for the latter reason the synchronization of the
  TX descriptors in iflib_encap() can go as it's enough to
  synchronize them before passing control over to the MAC by
  issuing the ift_txd_flush() method (see above).
o In iflib_txq_can_drain(), only synchronize TX descriptors
  if the ift_txd_credits_update method accessing these is
  actually called.

Differential Revision:	https://reviews.freebsd.org/D19081
2019-02-12 21:08:44 +00:00
Patrick Kelsey
8f2ac65690 Reduce the time it takes the kernel to install a new PF config containing a large number of queues
In general, the time savings come from separating the active and
inactive queues lists into separate interface and non-interface queue
lists, and changing the rule and queue tag management from list-based
to hash-bashed.

In HFSC, a linear scan of the class table during each queue destroy
was also eliminated.

There are now two new tunables to control the hash size used for each
tag set (default for each is 128):

net.pf.queue_tag_hashsize
net.pf.rule_tag_hashsize

Reviewed by:	kp
MFC after:	1 week
Sponsored by:	RG Nets
Differential Revision:	https://reviews.freebsd.org/D19131
2019-02-11 05:17:31 +00:00
Marius Strobl
bfce461ee9 o As illustrated by e. g. figure 7-14 of the Intel 82599 10 GbE
controller datasheet revision 3.3, in the context of Ethernet
  MACs the control data describing the packet buffers typically
  are named "descriptors". Each of these descriptors references
  one buffer, multiple of which a packet can be composed of.
  By contrast, in comments, messages and the names of structure
  members, iflib(4) refers to DMA resources employed for RX and
  TX buffers (rather than control data) as "desc(riptors)".
  This odd naming convention of iflib(4) made reviewing r343085
  and identifying wrong and missing bus_dmamap_sync(9) calls in
  particular way harder than it already is. This convention may
  also explain why the netmap(4) part of iflib(4) pairs the DMA
  tags for control data with DMA maps of buffers and vice versa
  in calls to bus_dma(9) functions.
  Therefore, change iflib(4) to refer to buf(fers) when buffers
  and not the usual understanding of descriptors is meant. This
  change does not include corrections to the DMA resources used
  in the netmap(4) parts. However, it revises error messages to
  state which kind of allocation/creation failed. Specifically,
  the "Unable to allocate tx_buffer (map) memory" copy & pasted
  inappropriately on several occasions was replaced with proper
  messages.
o Enhance some other error messages to indicate which half - RX
  or TX - they apply to instead of using identical text in both
  cases and generally canonicalize them.
o Correct the descriptions of iflib_{r,t}xsd_alloc() to reflect
  reality; current code doesn't use {r,t}x_buffer structures.
o In iflib_queues_alloc():
  - Remove redundant BUS_DMA_NOWAIT of iflib_dma_alloc() calls,
  - change the M_WAITOK from malloc(9) calls into M_NOWAIT. The
    return values are already checked, deferred DMA allocations
    not being an option at this point, BUS_DMA_NOWAIT has to be
    used anyway and prior malloc(9) calls in this function also
    specify M_NOWAIT.

Reviewed by:	shurd
Differential Revision:	https://reviews.freebsd.org/D19067
2019-02-04 20:46:57 +00:00
Gleb Smirnoff
3ca1c423aa Teach pfil_ioctl() about VIMAGE.
Submitted by:	gallatin
2019-02-03 08:28:02 +00:00
Vincenzo Maffione
5faab77822 netmap: upgrade sync-kloop support
Add SYNC_KLOOP_MODE option, and add support for direct mode, where application
executes the TXSYNC and RXSYNC in the context of the ioeventfd wake up callback.

MFC after:	5 days
2019-02-02 22:39:29 +00:00
Gleb Smirnoff
b252313f0b New pfil(9) KPI together with newborn pfil API and control utility.
The KPI have been reviewed and cleansed of features that were planned
back 20 years ago and never implemented.  The pfil(9) internals have
been made opaque to protocols with only returned types and function
declarations exposed. The KPI is made more strict, but at the same time
more extensible, as kernel uses same command structures that userland
ioctl uses.

In nutshell [KA]PI is about declaring filtering points, declaring
filters and linking and unlinking them together.

New [KA]PI makes it possible to reconfigure pfil(9) configuration:
change order of hooks, rehook filter from one filtering point to a
different one, disconnect a hook on output leaving it on input only,
prepend/append a filter to existing list of filters.

Now it possible for a single packet filter to provide multiple rulesets
that may be linked to different points. Think of per-interface ACLs in
Cisco or Juniper. None of existing packet filters yet support that,
however limited usage is already possible, e.g. default ruleset can
be moved to single interface, as soon as interface would pride their
filtering points.

Another future feature is possiblity to create pfil heads, that provide
not an mbuf pointer but just a memory pointer with length. That would
allow filtering at very early stages of a packet lifecycle, e.g. when
packet has just been received by a NIC and no mbuf was yet allocated.

Differential Revision:	https://reviews.freebsd.org/D18951
2019-01-31 23:01:03 +00:00
John Baldwin
829c56fc08 Don't set IFCAP_TXRTLMT during lagg_clone_create().
lagg_capabilities() will set the capability once interfaces supporting
the feature are added to the lagg.  Setting it on a lagg without any
interfaces is pointless as the if_snd_tag_alloc call will always fail
in that case.

Reviewed by:	hselasky, gallatin
MFC after:	2 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D19040
2019-01-31 21:35:37 +00:00
Gleb Smirnoff
f712b16127 Revert r316461: Remove "IPFW static rules" rmlock, and use pfil's global lock.
The pfil(9) system is about to be converted to epoch(9) synchronization, so
we need [temporarily] go back with ipfw internal locking.

Discussed with:	ae
2019-01-31 21:04:50 +00:00
Marius Strobl
b97de13ae0 - Stop iflib(4) from leaking MSI messages on detachment by calling
bus_teardown_intr(9) before pci_release_msi(9).
- Ensure that iflib(4) and associated drivers pass correct RIDs to
  bus_release_resource(9) by obtaining the RIDs via rman_get_rid(9)
  on the corresponding resources instead of using the RIDs initially
  passed to bus_alloc_resource_any(9) as the latter function may
  change those RIDs. Solely em(4) for the ioport resource (but not
  others) and bnxt(4) were using the correct RIDs by caching the ones
  returned by bus_alloc_resource_any(9).
- Change the logic of iflib_msix_init() around to only map the MSI-X
  BAR if MSI-X is actually supported, i. e. pci_msix_count(9) returns
  > 0. Otherwise the "Unable to map MSIX table " message triggers for
  devices that simply don't support MSI-X and the user may think that
  something is wrong while in fact everything works as expected.
- Put some (mostly redundant) debug messages emitted by iflib(4)
  and em(4) during attachment under bootverbose. The non-verbose
  output of em(4) seen during attachment now is close to the one
  prior to the conversion to iflib(4).
- Replace various variants of spelling "MSI-X" (several in messages)
  with "MSI-X" as used in the PCI specifications.
- Remove some trailing whitespace from messages emitted by iflib(4)
  and change them to consistently start with uppercase.
- Remove some obsolete comments about releasing interrupts from
  drivers and correct a few others.

Reviewed by:	erj, Jacob Keller, shurd
Differential Revision:	https://reviews.freebsd.org/D18980
2019-01-30 13:21:26 +00:00
Marius Strobl
3db348b54a - In _iflib_fl_refill(), don't mark an RX buffer as available in the
corresponding bitmap before adding an mbuf has actually succeeded.
  Previously, m_gethdr(M_NOWAIT, ...) failing caused a "hole" in the
  RX ring but not in its bitmap. One implication of such a hole was
  that in a subsequent call to _iflib_fl_refill() with the RX buffer
  accounting still indicating another reclaimable buffer, bit_ffc(3)
  nevertheless returned -1 in frag_idx which in turn caused havoc
  when used as an index. Thus, additionally assert that frag_idx is
  0 or greater.
  Another possible consequence of a hole in the RX ring was a NULL-
  dereference when trying to use the unallocated mbuf, for example
  in iflib_rxd_pkt_get().

  While at it, make the variable declarations in _iflib_fl_refill()
  conform to style(9) and remove redundant checks already performed
  by bit_ffc{,_at}(3).

- In iflib_queues_alloc(), don't pass redundant M_ZERO to bit_alloc(3).

Reported and tested by: pho
2019-01-26 21:35:51 +00:00
Andrew Gallatin
77102fd6a2 Fix an iflib driver unload panic introduced in r343085
The new loop to sync and unload descriptors was indexed
by "i", rather than "j".   The panic was caused by "i"
being advanced rather than "j", and eventually becoming
out of bounds.

Reviewed by:	kib
MFC after:	3 days
Sponsored by:	Netflix
2019-01-25 15:02:18 +00:00
Vincenzo Maffione
f79ba6d75b netmap: improvements to the netmap kloop (CSB mode)
Changelist:
    - Add the proper memory barriers in the kloop ring processing
      functions.
    - Fix memory barriers usage in the user helpers (nm_sync_kloop_appl_write,
      nm_sync_kloop_appl_read).
    - Fix nm_kr_txempty() helper to look at rhead rather than rcur. This
      is important since the kloop can read a value of rcur which is ahead
      of the value of rhead (see explanation in nm_sync_kloop_appl_write)
    - Remove obsolete ptnetmap_guest_write_kring_csb() and
      ptnet_guest_read_kring_csb(), and update if_ptnet(4) to use those.
    - Prepare in advance the arguments for netmap_sync_kloop_[tr]x_ring(),
      to make the kloop faster.
    - Provide kernel and user implementation for nm_ldld_barrier() and
      nm_ldst_barrier()

MFC after:	2 weeks
2019-01-23 14:51:36 +00:00
Brooks Davis
bc6f170e30 Rework CASE_IOC_IFGROUPREQ() to require a case before the macro.
This is more compatible with formatting tools and looks more normal.

Reported by:	jhb (on a different review)
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D18442
2019-01-22 17:39:26 +00:00
Patrick Kelsey
8f82136aec onvert vmx(4) to being an iflib driver.
Also, expose IFLIB_MAX_RX_SEGS to iflib drivers and add
iflib_dma_alloc_align() to the iflib API.

Performance is generally better with the tunable/sysctl
dev.vmx.<index>.iflib.tx_abdicate=1.

Reviewed by:	shurd
MFC after:	1 week
Relnotes:	yes
Sponsored by:	RG Nets
Differential Revision:	https://reviews.freebsd.org/D18761
2019-01-22 01:11:17 +00:00
Patrick Kelsey
7f3eb9dab3 Fix various resource leaks that can occur in the error paths of
iflib_device_register() and iflib_pseudo_register().

Reviewed by:	shurd
MFC after:	1 week
Sponsored by:	RG Nets
Differential Revision:	https://reviews.freebsd.org/D18760
2019-01-22 00:56:44 +00:00
Konstantin Belousov
8a04b53dce Improve iflib busdma(9) KPI use.
- Specify BUS_DMA_NOWAIT for bus_dmamap_load() on rx refill, since
  callbacks are not supposed to be used.
- Match tso/non-tso tags to corresponding tx map operations.  Create
  separate tso maps for tx descriptors.  In particular, do not use
  non-tso tag to load, unload, or destroy a map created with tso tag.
- Add missed bus_dmamap_sync() calls.
  Submitted by: marius.

Reported and tested by:	pho
Reviewed by:	marius
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-01-16 05:44:14 +00:00
Gleb Smirnoff
cc31f9821c Remove recursive NET_EPOCH_ENTER() from sysctl_ifmalist(), missed in r342872. 2019-01-11 00:45:22 +00:00
Gleb Smirnoff
7b7f772fa0 Bring the comment up to date. 2019-01-10 00:37:14 +00:00
Mark Johnston
72755d285f Stop setting if_linkmib in vlan(4) ifnets.
There are several reasons:
- The structure being exported via IFDATA_LINKSPECIFIC doesn't appear
  to be a standard MIB.
- The structure being exported is private to the kernel and always
  has been.
- No other drivers in common use set the if_linkmib field.
- Because IFDATA_LINKSPECIFIC can be used to overwrite the linkmib
  structure, a privileged user could use it to corrupt internal
  vlan(4) state. [1]

PR:		219472
Reported by:	CTurt <ecturt@gmail.com> [1]
Reviewed by:	kp (previous version)
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18779
2019-01-09 16:47:16 +00:00
Gleb Smirnoff
a68cc38879 Mechanical cleanup of epoch(9) usage in network stack.
- Remove macros that covertly create epoch_tracker on thread stack. Such
  macros a quite unsafe, e.g. will produce a buggy code if same macro is
  used in embedded scopes. Explicitly declare epoch_tracker always.

- Unmask interface list IFNET_RLOCK_NOSLEEP(), interface address list
  IF_ADDR_RLOCK() and interface AF specific data IF_AFDATA_RLOCK() read
  locking macros to what they actually are - the net_epoch.
  Keeping them as is is very misleading. They all are named FOO_RLOCK(),
  while they no longer have lock semantics. Now they allow recursion and
  what's more important they now no longer guarantee protection against
  their companion WLOCK macros.
  Note: INP_HASH_RLOCK() has same problems, but not touched by this commit.

This is non functional mechanical change. The only functionally changed
functions are ni6_addrs() and ni6_store_addrs(), where we no longer enter
epoch recursively.

Discussed with:	jtl, gallatin
2019-01-09 01:11:19 +00:00
Gleb Smirnoff
21ee61a6cf Remove part of comment that doesn't match reality. 2019-01-09 00:38:16 +00:00
Stephen Hurd
cd28ea929a Use iflib_if_init_locked() during resume instead of iflib_init_locked().
iflib_init_locked() assumes that iflib_stop() has been called, however,
it is not called for suspend.  iflib_if_init_locked() calls stop then init,
so fixes the problem.

This was causing errors after a resume from suspend.

PR:		224059
Reported by:	zeising
MFC after:	1 week
Sponsored by:	Limelight Networks
2019-01-07 23:46:54 +00:00
Matt Macy
5881181d8c mp_ring: avoid items offset difference between iflib and mp_ring
on architectures without 64-bit atomics

Reported by:	Augustin Cavalier <waddlesplash@gmail.com>
2019-01-03 23:06:05 +00:00
Konstantin Belousov
85f3b801e9 Fix typo, use boolean operator instead of bit-wise.
Reviewed by:	marius, shurd
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
2019-01-03 01:01:03 +00:00
Mateusz Guzik
cc426dd319 Remove unused argument to priv_check_cred.
Patch mostly generated with cocinnelle:

@@
expression E1,E2;
@@

- priv_check_cred(E1,E2,0)
+ priv_check_cred(E1,E2)

Sponsored by:	The FreeBSD Foundation
2018-12-11 19:32:16 +00:00
Stephen Hurd
7124b5ba04 Fix !tx_abdicate error from r336560
r336560 was supposed to restore pre-r323954 behaviour when tx_abdicate is
not set (the default case). However, it appears that rather than the drainage
check being made conditional on tx_abdicate being set, it was duplicated
so it occured twice if tx_abdicate was set and once if it was not.

Now when !tx_abdicate, drainage is only checked if the doorbell isn't
pending.

Reported by:    lev
MFC after:      1 week
Sponsored by:   Limelight Networks
2018-12-11 17:46:01 +00:00
Vincenzo Maffione
1d9ec3ee50 netmap.h: include stdatomic.h
The stdatomic.h header exports atomic_thread_fence(), that
can be used to implement the nm_stst_barrier() macro needed
by netmap.

MFC after:	3 days
2018-12-05 15:38:52 +00:00
Vincenzo Maffione
b6e66be22b netmap: align codebase to the current upstream (760279cfb2730a585)
Changelist:
  - Replace netmap passthrough host support with a more general
    mechanism to call TXSYNC/RXSYNC from an in-kernel event-loop.
    No kernel threads are used to use this feature: the application
    is required to spawn a thread (or a process) and issue a
    SYNC_KLOOP_START (NIOCCTRL) command in the thread body. The
    kernel loop is executed by the ioctl implementation, which returns
    to userspace only when a different thread calls SYNC_KLOOP_STOP
    or the netmap file descriptor is closed.
  - Update the if_ptnet driver to cope with the new data structures,
    and prune all the obsolete ptnetmap code.
  - Add support for "null" netmap ports, useful to allocate netmap_if,
    netmap_ring and netmap buffers to be used by specialized applications
    (e.g. hypervisors). TXSYNC/RXSYNC on these ports have no effect.
  - Various fixes and code refactoring.

Sponsored by:	Sunny Valley Networks
Differential Revision:	https://reviews.freebsd.org/D18015
2018-12-05 11:57:16 +00:00
Eric van Gyzen
9dae3b521e altq: manual cleanup after r341507
Remove a file that became practically empty.
Fix indentation.

Like r341507, I do not plan to MFC, but anyone else can.
2018-12-04 23:53:42 +00:00
Eric van Gyzen
325fab802e altq: remove ALTQ3_COMPAT code
This code has apparently never compiled on FreeBSD since its
introduction in 2004 (r130365).  It has certainly not compiled
since 2006, when r164033 added #elsif [sic] preprocessor directives.
The code was left in the tree to reduce the diff from upstream (KAME).
Since that upstream is no longer relevant, remove the long-dead code.

This commit is the direct result of:

    unifdef -m -UALTQ3_COMPAT sys/net/altq/*

A later commit will do some manual cleanup.

I do not plan to MFC this.  If that would help you, go for it.
2018-12-04 23:46:43 +00:00
Andrey V. Elsukov
851073551d Adapt the fix in r341008 to correctly work with EBR.
IFNET_RLOCK_NOSLEEP() is epoch_enter_preempt() in FreeBSD 12+. Holding
it in sysctl_rtsock() doesn't protect us from ifnet unlinking, because
unlinking occurs with IFNET_WLOCK(), that is rw_wlock+sx_xlock, and it
doesn check that concurrent code is running in epoch section. But while
we are in epoch section, we should be able to do access to ifnet's
fields, even it was unlinked. Thus do not change if_addr and if_hw_addr
fields in ifnet_detach_internal() to NULL, since rtsock code can do
access to these fields and this is allowed while it is running in epoch
section.

This should fix the race, when ifnet_detach_internal() unlinks ifnet
after we checked it for IFF_DYING in sysctl_dumpentry.

Move free(ifp->if_hw_addr) into ifnet_free_internal(). Also remove the
NULL check for ifp->if_description, since free(9) can correctly handle
NULL pointer.

MFC after:	1 week
2018-11-30 10:36:14 +00:00
Andrew Gallatin
fbec776de0 Use busdma unconditionally in iflib
- Remove the complex mechanism to choose between using busdma
and raw pmap_kextract at runtime.   The reduced complexity makes
the code easier to read and maintain.

- Fix a bug in the small packet receive path where clusters were
repeatedly mapped but never unmapped. We now store the cluster's
bus address and avoid re-mapping the cluster each time a small
packet is received.

This patch fixes bugs I've seen where ixl(4) will not even
respond to ping without seeing DMAR faults.

I see a small improvement (14%) on packet forwarding tests using
a Haswell based Xeon E5-2697 v3.  Olivier sees a small
regression (-3% to -6%) with lower end hardware.

Reviewed by:	mmacy
Not objected to by:	sbruno
MFC after:	8 weeks
Sponsored by:	Netflix, Inc
Differential Revision:		https://reviews.freebsd.org/D17901
2018-11-27 20:01:05 +00:00
Andrey V. Elsukov
a716ad4a35 Fix possible panic during ifnet detach in rtsock.
The panic can happen, when some application does dump of routing table
using sysctl interface. To prevent this, set IFF_DYING flag in
if_detach_internal() function, when ifnet under lock is removed from
the chain. In sysctl_rtsock() take IFNET_RLOCK_NOSLEEP() to prevent
ifnet detach during routes enumeration. In case, if some interface was
detached in the time before we take the lock, add the check, that ifnet
is not DYING. This prevents access to memory that could be freed after
ifnet is unlinked.

PR:		227720, 230498, 233306
Reviewed by:	bz, eugen
MFC after:	1 week
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D18338
2018-11-27 09:04:06 +00:00
Mark Johnston
d25f8522be Plug routing sysctl leaks.
Various structures exported by sysctl_rtsock() contain padding fields
which were not being zeroed.

Reported by:	Thomas Barabosch, Fraunhofer FKIE
Reviewed by:	ae
MFC after:	3 days
Security:	kernel memory disclosure
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18333
2018-11-26 13:42:18 +00:00
Oleg Bulyzhin
cac302483e Unbreak kernel build with VLAN_ARRAY defined.
MFC after:	1 week
2018-11-21 13:34:21 +00:00
Andrey V. Elsukov
ad43bf348b Allow configuration of several ipsec interfaces with the same tunnel
endpoints.

This can be used to configure several IPsec tunnels between two hosts
with different security associations.

Obtained from:	Yandex LLC
MFC after:	2 weeks
Sponsored by:	Yandex LLC
2018-11-16 14:21:57 +00:00
Mike Karels
456e896d6d Fix flags collision causing inability to enable CBQ in ALTQ
The CBQ BORROW flag conflicts with the RMCF_CODEL flag; the
two sets of definitions actually define the same things. The symptom
is that a kernel with CBQ support and not CODEL fails to load a QoS
policy with the obscure error "pfctl: DIOCADDALTQ: Cannot allocate memory."
If ALTQ_DEBUG is enabled, the error becomes a little clearer:
"rmc_newclass: CODEL not configured for CBQ!" is printed by the kernel.
There really shouldn't be two sets of macros that have to be defined
consistently, but the include structure isn't right for exporting
CBQ flags to altq_rmclass.h. Re-align the definitions, and add
CTASSERTs in the kernel to ensure that the definitions are consistent.

PR:		215716
Reviewed by:	pkelsey
MFC after:	2 weeks
Sponsored by:	Forcepoint LLC
Differential Revision:	https://reviews.freebsd.org/D17758
2018-11-16 03:42:29 +00:00
Stephen Hurd
0efb1a464f Clear RX completion queue state veriables in iflib_stop()
iflib_stop() was not resetting the rxq completion queue state variables.
This meant that for any driver that has receive completion queues, after a
reinit, iflib would start asking what's available on the rx side starting at
whatever the completion queue index was prior to the stop, instead of at 0.

Submitted by:	pkelsey
Reported by:	pkelsey
MFC after:	3 days
Sponsored by:	Limelight Networks
2018-11-14 20:36:18 +00:00
Stephen Hurd
8d4ceb9cc4 Prevent POLA violation with TSO/CSUM offload
Ensure that any time CSUM_IP_TSO or CSUM_IP6_TSO is set that the corresponding
CSUM_IP6?_TCP / CSUM_IP flags are also set.

Rather than requireing drivers to bake-in an understanding that TSO implies
checksum offloads, make it explicit.

This change requires us to move the IFLIB_NEED_ZERO_CSUM implementation to
ensure it's zeroed for TSO.

Reported by:	Jacob Keller <jacob.e.keller@intel.com>
MFC after:	1 week
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D17801
2018-11-14 15:23:39 +00:00
Stephen Hurd
4d261ce278 Fix leaks caused by ifc_nhwtxqs never being initialized
r333502 removed initialization of ifc_nhwtxqs, and it's not clear
there's a need to copy it into the struct iflib_ctx at all. Use
ctx->ifc_sctx->isc_ntxqs instead.

Further, iflib_stop() did not clear the last ring in the case where
isc_nfl != isc_nrxqs (such as when IFLIB_HAS_RXCQ is set). Use
ctx->ifc_sctx->isc_nrxqs here instead of isc_nfl.

Reported by:	pkelsey
Reviewed by:	pkelsey
MFC after:	3 days
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D17979
2018-11-14 15:16:45 +00:00
Gleb Smirnoff
b79aa45e0e For compatibility KPI functions like if_addr_rlock() that used to have
mutexes but now are converted to epoch(9) use thread-private epoch_tracker.
Embedding tracker into ifnet(9) or ifnet derived structures creates a non
reentrable function, that will fail miserably if called simultaneously from
two different contexts.
A thread private tracker will provide a single tracker that would allow to
call these functions safely. It doesn't allow nested call, but this is not
expected from compatibility KPIs.

Reviewed by:	markj
2018-11-13 22:58:38 +00:00
Stephen Hurd
a42546df88 Fix rxcsum issue introduced in r338838
r338838 attempted to fix issues with rxcsum and rxcsum6.
However, the rxcsum bits were set as though if_setcapenablebit() was
being called, not if_togglecapenable() which is in use. As a result,
it was not possible to disable rxcsum when rxcsum6 was supported.

PR:		233004
Reported by:	lev
Reviewed by:	lev
MFC after:	3 days
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D17881
2018-11-07 19:31:48 +00:00
Kristof Provost
fbbf436d56 pfsync: Handle syncdev going away
If the syncdev is removed we no longer need to clean up the multicast
entry we've got set up for that device.

Pass the ifnet detach event through pf to pfsync, and remove our
multicast handle, and mark us as no longer having a syncdev.

Note that this callback is always installed, even if the pfsync
interface is disabled (and thus it's not a per-vnet callback pointer).

MFC after:	2 weeks
Sponsored by:	Orange Business Services
Differential Revision:	https://reviews.freebsd.org/D17502
2018-11-02 16:57:23 +00:00
Kristof Provost
25c6ab1b78 Notify that the ifnet will go away, even on vnet shutdown
pf subscribes to ifnet_departure_event events, so it can clean up the
ifg_pf_kif and if_pf_kif pointers in the ifnet.
During vnet shutdown interfaces could go away without sending the event,
so pf ends up cleaning these up as part of its shutdown sequence, which
happens after the ifnet has already been freed.

Send the ifnet_departure_event during vnet shutdown, allowing pf to
clean up correctly.

MFC after:	2 weeks
Sponsored by:	Orange Business Services
Differential Revision:	https://reviews.freebsd.org/D17500
2018-11-02 16:50:17 +00:00
Kristof Provost
5f6cf24e2d pfsync: Make pfsync callbacks per-vnet
The callbacks are installed and removed depending on the state of the
pfsync device, which is per-vnet. The callbacks must also be per-vnet.

MFC after:	2 weeks
Sponsored by:	Orange Business Services
Differential Revision:	https://reviews.freebsd.org/D17499
2018-11-02 16:47:07 +00:00
Kristof Provost
790194cd47 pf: Limit the fragment entry queue length to 64 per bucket.
So we have a global limit of 1024 fragments, but it is fine grained to
the region of the packet.  Smaller packets may have less fragments.
This costs another 16 bytes of memory per reassembly and devides the
worst case for searching by 8.

Obtained from:	OpenBSD
Differential Revision:	https://reviews.freebsd.org/D17734
2018-11-02 15:32:04 +00:00
Kristof Provost
fd2ea405e6 pf: Split the fragment reassembly queue into smaller parts
Remember 16 entry points based on the fragment offset.  Instead of
a worst case of 8196 list traversals we now check a maximum of 512
list entries or 16 array elements.

Obtained from:	OpenBSD
Differential Revision:	https://reviews.freebsd.org/D17733
2018-11-02 15:26:51 +00:00
Bjoern A. Zeeb
c955c6cd08 With more excessive use of modules, more kernel parts working with
VIMAGE, and feature richness and global state increasing the 8k of
vnet module space are no longer sufficient for people and loading
multiple modules, e.g., pf(4) and ipl(4) or ipsec(4) will fail on
the second module.

Increase the module space to 8 * PAGE_SIZE which should be enough
to hold multiple firewalls, ipsec, multicast (as in the old days was
a problem), epair, carp, and any kind of other vnet enabled modules.

Sadly this is a global byte array part of the vnet_set, so we cannot
dynamically change its size;  otherwise a TUNABLE would have been
a better solution.

PR:			228854
Reported by:		Ernie Luzar, Marek Zarychta
Discussed with:		rgrimes on current
MFC after:		3 days
2018-10-30 20:45:15 +00:00