Commit Graph

183 Commits

Author SHA1 Message Date
Mike Karels
2510235150 Allow TCP to reuse local port with different destinations
Previously, tcp_connect() would bind a local port before connecting,
forcing the local port to be unique across all outgoing TCP connections
for the address family. Instead, choose a local port after selecting
the destination and the local address, requiring only that the tuple
is unique and does not match a wildcard binding.

Reviewed by:	tuexen (rscheff, rrs previous version)
MFC after:	1 month
Sponsored by:	Forcepoint LLC
Differential Revision:	https://reviews.freebsd.org/D24781
2020-05-18 22:53:12 +00:00
Alexander V. Chernikov
983066f05b Convert route caching to nexthop caching.
This change is build on top of nexthop objects introduced in r359823.

Nexthops are separate datastructures, containing all necessary information
 to perform packet forwarding such as gateway interface and mtu. Nexthops
 are shared among the routes, providing more pre-computed cache-efficient
 data while requiring less memory. Splitting the LPM code and the attached
 data solves multiple long-standing problems in the routing layer,
 drastically reduces the coupling with outher parts of the stack and allows
 to transparently introduce faster lookup algorithms.

Route caching was (re)introduced to minimise (slow) routing lookups, allowing
 for notably better performance for large TCP senders. Caching works by
 acquiring rtentry reference, which is protected by per-rtentry mutex.
 If the routing table is changed (checked by comparing the rtable generation id)
 or link goes down, cache record gets withdrawn.

Nexthops have the same reference counting interface, backed by refcount(9).
This change merely replaces rtentry with the actual forwarding nextop as a
 cached object, which is mostly mechanical. Other moving parts like cache
 cleanup on rtable change remains the same.

Differential Revision:	https://reviews.freebsd.org/D24340
2020-04-25 09:06:11 +00:00
Michael Tuexen
fe1274ee39 Fix race when accepting TCP connections.
When expanding a SYN-cache entry to a socket/inp a two step approach was
taken:
1) The local address was filled in, then the inp was added to the hash
   table.
2) The remote address was filled in and the inp was relocated in the
   hash table.
Before the epoch changes, a write lock was held when this happens and
the code looking up entries was holding a corresponding read lock.
Since the read lock is gone away after the introduction of the
epochs, the half populated inp was found during lookup.
This resulted in processing TCP segments in the context of the wrong
TCP connection.
This patch changes the above procedure in a way that the inp is fully
populated before inserted into the hash table.

Thanks to Paul <devgs@ukr.net> for reporting the issue on the net@
mailing list and for testing the patch!

Reviewed by:		rrs@
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D22971
2020-01-12 17:52:32 +00:00
Gleb Smirnoff
c17cd08f53 It is unclear why in6_pcblookup_local() would require write access
to the PCB hash.  The function doesn't modify the hash. It always
asserted write lock historically, but with epoch conversion this
fails in some special cases.

Reviewed by:	rwatson, bz
Reported-by:	syzbot+0b0488ca537e20cb2429@syzkaller.appspotmail.com
2019-11-11 06:28:25 +00:00
Gleb Smirnoff
d797164a86 Since r353292 on input path we are always in network epoch, when
we lookup PCBs.  Thus, do not enter epoch recursively in
in_pcblookup_hash() and in6_pcblookup_hash().  Same applies to
tcp_ctlinput() and tcp6_ctlinput().

This leaves several sysctl(9) handlers that return PCB credentials
unprotected.  Add epoch enter/exit to all of them.

Differential Revision:	https://reviews.freebsd.org/D22197
2019-11-07 20:49:56 +00:00
Bjoern A. Zeeb
0ecd976e80 IPv6 cleanup: kernel
Finish what was started a few years ago and harmonize IPv6 and IPv4
kernel names.  We are down to very few places now that it is feasible
to do the change for everything remaining with causing too much disturbance.

Remove "aliases" for IPv6 names which confusingly could indicate
that we are talking about a different data structure or field or
have two fields, one for each address family.
Try to follow common conventions used in FreeBSD.

* Rename sin6p to sin6 as that is how it is spelt in most places.
* Remove "aliases" (#defines) for:
  - in6pcb which really is an inpcb and nothing separate
  - sotoin6pcb which is sotoinpcb (as per above)
  - in6p_sp which is inp_sp
  - in6p_flowinfo which is inp_flow
* Try to use ia6 for in6_addr rather than in6p.
* With all these gone  also rename the in6p variables to inp as
  that is what we call it in most of the network stack including
  parts of netinet6.

The reasons behind this cleanup are that we try to further
unify netinet and netinet6 code where possible and that people
will less ignore one or the other protocol family when doing
code changes as they may not have spotted places due to different
names for the same thing.

No functional changes.

Discussed with:		tuexen (SCTP changes)
MFC after:		3 months
Sponsored by:		Netflix
2019-08-02 07:41:36 +00:00
Hans Petter Selasky
59854ecf55 Convert all IPv4 and IPv6 multicast memberships into using a STAILQ
instead of a linear array.

The multicast memberships for the inpcb structure are protected by a
non-sleepable lock, INP_WLOCK(), which needs to be dropped when
calling the underlying possibly sleeping if_ioctl() method. When using
a linear array to keep track of multicast memberships, the computed
memory location of the multicast filter may suddenly change, due to
concurrent insertion or removal of elements in the linear array. This
in turn leads to various invalid memory access issues and kernel
panics.

To avoid this problem, put all multicast memberships on a STAILQ based
list. Then the memory location of the IPv4 and IPv6 multicast filters
become fixed during their lifetime and use after free and memory leak
issues are easier to track, for example by: vmstat -m | grep multi

All list manipulation has been factored into inline functions
including some macros, to easily allow for a future hash-list
implementation, if needed.

This patch has been tested by pho@ .

Differential Revision: https://reviews.freebsd.org/D20080
Reviewed by:	markj @
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2019-06-25 11:54:41 +00:00
Gleb Smirnoff
a68cc38879 Mechanical cleanup of epoch(9) usage in network stack.
- Remove macros that covertly create epoch_tracker on thread stack. Such
  macros a quite unsafe, e.g. will produce a buggy code if same macro is
  used in embedded scopes. Explicitly declare epoch_tracker always.

- Unmask interface list IFNET_RLOCK_NOSLEEP(), interface address list
  IF_ADDR_RLOCK() and interface AF specific data IF_AFDATA_RLOCK() read
  locking macros to what they actually are - the net_epoch.
  Keeping them as is is very misleading. They all are named FOO_RLOCK(),
  while they no longer have lock semantics. Now they allow recursion and
  what's more important they now no longer guarantee protection against
  their companion WLOCK macros.
  Note: INP_HASH_RLOCK() has same problems, but not touched by this commit.

This is non functional mechanical change. The only functionally changed
functions are ni6_addrs() and ni6_store_addrs(), where we no longer enter
epoch recursively.

Discussed with:	jtl, gallatin
2019-01-09 01:11:19 +00:00
Mateusz Guzik
cc426dd319 Remove unused argument to priv_check_cred.
Patch mostly generated with cocinnelle:

@@
expression E1,E2;
@@

- priv_check_cred(E1,E2,0)
+ priv_check_cred(E1,E2)

Sponsored by:	The FreeBSD Foundation
2018-12-11 19:32:16 +00:00
Mark Johnston
9d2877fc3d Clamp the INPCB port hash tables to IPPORT_MAX + 1 chains.
Memory beyond that limit was previously unused, wasting roughly 1MB per
8GB of RAM.  Also retire INP_PCBLBGROUP_PORTHASH, which was identical to
INP_PCBPORTHASH.

Reviewed by:	glebius
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D17803
2018-12-05 17:06:00 +00:00
Mark Johnston
d9ff5789be Remove redundant checks for a NULL lbgroup table.
No functional change intended.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17108
2018-11-01 15:52:49 +00:00
Mark Johnston
d3a4b0dabc Fix style bugs in in6_pcblookup_lbgroup().
This should have been a part of r338470.  No functional changes
intended.

Reported by:	gallatin
Reviewed by:	gallatin, Johannes Lundberg <johalun0@gmail.com>
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17109
2018-10-22 16:09:01 +00:00
Bjoern A. Zeeb
e15e0e3e4d In in6_pcbpurgeif0() called, e.g., from if_clone_destroy(),
once we have a lock, make sure the inp is not marked freed.
This can happen since the list traversal and locking was
converted to epoch(9).  If the inp is marked "freed", skip it.

This prevents a NULL pointer deref panic later on.

Reported by:	slavash (Mellanox)
Tested by:	slavash (Mellanox)
Reviewed by:	markj (no formal review but caught my unlock mistake)
Approved by:	re (kib)
2018-09-27 15:32:37 +00:00
Mark Johnston
54af3d0dac Fix synchronization of LB group access.
Lookups are protected by an epoch section, so the LB group linkage must
be a CK_LIST rather than a plain LIST.  Furthermore, we were not
deferring LB group frees, so in_pcbremlbgrouphash() could race with
readers and cause a use-after-free.

Reviewed by:	sbruno, Johannes Lundberg <johalun0@gmail.com>
Tested by:	gallatin
Approved by:	re (gjb)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17031
2018-09-10 19:00:29 +00:00
Bjoern A. Zeeb
ec86402ecd Replicate r328271 from legacy IP to IPv6 using a single macro
to clear L2 and L3 route caches.
Also mark one function argument as __unused.

Reviewed by:	karels, ae
Approved by:	re (rgrimes)
Differential Revision:	https://reviews.freebsd.org/D17007
2018-09-03 22:27:27 +00:00
Bjoern A. Zeeb
f6aeb1eee5 Replicate r307234 from legacy IP to IPv6 code, using the RO_RTFREE()
macro rather than hand crafted code.
No functional changes.

Reviewed by:	karels
Approved by:	re (rgrimes)
Differential Revision:	https://reviews.freebsd.org/D17006
2018-09-03 22:14:37 +00:00
Bjoern A. Zeeb
bc11a8829e As discussed in D6262 post-commit review, change inp_route to
inp_route6 for IPv6 code after r301217.
This was most likely a c&p error from the legacy IP code, which
did not matter as it is a union and both structures have the same
layout at the beginning.
No functional changes.

Reviewed by:	karels, ae
Approved by:	re (rgrimes)
Differential Revision:	https://reviews.freebsd.org/D17005
2018-09-03 22:12:48 +00:00
Andrew Gallatin
4b82a7b62f Reject IPv4 SO_REUSEPORT_LB groups when looking up an IPv6 listening socket
Similar to how the IPv4 code will reject an IPv6 LB group,
we must ignore IPv4 LB groups when looking up an IPv6
listening socket.   If this is not done, a port only match
may return an IPv4 socket, which causes problems (like
sending IPv6 packets with a hopcount of 0, making them unrouteable).

Thanks to rrs for all the work to diagnose this.

Approved by:	re (rgrimes)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D16899
2018-08-27 18:13:20 +00:00
Matt Macy
f25b23cf89 in6_pcblookup_hash: validate inp for liveness 2018-07-01 01:01:59 +00:00
Matt Macy
feeef8509b Fix PCBGROUPS build post CK conversion of pcbinfo 2018-06-13 23:19:54 +00:00
Matt Macy
b872626dbe mechanical CK macro conversion of inpcbinfo lists
This is a dependency for converting the inpcbinfo hash and info rlocks
to epoch.
2018-06-12 22:18:20 +00:00
Sean Bruno
1a43cff92a Load balance sockets with new SO_REUSEPORT_LB option.
This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple
programs or threads to bind to the same port and incoming connections will be
load balanced using a hash function.

Most of the code was copied from a similar patch for DragonflyBSD.

However, in DragonflyBSD, load balancing is a global on/off setting and can not
be set per socket. This patch allows for simultaneous use of both the current
SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system.

Required changes to structures:
Globally change so_options from 16 to 32 bit value to allow for more options.
Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets.

Limitations:
As DragonflyBSD, a load balance group is limited to 256 pcbs (256 programs or
threads sharing the same socket).

This is a substantially different contribution as compared to its original
incarnation at svn r332894 and reverted at svn r332967.  Thanks to rwatson@
for the substantive feedback that is included in this commit.

Submitted by:	Johannes Lundberg <johalun0@gmail.com>
Obtained from:	DragonflyBSD
Relnotes:	Yes
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D11003
2018-06-06 15:45:57 +00:00
Matt Macy
4f6c66cc9c UDP: further performance improvements on tx
Cumulative throughput while running 64
  netperf -H $DUT -t UDP_STREAM -- -m 1
on a 2x8x2 SKL went from 1.1Mpps to 2.5Mpps

Single stream throughput increases from 910kpps to 1.18Mpps

Baseline:
https://people.freebsd.org/~mmacy/2018.05.11/udpsender2.svg

- Protect read access to global ifnet list with epoch
https://people.freebsd.org/~mmacy/2018.05.11/udpsender3.svg

- Protect short lived ifaddr references with epoch
https://people.freebsd.org/~mmacy/2018.05.11/udpsender4.svg

- Convert if_afdata read lock path to epoch
https://people.freebsd.org/~mmacy/2018.05.11/udpsender5.svg

A fix for the inpcbhash contention is pending sufficient time
on a canary at LLNW.

Reviewed by:	gallatin
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D15409
2018-05-23 21:02:14 +00:00
Matt Macy
d7c5a620e2 ifnet: Replace if_addr_lock rwlock with epoch + mutex
Run on LLNW canaries and tested by pho@

gallatin:
Using a 14-core, 28-HTT single socket E5-2697 v3 with a 40GbE MLX5
based ConnectX 4-LX NIC, I see an almost 12% improvement in received
packet rate, and a larger improvement in bytes delivered all the way
to userspace.

When the host receiving 64 streams of netperf -H $DUT -t UDP_STREAM -- -m 1,
I see, using nstat -I mce0 1 before the patch:

InMpps OMpps  InGbs  OGbs err TCP Est %CPU syscalls csw     irq GBfree
4.98   0.00   4.42   0.00 4235592     33   83.80 4720653 2149771   1235 247.32
4.73   0.00   4.20   0.00 4025260     33   82.99 4724900 2139833   1204 247.32
4.72   0.00   4.20   0.00 4035252     33   82.14 4719162 2132023   1264 247.32
4.71   0.00   4.21   0.00 4073206     33   83.68 4744973 2123317   1347 247.32
4.72   0.00   4.21   0.00 4061118     33   80.82 4713615 2188091   1490 247.32
4.72   0.00   4.21   0.00 4051675     33   85.29 4727399 2109011   1205 247.32
4.73   0.00   4.21   0.00 4039056     33   84.65 4724735 2102603   1053 247.32

After the patch

InMpps OMpps  InGbs  OGbs err TCP Est %CPU syscalls csw     irq GBfree
5.43   0.00   4.20   0.00 3313143     33   84.96 5434214 1900162   2656 245.51
5.43   0.00   4.20   0.00 3308527     33   85.24 5439695 1809382   2521 245.51
5.42   0.00   4.19   0.00 3316778     33   87.54 5416028 1805835   2256 245.51
5.42   0.00   4.19   0.00 3317673     33   90.44 5426044 1763056   2332 245.51
5.42   0.00   4.19   0.00 3314839     33   88.11 5435732 1792218   2499 245.52
5.44   0.00   4.19   0.00 3293228     33   91.84 5426301 1668597   2121 245.52

Similarly, netperf reports 230Mb/s before the patch, and 270Mb/s after the patch

Reviewed by:	gallatin
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D15366
2018-05-18 20:13:34 +00:00
Stephen Hurd
f3e1324b41 Separate list manipulation locking from state change in multicast
Multicast incorrectly calls in to drivers with a mutex held causing drivers
to have to go through all manner of contortions to use a non sleepable lock.
Serialize multicast updates instead.

Submitted by:	mmacy <mmacy@mattmacy.io>
Reviewed by:	shurd, sbruno
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D14969
2018-05-02 19:36:29 +00:00
Sean Bruno
7875017ca9 Revert r332894 at the request of the submitter.
Submitted by:	Johannes Lundberg <johalun0_gmail.com>
Sponsored by:	Limelight Networks
2018-04-24 19:55:12 +00:00
Sean Bruno
7b7796eea5 Load balance sockets with new SO_REUSEPORT_LB option
This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple
programs or threads to bind to the same port and incoming connections will be
load balanced using a hash function.

Most of the code was copied from a similar patch for DragonflyBSD.

However, in DragonflyBSD, load balancing is a global on/off setting and can not
be set per socket. This patch allows for simultaneous use of both the current
SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system.

Required changes to structures
Globally change so_options from 16 to 32 bit value to allow for more options.
Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets.

Limitations
As DragonflyBSD, a load balance group is limited to 256 pcbs
(256 programs or threads sharing the same socket).

Submitted by:	Johannes Lundberg <johanlun0@gmail.com>
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D11003
2018-04-23 19:51:00 +00:00
Jonathan T. Looney
7fb2986ff6 If the INP lock is uncontested, avoid taking a reference and jumping
through the lock-switching hoops.

A few of the INP lookup operations that lock INPs after the lookup do
so using this mechanism (to maintain lock ordering):

1. Lock lookup structure.
2. Find INP.
3. Acquire reference on INP.
4. Drop lock on lookup structure.
5. Acquire INP lock.
6. Drop reference on INP.

This change provides a slightly shorter path for cases where the INP
lock is uncontested:

1. Lock lookup structure.
2. Find INP.
3. Try to acquire the INP lock.
4. If successful, drop lock on lookup structure.

Of course, if the INP lock is contested, the functions will need to
revert to the previous way of switching locks safely.

This saves a few atomic operations when the INP lock is uncontested.

Discussed with:	gallatin, rrs, rwatson
MFC after:	2 weeks
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D12911
2018-03-21 15:54:46 +00:00
Pedro F. Giffuni
51369649b0 sys: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
2017-11-20 19:43:44 +00:00
Ed Maste
3e85b721d6 Remove register keyword from sys/ and ANSIfy prototypes
A long long time ago the register keyword told the compiler to store
the corresponding variable in a CPU register, but it is not relevant
for any compiler used in the FreeBSD world today.

ANSIfy related prototypes while here.

Reviewed by:	cem, jhb
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D10193
2017-05-17 00:34:34 +00:00
Ermal Luçi
dce33a45c9 The patch provides the same socket option as Linux IP_ORIGDSTADDR.
Unfortunately they will have different integer value due to Linux value being already assigned in FreeBSD.

The patch is similar to IP_RECVDSTADDR but also provides the destination port value to the application.

This allows/improves implementation of transparent proxies on UDP sockets due to having the whole information on forwarded packets.

Reviewed by:	adrian, aw
Approved by:	ae (mentor)
Sponsored by:	rsync.net
Differential Revision:	D9235
2017-03-06 04:01:58 +00:00
Warner Losh
fbbd9655e5 Renumber copyright clause 4
Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by:	Jan Schaumann <jschauma@stevens.edu>
Pull Request:	https://github.com/freebsd/freebsd/pull/96
2017-02-28 23:42:47 +00:00
Ermal Luçi
c10c5b1eba Committed without approval from mentor.
Reported by:	gnn
2017-02-12 06:56:33 +00:00
Ermal Luçi
4616026faf Revert r313527
Heh svn is not git
2017-02-10 05:58:16 +00:00
Ermal Luçi
c0fadfdbbf Correct missed variable name.
Reported-by: ohartmann@walstatt.org
2017-02-10 05:51:39 +00:00
Ermal Luçi
ed55edceef The patch provides the same socket option as Linux IP_ORIGDSTADDR.
Unfortunately they will have different integer value due to Linux value being already assigned in FreeBSD.

The patch is similar to IP_RECVDSTADDR but also provides the destination port value to the application.

This allows/improves implementation of transparent proxies on UDP sockets due to having the whole information on forwarded packets.

Sponsored-by: rsync.net
Differential Revision: D9235
Reviewed-by: adrian
2017-02-10 05:16:14 +00:00
George V. Neville-Neil
6d76822688 This change re-adds L2 caching for TCP and UDP, as originally added in D4306
but removed due to other changes in the system. Restore the llentry pointer
to the "struct route", and use it to cache the L2 lookup (ARP or ND6) as
appropriate.

Submitted by:	Mike Karels
Differential Revision:	https://reviews.freebsd.org/D6262
2016-06-02 17:51:29 +00:00
George V. Neville-Neil
84cc0778d0 FreeBSD previously provided route caching for TCP (and UDP). Re-add
route caching for TCP, with some improvements. In particular, invalidate
the route cache if a new route is added, which might be a better match.
The cache is automatically invalidated if the old route is deleted.

Submitted by:	Mike Karels
Reviewed by:	gnn
Differential Revision:	https://reviews.freebsd.org/D4306
2016-03-24 07:54:56 +00:00
Alexander V. Chernikov
601c0b8bcc Split in6_selectsrc() into in6_selectsrc_addr() and in6_selectsrc_socket().
in6_selectsrc() has 2 class of users: socket-based one (raw/udp/pcb/etc) and
  socket-less (ND code). The main reason for that change is inability to
  specify non-default FIB for callers w/o socket since (internally) inpcb
  is used to determine fib.

As as result, add 2 wrappers for in6_selectsrc() (making in6_selectsrc()
  static):
1) in6_selectsrc_socket() for the former class. Embed scope_ambiguous check
  along with returning hop limit when needed.
2) in6_selectsrc_addr() for the latter case. Add 'fibnum' argument and
  pass IPv6 address  w/ explicitly specified scope as separate argument.

Reviewed by:	ae (previous version)
2016-01-10 13:40:29 +00:00
Alexander V. Chernikov
357ce739b9 Remove 'struct route_int6' argument from in6_selectsrc() and
in6_selectif().

The main task of in6_selectsrc() is to return IPv6 SAS (along with
  output interface used for scope checks). No data-path code uses
  route argument for caching. The only users are icmp6 (reflect code),
  ND6 ns/na generation code. All this fucntions are control-plane, so
  there is no reason to try to 'optimize' something by passing cached
  route into to ip6_output(). Given that, simplify code by eliminating
  in6_selectsrc() 'struct route_in6' argument. Since in6_selectif() is
  used only by in6_selectsrc(), eliminate its 'struct route_in6' argument,
  too. While here, reshape rte-related code inside in6_selectif() to
  free lookup result immediately after saving all the needed fields.
2016-01-03 10:43:23 +00:00
Julien Charbon
ff9b006d61 Decompose TCP INP_INFO lock to increase short-lived TCP connections scalability:
- The existing TCP INP_INFO lock continues to protect the global inpcb list
  stability during full list traversal (e.g. tcp_pcblist()).

- A new INP_LIST lock protects inpcb list actual modifications (inp allocation
  and free) and inpcb global counters.

It allows to use TCP INP_INFO_RLOCK lock in critical paths (e.g. tcp_input())
and INP_INFO_WLOCK only in occasional operations that walk all connections.

PR:			183659
Differential Revision:	https://reviews.freebsd.org/D2599
Reviewed by:		jhb, adrian
Tested by:		adrian, nitroboost-gmail.com
Sponsored by:		Verisign, Inc.
2015-08-03 12:13:54 +00:00
Andrey V. Elsukov
fd8dd3a6d7 tcp6_ctlinput() doesn't pass MTU value to in6_pcbnotify().
Check cmdarg isn't NULL before dereference, this check was in the
ip6_notify_pmtu() before r279588.

Reported by:	Florian Smeets
MFC after:	1 week
2015-03-06 05:50:39 +00:00
Andrey V. Elsukov
8f1beb889e Fix deadlock in IPv6 PCB code.
When several threads are trying to send datagram to the same destination,
but fragmentation is disabled and datagram size exceeds link MTU,
ip6_output() calls pfctlinput2(PRC_MSGSIZE). It does notify all
sockets wanted to know MTU to this destination. And since all threads
hold PCB lock while sending, taking the lock for each PCB in the
in6_pcbnotify() leads to deadlock.

RFC 3542 p.11.3 suggests notify all application wanted to receive
IPV6_PATHMTU ancillary data for each ICMPv6 packet too big message.
But it doesn't require this, when we don't receive ICMPv6 message.

Change ip6_notify_pmtu() function to be able use it directly from
ip6_output() to notify only one socket, and to notify all sockets
when ICMPv6 packet too big message received.

PR:		197059
Differential Revision:	https://reviews.freebsd.org/D1949
Reviewed by:	no objection from #network
Obtained from:	Yandex LLC
MFC after:	1 week
Sponsored by:	Yandex LLC
2015-03-04 11:20:01 +00:00
Hans Petter Selasky
c25290420e Start process of removing the use of the deprecated "M_FLOWID" flag
from the FreeBSD network code. The flag is still kept around in the
"sys/mbuf.h" header file, but does no longer have any users. Instead
the "m_pkthdr.rsstype" field in the mbuf structure is now used to
decide the meaning of the "m_pkthdr.flowid" field. To modify the
"m_pkthdr.rsstype" field please use the existing "M_HASHTYPE_XXX"
macros as defined in the "sys/mbuf.h" header file.

This patch introduces new behaviour in the transmit direction.
Previously network drivers checked if "M_FLOWID" was set in "m_flags"
before using the "m_pkthdr.flowid" field. This check has now now been
replaced by checking if "M_HASHTYPE_GET(m)" is different from
"M_HASHTYPE_NONE". In the future more hashtypes will be added, for
example hashtypes for hardware dedicated flows.

"M_HASHTYPE_OPAQUE" indicates that the "m_pkthdr.flowid" value is
valid and has no particular type. This change removes the need for an
"if" statement in TCP transmit code checking for the presence of a
valid flowid value. The "if" statement mentioned above is now a direct
variable assignment which is then later checked by the respective
network drivers like before.

Additional notes:
- The SCTP code changes will be committed as a separate patch.
- Removal of the "M_FLOWID" flag will also be done separately.
- The FreeBSD version has been bumped.

MFC after:	1 month
Sponsored by:	Mellanox Technologies
2014-12-01 11:45:24 +00:00
Alexander V. Chernikov
603eaf792b Renove faith(4) and faithd(8) from base. It looks like industry
have chosen different (and more traditional) stateless/statuful
NAT64 as translation mechanism. Last non-trivial commits to both
faith(4) and faithd(8) happened more than 12 years ago, so I assume
it is time to drop RFC3142 in FreeBSD.

No objections from:	net@
2014-11-09 21:33:01 +00:00
Andrey V. Elsukov
a7e201bbac Make in6_pcblookup_hash_locked and in6_pcbladdr static.
Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-09-10 13:17:35 +00:00
Andrey V. Elsukov
1b44e5ffe3 Introduce INP6_PCBHASHKEY macro. Replace usage of hardcoded part of
IPv6 address as hash key in all places.

Obtained from:	Yandex LLC
2014-09-10 12:35:42 +00:00
Adrian Chadd
c7c0d94874 Add IPv6 flowid, bindmulti and RSS awareness. 2014-07-12 05:46:33 +00:00
Robert Watson
7527624efa Several years after initial development, merge prototype support for
linking NIC Receive Side Scaling (RSS) to the network stack's
connection-group implementation.  This prototype (and derived patches)
are in use at Juniper and several other FreeBSD-using companies, so
despite some reservations about its maturity, merge the patch to the
base tree so that it can be iteratively refined in collaboration rather
than maintained as a set of gradually diverging patch sets.

(1) Merge a software implementation of the Toeplitz hash specified in
    RSS implemented by David Malone.  This is used to allow suitable
    pcbgroup placement of connections before the first packet is
    received from the NIC.  Software hashing is generally avoided,
    however, due to high cost of the hash on general-purpose CPUs.

(2) In in_rss.c, maintain authoritative versions of RSS state intended
    to be pushed to each NIC, including keying material, hash
    algorithm/ configuration, and buckets.  Provide software-facing
    interfaces to hash 2- and 4-tuples for IPv4 and IPv6 using both
    the RSS standardised Toeplitz and a 'naive' variation with a hash
    efficient in software but with poor distribution properties.
    Implement rss_m2cpuid()to be used by netisr and other load
    balancing code to look up the CPU on which an mbuf should be
    processed.

(3) In the Ethernet link layer, allow netisr distribution using RSS as
    a source of policy as an alternative to source ordering; continue
    to default to direct dispatch (i.e., don't try and requeue packets
    for processing on the 'right' CPU if they arrive in a directly
    dispatchable context).

(4) Allow RSS to control tuning of connection groups in order to align
    groups with RSS buckets.  If a packet arrives on a protocol using
    connection groups, and contains a suitable hardware-generated
    hash, use that hash value to select the connection group for pcb
    lookup for both IPv4 and IPv6.  If no hardware-generated Toeplitz
    hash is available, we fall back on regular PCB lookup risking
    contention rather than pay the cost of Toeplitz in software --
    this is a less scalable but, at my last measurement, faster
    approach.  As core counts go up, we may want to revise this
    strategy despite CPU overhead.

Where device drivers suitably configure NICs, and connection groups /
RSS are enabled, this should avoid both lock and line contention during
connection lookup for TCP.  This commit does not modify any device
drivers to tune device RSS configuration to the global RSS
configuration; patches are in circulation to do this for at least
Chelsio T3 and Intel 1G/10G drivers.  Currently, the KPI for device
drivers is not particularly robust, nor aware of more advanced features
such as runtime reconfiguration/rebalancing.  This will hopefully prove
a useful starting point for refinement.

No MFC is scheduled as we will first want to nail down a more mature
and maintainable KPI/KBI for device drivers.

Sponsored by:   Juniper Networks (original work)
Sponsored by:   EMC/Isilon (patch update and merge)
2014-03-15 00:57:50 +00:00
Gleb Smirnoff
76039bc84f The r48589 promised to remove implicit inclusion of if_var.h soon. Prepare
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-26 17:58:36 +00:00