Commit Graph

228 Commits

Author SHA1 Message Date
Ermal Luçi
c0fadfdbbf Correct missed variable name.
Reported-by: ohartmann@walstatt.org
2017-02-10 05:51:39 +00:00
Ermal Luçi
ed55edceef The patch provides the same socket option as Linux IP_ORIGDSTADDR.
Unfortunately they will have different integer value due to Linux value being already assigned in FreeBSD.

The patch is similar to IP_RECVDSTADDR but also provides the destination port value to the application.

This allows/improves implementation of transparent proxies on UDP sockets due to having the whole information on forwarded packets.

Sponsored-by: rsync.net
Differential Revision: D9235
Reviewed-by: adrian
2017-02-10 05:16:14 +00:00
Hans Petter Selasky
f3e7afe2d7 Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.

- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.

- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().

- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.

- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.

- How rate limiting works:

1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.

2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.

3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.

4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.

Reviewed by:		wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision:	https://reviews.freebsd.org/D3687
Sponsored by:		Mellanox Technologies
MFC after:		3 months
2017-01-18 13:31:17 +00:00
Navdeep Parhar
f22bfc72f8 Add spares to struct ifnet and socket for packet pacing and/or general
use.  Update comments regarding the spare fields in struct inpcb.

Bump __FreeBSD_version for the changes to the size of the structures.

Reviewed by:	gnn@
Approved by:	re@ (gjb@)
Sponsored by:	Chelsio Communications
2016-06-23 21:07:15 +00:00
Pedro F. Giffuni
a4641f4eaa sys/net*: minor spelling fixes.
No functional change.
2016-05-03 18:05:43 +00:00
George V. Neville-Neil
84cc0778d0 FreeBSD previously provided route caching for TCP (and UDP). Re-add
route caching for TCP, with some improvements. In particular, invalidate
the route cache if a new route is added, which might be a better match.
The cache is automatically invalidated if the old route is deleted.

Submitted by:	Mike Karels
Reviewed by:	gnn
Differential Revision:	https://reviews.freebsd.org/D4306
2016-03-24 07:54:56 +00:00
Gleb Smirnoff
388909a12a Use Jenkins hash for TCP syncache.
o Unlike xor, in Jenkins hash every bit of input affects virtually
  every bit of output, thus salting the hash actually works. With
  xor salting only provides a false sense of security, since if
  hash(x) collides with hash(y), then of course, hash(x) ^ salt
  would also collide with hash(y) ^ salt. [1]
o Jenkins provides much better distribution than xor, very close to
  ideal.

TCP connection setup/teardown benchmark has shown a 10% increase
with default hash size, and with bigger hashes that still provide
possibility for collisions. With enormous hash size, when dataset is
by an order of magnitude smaller than hash size, the benchmark has
shown 4% decrease in performance decrease, which is expected and
acceptable.

Noticed by:	Jeffrey Knockel <jeffk cs.unm.edu> [1]
Benchmarks by:	jch
Reviewed by:	jch, pkelsey, delphij
Security:	strengthens protection against hash collision DoS
Sponsored by:	Nginx, Inc.
2015-09-05 10:15:19 +00:00
Julien Charbon
079672cb07 Fix a kernel assertion issue introduced with r286227:
Avoid too strict INP_INFO_RLOCK_ASSERT checks due to
tcp_notify() being called from in6_pcbnotify().

Reported by:	Larry Rosenman <ler@lerctr.org>
Submitted by:	markj, jch
2015-08-08 08:40:36 +00:00
Julien Charbon
ff9b006d61 Decompose TCP INP_INFO lock to increase short-lived TCP connections scalability:
- The existing TCP INP_INFO lock continues to protect the global inpcb list
  stability during full list traversal (e.g. tcp_pcblist()).

- A new INP_LIST lock protects inpcb list actual modifications (inp allocation
  and free) and inpcb global counters.

It allows to use TCP INP_INFO_RLOCK lock in critical paths (e.g. tcp_input())
and INP_INFO_WLOCK only in occasional operations that walk all connections.

PR:			183659
Differential Revision:	https://reviews.freebsd.org/D2599
Reviewed by:		jhb, adrian
Tested by:		adrian, nitroboost-gmail.com
Sponsored by:		Verisign, Inc.
2015-08-03 12:13:54 +00:00
Hans Petter Selasky
c25290420e Start process of removing the use of the deprecated "M_FLOWID" flag
from the FreeBSD network code. The flag is still kept around in the
"sys/mbuf.h" header file, but does no longer have any users. Instead
the "m_pkthdr.rsstype" field in the mbuf structure is now used to
decide the meaning of the "m_pkthdr.flowid" field. To modify the
"m_pkthdr.rsstype" field please use the existing "M_HASHTYPE_XXX"
macros as defined in the "sys/mbuf.h" header file.

This patch introduces new behaviour in the transmit direction.
Previously network drivers checked if "M_FLOWID" was set in "m_flags"
before using the "m_pkthdr.flowid" field. This check has now now been
replaced by checking if "M_HASHTYPE_GET(m)" is different from
"M_HASHTYPE_NONE". In the future more hashtypes will be added, for
example hashtypes for hardware dedicated flows.

"M_HASHTYPE_OPAQUE" indicates that the "m_pkthdr.flowid" value is
valid and has no particular type. This change removes the need for an
"if" statement in TCP transmit code checking for the presence of a
valid flowid value. The "if" statement mentioned above is now a direct
variable assignment which is then later checked by the respective
network drivers like before.

Additional notes:
- The SCTP code changes will be committed as a separate patch.
- Removal of the "M_FLOWID" flag will also be done separately.
- The FreeBSD version has been bumped.

MFC after:	1 month
Sponsored by:	Mellanox Technologies
2014-12-01 11:45:24 +00:00
Alexander V. Chernikov
603eaf792b Renove faith(4) and faithd(8) from base. It looks like industry
have chosen different (and more traditional) stateless/statuful
NAT64 as translation mechanism. Last non-trivial commits to both
faith(4) and faithd(8) happened more than 12 years ago, so I assume
it is time to drop RFC3142 in FreeBSD.

No objections from:	net@
2014-11-09 21:33:01 +00:00
Andrey V. Elsukov
028bdf289d Add scope zone id to the in_endpoints and hc_metrics structures.
A non-global IPv6 address can be used in more than one zone of the same
scope. This zone index is used to identify to which zone a non-global
address belongs.

Also we can have many foreign hosts with equal non-global addresses,
but from different zones. So, they can have different metrics in the
host cache.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-09-10 16:26:18 +00:00
Andrey V. Elsukov
1b44e5ffe3 Introduce INP6_PCBHASHKEY macro. Replace usage of hardcoded part of
IPv6 address as hash key in all places.

Obtained from:	Yandex LLC
2014-09-10 12:35:42 +00:00
Adrian Chadd
9d3ddf4384 Add support for receiving and setting flowtype, flowid and RSS bucket
information as part of recvmsg().

This is primarily used for debugging/verification of the various
processing paths in the IP, PCB and driver layers.

Unfortunately the current implementation of the control message path
results in a ~10% or so drop in UDP frame throughput when it's used.

Differential Revision:	https://reviews.freebsd.org/D527
Reviewed by:	grehan
2014-09-09 01:45:39 +00:00
Adrian Chadd
d5bb8bd315 Expose in_pcbbind_check_bindmulti() so the upcoming IPv6 RSS changes
can be made to use it.
2014-07-12 05:40:13 +00:00
Adrian Chadd
0a100a6f1e Implement the first stage of multi-bind listen sockets and RSS socket
awareness.

* Introduce IP_BINDMULTI - indicating that it's okay to bind multiple
  sockets on the same bind details.

  Although the PCB code has been taught about this (see below) this patch
  doesn't introduce the rest of the PCB changes necessary to distribute
  lookups among multiple PCB entries in the global wildcard table.

* Introduce IP_RSS_LISTEN_BUCKET - placing an listen socket into the
  given RSS bucket (and thus a single PCBGROUP hash.)

* Modify the PCB add path to be aware of IP_BINDMULTI:
  + Only allow further PCB entries to be added if the owner credentials
    and IP_BINDMULTI has been specified.  Ie, only allow further
    IP_BINDMULTI sockets to appear if the first bind() was IP_BINDMULTI.

* Teach the PCBGROUP code about IP_RSS_LISTE_BUCKET marked PCB entries.
  Instead of using the wildcard logic and hashing, these sockets are
  simply placed into the PCBGROUP and _not_ in the wildcard hash.

* When doing a PCBGROUP lookup, also do a wildcard match as well.
  This allows for an RSS bucket PCB entry to appear in a PCBGROUP
  rather than having to exist in the wildcard list.

Tested:

* TCP IPv4 server testing with igb(4)
* TCP IPv4 server testing with ix(4)

TODO:

* The pcbgroup lookup code duplicated the wildcard and wildcard-PCB
  logic.  This could be refactored into a single function.

* This doesn't yet work for IPv6 (The PCBGROUP code in netinet6/ doesn't
  yet know about this); nor does it yet fully work for UDP.
2014-07-10 03:10:56 +00:00
Adrian Chadd
34e3dcedec Add the flowtype to the inpcb.
The flowid isn't enough to use as part of any RSS related CPU affinity
lookups - the RSS code would like to know what kind of hash it is.
2014-05-18 22:30:12 +00:00
Steven Hartland
ae19083248 Fix jailed raw sockets not setting the correct source address by
calling in_pcbladdr instead of prison_get_ip4

MFC after:	1 month
2014-04-24 12:52:31 +00:00
Mikolaj Golub
efdf104bca In r227207, to fix the issue with possible NULL inp_socket pointer
dereferencing, when checking for SO_REUSEPORT option (and SO_REUSEADDR
for multicast), INP_REUSEPORT flag was introduced to cache the socket
option.  It was decided then that one flag would be enough to cache
both SO_REUSEPORT and SO_REUSEADDR: when processing SO_REUSEADDR
setsockopt(2), it was checked if it was called for a multicast address
and INP_REUSEPORT was set accordingly.

Unfortunately that approach does not work when setsockopt(2) is called
before binding to a multicast address: the multicast check fails and
INP_REUSEPORT is not set.

Fix this by adding INP_REUSEADDR flag to unconditionally cache
SO_REUSEADDR.

PR:		179901
Submitted by:	Michael Gmelin freebsd grem.de (initial version)
Reviewed by:	rwatson
MFC after:	1 week
2013-07-04 18:38:00 +00:00
Andre Oppermann
f89d4c3acf Back out r249318, r249320 and r249327 due to a heisenbug most
likely related to a race condition in the ipi_hash_lock with
the exact cause currently unknown but under investigation.
2013-05-06 16:42:18 +00:00
Andre Oppermann
e8b3186b6a Change certain heavily used network related mutexes and rwlocks to
reside on their own cache line to prevent false sharing with other
nearby structures, especially for those in the .bss segment.

NB: Those mutexes and rwlocks with variables next to them that get
changed on every invocation do not benefit from their own cache line.
Actually it may be net negative because two cache misses would be
incurred in those cases.
2013-04-09 21:02:20 +00:00
Gleb Smirnoff
df4e91d386 There is a complex race in in_pcblookup_hash() and in_pcblookup_group().
Both functions need to obtain lock on the found PCB, and they can't do
classic inter-lock with the PCB hash lock, due to lock order reversal.
To keep the PCB stable, these functions put a reference on it and after PCB
lock is acquired drop it. If the reference was the last one, this means
we've raced with in_pcbfree() and the PCB is no longer valid.

  This approach works okay only if we are acquiring writer-lock on the PCB.
In case of reader-lock, the following scenario can happen:

  - 2 threads locate pcb, and do in_pcbref() on it.
  - These 2 threads drop the inp hash lock.
  - Another thread comes to delete pcb via in_pcbfree(), it obtains hash lock,
    does in_pcbremlists(), drops hash lock, and runs in_pcbrele_wlocked(), which
    doesn't free the pcb due to two references on it. Then it unlocks the pcb.
  - 2 aforementioned threads acquire reader lock on the pcb and run
    in_pcbrele_rlocked(). One gets 1 from in_pcbrele_rlocked() and continues,
    second gets 0 and considers pcb freed, returns.
  - The thread that got 1 continutes working with detached pcb, which later
    leads to panic in the underlying protocol level.

  To plumb that problem an additional INPCB flag introduced - INP_FREED. We
check for that flag in the in_pcbrele_rlocked() and if it is set, we pretend
that that was the last reference.

Discussed with:		rwatson, jhb
Reported by:		Vladimir Medvedkin <medved rambler-co.ru>
2012-10-02 12:03:02 +00:00
Michael Tuexen
3cca425b29 Add a IP_RECVTOS socket option to receive for received UDP/IPv4
packets a cmsg of type IP_RECVTOS which contains the TOS byte.
Much like IP_RECVTTL does for TTL. This allows to implement a
protocol on top of UDP and implementing ECN.

MFC after: 3 days
2012-06-12 14:02:38 +00:00
Robert Millan
75410f02d2 Hide a few declarations from userland (including `struct inpcbgroup'). This
removes the dependency on <machine/param.h> which was introduced with SVN
rev 222748 (due to CACHE_LINE_SIZE).

Reviewed by:	bde
MFC after:	10 days
2012-03-17 21:51:39 +00:00
Mikolaj Golub
fc06cd427e Cache SO_REUSEPORT socket option in inpcb-layer in order to avoid
inp_socket->so_options dereference when we may not acquire the lock on
the inpcb.

This fixes the crash due to NULL pointer dereference in
in_pcbbind_setup() when inp_socket->so_options in a pcb returned by
in_pcblookup_local() was checked.

Reported by:	dave jones <s.dave.jones@gmail.com>, Arnaud Lacombe <lacombar@gmail.com>
Suggested by:	rwatson
Glanced by:	rwatson
Tested by:	dave jones <s.dave.jones@gmail.com>
2011-11-06 10:47:20 +00:00
Bjoern A. Zeeb
d9a362862c Add spares to the network stack for FreeBSD-9:
- TCP keep* timers
- TCP UTO (adjust from what was there already)
- netmap
- route caching
- user cookie (temporary to allow for the real fix)

Slightly re-shuffle struct ifnet moving fields out of the middle
of spares and to better align.

Discussed with:	rwatson (slightly earlier version)
2011-07-17 21:15:20 +00:00
Bjoern A. Zeeb
1417604e70 Unbreak kernels with non-default PCBGROUP included but no WITNESS.
Rather than including lock.h in in_pcbgroup.c in right order, fix it
for all consumers of in_pcb.h by further header file pollution under
#ifdef KERNEL.

Reported by:	Pan Tsu (inyaoo gmail.com)
2011-06-06 21:45:32 +00:00
Robert Watson
52cd27cb58 Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup.  pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.

Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups.  During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock.  By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details).  This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).

Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems".  However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.

Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies.  Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.

Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect.  In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).

Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.

Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.

Reviewed by:    bz
Sponsored by:   Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
Robert Watson
d3c1f00350 Add _mbuf() variants of various inpcb-related interfaces, including lookup,
hash install, etc.  For now, these are arguments are unused, but as we add
RSS support, we will want to use hashes extracted from mbufs, rather than
manually calculated hashes of header fields, due to the expensive of the
software version of Toeplitz (and similar hashes).

Add notes that it would be nice to be able to pass mbufs into lookup
routines in pf(4), optimising firewall lookup in the same way, but the
code structure there doesn't facilitate that currently.

(In principle there is no reason this couldn't be MFCed -- the change
extends rather than modifies the KBI.  However, it won't be useful without
other previous possibly less MFCable changes.)

Reviewed by:    bz
Sponsored by:   Juniper Networks, Inc.
2011-06-04 16:33:06 +00:00
Robert Watson
fa046d8774 Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
  inpcb counter.  This lock is now relegated to a small number of
  allocation and free operations, and occasional operations that walk
  all connections (including, awkwardly, certain UDP multicast receive
  operations -- something to revisit).

- A new ipi_hash_lock protects the two inpcbinfo hash tables for
  looking up connections and bound sockets, manipulated using new
  INP_HASH_*() macros.  This lock, combined with inpcb locks, protects
  the 4-tuple address space.

Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required.  As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.

A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb.  Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed.  In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup.  New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:

  INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
  INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb

Callers must pass exactly one of these flags (for the time being).

Some notes:

- All protocols are updated to work within the new regime; especially,
  TCP, UDPv4, and UDPv6.  pcbinfo ipi_lock acquisitions are largely
  eliminated, and global hash lock hold times are dramatically reduced
  compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
  may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
  is no longer available -- hash lookup locks are now held only very
  briefly during inpcb lookup, rather than for potentially extended
  periods.  However, the pcbinfo ipi_lock will still be acquired if a
  connection state might change such that a connection is added or
  removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
  due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
  callers to acquire hash locks and perform one or more lookups atomically
  with 4-tuple allocation: this is required only for TCPv6, as there is no
  in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
  locking, which relates to source address selection.  This needs
  attention, as it likely significantly reduces parallelism in this code
  for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
  somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
  is no longer sufficient.  A second check once the inpcb lock is held
  should do the trick, keeping the general case from requiring the inpcb
  lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
  which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
  undesirable, and probably another argument is required to take care of
  this (or a char array name field in the pcbinfo?).

This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics.  It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.

Reviewed by:    bz
Sponsored by:   Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
Robert Watson
79bdc6e5d3 Continue to refine inpcb reference counting and locking, in preparation for
reworking of inpcbinfo locking:

(1) Convert inpcb reference counting from manually manipulated integers to
    the refcount(9) KPI.  This allows the refcount to be managed atomically
    with an inpcb read lock rather than write lock, or even with no inpcb
    lock at all.  As a result, in_pcbref() also no longer requires an inpcb
    lock, so can be performed solely using the lock used to look up an
    inpcb.

(2) Shift more inpcb freeing activity from the in_pcbrele() context (via
    in_pcbfree_internal) to the explicit in_pcbfree() context.  This means
    that the inpcb refcount is increasingly used only to maintain memory
    stability, not actually defer the clean up of inpcb protocol parts.
    This is desirable as many of those protocol parts required the pcbinfo
    lock, which we'd like not to acquire in in_pcbrele() contexts.  Document
    this in comments better.

(3) Introduce new read-locked and write-locked in_pcbrele() variations,
    in_pcbrele_rlocked() and in_pcbrele_wlocked(), which allow the inpcb to
    be properly unlocked as needed.  in_pcbrele() is a wrapper around the
    latter, and should probably go away at some point.  This makes it
    easier to use this weak reference model when holding only a read lock,
    as will happen in the future.

This may well be safe to MFC, but some more KBI analysis is required.

Reviewed by:    bz
MFC after:      3 weeks
Sponsored by:   Juniper Networks, Inc.
2011-05-23 19:32:02 +00:00
Robert Watson
82a5be494a A number of quite incremental refinements to struct inpcbinfo's definition:
(1) Add a locking guide for inpcbinfo.
(2) Annotate inpcbinfo fields with synchronisation information; not all
    annotations are 100% satisfactory.
(3) Reorder inpcbinfo fields so that the lock is at the head of the
    structure, and close to fields it protects.
(4) Sort fields that will eventually be hashlock/pcbgroup-related together
    even though they remain locked by ipi_lock for now.

Reviewed by:	bz
Sponsored by:	Juniper Networks
X-MFC after:	KBI analysis required
2011-05-23 13:51:57 +00:00
Bjoern A. Zeeb
aae49dd304 MFp4 CH=191470:
Move the ipport_tick_callout and related functions from ip_input.c
to in_pcb.c.  The random source port allocation code has been merged
and is now local to in_pcb.c only.
Use a SYSINIT to get the callout started and no longer depend on
initialization from the inet code, which would not work in an IPv6
only setup.

Reviewed by:	gnn
Sponsored by:	The FreeBSD Foundation
Sponsored by:	iXsystems
MFC after:	4 days
2011-04-20 08:00:29 +00:00
Bjoern A. Zeeb
efc76f729a Merge the two identical implementations for local port selections from
in_pcbbind_setup() and in6_pcbsetport() in a single in_pcb_lport().

MFC after:	2 weeks
2011-03-12 21:46:37 +00:00
Robert Watson
9bcd427b89 Abstract out initialization of most aspects of struct inpcbinfo from
their calling contexts in {IP divert, raw IP sockets, TCP, UDP} and
create new helper functions: in_pcbinfo_init() and in_pcbinfo_destroy()
to do this work in a central spot.  As inpcbinfo becomes more complex
due to ongoing work to add connection groups, this will reduce code
duplication.

MFC after:      1 month
Reviewed by:    bz
Sponsored by:   Juniper Networks
2010-03-14 18:59:11 +00:00
Robert Watson
66f80e90ef Wrap use of rw_try_upgrade() on pcbinfo with macro INP_INFO_TRY_UPGRADE()
to match other pcbinfo locking macros.

MFC after:	1 week
2010-03-06 21:24:11 +00:00
Robert Watson
9d2eb78bcb Add padding to struct inpcb, missed during our padding sweep earlier in
the release cycle.

Approved by:	re (kensmith)
2009-08-02 22:47:08 +00:00
Robert Watson
1e77c1056a Remove unused VNET_SET() and related macros; only VNET_GET() is
ever actually used.  Rename VNET_GET() to VNET() to shorten
variable references.

Discussed with:	bz, julian
Reviewed by:	bz
Approved by:	re (kensmith, kib)
2009-07-16 21:13:04 +00:00
Robert Watson
eddfbb763d Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator.  Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...).  This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.

Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack.  Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory.  Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.

Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy.  Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address.  When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.

This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.

Bump __FreeBSD_version and update UPDATING.

Portions submitted by:  bz
Reviewed by:            bz, zec
Discussed with:         gnn, jamie, jeff, jhb, julian, sam
Suggested by:           peter
Approved by:            re (kensmith)
2009-07-14 22:48:30 +00:00
Bjoern A. Zeeb
5736e6fb9d After cleaning up rt_tables from vnet.h and cleaning up opt_route.h
a lot of files no longer need route.h either. Garbage collect them.
While here remove now unneeded vnet.h #includes as well.
2009-06-23 17:03:45 +00:00
Pawel Jakub Dawidek
f44270e764 - Rename IP_NONLOCALOK IP socket option to IP_BINDANY, to be more consistent
with OpenBSD (and BSD/OS originally). We can't easly do it SOL_SOCKET option
  as there is no more space for more SOL_SOCKET options, but this option also
  fits better as an IP socket option, it seems.
- Implement this functionality also for IPv6 and RAW IP sockets.
- Always compile it in (don't use additional kernel options).
- Remove sysctl to turn this functionality on and off.
- Introduce new privilege - PRIV_NETINET_BINDANY, which allows to use this
  functionality (currently only unjail root can use it).

Discussed with:	julian, adrian, jhb, rwatson, kmacy
2009-06-01 10:30:00 +00:00
Robert Watson
6d888973c8 Staticize two functions not used outside of in_pcb.c: in_pcbremlists() and
db_print_inpcb().

MFC after:	1 month
2009-05-14 20:59:36 +00:00
Marko Zec
f6dfe47a14 Permit buiding kernels with options VIMAGE, restricted to only a single
active network stack instance.  Turning on options VIMAGE at compile
time yields the following changes relative to default kernel build:

1) V_ accessor macros for virtualized variables resolve to structure
fields via base pointers, instead of being resolved as fields in global
structs or plain global variables.  As an example, V_ifnet becomes:

    options VIMAGE:          ((struct vnet_net *) vnet_net)->_ifnet
    default build:           vnet_net_0._ifnet
    options VIMAGE_GLOBALS:  ifnet

2) INIT_VNET_* macros will declare and set up base pointers to be used
by V_ accessor macros, instead of resolving to whitespace:

    INIT_VNET_NET(ifp->if_vnet); becomes

    struct vnet_net *vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET];

3) Memory for vnet modules registered via vnet_mod_register() is now
allocated at run time in sys/kern/kern_vimage.c, instead of per vnet
module structs being declared as globals.  If required, vnet modules
can now request the framework to provide them with allocated bzeroed
memory by filling in the vmi_size field in their vmi_modinfo structures.

4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are
extended to hold a pointer to the parent vnet.  options VIMAGE builds
will fill in those fields as required.

5) curvnet is introduced as a new global variable in options VIMAGE
builds, always pointing to the default and only struct vnet.

6) struct sysctl_oid has been extended with additional two fields to
store major and minor virtualization module identifiers, oid_v_subs and
oid_v_mod.  SYSCTL_V_* family of macros will fill in those fields
accordingly, and store the offset in the appropriate vnet container
struct in oid_arg1.
In sysctl handlers dealing with virtualized sysctls, the
SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target
variable and make it available in arg1 variable for further processing.

Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have
been deleted.

Reviewed by:	bz, rwatson
Approved by:	julian (mentor)
2009-04-30 13:36:26 +00:00
Kip Macy
ab25fa3558 s/void/void */ 2009-04-16 23:02:56 +00:00
Kip Macy
114f15c686 restore spare pointers for MFCing 2009-04-16 22:47:43 +00:00
Kip Macy
8b12a7c2a6 - convert pspare pointers in inpcb to an llentry and rtentry cache
- add flags to indicate their validity
2009-04-15 22:22:00 +00:00
Kip Macy
773b573a96 - add second flags field to to inpcb
- update comments in vflag
2009-04-15 22:09:42 +00:00
Kip Macy
82c33e73f2 provide additional convenience macros for inpcb locking (upgrade, downgrade, exclusive) 2009-04-15 21:39:56 +00:00
Kip Macy
80cb9f211a Import "flowid" support for serializing flows across transmit queues
Reviewed by:	rwatson and jeli
2009-04-10 06:16:14 +00:00
Robert Watson
ad71fe3c35 Correct a number of evolved problems with inp_vflag and inp_flags:
certain flags that should have been in inp_flags ended up in inp_vflag,
meaning that they were inconsistently locked, and in one case,
interpreted.  Move the following flags from inp_vflag to gaps in the
inp_flags space (and clean up the inp_flags constants to make gaps
more obvious to future takers):

  INP_TIMEWAIT
  INP_SOCKREF
  INP_ONESBCAST
  INP_DROPPED

Some aspects of this change have no effect on kernel ABI at all, as these
are UDP/TCP/IP-internal uses; however, netstat and sockstat detect
INP_TIMEWAIT when listing TCP sockets, so any MFC will need to take this
into account.

MFC after:      1 week (or after dependencies are MFC'd)
Reviewed by:    bz
2009-03-15 09:58:31 +00:00
Robert Watson
111d57a69c Add INP_INHASHLIST flag for inpcb->inp_flags to indicate whether
or not the inpcb is currenty on various hash lookup lists, rather
than using (lport != 0) to detect this.  This means that the full
4-tuple of a connection can be retained after close, which should
lead to more sensible netstat output in the window between TCP
close and socket close.

MFC after:	2 weeks
2009-03-11 00:29:22 +00:00
Robert Watson
4cf172fd65 Remove unused v6 macro aliases for inpcb fields:
in6p_ip6_nxt
        in6p_vflag
        in6p_flags
        in6p_socket
        in6p_lport
        in6p_fport
        in6p_ppcb

Remove unused v6 macro aliases for inpcb flags:

        IN6P_HIGHPORT
        IN6P_LOWPORT
        IN6P_ANONPORT
        IN6P_RECVIF
        IN6P_MTUDISC
        IN6P_FAITH
        IN6P_CONTROLOPTS

References to in6p_lport and in6_fport in sockstat are also replaced with
normal inp_lport and inp_fport references.

MFC after:	3 days
Reviewed by:	bz
2009-03-10 17:57:41 +00:00
Robert Watson
b9bbb597b1 Remove now-unused INP_UNMAPPABLEOPTS.
MFC after:	3 days
Discussed with:	bz
2009-03-10 11:04:19 +00:00
Adrian Chadd
be9347e3fe Implement a new IP option (not compiled/enabled by default) to allow
applications to specify a non-local IP address when bind()'ing a socket
to a local endpoint.

This allows applications to spoof the client IP address of connections
if (obviously!) they somehow are able to receive the traffic normally
destined to said clients.

This patch doesn't include any changes to ipfw or the bridging code to
redirect the client traffic through the PCB checks so TCP gets a shot
at it. The normal behaviour is that packets with a non-local destination
IP address are not handled locally. This can be dealth with some IPFW hackery;
modifications to IPFW to make this less hacky will occur in subsequent
commmits.

Thanks to Julian Elischer and others at Ironport. This work was approved
and donated before Cisco acquired them.

Obtained from:	Julian Elischer and others
MFC after:	2 weeks
2009-01-09 16:02:19 +00:00
Bjoern A. Zeeb
97590249ad Another step assimilating IPv[46] PCB code:
normalize IN6P_* compat flags usage to their equialent
INP_* counterpart.

Discussed with:	rwatson
Reviewed by:	rwatson
MFC after:	4 weeks
2008-12-17 13:00:18 +00:00
Bjoern A. Zeeb
dcdb4371ca Use inc_flags instead of the inc_isipv6 alias which so far
had been the only flag with random usage patterns.
Switch inc_flags to be used as a real bit field by using
INC_ISIPV6 with bitops to check for the 'isipv6' condition.

While here fix a place or two where in case of v4 inc_flags
were not properly initialized before.[1]

Found by:	rwatson during review [1]
Discussed with:	rwatson
Reviewed by:	rwatson
MFC after:	4 weeks
2008-12-17 12:52:34 +00:00
Bjoern A. Zeeb
86413abf5f Put a global variables, which were virtualized but formerly
missed under VIMAGE_GLOBAL.

Start putting the extern declarations of the  virtualized globals
under VIMAGE_GLOBAL as the globals themsevles are already.
This will help by the time when we are going to remove the globals
entirely.

While there garbage collect a few dead externs from ip6_var.h.

Sponsored by:	The FreeBSD Foundation
2008-12-11 16:26:38 +00:00
Robert Watson
0ca989b376 Update comment on INP_TIMEWAIT to say what it's about, as we caution
regarding the misplacement of flags in inp_vflag in an earlier comment.

MFC after:	pretty soon
2008-12-09 23:57:09 +00:00
Robert Watson
a5654bb2ae Move macros defining flags and shortcus to nested structure fields in
inpcbinfo below the structure definition in order to make inpcbinfo
fit on a single printed page; related style tweaks.

MFC after:	pretty soon
2008-12-09 10:21:38 +00:00
Robert Watson
28696211d6 Add a reference count to struct inpcb, which may be explicitly
incremented using in_pcbref(), and decremented using in_pcbfree()
or inpcbrele().  Protocols using only current in_pcballoc() and
in_pcbfree() calls will see the same semantics, but it is now
possible for TCP to call in_pcbref() and in_pcbrele() to prevent
an inpcb from being freed when both tcbinfo and per-inpcb locks
are released.  This makes it possible to safely transition from
holding only the inpcb lock to both tcbinfo and inpcb lock
without re-looking up a connection in the input path, timer
path, etc.

Notice that in_pcbrele() does not unlock the connection after
decrementing the refcount, if the connection remains, so that
the caller can continue to use it; in_pcbrele() returns a flag
indicating whether or not the inpcb pointer is still valid, and
in_pcbfee() is now a simple wrapper around in_pcbrele().

MFC after:	1 month
Discussed with:	bz, kmacy
Reviewed by:	bz, gnn, kmacy
Tested by:	kmacy
2008-12-08 20:18:50 +00:00
Marko Zec
44e33a0758 Change the initialization methodology for global variables scheduled
for virtualization.

Instead of initializing the affected global variables at instatiation,
assign initial values to them in initializer functions.  As a rule,
initialization at instatiation for such variables should never be
introduced again from now on.  Furthermore, enclose all instantiations
of such global variables in #ifdef VIMAGE_GLOBALS blocks.

Essentialy, this change should have zero functional impact.  In the next
phase of merging network stack virtualization infrastructure from
p4/vimage branch, the new initialization methology will allow us to
switch between using global variables and their counterparts residing in
virtualization containers with minimum code churn, and in the long run
allow us to intialize multiple instances of such container structures.

Discussed at:	devsummit Strassburg
Reviewed by:	bz, julian
Approved by:	julian (mentor)
Obtained from:	//depot/projects/vimage-commit2/...
X-MFC after:	never
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
2008-11-19 09:39:34 +00:00
Bjoern A. Zeeb
dc3c09c89f Bring over the change switching from using sequential to random
ephemeral port allocation as implemented in netinet/in_pcb.c rev. 1.143
(initially from OpenBSD) and follow-up commits during the last four and
a half years including rev. 1.157, 1.162 and 1.199.
This now is relying on the same infrastructure as has been implemented
in in_pcb.c since rev. 1.199.

Reviewed by:	silby, rpaulo, mlaier
MFC after:	2 months
2008-10-20 18:43:59 +00:00
Bjoern A. Zeeb
86d02c5c63 Cache so_cred as inp_cred in the inpcb.
This means that inp_cred is always there, even after the socket
has gone away. It also means that it is constant for the lifetime
of the inp.
Both facts lead to simpler code and possibly less locking.

Suggested by:	rwatson
Reviewed by:	rwatson
MFC after:	6 weeks
X-MFC Note:	use a inp_pspare for inp_cred
2008-10-04 15:06:34 +00:00
Robert Watson
fc18af966f Fix typo in comment.
MFC after:	3 days
2008-09-29 13:48:48 +00:00
Robert Watson
5cb2685a59 Minor white space tweaks.
MFC after:	1 week
2008-08-07 09:06:04 +00:00
Tai-hwa Liang
df9cf830d1 Trying to fix compilation bustage:
- removing 'const' qualifier from an input parameter to conform to the type
  required by rw_assert();
- using in_addr->s_addr to retrive 32 bits address value.

Observed by:	tinderbox
2008-07-22 04:23:57 +00:00
Kip Macy
9d29c635da make new accessor functions consistent with existing style 2008-07-21 22:11:39 +00:00
Kip Macy
9378e4377f add inpcb accessor functions for fields needed by TOE devices 2008-07-21 00:08:34 +00:00
Robert Watson
43cc0bc1df Merge last of a series of rwlock conversion changes to UDP, which
completes the move to a fully parallel UDP transmit path by using
global read, rather than write, locking of inpcbinfo in further
semi-connected cases:

- Add macros to allow try-locking of inpcb and inpcbinfo.
- Always acquire an incpcb read lock in udp_output(), which stablizes the
  local inpcb address and port bindings in order to determine what further
  locking is required:
  - If the inpcb is currently not bound (at all) and are implicitly
    connecting, we require inpcbinfo and inpcb write locks, so drop the
    read lock and re-acquire.
  - If the inpcb is bound for at least one of the port or address, but an
    explicit source or destination is requested, trylock the inpcbinfo
    lock, and if that fails, drop the inpcb lock, lock the global lock,
    and relock the inpcb lock.
  - Otherwise, no further locking is required (common case).
- Update comments.

In practice, this means that the vast majority of consumers of UDP sockets
will not acquire any exclusive locks at the socket or UDP levels of the
network stack.  This leads to a marked performance improvement in several
important workloads, including BIND, nsd, and memcached over UDP, as well
as significant improvements in pps microbenchmarks.

The plan is to MFC all of the rwlock changes to RELENG_7 once they have
settled for a weeks in the tree.

Tested by:	ps, kris (older revision), bde
MFC after:	3 weeks
2008-07-15 15:38:47 +00:00
Bjoern A. Zeeb
078b704233 Pass the ucred along into in{,6}_pcblookup_local for upcoming
prison checks.

Reviewed by:	rwatson
2008-07-10 13:31:11 +00:00
Bjoern A. Zeeb
cdcb11b92c For consistency take lport as u_short in in{,6}_pcblookup_local.
All callers either pass in an u_short or u_int16_t.

Reviewed by:	rwatson
2008-07-10 13:23:22 +00:00
Robert Watson
7b709f8ad4 Provide some initial chicken-scratching annotations of locking for
struct inpcb.

Prodded by:	bz
MFC after:	3 days
2008-07-08 17:22:59 +00:00
Julian Elischer
8b07e49a00 Add code to allow the system to handle multiple routing tables.
This particular implementation is designed to be fully backwards compatible
and to be MFC-able to 7.x (and 6.x)

Currently the only protocol that can make use of the multiple tables is IPv4
Similar functionality exists in OpenBSD and Linux.

From my notes:

-----

  One thing where FreeBSD has been falling behind, and which by chance I
  have some time to work on is "policy based routing", which allows
  different
  packet streams to be routed by more than just the destination address.

  Constraints:
  ------------

  I want to make some form of this available in the 6.x tree
  (and by extension 7.x) , but FreeBSD in general needs it so I might as
  well do it in -current and back port the portions I need.

  One of the ways that this can be done is to have the ability to
  instantiate multiple kernel routing tables (which I will now
  refer to as "Forwarding Information Bases" or "FIBs" for political
  correctness reasons). Which FIB a particular packet uses to make
  the next hop decision can be decided by a number of mechanisms.
  The policies these mechanisms implement are the "Policies" referred
  to in "Policy based routing".

  One of the constraints I have if I try to back port this work to
  6.x is that it must be implemented as a EXTENSION to the existing
  ABIs in 6.x so that third party applications do not need to be
  recompiled in timespan of the branch.

  This first version will not have some of the bells and whistles that
  will come with later versions. It will, for example, be limited to 16
  tables in the first commit.
  Implementation method, Compatible version. (part 1)
  -------------------------------
  For this reason I have implemented a "sufficient subset" of a
  multiple routing table solution in Perforce, and back-ported it
  to 6.x. (also in Perforce though not  always caught up with what I
  have done in -current/P4). The subset allows a number of FIBs
  to be defined at compile time (8 is sufficient for my purposes in 6.x)
  and implements the changes needed to allow IPV4 to use them. I have not
  done the changes for ipv6 simply because I do not need it, and I do not
  have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.

  Other protocol families are left untouched and should there be
  users with proprietary protocol families, they should continue to work
  and be oblivious to the existence of the extra FIBs.

  To understand how this is done, one must know that the current FIB
  code starts everything off with a single dimensional array of
  pointers to FIB head structures (One per protocol family), each of
  which in turn points to the trie of routes available to that family.

  The basic change in the ABI compatible version of the change is to
  extent that array to be a 2 dimensional array, so that
  instead of protocol family X looking at rt_tables[X] for the
  table it needs, it looks at rt_tables[Y][X] when for all
  protocol families except ipv4 Y is always 0.
  Code that is unaware of the change always just sees the first row
  of the table, which of course looks just like the one dimensional
  array that existed before.

  The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
  are all maintained, but refer only to the first row of the array,
  so that existing callers in proprietary protocols can continue to
  do the "right thing".
  Some new entry points are added, for the exclusive use of ipv4 code
  called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
  which have an extra argument which refers the code to the correct row.

  In addition, there are some new entry points (currently called
  rtalloc_fib() and friends) that check the Address family being
  looked up and call either rtalloc() (and friends) if the protocol
  is not IPv4 forcing the action to row 0 or to the appropriate row
  if it IS IPv4 (and that info is available). These are for calling
  from code that is not specific to any particular protocol. The way
  these are implemented would change in the non ABI preserving code
  to be added later.

  One feature of the first version of the code is that for ipv4,
  the interface routes show up automatically on all the FIBs, so
  that no matter what FIB you select you always have the basic
  direct attached hosts available to you. (rtinit() does this
  automatically).

  You CAN delete an interface route from one FIB should you want
  to but by default it's there. ARP information is also available
  in each FIB. It's assumed that the same machine would have the
  same MAC address, regardless of which FIB you are using to get
  to it.

  This brings us as to how the correct FIB is selected for an outgoing
  IPV4 packet.

  Firstly, all packets have a FIB associated with them. if nothing
  has been done to change it, it will be FIB 0. The FIB is changed
  in the following ways.

  Packets fall into one of a number of classes.

  1/ locally generated packets, coming from a socket/PCB.
     Such packets select a FIB from a number associated with the
     socket/PCB. This in turn is inherited from the process,
     but can be changed by a socket option. The process in turn
     inherits it on fork. I have written a utility call setfib
     that acts a bit like nice..

         setfib -3 ping target.example.com # will use fib 3 for ping.

     It is an obvious extension to make it a property of a jail
     but I have not done so. It can be achieved by combining the setfib and
     jail commands.

  2/ packets received on an interface for forwarding.
     By default these packets would use table 0,
     (or possibly a number settable in a sysctl(not yet)).
     but prior to routing the firewall can inspect them (see below).
     (possibly in the future you may be able to associate a FIB
     with packets received on an interface..  An ifconfig arg, but not yet.)

  3/ packets inspected by a packet classifier, which can arbitrarily
     associate a fib with it on a packet by packet basis.
     A fib assigned to a packet by a packet classifier
     (such as ipfw) would over-ride a fib associated by
     a more default source. (such as cases 1 or 2).

  4/ a tcp listen socket associated with a fib will generate
     accept sockets that are associated with that same fib.

  5/ Packets generated in response to some other packet (e.g. reset
     or icmp packets). These should use the FIB associated with the
     packet being reponded to.

  6/ Packets generated during encapsulation.
     gif, tun and other tunnel interfaces will encapsulate using the FIB
     that was in effect withthe proces that set up the tunnel.
     thus setfib 1 ifconfig gif0 [tunnel instructions]
     will set the fib for the tunnel to use to be fib 1.

  Routing messages would be associated with their
  process, and thus select one FIB or another.
  messages from the kernel would be associated with the fib they
  refer to and would only be received by a routing socket associated
  with that fib. (not yet implemented)

  In addition Netstat has been edited to be able to cope with the
  fact that the array is now 2 dimensional. (It looks in system
  memory using libkvm (!)). Old versions of netstat see only the first FIB.

  In addition two sysctls are added to give:
  a) the number of FIBs compiled in (active)
  b) the default FIB of the calling process.

  Early testing experience:
  -------------------------

  Basically our (IronPort's) appliance does this functionality already
  using ipfw fwd but that method has some drawbacks.

  For example,
  It can't fully simulate a routing table because it can't influence the
  socket's choice of local address when a connect() is done.

  Testing during the generating of these changes has been
  remarkably smooth so far. Multiple tables have co-existed
  with no notable side effects, and packets have been routes
  accordingly.

  ipfw has grown 2 new keywords:

  setfib N ip from anay to any
  count ip from any to any fib N

  In pf there seems to be a requirement to be able to give symbolic names to the
  fibs but I do not have that capacity. I am not sure if it is required.

  SCTP has interestingly enough built in support for this, called VRFs
  in Cisco parlance. it will be interesting to see how that handles it
  when it suddenly actually does something.

  Where to next:
  --------------------

  After committing the ABI compatible version and MFCing it, I'd
  like to proceed in a forward direction in -current. this will
  result in some roto-tilling in the routing code.

  Firstly: the current code's idea of having a separate tree per
  protocol family, all of the same format, and pointed to by the
  1 dimensional array is a bit silly. Especially when one considers that
  there is code that makes assumptions about every protocol having the
  same internal structures there. Some protocols don't WANT that
  sort of structure. (for example the whole idea of a netmask is foreign
  to appletalk). This needs to be made opaque to the external code.

  My suggested first change is to add routing method pointers to the
  'domain' structure, along with information pointing the data.
  instead of having an array of pointers to uniform structures,
  there would be an array pointing to the 'domain' structures
  for each protocol address domain (protocol family),
  and the methods this reached would be called. The methods would have
  an argument that gives FIB number, but the protocol would be free
  to ignore it.

  When the ABI can be changed it raises the possibilty of the
  addition of a fib entry into the "struct route". Currently,
  the structure contains the sockaddr of the desination, and the resulting
  fib entry. To make this work fully, one could add a fib number
  so that given an address and a fib, one can find the third element, the
  fib entry.

  Interaction with the ARP layer/ LL layer would need to be
  revisited as well. Qing Li has been working on this already.

  This work was sponsored by Ironport Systems/Cisco

Reviewed by:    several including rwatson, bz and mlair (parts each)
Obtained from:  Ironport systems/Cisco
2008-05-09 23:03:00 +00:00
Robert Watson
8501a69cc9 Convert pcbinfo and inpcb mutexes to rwlocks, and modify macros to
explicitly select write locking for all use of the inpcb mutex.
Update some pcbinfo lock assertions to assert locked rather than
write-locked, although in practice almost all uses of the pcbinfo
rwlock main exclusive, and all instances of inpcb lock acquisition
are exclusive.

This change should introduce (ideally) little functional change.
However, it lays the groundwork for significantly increased
parallelism in the TCP/IP code.

MFC after:	3 months
Tested by:	kris (superset of committered patch)
2008-04-17 21:38:18 +00:00
Kip Macy
e79dd20dd5 change inp_wlock_assert to inp_lock_assert 2008-03-24 20:24:04 +00:00
Kip Macy
8815ab518a Label inp as unused in the non-INVARIANTS case 2008-03-24 00:29:01 +00:00
Kip Macy
3d5853271e Insulate inpcb consumers outside the stack from the lock type and offset within the pcb by adding accessor functions.
Reviewed by: rwatson
MFC after: 3 weeks
2008-03-23 22:34:16 +00:00
Kip Macy
2de2af32a0 Add padding for anticipated functionality
- vimage
 - TOE
 - multiq
 - host rtentry caching

Rename spare used by 80211 to if_llsoftc

Reviewed by: rwatson, gnn
MFC after: 1 day
2007-12-07 01:46:13 +00:00
Robert Watson
0bf686c125 Remove the now-unused NET_{LOCK,UNLOCK,ASSERT}_GIANT() macros, which
previously conditionally acquired Giant based on debug.mpsafenet.  As that
has now been removed, they are no longer required.  Removing them
significantly simplifies error-handling in the socket layer, eliminated
quite a bit of unwinding of locking in error cases.

While here clean up the now unneeded opt_net.h, which previously was used
for the NET_WITH_GIANT kernel option.  Clean up some related gotos for
consistency.

Reviewed by:	bz, csjp
Tested by:	kris
Approved by:	re (kensmith)
2007-08-06 14:26:03 +00:00
George V. Neville-Neil
2cb64cb272 Commit IPv6 support for FAST_IPSEC to the tree.
This commit includes only the kernel files, the rest of the files
will follow in a second commit.

Reviewed by:    bz
Approved by:    re
Supported by:   Secure Computing
2007-07-01 11:41:27 +00:00
Robert Watson
54d642bbe5 Reduce network stack oddness: implement .pru_sockaddr and .pru_peeraddr
protocol entry points using functions named proto_getsockaddr and
proto_getpeeraddr rather than proto_setsockaddr and proto_setpeeraddr.
While it's true that sockaddrs are allocated and set, the net effect is
to retrieve (get) the socket address or peer address from a socket, not
set it, so align names to that intent.
2007-05-11 10:20:51 +00:00
Robert Watson
84ca8aa609 Remove unused pcbinfo arguments to in_setsockaddr() and
in_setpeeraddr().
2007-05-01 16:31:02 +00:00
Robert Watson
712fc218a0 Rename some fields of struct inpcbinfo to have the ipi_ prefix,
consistent with the naming of other structure field members, and
reducing improper grep matches.  Clean up and comment structure
fields in structure definition.
2007-04-30 23:12:05 +00:00
Andre Oppermann
995a77176f Add INP_INFO_UNLOCK_ASSERT() and use it in tcp_input(). Also add some
further INP_INFO_WLOCK_ASSERT() while there.
2007-04-04 18:30:16 +00:00
Andre Oppermann
beaa515e95 Some local and style(9) cleanups. 2007-04-04 15:30:31 +00:00
Robert Watson
77c78838f0 Remove stale comment about not enabling inpcb and inpcbinfo lock assertions
when IPv6 is enabled.

MFC after:	3 days
2007-03-28 00:50:20 +00:00
Robert Watson
497057eeea Add "show inpcb", "show tcpcb" DDB commands, which should come in handy
for debugging sblock and other network panics.
2007-02-17 21:02:38 +00:00
Robert Watson
8ca5b13f2f Remove unused inp6_ifindex field from inpcb, as well as unused macro
shortcut for it.
2007-02-16 14:09:24 +00:00
Robert Watson
1f9b46facf Remove unused in6p_ip6_hlim macro shortcut for non-present
inp_depend6.inp6_hlim field in the inpcb.
2007-02-16 13:56:06 +00:00
Stephan Uphoff
d915b28015 Fix race conditions on enumerating pcb lists by moving the initialization
( and where appropriate the destruction) of the pcb mutex to the init/finit
functions of the pcb zones.
This allows locking of the pcb entries and race condition free comparison
of the generation count.
Rearrange locking a bit to avoid extra locking operation to update the generation
count in in_pcballoc(). (in_pcballoc now returns the pcb locked)

I am planning to convert pcb list handling from a type safe to a reference count
model soon. ( As this allows really freeing the PCBs)

Reviewed by:	rwatson@, mohans@
MFC after:	1 week
2006-07-18 22:34:27 +00:00
Robert Watson
10702a2840 Abstract inpcb drop logic, previously just setting of INP_DROPPED in TCP,
into in_pcbdrop().  Expand logic to detach the inpcb from its bound
address/port so that dropping a TCP connection releases the inpcb resource
reservation, which since the introduction of socket/pcb reference count
updates, has been persisting until the socket closed rather than being
released implicitly due to prior freeing of the inpcb on TCP drop.

MFC after:	3 months
2006-04-25 11:17:35 +00:00
Robert Watson
afa39e25c4 Change inp_ppcb from caddr_t to void *, fix/remove associated related
casts.

Consistently use intotw() to cast inp_ppcb pointers to struct tcptw *
pointers.

Consistently use intotcpcb() to cast inp_ppcb pointers to struct tcpcb *
pointers.

Don't assign tp to the results to intotcpcb() during variable declation
at the top of functions, as that is before the asserts relating to
locking have been performed.  Do this later in the function after
appropriate assertions have run to allow that operation to be conisdered
safe.

MFC after:	3 months
2006-04-03 13:33:55 +00:00
Robert Watson
4c7c478d0f Break out in_pcbdetach() into two functions:
- in_pcbdetach(), which removes the link between an inpcb and its
  socket.

- in_pcbfree(), which frees a detached pcb.

Unlike the previous in_pcbdetach(), neither of these functions will
attempt to conditionally free the socket, as they are responsible only
for managing in_pcb memory.  Mirror these changes into in6_pcbdetach()
by breaking it into in6_pcbdetach() and in6_pcbfree().

While here, eliminate undesired checks for NULL inpcb pointers in
sockets, as we will now have as an invariant that sockets will always
have valid so_pcb pointers.

MFC after:	3 months
2006-04-01 16:04:42 +00:00
Robert Watson
6882aa2c03 Define two new inpcb flags in the inp_vflag field, which for whatever
reason, seems to be where new flags are getting defined:

INP_DROPPED - The protocol has terminated this connection and the socket
              is not reusable: when the socket code enters the protocol,
              an error is immediately returned.  This will substitute for
              NULLing the so_pcb socket field, helping to implement the
              invariant that all valid sockets have valid pcb's in TCP.

INP_SOCKREF - The protocol has become the owner of the socket reference,
              and will need to free it when freeing the pcb, which will
              be used when a TCP socket is closed but still has queued
              data.

MFC after:	1 month
2006-03-26 11:30:31 +00:00
Robert Watson
a07b8fd178 Minor style tweak: tab after #define, not space.
MFC after:	1 month
2006-03-26 11:26:12 +00:00
David Malone
fcd1001c63 Make net.inet.ip.portrange.reservedhigh and
net.inet.ip.portrange.reservedlow apply to IPv6 aswell as IPv4.

We could have made new sysctls for IPv6, but that potentially makes
things complicated for mapped addresses. This seems like the least
confusing option and least likely to cause obscure problems in the
future.

This change makes the mac_portacl module useful with IPv6 apps.

Reviewed by:	ume
MFC after:	1 month
2006-03-19 11:48:48 +00:00
Andre Oppermann
b2828ad291 Implement IP_DONTFRAG IP socket option enabling the Don't Fragment
flag on IP packets.  Currently this option is only repected on udp
and raw ip sockets.  On tcp sockets the DF flag is controlled by the
path MTU discovery option.

Sending a packet larger than the MTU size of the egress interface
returns an EMSGSIZE error.

Discussed with:	rwatson
Sponsored by:	TCP/IP Optimization Fundraise 2005
2005-09-26 20:25:16 +00:00
Andre Oppermann
936cd18dad Add socketoption IP_MINTTL. May be used to set the minimum acceptable
TTL a packet must have when received on a socket.  All packets with a
lower TTL are silently dropped.  Works on already connected/connecting
and listening sockets for RAW/UDP/TCP.

This option is only really useful when set to 255 preventing packets
from outside the directly connected networks reaching local listeners
on sockets.

Allows userland implementation of 'The Generalized TTL Security Mechanism
(GTSM)' according to RFC3682.  Examples of such use include the Cisco IOS
BGP implementation command "neighbor ttl-security".

MFC after:	2 weeks
Sponsored by:	TCP/IP Optimization Fundraise 2005
2005-08-22 16:13:08 +00:00
Warner Losh
c398230b64 /* -> /*- for license, minor formatting changes 2005-01-07 01:45:51 +00:00
Mike Silbersack
5f311da2cc Port randomization leads to extremely fast port reuse at high
connection rates, which is causing problems for some users.

To retain the security advantage of random ports and ensure
correct operation for high connection rate users, disable
port randomization during periods of high connection rates.

Whenever the connection rate exceeds randomcps (10 by default),
randomization will be disabled for randomtime (45 by default)
seconds.  These thresholds may be tuned via sysctl.

Many thanks to Igor Sysoev, who proved the necessity of this
change and tested many preliminary versions of the patch.

MFC After:	20 seconds
2005-01-02 01:50:57 +00:00