Commit Graph

510 Commits

Author SHA1 Message Date
Andre Oppermann
79ce26a08c Simplify and enhance the window change/update acceptance logic,
especially in the presence of bi-directional data transfers.

snd_wl1 tracks the right edge, including data in the reassembly
queue, of valid incoming data.  This makes it like rcv_nxt plus
reassembly.  It never goes backwards to prevent older, possibly
reordered segments from updating the window.

snd_wl2 tracks the left edge of sent data.  This makes it a duplicate
of snd_una.  However joining them right now is difficult due to
separate update dependencies in different places in the code flow.

snd_wnd tracks the current advertized send window by the peer.  In
tcp_output() the effective window is calculated by subtracting the
already in-flight data, snd_nxt less snd_una, from it.

ACK's become the main clock of window updates and will always update
the window when the left edge of what we sent is advanced.  The ACK
clock is the primary signaling mechanism in ongoing data transfers.
This works reliably even in the presence of reordering, reassembly
and retransmitted segments.  The ACK clock is most important because
it determines how much data we are allowed to inject into the network.

Zero window updates get us out of persistence mode are crucial.  Here
a segment that neither moves ACK nor SEQ but enlarges WND is accepted.

When the ACK clock is not active (that is we're not or no longer
sending any data) any segment that moves the extended right SEQ edge,
including out-of-order segments, updates the window.  This gives us
updates especially during ping-pong transfers where the peer isn't
done consuming the already acknowledged data from the receive buffer
while responding with data.

The SSH protocol is a prime candidate to benefit from the improved
bi-directional window update logic as it has its own windowing
mechanism on top of TCP and is frequently sending back protocol ACK's.

Tcpdump provided by:	darrenr
Tested by:	darrenr
MFC after:	2 weeks
2012-10-28 19:16:22 +00:00
Andre Oppermann
4faaea5505 Allow arbitrary MSS sizes and don't mind about the cluster size anymore.
We've got more cluster sizes for quite some time now and the orginally
imposed limits and the previously codified thoughts on efficiency gains
are no longer true.

MFC after:	2 weeks
2012-10-28 18:33:52 +00:00
Andre Oppermann
cf8f04f4c0 When SYN or SYN/ACK had to be retransmitted RFC5681 requires us to
reduce the initial CWND to one segment.  This reduction got lost
some time ago due to a change in initialization ordering.

Additionally in tcp_timer_rexmt() avoid entering fast recovery when
we're still in TCPS_SYN_SENT state.

MFC after:	2 weeks
2012-10-28 17:25:08 +00:00
Andre Oppermann
22efabd40c Adjust the initial default CWND upon connection establishment to the
new and increased values specified by RFC5681 Section 3.1.

The even larger initial CWND per RFC3390, if enabled, is not affected.

MFC after:	2 weeks
2012-10-28 17:16:09 +00:00
Andrey V. Elsukov
c1de64a495 Remove the IPFIREWALL_FORWARD kernel option and make possible to turn
on the related functionality in the runtime via the sysctl variable
net.pfil.forward. It is turned off by default.

Sponsored by:	Yandex LLC
Discussed with:	net@
MFC after:	2 weeks
2012-10-25 09:39:14 +00:00
Gleb Smirnoff
8ad458a471 Do not reduce ip_len by size of IP header in the ip_input()
before passing a packet to protocol input routines.
  For several protocols this mean that now protocol needs to
do subtraction itself, and for another half this means that
we do not need to add header length back to the packet.

  Make ip_stripoptions() to adjust ip_len, since now we enter
this function with a packet header whose ip_len does represent
length of entire packet, not payload only.
2012-10-23 08:33:13 +00:00
Gleb Smirnoff
8f134647ca Switch the entire IPv4 stack to keep the IP packet header
in network byte order. Any host byte order processing is
done in local variables and host byte order values are
never[1] written to a packet.

  After this change a packet processed by the stack isn't
modified at all[2] except for TTL.

  After this change a network stack hacker doesn't need to
scratch his head trying to figure out what is the byte order
at the given place in the stack.

[1] One exception still remains. The raw sockets convert host
byte order before pass a packet to an application. Probably
this would remain for ages for compatibility.

[2] The ip_input() still subtructs header len from ip->ip_len,
but this is planned to be fixed soon.

Reviewed by:	luigi, Maxim Dounin <mdounin mdounin.ru>
Tested by:	ray, Olivier Cochard-Labbe <olivier cochard.me>
2012-10-22 21:09:03 +00:00
Gleb Smirnoff
105bd2113b In ip_stripoptions():
- Remove unused argument and incorrect comment.
  - Fixup ip_len after stripping.
2012-10-12 09:24:24 +00:00
Randall Stewart
ec03d5433f This small change takes care of a race condition
that can occur when both sides close at the same time.
If that occurs, without this fix the connection enters
FIN1 on both sides and they will forever send FIN|ACK at
each other until the connection times out. This is because
we stopped processing the FIN|ACK and thus did not advance
the sequence and so never ACK'd each others FIN. This
fix adjusts it so we *do* process the FIN properly and
the race goes away ;-)

MFC after:	1 month
2012-08-25 09:26:37 +00:00
Robert Watson
0989f56cff Update some stale comments regarding tcbinfo locking in the TCP input
path: read locks on tcbinfo are no longer used, so won't happen.  No
functional change.

MFC after:	3 days
2012-07-22 17:31:36 +00:00
Navdeep Parhar
09fe63205c - Updated TOE support in the kernel.
- Stateful TCP offload drivers for Terminator 3 and 4 (T3 and T4) ASICs.
  These are available as t3_tom and t4_tom modules that augment cxgb(4)
  and cxgbe(4) respectively.  The cxgb/cxgbe drivers continue to work as
  usual with or without these extra features.

- iWARP driver for Terminator 3 ASIC (kernel verbs).  T4 iWARP in the
  works and will follow soon.

Build-tested with make universe.

30s overview
============
What interfaces support TCP offload?  Look for TOE4 and/or TOE6 in the
capabilities of an interface:
# ifconfig -m | grep TOE

Enable/disable TCP offload on an interface (just like any other ifnet
capability):
# ifconfig cxgbe0 toe
# ifconfig cxgbe0 -toe

Which connections are offloaded?  Look for toe4 and/or toe6 in the
output of netstat and sockstat:
# netstat -np tcp | grep toe
# sockstat -46c | grep toe

Reviewed by:	bz, gnn
Sponsored by:	Chelsio communications.
MFC after:	~3 months (after 9.1, and after ensuring MFC is feasible)
2012-06-19 07:34:13 +00:00
Maksim Yevmenkin
77d396fd18 Plug more refcount leaks and possible NULL deref for interface
address list.

Submitted by:	scottl@
MFC after:	3 days
2012-06-04 18:43:51 +00:00
Bjoern A. Zeeb
356ab07e2d It turns out that too many drivers are not only parsing the L2/3/4
headers for TSO but also for generic checksum offloading.  Ideally we
would only have one common function shared amongst all drivers, and
perhaps when updating them for IPv6 we should introduce that.
Eventually we should provide the meta information along with mbufs to
avoid (re-)parsing entirely.

To not break IPv6 (checksums and offload) and to be able to MFC the
changes without risking to hurt 3rd party drivers, duplicate the v4
framework, as other OSes have done as well.

Introduce interface capability flags for TX/RX checksum offload with
IPv6, to allow independent toggling (where possible).  Add CSUM_*_IPV6
flags for UDP/TCP over IPv6, and reserve further for SCTP, and IPv6
fragmentation.  Define CSUM_DELAY_DATA_IPV6 as we do for legacy IP and
add an alias for CSUM_DATA_VALID_IPV6.

This pretty much brings IPv6 handling in line with IPv4.
TSO is still handled in a different way and not via if_hwassist.

Update ifconfig to allow (un)setting of the new capability flags.
Update loopback to announce the new capabilities and if_hwassist flags.

Individual driver updates will have to follow, as will SCTP.

Reported by:	gallatin, dim, ..
Reviewed by:	gallatin (glanced at?)
MFC after:	3 days
X-MFC with:	r235961,235959,235958
2012-05-28 09:30:13 +00:00
Bjoern A. Zeeb
45747ba53c MFp4 bz_ipv6_fast:
Add code to handle pre-checked TCP checksums as indicated by mbuf
  flags to save the entire computation for validation if not needed.

  In the IPv6 TCP output path only compute the pseudo-header checksum,
  set the checksum offset in the mbuf field along the appropriate flag
  as done in IPv4.

  In tcp_respond() just initialize the IPv6 payload length to 0 as
  ip6_output() will properly set it.

  Sponsored by:	The FreeBSD Foundation
  Sponsored by:	iXsystems

Reviewed by:	gnn (as part of the whole)
MFC After:	3 days
2012-05-25 02:23:26 +00:00
Bjoern A. Zeeb
3a9391defb MFp4 bz_ipv6_fast:
Factor out the tcp_hc_getmtu() call.  As the comments say it
  applies to both v4 and v6, so only write it once making it easier
  to read the protocol family specifc code.

  Sponsored by:	The FreeBSD Foundation
  Sponsored by:	iXsystems

Reviewed by:	gnn (as part of the whole)
MFC After:	3 days
2012-05-25 01:13:39 +00:00
Gleb Smirnoff
ef341ee1e3 When we receive an ICMP unreach need fragmentation datagram, we take
proposed MTU value from it and update the TCP host cache. Then
tcp_mss_update() is called on the corresponding tcpcb. It finds the
just allocated entry in the TCP host cache and updates MSS on the
tcpcb. And then we do a fast retransmit of what we have in the tcp
send buffer.

This sequence gets broken if the TCP host cache is exausted. In this
case allocation fails, and later called tcp_mss_update() finds nothing
in cache. The fast retransmit is done with not reduced MSS and is
immidiately replied by remote host with new ICMP datagrams and the
cycle repeats. This ping-pong can go up to wirespeed.

To fix this:
- tcp_mss_update() gets new parameter - mtuoffer, that is like
  offer, but needs to have min_protoh subtracted.
- tcp_mtudisc() as notification method renamed to tcp_mtudisc_notify().
- tcp_mtudisc() now accepts not a useless error argument, but proposed
  MTU value, that is passed to tcp_mss_update() as mtuoffer.

Reported by:	az
Reported by:	Andrey Zonov <andrey zonov.org>
Reviewed by:	andre (previous version of patch)
2012-04-16 13:49:03 +00:00
Bjoern A. Zeeb
d8951c8a2f Fix PAWS (Protect Against Wrapped Sequence numbers) in cases when
hz >> 1000 and thus getting outside the timestamp clock frequenceny of
1ms < x < 1s per tick as mandated by RFC1323, leading to connection
resets on idle connections.

Always use a granularity of 1ms using getmicrouptime() making all but
relevant callouts independent of hz.

Use getmicrouptime(), not getmicrotime() as the latter may make a jump
possibly breaking TCP nfsroot mounts having our timestamps move forward
for more than 24.8 days in a second without having been idle for that
long.

PR:		kern/61404
Reviewed by:	jhb, mav, rrs
Discussed with:	silby, lstewart
Sponsored by:	Sandvine Incorporated (originally in 2011)
MFC after:	6 weeks
2012-02-15 16:09:56 +00:00
Gleb Smirnoff
9077f38738 Add new socket options: TCP_KEEPINIT, TCP_KEEPIDLE, TCP_KEEPINTVL and
TCP_KEEPCNT, that allow to control initial timeout, idle time, idle
re-send interval and idle send count on a per-socket basis.

Reviewed by:	andre, bz, lstewart
2012-02-05 16:53:02 +00:00
John Baldwin
1e96ae8193 Remove the assertion from tcp_input() that rcv_nxt is always greater
than or equal to rcv_adv and fix tcp_twstart() to handle this case by
assuming the last window was zero rather than a negative value.

The code in tcp_input() already safely handled this case.  It can happen
due to delayed ACKs along with a remote sender that sends data beyond
the window we previously advertised.  If we have room in our socket buffer
for the extra data beyond the advertised window, we will accept it.
However, if the ACK for that segment is delayed, then we will not
effectively fixup rcv_adv to account for that extra data until the
next segment arrives and forces out an ACK.  When that next segment
arrives, rcv_nxt will be beyond rcv_adv.

Tested by:	pjd
MFC after:	1 week
2012-01-05 22:29:11 +00:00
Ed Schouten
6472ac3d8a Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs.
The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.
2011-11-07 15:43:11 +00:00
Sergey Kandaurov
ddd0c4a969 Restore sysctl names for tcp_sendspace/tcp_recvspace.
They seem to be changed unintentionally in r226437, and there were no
any mentions of renaming in commit log message.

Reported by:	Anton Yuzhaninov <citrin citrin ru>
2011-11-02 20:58:47 +00:00
Bjoern A. Zeeb
fba0cea143 Add syntactic sugar missed in r226437 and then not added either when moving
things around in r226448 but desperately needed to always make things
compile successfully.

MFC after:	1 week
2011-10-17 00:05:31 +00:00
Andre Oppermann
873789cb0f Move the tcp_sendspace and tcp_recvspace sysctl's from
the middle of tcp_usrreq.c to the top of tcp_output.c
and tcp_input.c respectively next to the socket buffer
autosizing controls.

MFC after:	1 week
2011-10-16 20:18:39 +00:00
Andre Oppermann
9ec4a4cca5 Remove the ss_fltsz and ss_fltsz_local sysctl's which have
long been superseded by the RFC3390 initial CWND sizing.

Also remove the remnants of TCP_METRICS_CWND which used the
TCP hostcache to set the initial CWND in a non-RFC compliant
way.

MFC after:	1 week
2011-10-16 20:06:44 +00:00
Andre Oppermann
e233e2acb3 VNET virtualize tcp_sendspace/tcp_recvspace and change the
type to INT.  A long is not necessary as the TCP window is
limited to 2**30.  A larger initial window isn't useful.

MFC after:	1 week
2011-10-16 15:08:43 +00:00
Attilio Rao
4af309c810 For the INP_TIMEWAIT case, there is no valid tcpcb object tied to the
inpcb object.
Skip the TCP_SIGNATURE check in that case as it is consistent with the
output path (no TCP_SIGNATURE for outcoming packets in TIMEWAIT state)
and also because for TIMEWAIT state the verify may be less effective.

Sponsored by:		Sandvine Incorporated
Reported by:		rwatson
No objections by:	rwatson
MFC after:		3 days
2011-10-06 14:29:38 +00:00
Bjoern A. Zeeb
b233773bb9 Increase the defaults for the maximum socket buffer limit,
and the maximum TCP send and receive buffer limits from 256kB
to 2MB.

For sb_max_adj we need to add the cast as already used in the sysctl
handler to not overflow the type doing the maths.

Note that this is just the defaults.  They will allow more memory
to be consumed per socket/connection if needed but not change the
default "idle" memory consumption.   All values are still tunable
by sysctls.

Suggested by:	gnn
Discussed on:	arch (Mar and Aug 2011)
MFC after:	3 weeks
Approved by:	re (kib)
2011-08-25 09:20:13 +00:00
Bjoern A. Zeeb
6f69742441 Fix compilation in case of defined(INET) && defined(IPFIREWALL_FORWARD)
but no INET6.

Reported by:	avg
Tested by:	avg
MFC after:	4 weeks
X-MFC with:	r225044
Approved by:	re (kib)
2011-08-20 18:45:38 +00:00
Bjoern A. Zeeb
8a006adb24 Add support for IPv6 to ipfw fwd:
Distinguish IPv4 and IPv6 addresses and optional port numbers in
user space to set the option for the correct protocol family.
Add support in the kernel for carrying the new IPv6 destination
address and port.
Add support to TCP and UDP for IPv6 and fix UDP IPv4 to not change
the address in the IP header.
Add support for IPv6 forwarding to a non-local destination.
Add a regession test uitilizing VIMAGE to check all 20 possible
combinations I could think of.

Obtained from:	David Dolson at Sandvine Incorporated
		(original version for ipfw fwd IPv6 support)
Sponsored by:	Sandvine Incorporated
PR:		bin/117214
MFC after:	4 weeks
Approved by:	re (kib)
2011-08-20 17:05:11 +00:00
Robert Watson
d3c1f00350 Add _mbuf() variants of various inpcb-related interfaces, including lookup,
hash install, etc.  For now, these are arguments are unused, but as we add
RSS support, we will want to use hashes extracted from mbufs, rather than
manually calculated hashes of header fields, due to the expensive of the
software version of Toeplitz (and similar hashes).

Add notes that it would be nice to be able to pass mbufs into lookup
routines in pf(4), optimising firewall lookup in the same way, but the
code structure there doesn't facilitate that currently.

(In principle there is no reason this couldn't be MFCed -- the change
extends rather than modifies the KBI.  However, it won't be useful without
other previous possibly less MFCable changes.)

Reviewed by:    bz
Sponsored by:   Juniper Networks, Inc.
2011-06-04 16:33:06 +00:00
Robert Watson
fa046d8774 Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
  inpcb counter.  This lock is now relegated to a small number of
  allocation and free operations, and occasional operations that walk
  all connections (including, awkwardly, certain UDP multicast receive
  operations -- something to revisit).

- A new ipi_hash_lock protects the two inpcbinfo hash tables for
  looking up connections and bound sockets, manipulated using new
  INP_HASH_*() macros.  This lock, combined with inpcb locks, protects
  the 4-tuple address space.

Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required.  As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.

A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb.  Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed.  In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup.  New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:

  INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
  INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb

Callers must pass exactly one of these flags (for the time being).

Some notes:

- All protocols are updated to work within the new regime; especially,
  TCP, UDPv4, and UDPv6.  pcbinfo ipi_lock acquisitions are largely
  eliminated, and global hash lock hold times are dramatically reduced
  compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
  may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
  is no longer available -- hash lookup locks are now held only very
  briefly during inpcb lookup, rather than for potentially extended
  periods.  However, the pcbinfo ipi_lock will still be acquired if a
  connection state might change such that a connection is added or
  removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
  due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
  callers to acquire hash locks and perform one or more lookups atomically
  with 4-tuple allocation: this is required only for TCPv6, as there is no
  in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
  locking, which relates to source address selection.  This needs
  attention, as it likely significantly reduces parallelism in this code
  for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
  somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
  is no longer sufficient.  A second check once the inpcb lock is held
  should do the trick, keeping the general case from requiring the inpcb
  lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
  which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
  undesirable, and probably another argument is required to take care of
  this (or a char array name field in the pcbinfo?).

This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics.  It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.

Reviewed by:    bz
Sponsored by:   Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
John Baldwin
5891ebd6cd Oops, fix order of sequence numbers in KASSERT()'s to catch negative
receive windows to match the labels in the panic message.

Submitted by:	trociny
2011-05-14 14:41:40 +00:00
John Baldwin
f701e30d7f Handle a rare edge case with nearly full TCP receive buffers. If a TCP
buffer fills up causing the remote sender to enter into persist mode, but
there is still room available in the receive buffer when a window probe
arrives (either due to window scaling, or due to the local application
very slowing draining data from the receive buffer), then the single byte
of data in the window probe is accepted.  However, this can cause rcv_nxt
to be greater than rcv_adv.  This condition will only last until the next
ACK packet is pushed out via tcp_output(), and since the previous ACK
advertised a zero window, the ACK should be pushed out while the TCP
pcb is write-locked.

During the window while rcv_nxt is greather than rcv_adv, a few places
would compute the remaining receive window via rcv_adv - rcv_nxt.
However, this value was then (uint32_t)-1.  On a 64 bit machine this
could expand to a positive 2^32 - 1 when cast to a long.  In particular,
when calculating the receive window in tcp_output(), the result would be
that the receive window was computed as 2^32 - 1 resulting in advertising
a far larger window to the remote peer than actually existed.

Fix various places that compute the remaining receive window to either
assert that it is not negative (i.e. rcv_nxt <= rcv_adv), or treat the
window as full if rcv_nxt is greather than rcv_adv.

Reviewed by:	bz
MFC after:	1 month
2011-05-02 21:05:52 +00:00
Bjoern A. Zeeb
29bd2010d4 Fix a mismerge from p4 in that in_localaddr() is not available without INET.
Sponsored by:	The FreeBSD Foundation
Sponsored by:	iXsystems
MFC after:	4 days
2011-04-30 16:30:18 +00:00
Bjoern A. Zeeb
b287c6c70c Make the TCP code compile without INET. Sort #includes and add #ifdef INETs.
Add some comments at #endifs given more nestedness.  To make the compiler
happy, some default initializations were added in accordance with the style
on the files.

Reviewed by:	gnn
Sponsored by:	The FreeBSD Foundation
Sponsored by:	iXsystems
MFC after:	4 days
2011-04-30 11:21:29 +00:00
John Baldwin
672dc4aea2 TCP reuses t_rxtshift to determine the backoff timer used for both the
persist state and the retransmit timer.  However, the code that implements
"bad retransmit recovery" only checks t_rxtshift to see if an ACK has been
received in during the first retransmit timeout window.  As a result, if
ticks has wrapped over to a negative value and a socket is in the persist
state, it can incorrectly treat an ACK from the remote peer as a
"bad retransmit recovery" and restore saved values such as snd_ssthresh and
snd_cwnd.  However, if the socket has never had a retransmit timeout, then
these saved values will be zero, so snd_ssthresh and snd_cwnd will be set
to 0.

If the socket is in fast recovery (this can be caused by excessive
duplicate ACKs such as those fixed by 220794), then each ACK that arrives
triggers either NewReno or SACK partial ACK handling which clamps snd_cwnd
to be no larger than snd_ssthresh.  In effect, the socket's send window
is permamently stuck at 0 even though the remote peer is advertising a
much larger window and pending data is only sent via TCP window probes
(so one byte every few seconds).

Fix this by adding a new TCP pcb flag (TF_PREVVALID) that indicates that
the various snd_*_prev fields in the pcb are valid and only perform
"bad retransmit recovery" if this flag is set in the pcb.  The flag is set
on the first retransmit timeout that occurs and is cleared on subsequent
retransmit timeouts or when entering the persist state.

Reviewed by:	bz
MFC after:	2 weeks
2011-04-29 15:40:12 +00:00
Attilio Rao
2903309aca Add the possibility to verify MD5 hash of incoming TCP packets.
As long as this is a costy function, even when compiled in (along with
the option TCP_SIGNATURE), it can be disabled via the
net.inet.tcp.signature_verify_input sysctl.

Sponsored by:	Sandvine Incorporated
Reviewed by:	emaste, bz
MFC after:	2 weeks
2011-04-25 17:13:40 +00:00
Lawrence Stewart
891b8ed467 Use the full and proper company name for Swinburne University of Technology
throughout the source tree.

Requested by:	Grenville Armitage, Director of CAIA at Swinburne University of
			Technology
MFC after:	3 days
2011-04-12 08:13:18 +00:00
John Baldwin
766282cbe7 Clamp the initial advertised receive window when responding to a SYN/ACK
to the maximum allowed window.  Growing the window too large would cause
an underflow in the calculations in tcp_output() to decide if a window
update should be sent which would prevent the persist timer from being
started if data was pending and the other end of the connection advertised
an initial window size of 0.

PR:		kern/154006
Submitted by:	Stefan `Sec` Zehl  sec 42 org
Reviewed by:	bz
MFC after:	1 week
2011-03-30 12:35:39 +00:00
Lawrence Stewart
d64a46ea1a Reset the last_sack_ack SACK hint for TCP input processing to ensure that the
hint is 0 when no SACK data is received to update the hint with. This was
accidentally omitted from r216753.

Sponsored by:	FreeBSD Foundation
MFC after:	10 weeks
X-MFC with:	216753
2011-01-10 06:12:01 +00:00
John Baldwin
79e955ed63 Trim extra spaces before tabs. 2011-01-07 21:40:34 +00:00
Lawrence Stewart
39bc9de532 - Add some helper hook points to the TCP stack. The hooks allow Khelp modules to
access inbound/outbound events and associated data for established TCP
  connections. The hooks only run if at least one hook function is registered
  for the hook point, ensuring the impact on the stack is effectively nil when
  no TCP Khelp modules are loaded. struct tcp_hhook_data is passed as contextual
  data to any registered Khelp module hook functions.

- Add an OSD (Object Specific Data) pointer to struct tcpcb to allow Khelp
  modules to associate per-connection data with the TCP control block.

- Bump __FreeBSD_version and add a note to UPDATING regarding to ABI changes
  introduced by this commit and r216753.

In collaboration with:	David Hayes <dahayes at swin edu au> and
				Grenville Armitage <garmitage at swin edu au>
Sponsored by:	FreeBSD Foundation
Reviewed by:	bz, others along the way
MFC after:	3 months
2010-12-28 12:13:30 +00:00
Lawrence Stewart
6157935fa5 Set ssthresh appropriately on RTO. This change was accidentally not ported from
the pre modular CC stack.

Sponsored by:	FreeBSD Foundation
Submitted by:	David Hayes <dahayes at swin edu au>
MFC after:	9 weeks
X-MFC with:	r215166
2010-12-02 01:01:37 +00:00
Lawrence Stewart
dbc4240942 This commit marks the first formal contribution of the "Five New TCP Congestion
Control Algorithms for FreeBSD" FreeBSD Foundation funded project. More details
about the project are available at: http://caia.swin.edu.au/freebsd/5cc/

- Add a KPI and supporting infrastructure to allow modular congestion control
  algorithms to be used in the net stack. Algorithms can maintain per-connection
  state if required, and connections maintain their own algorithm pointer, which
  allows different connections to concurrently use different algorithms. The
  TCP_CONGESTION socket option can be used with getsockopt()/setsockopt() to
  programmatically query or change the congestion control algorithm respectively
  from within an application at runtime.

- Integrate the framework with the TCP stack in as least intrusive a manner as
  possible. Care was also taken to develop the framework in a way that should
  allow integration with other congestion aware transport protocols (e.g. SCTP)
  in the future. The hope is that we will one day be able to share a single set
  of congestion control algorithm modules between all congestion aware transport
  protocols.

- Introduce a new congestion recovery (TF_CONGRECOVERY) state into the TCP stack
  and use it to decouple the meaning of recovery from a congestion event and
  recovery from packet loss (TF_FASTRECOVERY) a la RFC2581. ECN and delay based
  congestion control protocols don't generally need to recover from packet loss
  and need a different way to note a congestion recovery episode within the
  stack.

- Remove the net.inet.tcp.newreno sysctl, which simplifies some portions of code
  and ensures the stack always uses the appropriate mechanisms for recovering
  from packet loss during a congestion recovery episode.

- Extract the NewReno congestion control algorithm from the TCP stack and
  massage it into module form. NewReno is always built into the kernel and will
  remain the default algorithm for the forseeable future. Implementations of
  additional different algorithms will become available in the near future.

- Bump __FreeBSD_version to 900025 and note in UPDATING that rebuilding code
  that relies on the size of "struct tcpcb" is required.

Many thanks go to the Cisco University Research Program Fund at Community
Foundation Silicon Valley and the FreeBSD Foundation. Their support of our work
at the Centre for Advanced Internet Architectures, Swinburne University of
Technology is greatly appreciated.

In collaboration with:	David Hayes <dahayes at swin edu au> and
			Grenville Armitage <garmitage at swin edu au>
Sponsored by:	Cisco URP, FreeBSD Foundation
Reviewed by:	rpaulo
Tested by:	David Hayes (and many others over the years)
MFC after:	3 months
2010-11-12 06:41:55 +00:00
Andre Oppermann
1c18314d17 Remove the TCP inflight bandwidth limiter as announced in r211315
to give way for the pluggable congestion control framework.  It is
the task of the congestion control algorithm to set the congestion
window and amount of inflight data without external interference.

In 'struct tcpcb' the variables previously used by the inflight
limiter are renamed to spares to keep the ABI intact and to have
some more space for future extensions.

In 'struct tcp_info' the variable 'tcpi_snd_bwnd' is not removed to
preserve the ABI.  It is always set to 0.

In siftr.c in 'struct pkt_node' the variable 'snd_bwnd' is not removed
to preserve the ABI.  It is always set to 0.

These unused variable in the various structures may be reused in the
future or garbage collected before the next release or at some other
point when an ABI change happens anyway for other reasons.

No MFC is planned.  The inflight bandwidth limiter stays disabled by
default in the other branches but remains available.
2010-09-16 21:06:45 +00:00
Andre Oppermann
8502ec25dc Use timestamp modulo comparison macro for automatic receive buffer
scaling to correctly handle wrapping of ticks value.

MFC after:	1 week
2010-08-27 12:34:53 +00:00
Andre Oppermann
b7d747ecec Untangle the net.inet.tcp.log_in_vain and net.inet.tcp.log_debug
sysctl's and remove any side effects.

Both sysctl's share the same backend infrastructure and due to the
way it was implemented enabling net.inet.tcp.log_in_vain would also
cause log_debug output to be generated.  This was surprising and
eventually annoying to the user.

The log output backend is kept the same but a little shim is inserted
to properly separate log_in_vain and log_debug and to remove any side
effects.

PR:		kern/137317
MFC after:	1 week
2010-08-18 17:39:47 +00:00
Bjoern A. Zeeb
82cea7e6f3 MFP4: @176978-176982, 176984, 176990-176994, 177441
"Whitspace" churn after the VIMAGE/VNET whirls.

Remove the need for some "init" functions within the network
stack, like pim6_init(), icmp_init() or significantly shorten
others like ip6_init() and nd6_init(), using static initialization
again where possible and formerly missed.

Move (most) variables back to the place they used to be before the
container structs and VIMAGE_GLOABLS (before r185088) and try to
reduce the diff to stable/7 and earlier as good as possible,
to help out-of-tree consumers to update from 6.x or 7.x to 8 or 9.

This also removes some header file pollution for putatively
static global variables.

Revert VIMAGE specific changes in ipfilter::ip_auth.c, that are
no longer needed.

Reviewed by:	jhb
Discussed with:	rwatson
Sponsored by:	The FreeBSD Foundation
Sponsored by:	CK Software GmbH
MFC after:	6 days
2010-04-29 11:52:42 +00:00
Rui Paulo
9c251892c0 Honor the CE bit even when the CWR bit is set.
PR:		145600
Submitted by:	Richard Scheffenegger <rs at netapp.com>
MFC after:	1 week
2010-04-10 12:47:06 +00:00
Robert Watson
66f80e90ef Wrap use of rw_try_upgrade() on pcbinfo with macro INP_INFO_TRY_UPGRADE()
to match other pcbinfo locking macros.

MFC after:	1 week
2010-03-06 21:24:11 +00:00
Robert Watson
f681a5fdd4 Remove tcp_input lock statistics; these are intended for debugging only
and are not intended to ship in 8.0 as they dirty additional cache
lines in a performance-critical per-packet path.

MFC after:	3 days
2009-10-06 20:35:41 +00:00
Robert Watson
883e9bc41d In tcp_input(), we acquire a global write lock at first only if a
segment is likely to trigger a TCP state change (i.e., FIN/RST/SYN).
If we later have to upgrade the lock, we acquire an inpcb reference
and drop both global/inpcb locks before reacquiring in-order.  In
that gap, the connection may transition into TIMEWAIT, so we need
to loop back and reevaluate the inpcb after relocking.

MFC after:	3 days
Reported by:	Kamigishi Rei <spambox at haruhiism.net>
Reviewed by:	bz
2009-10-05 22:24:13 +00:00
Robert Watson
315e3e38fa Many network stack subsystems use a single global data structure to hold
all pertinent statatistics for the subsystem.  These structures are
sometimes "borrowed" by kernel modules that require a place to store
statistics for similar events.

Add KPI accessor functions for statistics structures referenced by kernel
modules so that they no longer encode certain specifics of how the data
structures are named and stored.  This change is intended to make it
easier to move to per-CPU network stats following 8.0-RELEASE.

The following modules are affected by this change:

      if_bridge
      if_cxgb
      if_gif
      ip_mroute
      ipdivert
      pf

In practice, most of these statistics consumers should, in fact, maintain
their own statistics data structures rather than borrowing structures
from the base network stack.  However, that change is too agressive for
this point in the release cycle.

Reviewed by:	bz
Approved by:	re (kib)
2009-08-02 19:43:32 +00:00
Robert Watson
530c006014 Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks.  Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by:	bz
Approved by:	re (vimage blanket)
2009-08-01 19:26:27 +00:00
Julian Elischer
7973fba3a4 Somewhere along the line accept sockets stopped honoring the
FIB selected for them. Fix this.

Reviewed by:	ambrisko
Approved by:	re (kib)
MFC after:	3 days
2009-07-28 19:43:27 +00:00
Robert Watson
eddfbb763d Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator.  Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...).  This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.

Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack.  Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory.  Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.

Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy.  Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address.  When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.

This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.

Bump __FreeBSD_version and update UPDATING.

Portions submitted by:  bz
Reviewed by:            bz, zec
Discussed with:         gnn, jamie, jeff, jhb, julian, sam
Suggested by:           peter
Approved by:            re (kensmith)
2009-07-14 22:48:30 +00:00
Robert Watson
8c0fec805f Modify most routines returning 'struct ifaddr *' to return references
rather than pointers, requiring callers to properly dispose of those
references.  The following routines now return references:

  ifaddr_byindex
  ifa_ifwithaddr
  ifa_ifwithbroadaddr
  ifa_ifwithdstaddr
  ifa_ifwithnet
  ifaof_ifpforaddr
  ifa_ifwithroute
  ifa_ifwithroute_fib
  rt_getifa
  rt_getifa_fib
  IFP_TO_IA
  ip_rtaddr
  in6_ifawithifp
  in6ifa_ifpforlinklocal
  in6ifa_ifpwithaddr
  in6_ifadd
  carp_iamatch6
  ip6_getdstifaddr

Remove unused macro which didn't have required referencing:

  IFP_TO_IA6

This closes many small races in which changes to interface
or address lists while an ifaddr was in use could lead to use of freed
memory (etc).  In a few cases, add missing if_addr_list locking
required to safely acquire references.

Because of a lack of deep copying support, we accept a race in which
an in6_ifaddr pointed to by mbuf tags and extracted with
ip6_getdstifaddr() doesn't hold a reference while in transmit.  Once
we have mbuf tag deep copy support, this can be fixed.

Reviewed by:	bz
Obtained from:	Apple, Inc. (portions)
MFC after:	6 weeks (portions)
2009-06-23 20:19:09 +00:00
John Baldwin
6dfb8b316c Fix edge cases with ticks wrapping from INT_MAX to INT_MIN in the handling
of the per-tcpcb t_badtrxtwin.

Submitted by:	bde
2009-06-16 19:00:12 +00:00
John Baldwin
1a0e7cfc42 Trim extra ()'s.
Submitted by:	bde
2009-06-11 14:36:13 +00:00
John Baldwin
0e8cc7e748 Change a few members of tcpcb that store cached copies of ticks to be ints
instead of unsigned longs.  This fixes a few overflow edge cases on 64-bit
platforms.  Specifically, if an idle connection receives a packet shortly
before 2^31 clock ticks of uptime (about 25 days with hz=1000) and the keep
alive timer fires after 2^31 clock ticks, the keep alive timer will think
that the connection has been idle for a very long time and will immediately
drop the connection instead of sending a keep alive probe.

Reviewed by:	silby, gnn, lstewart
MFC after:	1 week
2009-06-10 18:27:15 +00:00
Robert Watson
bcf11e8d00 Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with:	pjd
2009-06-05 14:55:22 +00:00
Robert Watson
f93bfb23dc Add internal 'mac_policy_count' counter to the MAC Framework, which is a
count of the number of registered policies.

Rather than unconditionally locking sockets before passing them into MAC,
lock them in the MAC entry points only if mac_policy_count is non-zero.

This avoids locking overhead for a number of socket system calls when no
policies are registered, eliminating measurable overhead for the MAC
Framework for the socket subsystem when there are no active policies.

Possibly socket locks should be acquired by policies if they are required
for socket labels, which would further avoid locking overhead when there
are policies but they don't require labeling of sockets, or possibly
don't even implement socket controls.

Obtained from:	TrustedBSD Project
2009-06-02 18:26:17 +00:00
Zachary Loafman
81ad7eb017 Correct handling of SYN packets that are to the left of the current window of an ESTABLISHED connection.
Reviewed by:        net@, gnn
Approved by:        dfr (mentor)
2009-05-27 17:02:10 +00:00
Robert Watson
78b5071407 Update stats in struct tcpstat using two new macros, TCPSTAT_ADD() and
TCPSTAT_INC(), rather than directly manipulating the fields across the
kernel.  This will make it easier to change the implementation of
these statistics, such as using per-CPU versions of the data structures.

MFC after:	3 days
2009-04-11 22:07:19 +00:00
Kip Macy
80cb9f211a Import "flowid" support for serializing flows across transmit queues
Reviewed by:	rwatson and jeli
2009-04-10 06:16:14 +00:00
Robert Watson
ad71fe3c35 Correct a number of evolved problems with inp_vflag and inp_flags:
certain flags that should have been in inp_flags ended up in inp_vflag,
meaning that they were inconsistently locked, and in one case,
interpreted.  Move the following flags from inp_vflag to gaps in the
inp_flags space (and clean up the inp_flags constants to make gaps
more obvious to future takers):

  INP_TIMEWAIT
  INP_SOCKREF
  INP_ONESBCAST
  INP_DROPPED

Some aspects of this change have no effect on kernel ABI at all, as these
are UDP/TCP/IP-internal uses; however, netstat and sockstat detect
INP_TIMEWAIT when listing TCP sockets, so any MFC will need to take this
into account.

MFC after:      1 week (or after dependencies are MFC'd)
Reviewed by:    bz
2009-03-15 09:58:31 +00:00
Lawrence Stewart
24cb0f2232 Add TCP Appropriate Byte Counting (RFC 3465) support to kernel.
The new behaviour is on by default, and can be disabled by setting the
net.inet.tcp.rfc3465 sysctl to 0 to obtain previous behaviour.

The patch changes struct tcpcb in sys/netinet/tcp_var.h which breaks
the ABI. Bump __FreeBSD_version to 800061 accordingly. User space tools
that rely on the size of struct tcpcb (e.g. sockstat) need to be recompiled.

Reviewed by:	rpaulo, gnn
Approved by:	gnn, kmacy (mentors)
Sponsored by:	FreeBSD Foundation
2009-01-15 06:44:22 +00:00
Bjoern A. Zeeb
dcdb4371ca Use inc_flags instead of the inc_isipv6 alias which so far
had been the only flag with random usage patterns.
Switch inc_flags to be used as a real bit field by using
INC_ISIPV6 with bitops to check for the 'isipv6' condition.

While here fix a place or two where in case of v4 inc_flags
were not properly initialized before.[1]

Found by:	rwatson during review [1]
Discussed with:	rwatson
Reviewed by:	rwatson
MFC after:	4 weeks
2008-12-17 12:52:34 +00:00
Robert Watson
d15fb96522 Enhance one comment relating to recent TCP locking changes, and fix a
typo in another.

MFC after:	6 weeks
2008-12-09 15:49:02 +00:00
Robert Watson
252ca42863 Move from solely write-locking the global tcbinfo in tcp_input()
to read-locking in the TCP input path, allowing greater TCP
input parallelism where multiple ithreads or ithread and netisr
are able to run in parallel.  Previously, most TCP input paths
held a write lock on the global tcbinfo lock, effectively
serializing TCP input.

Before looking up the connection, acquire a write lock if a
potentially state-changing flag is set on the TCP segment header
(FIN, RST, SYN), and otherwise a read lock.  We may later have
to upgrade to a write lock in certain cases (ACKs received by the
syncache or during TIMEWAIT) in order to support global state
transitions, but this is never required for steady-state packets.

Upgrading from a write lock to a read lock must be done as a
trylock operation to avoid deadlocks, and actually violates the
lock order as the tcbinfo lock preceeds the inpcb lock held at
the time of upgrade.  If the trylock fails, we bump the refcount
on the inpcb, drop both locks, and re-acquire in-order.  If
another thread has freed the connection while the locks are
dropped, we free the inpcb and repeat the lookup (this should
hardly ever or never happen in practice).

For now, maintain a number of new counters measuring how many
times various cases execute, and in particular whether various
optimistic assumptions about when read locks can be used, whether
upgrades are done using the fast path, and whether connections
close in practice in the above-described race, actually occur.

MFC after:	6 weeks
Discussed with:	kmacy
Reviewed by:	bz, gnn, kmacy
Tested by:	kmacy
2008-12-08 20:27:00 +00:00
Bjoern A. Zeeb
4b79449e2f Rather than using hidden includes (with cicular dependencies),
directly include only the header files needed. This reduces the
unneeded spamming of various headers into lots of files.

For now, this leaves us with very few modules including vnet.h
and thus needing to depend on opt_route.h.

Reviewed by:	brooks, gnn, des, zec, imp
Sponsored by:	The FreeBSD Foundation
2008-12-02 21:37:28 +00:00
Marko Zec
97021c2464 Merge more of currently non-functional (i.e. resolving to
whitespace) macros from p4/vimage branch.

Do a better job at enclosing all instantiations of globals
scheduled for virtualization in #ifdef VIMAGE_GLOBALS blocks.

De-virtualize and mark as const saorder_state_alive and
saorder_state_any arrays from ipsec code, given that they are never
updated at runtime, so virtualizing them would be pointless.

Reviewed by:  bz, julian
Approved by:  julian (mentor)
Obtained from:        //depot/projects/vimage-commit2/...
X-MFC after:  never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
2008-11-26 22:32:07 +00:00
Marko Zec
44e33a0758 Change the initialization methodology for global variables scheduled
for virtualization.

Instead of initializing the affected global variables at instatiation,
assign initial values to them in initializer functions.  As a rule,
initialization at instatiation for such variables should never be
introduced again from now on.  Furthermore, enclose all instantiations
of such global variables in #ifdef VIMAGE_GLOBALS blocks.

Essentialy, this change should have zero functional impact.  In the next
phase of merging network stack virtualization infrastructure from
p4/vimage branch, the new initialization methology will allow us to
switch between using global variables and their counterparts residing in
virtualization containers with minimum code churn, and in the long run
allow us to intialize multiple instances of such container structures.

Discussed at:	devsummit Strassburg
Reviewed by:	bz, julian
Approved by:	julian (mentor)
Obtained from:	//depot/projects/vimage-commit2/...
X-MFC after:	never
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
2008-11-19 09:39:34 +00:00
Bjoern A. Zeeb
8e5c87f4b6 Fix typo and while here another one.
Reviewed by:	keramida
Reported by:	keramida
MFC after:	2 months (with r184720)
2008-11-06 16:30:20 +00:00
Bjoern A. Zeeb
91d6cfa6b1 Fix a bug introduced with r182851 splitting tcp_mss() into
tcp_mss() and tcp_mss_update() so that tcp_mtudisc() could
re-use the same code.

Move the TSO logic back to tcp_mss() and out of tcp_mss_update().
We tried to avoid that initially but if were are called from
tcp_output() with EMSGSIZE, we cleared the TSO flag on the tcpcb
there, called into tcp_mtudisc() and tcp_mss_update() which
then would reenable TSO on the tcpcb based on TSO capabilities
of the interface as learnt in tcp_maxmtu/6().
So if TSO was enabled on the (possibly new) outgoing interface
it was turned back on, which lead to an endless loop between
tcp_output() and tcp_mtudisc() until we overflew the stack.

Reported by:	kmacy
MFC after:	2 months (along with r182851)
2008-11-06 13:25:59 +00:00
Bjoern A. Zeeb
6f01cac68a Fix a bug introduced with r182851 splitting tcp_mss() into
tcp_mss() and tcp_mss_update() so that tcp_mtudisc() could
re-use the same code.

In case we return early and got a metricptr to pass the hostcache
info back to the caller we need to initialize the data to a defined
state (zero it) as tcp_hc_get() would do if there was no hit.
Without that the caller would check on random stack garbage which
could lead to undefined results.

This only affected tcp_mss() if there was no routing entry for the peer,
tcp_mtudisc() was not affected.

MFC after:	2 months (along with r182851)
2008-11-06 12:33:33 +00:00
Robert Watson
dd8ac7f990 In both dropwithreset paths in tcp_input.c, drop the tcbinfo lock
sooner to decomplicate locking and eliminate the need for a rather
chatty comment about why we have to handle the global lock in a
special way for the benefit of ipfw and pf cred rules.

MFC after:	3 days
2008-10-26 22:03:52 +00:00
Robert Watson
4c95fd23d6 Remove endearing but syntactically unnecessary "return;" statements
directly before the final closeing brackets of some TCP functions.

MFC after:	3 days
2008-10-26 19:33:22 +00:00
Robert Watson
6c8286e42d Don't pass curthread to sbreserve_locked() in tcp_do_segment(), as the
netisr or ithread's socket buffer size limit is not the right limit to
use.  Instead, pass NULL as the other two calls to sbreserve_locked()
in the TCP input path (tcp_mss()) do.

In practice, this is a no-op, as ithreads and the netisr run without a
process limit on socket buffer use, and a NULL thread pointer leads to
not using the process's limit, if any.  However, if tcp_input() is
called in other contexts that do have limits, this may prevent the
incorrect limit from being used.

MFC after:	3 days
2008-10-07 09:41:07 +00:00
Marko Zec
8b615593fc Step 1.5 of importing the network stack virtualization infrastructure
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit

Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.

Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().

Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).

All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).

(*) netipsec/keysock.c did not validate depending on compile time options.

Implemented by:	julian, bz, brooks, zec
Reviewed by:	julian, bz, brooks, kris, rwatson, ...
Approved by:	julian (mentor)
Obtained from:	//depot/projects/vimage-commit2/...
X-MFC after:	never
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
2008-10-02 15:37:58 +00:00
Robert Watson
014ea782b1 As a follow-on to r183323, correct another case where ip_output() was
called without an inpcb pointer despite holding the tcbinfo global
lock, which lead to a deadlock or panic when ipfw tried to further
acquire it recursively.

Reported by:    Stefan Ehmann <shoesoft at gmx dot net>
MFC after:      3 days
2008-09-25 17:26:54 +00:00
Robert Watson
a0ca087183 When dropping a packet and issuing a reset during TCP segment handling,
unconditionally drop the tcbinfo lock (after all, we assert it lines
before), but call tcp_dropwithreset() under both inpcb and inpcbinfo
locks only if we pass in an tcpcb.  Otherwise, if the pointer is NULL,
firewall code may later recurse the global tcbinfo lock trying to look
up an inpcb.

This is an instance where a layering violation leads not only
potentially to code reentrace and recursion, but also to lock
recursion, and was revealed by the conversion to rwlocks because
acquiring a read lock on an rwlock already held with a write lock is
forbidden.  When these locks were mutexes, they simply recursed.

Reported by:	Stefan Ehmann <shoesoft at gmx dot net>
MFC after:	3 days
2008-09-24 11:07:03 +00:00
Bjoern A. Zeeb
c10eb6d10a Work around an integer division resulting in 0 and thus the
congestion window not being incremented, if cwnd > maxseg^2.
As suggested in RFC2581 increment the cwnd by 1 in this case.

See http://caia.swin.edu.au/reports/080829A/CAIA-TR-080829A.pdf
for more details.

Submitted by:	Alana Huebner, Lawrence Stewart,
		Grenville Armitage (caia.swin.edu.au)
Reviewed by:	dwmalone, gnn, rpaulo
MFC After:	3 days
2008-09-09 07:35:21 +00:00
Bjoern A. Zeeb
3cee92e074 Split tcp_mss() in tcp_mss() and tcp_mss_update() where the former
calls the latter.

Merge tcp_mss_update() with code from tcp_mtudisc() basically
doing the same thing.

This gives us one central place where we calcuate and check mss values
to update t_maxopd (maximum mss + options length) instead of two slightly
different but almost equal implementations to maintain.

PR:		kern/118455
Reviewed by:	silby (back in March)
MFC after:	2 months
2008-09-07 18:50:25 +00:00
Julian Elischer
ac957cd271 A bunch of formatting fixes brough to light by, or created by the Vimage commit
a few days ago.
2008-08-20 01:05:56 +00:00
Bjoern A. Zeeb
603724d3ab Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).

This is the first in a series of commits over the course
of the next few weeks.

Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.

We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.

Obtained from:	//depot/projects/vimage-commit2/...
Reviewed by:	brooks, des, ed, mav, julian,
		jamie, kris, rwatson, zec, ...
		(various people I forgot, different versions)
		md5 (with a bit of help)
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
X-MFC after:	never
V_Commit_Message_Reviewed_By:	more people than the patch
2008-08-17 23:27:27 +00:00
Rui Paulo
f2512ba12a MFp4 (//depot/projects/tcpecn/):
TCP ECN support. Merge of my GSoC 2006 work for NetBSD.
  TCP ECN is defined in RFC 3168.

Partly reviewed by:	dwmalone, silby
Obtained from:		NetBSD
2008-07-31 15:10:09 +00:00
Julian Elischer
8b07e49a00 Add code to allow the system to handle multiple routing tables.
This particular implementation is designed to be fully backwards compatible
and to be MFC-able to 7.x (and 6.x)

Currently the only protocol that can make use of the multiple tables is IPv4
Similar functionality exists in OpenBSD and Linux.

From my notes:

-----

  One thing where FreeBSD has been falling behind, and which by chance I
  have some time to work on is "policy based routing", which allows
  different
  packet streams to be routed by more than just the destination address.

  Constraints:
  ------------

  I want to make some form of this available in the 6.x tree
  (and by extension 7.x) , but FreeBSD in general needs it so I might as
  well do it in -current and back port the portions I need.

  One of the ways that this can be done is to have the ability to
  instantiate multiple kernel routing tables (which I will now
  refer to as "Forwarding Information Bases" or "FIBs" for political
  correctness reasons). Which FIB a particular packet uses to make
  the next hop decision can be decided by a number of mechanisms.
  The policies these mechanisms implement are the "Policies" referred
  to in "Policy based routing".

  One of the constraints I have if I try to back port this work to
  6.x is that it must be implemented as a EXTENSION to the existing
  ABIs in 6.x so that third party applications do not need to be
  recompiled in timespan of the branch.

  This first version will not have some of the bells and whistles that
  will come with later versions. It will, for example, be limited to 16
  tables in the first commit.
  Implementation method, Compatible version. (part 1)
  -------------------------------
  For this reason I have implemented a "sufficient subset" of a
  multiple routing table solution in Perforce, and back-ported it
  to 6.x. (also in Perforce though not  always caught up with what I
  have done in -current/P4). The subset allows a number of FIBs
  to be defined at compile time (8 is sufficient for my purposes in 6.x)
  and implements the changes needed to allow IPV4 to use them. I have not
  done the changes for ipv6 simply because I do not need it, and I do not
  have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.

  Other protocol families are left untouched and should there be
  users with proprietary protocol families, they should continue to work
  and be oblivious to the existence of the extra FIBs.

  To understand how this is done, one must know that the current FIB
  code starts everything off with a single dimensional array of
  pointers to FIB head structures (One per protocol family), each of
  which in turn points to the trie of routes available to that family.

  The basic change in the ABI compatible version of the change is to
  extent that array to be a 2 dimensional array, so that
  instead of protocol family X looking at rt_tables[X] for the
  table it needs, it looks at rt_tables[Y][X] when for all
  protocol families except ipv4 Y is always 0.
  Code that is unaware of the change always just sees the first row
  of the table, which of course looks just like the one dimensional
  array that existed before.

  The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
  are all maintained, but refer only to the first row of the array,
  so that existing callers in proprietary protocols can continue to
  do the "right thing".
  Some new entry points are added, for the exclusive use of ipv4 code
  called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
  which have an extra argument which refers the code to the correct row.

  In addition, there are some new entry points (currently called
  rtalloc_fib() and friends) that check the Address family being
  looked up and call either rtalloc() (and friends) if the protocol
  is not IPv4 forcing the action to row 0 or to the appropriate row
  if it IS IPv4 (and that info is available). These are for calling
  from code that is not specific to any particular protocol. The way
  these are implemented would change in the non ABI preserving code
  to be added later.

  One feature of the first version of the code is that for ipv4,
  the interface routes show up automatically on all the FIBs, so
  that no matter what FIB you select you always have the basic
  direct attached hosts available to you. (rtinit() does this
  automatically).

  You CAN delete an interface route from one FIB should you want
  to but by default it's there. ARP information is also available
  in each FIB. It's assumed that the same machine would have the
  same MAC address, regardless of which FIB you are using to get
  to it.

  This brings us as to how the correct FIB is selected for an outgoing
  IPV4 packet.

  Firstly, all packets have a FIB associated with them. if nothing
  has been done to change it, it will be FIB 0. The FIB is changed
  in the following ways.

  Packets fall into one of a number of classes.

  1/ locally generated packets, coming from a socket/PCB.
     Such packets select a FIB from a number associated with the
     socket/PCB. This in turn is inherited from the process,
     but can be changed by a socket option. The process in turn
     inherits it on fork. I have written a utility call setfib
     that acts a bit like nice..

         setfib -3 ping target.example.com # will use fib 3 for ping.

     It is an obvious extension to make it a property of a jail
     but I have not done so. It can be achieved by combining the setfib and
     jail commands.

  2/ packets received on an interface for forwarding.
     By default these packets would use table 0,
     (or possibly a number settable in a sysctl(not yet)).
     but prior to routing the firewall can inspect them (see below).
     (possibly in the future you may be able to associate a FIB
     with packets received on an interface..  An ifconfig arg, but not yet.)

  3/ packets inspected by a packet classifier, which can arbitrarily
     associate a fib with it on a packet by packet basis.
     A fib assigned to a packet by a packet classifier
     (such as ipfw) would over-ride a fib associated by
     a more default source. (such as cases 1 or 2).

  4/ a tcp listen socket associated with a fib will generate
     accept sockets that are associated with that same fib.

  5/ Packets generated in response to some other packet (e.g. reset
     or icmp packets). These should use the FIB associated with the
     packet being reponded to.

  6/ Packets generated during encapsulation.
     gif, tun and other tunnel interfaces will encapsulate using the FIB
     that was in effect withthe proces that set up the tunnel.
     thus setfib 1 ifconfig gif0 [tunnel instructions]
     will set the fib for the tunnel to use to be fib 1.

  Routing messages would be associated with their
  process, and thus select one FIB or another.
  messages from the kernel would be associated with the fib they
  refer to and would only be received by a routing socket associated
  with that fib. (not yet implemented)

  In addition Netstat has been edited to be able to cope with the
  fact that the array is now 2 dimensional. (It looks in system
  memory using libkvm (!)). Old versions of netstat see only the first FIB.

  In addition two sysctls are added to give:
  a) the number of FIBs compiled in (active)
  b) the default FIB of the calling process.

  Early testing experience:
  -------------------------

  Basically our (IronPort's) appliance does this functionality already
  using ipfw fwd but that method has some drawbacks.

  For example,
  It can't fully simulate a routing table because it can't influence the
  socket's choice of local address when a connect() is done.

  Testing during the generating of these changes has been
  remarkably smooth so far. Multiple tables have co-existed
  with no notable side effects, and packets have been routes
  accordingly.

  ipfw has grown 2 new keywords:

  setfib N ip from anay to any
  count ip from any to any fib N

  In pf there seems to be a requirement to be able to give symbolic names to the
  fibs but I do not have that capacity. I am not sure if it is required.

  SCTP has interestingly enough built in support for this, called VRFs
  in Cisco parlance. it will be interesting to see how that handles it
  when it suddenly actually does something.

  Where to next:
  --------------------

  After committing the ABI compatible version and MFCing it, I'd
  like to proceed in a forward direction in -current. this will
  result in some roto-tilling in the routing code.

  Firstly: the current code's idea of having a separate tree per
  protocol family, all of the same format, and pointed to by the
  1 dimensional array is a bit silly. Especially when one considers that
  there is code that makes assumptions about every protocol having the
  same internal structures there. Some protocols don't WANT that
  sort of structure. (for example the whole idea of a netmask is foreign
  to appletalk). This needs to be made opaque to the external code.

  My suggested first change is to add routing method pointers to the
  'domain' structure, along with information pointing the data.
  instead of having an array of pointers to uniform structures,
  there would be an array pointing to the 'domain' structures
  for each protocol address domain (protocol family),
  and the methods this reached would be called. The methods would have
  an argument that gives FIB number, but the protocol would be free
  to ignore it.

  When the ABI can be changed it raises the possibilty of the
  addition of a fib entry into the "struct route". Currently,
  the structure contains the sockaddr of the desination, and the resulting
  fib entry. To make this work fully, one could add a fib number
  so that given an address and a fib, one can find the third element, the
  fib entry.

  Interaction with the ARP layer/ LL layer would need to be
  revisited as well. Qing Li has been working on this already.

  This work was sponsored by Ironport Systems/Cisco

Reviewed by:    several including rwatson, bz and mlair (parts each)
Obtained from:  Ironport systems/Cisco
2008-05-09 23:03:00 +00:00
Robert Watson
8501a69cc9 Convert pcbinfo and inpcb mutexes to rwlocks, and modify macros to
explicitly select write locking for all use of the inpcb mutex.
Update some pcbinfo lock assertions to assert locked rather than
write-locked, although in practice almost all uses of the pcbinfo
rwlock main exclusive, and all instances of inpcb lock acquisition
are exclusive.

This change should introduce (ideally) little functional change.
However, it lays the groundwork for significantly increased
parallelism in the TCP/IP code.

MFC after:	3 months
Tested by:	kris (superset of committered patch)
2008-04-17 21:38:18 +00:00
Robert Watson
7a3244ccb7 Add further TCP inpcb locking assertions to some TCP input code paths.
MFC after:	1 month
2008-04-07 12:41:45 +00:00
Bjoern A. Zeeb
c3b02504bc Some "cleanup" of tcp_mss():
- Move the assigment of the socket down before we first need it.
  No need to do it at the beginning and then drop out the function
  by one of the returns before using it 100 lines further down.
- Use t_maxopd which was assigned the "tcp_mssdflt" for the corrrect
  AF already instead of another #ifdef ? : #endif block doing the same.
- Remove an unneeded (duplicate) assignment of mss to t_maxseg just before
  we possibly change mss and re-do the assignment without using t_maxseg
  in between.

Reviewed by:	silby
No objections:	net@ (silence)
MFC after:	5 days
2008-03-02 08:40:47 +00:00
Bjoern A. Zeeb
af92e6cf95 Fix indentation (whitespace changes only).
MFC after:	6 days
2008-03-01 22:27:15 +00:00
Robert Watson
30d239bc4c Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

  mac_<object>_<method/action>
  mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme.  Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier.  Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods.  Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by:	SPARTA (original patches against Mac OS X)
Obtained from:	TrustedBSD Project, Apple Computer
2007-10-24 19:04:04 +00:00
Mike Silbersack
4b421e2daa Add FBSDID to all files in netinet so that people can more
easily include file version information in bug reports.

Approved by:	re (kensmith)
2007-10-07 20:44:24 +00:00
Mike Silbersack
e31d8aa3da Improve the debugging message:
TCP: [X.X.X.X]:X to [X.X.X.X]:X tcpflags 0x18<PUSH,ACK>; tcp_do_segment: FIN_WAIT_2: Received data after socket was closed, sending RST and removing tcpcb

So that it also includes how many bytes of data were received.  It now looks
like this:

TCP: [X.X.X.X]:X to [X.X.X.X]:X tcpflags 0x18<PUSH,ACK>; tcp_do_segment: FIN_WAIT_2: Received X bytes of data after socket was closed, sending RST and removing tcpcb

Approved by:	re (gnn)
2007-10-07 00:07:27 +00:00
Ken Smith
a258946554 Make sure that either inp is NULL or we have obtained a lock on it before
jumping to dropunlock to avoid a panic.  While here move the calls to
ipsec4_in_reject() and ipsec6_in_reject() so they are after we obtain
the lock on inp.

Original patch to avoid panic:	pjd
Review of locking adjustments:	gnn, sam
Approved by:			re (rwatson)
2007-09-10 14:49:32 +00:00
Dag-Erling Smørgrav
218cbbea9a Make tcpstates[] static, and make sure TCPSTATES is defined before
<netinet/tcp_fsm.h> is included into any compilation unit that needs
tcpstates[].  Also remove incorrect extern declarations and TCPDEBUG
conditionals.  This allows kernels both with and without TCPDEBUG to
build, and unbreaks the tinderbox.

Approved by:	re (rwatson)
2007-07-30 11:06:42 +00:00
Matt Jacob
24face5416 Fix compilation problems- tcpstates is only available if TCPDEBUG
is set.

Approved by:	re (in spirit)
2007-07-29 01:31:33 +00:00
Andre Oppermann
773673c133 Provide a sysctl to toggle reporting of TCP debug logging:
sys.net.inet.tcp.log_debug = 1

It defaults to enabled for the moment and is to be turned off for
the next release like other diagnostics from development branches.

It is important to note that sysctl sys.net.inet.tcp.log_in_vain
uses the same logging function as log_debug.  Enabling of the former
also causes the latter to engage, but not vice versa.

Use consistent terminology in tcp log messages:

 "ignored" means a segment contains invalid flags/information and
   is dropped without changing state or issuing a reply.

 "rejected" means a segments contains invalid flags/information but
   is causing a reply (usually RST) and may cause a state change.

Approved by:	re (rwatson)
2007-07-28 12:20:39 +00:00
Andre Oppermann
19bc77c549 o Move all detailed checks for RST in LISTEN state from tcp_input() to
syncache_rst().
o Fix tests for flag combinations of RST and SYN, ACK, FIN.  Before
  a RST for a connection in syncache did not properly free the entry.
o Add more detailed logging.

Approved by:	re (rwatson)
2007-07-28 11:51:44 +00:00