Commit Graph

113 Commits

Author SHA1 Message Date
Andre Oppermann
ef39adf007 Consolidate all IP Options handling functions into ip_options.[ch] and
include ip_options.h into all files making use of IP Options functions.

From ip_input.c rev 1.306:
  ip_dooptions(struct mbuf *m, int pass)
  save_rte(m, option, dst)
  ip_srcroute(m0)
  ip_stripoptions(m, mopt)

From ip_output.c rev 1.249:
  ip_insertoptions(m, opt, phlen)
  ip_optcopy(ip, jp)
  ip_pcbopts(struct inpcb *inp, int optname, struct mbuf *m)

No functional changes in this commit.

Discussed with:	rwatson
Sponsored by:	TCP/IP Optimization Fundraise 2005
2005-11-18 20:12:40 +00:00
Andre Oppermann
34333b16cd Retire MT_HEADER mbuf type and change its users to use MT_DATA.
Having an additional MT_HEADER mbuf type is superfluous and redundant
as nothing depends on it.  It only adds a layer of confusion.  The
distinction between header mbuf's and data mbuf's is solely done
through the m->m_flags M_PKTHDR flag.

Non-native code is not changed in this commit.  For compatibility
MT_HEADER is mapped to MT_DATA.

Sponsored by:	TCP/IP Optimization Fundraise 2005
2005-11-02 13:46:32 +00:00
Paul Saab
2cdbfa66ee Replace t_force with a t_flag (TF_FORCEDATA).
Submitted by:   Raja Mukerji.
Reviewed by:    Mohan, Silby, Andre Opperman.
2005-05-21 00:38:29 +00:00
Paul Saab
0077b0163f When looking for the next hole to retransmit from the scoreboard,
or to compute the total retransmitted bytes in this sack recovery
episode, the scoreboard is traversed. While in sack recovery, this
traversal occurs on every call to tcp_output(), every dupack and
every partial ack. The scoreboard could potentially get quite large,
making this traversal expensive.

This change optimizes this by storing hints (for the next hole to
retransmit and the total retransmitted bytes in this sack recovery
episode) reducing the complexity to find these values from O(n) to
constant time.

The debug code that sanity checks the hints against the computed
value will be removed eventually.

Submitted by:   Mohan Srinivasan, Noritoshi Demizu, Raja Mukerji.
2005-05-11 21:37:42 +00:00
Paul Saab
be3f3b5ead Fix for interaction problems between TCP SACK and TCP Signature.
If TCP Signatures are enabled, the maximum allowed sack blocks aren't
going to fit. The fix is to compute how many sack blocks fit and tack
these on last. Also on SYNs, defer padding until after the SACK
PERMITTED option has been added.

Found by:	Mohan Srinivasan.
Submitted by:	Mohan Srinivasan, Noritoshi Demizu.
Reviewed by:	Raja Mukerji.
2005-04-21 20:26:07 +00:00
Andre Oppermann
1600372b6b Ignore ICMP Source Quench messages for TCP sessions. Source Quench is
ineffective, depreciated and can be abused to degrade the performance
of active TCP sessions if spoofed.

Replace a bogus call to tcp_quench() in tcp_output() with the direct
equivalent tcpcb variable assignment.

Security:	draft-gont-tcpm-icmp-attacks-03.txt Section 7.1
MFC after:	3 days
2005-04-21 12:37:12 +00:00
Paul Saab
8d03f2b53b Fix a TCP SACK related crash resulting from incorrect computation
of len in tcp_output(), in the case where the FIN has already been
transmitted. The mis-computation of len is because of a gcc
optimization issue, which this change works around.

Submitted by:	Mohan Srinivasan
2005-01-12 21:40:51 +00:00
Warner Losh
c398230b64 /* -> /*- for license, minor formatting changes 2005-01-07 01:45:51 +00:00
Paul Saab
7d5ed1ceea Fixes a bug in SACK causing us to send data beyond the receive window.
Found by: Pawel Worach and Daniel Hartmeier
Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com
2004-11-29 18:47:27 +00:00
Andre Oppermann
c94c54e4df Remove RFC1644 T/TCP support from the TCP side of the network stack.
A complete rationale and discussion is given in this message
and the resulting discussion:

 http://docs.freebsd.org/cgi/mid.cgi?4177C8AD.6060706

Note that this commit removes only the functional part of T/TCP
from the tcp_* related functions in the kernel.  Other features
introduced with RFC1644 are left intact (socket layer changes,
sendmsg(2) on connection oriented protocols)  and are meant to
be reused by a simpler and less intrusive reimplemention of the
previous T/TCP functionality.

Discussed on:	-arch
2004-11-02 22:22:22 +00:00
Robert Watson
ab5c14d828 Correct a bug in TCP SACK that could result in wedging of the TCP stack
under high load: only set function state to loop and continuing sending
if there is no data left to send.

RELENG_5_3 candidate.

Feet provided:	Peter Losher <Peter underscore Losher at isc dot org>
Diagnosed by:	Aniel Hartmeier <daniel at benzedrine dot cx>
Submitted by:	mohan <mohans at yahoo-inc dot com>
2004-10-30 12:02:50 +00:00
Robert Watson
cf2942b67c Acquire the send socket buffer lock around tcp_output() activities
reaching into the socket buffer.  This prevents a number of potential
races, including dereferencing of sb_mb while unlocked leading to
a NULL pointer deref (how I found it).  Potentially this might also
explain other "odd" TCP behavior on SMP boxes (although  haven't
seen it reported).

RELENG_5 candidate.
2004-10-09 16:48:51 +00:00
Paul Saab
a55db2b6e6 - Estimate the amount of data in flight in sack recovery and use it
to control the packets injected while in sack recovery (for both
  retransmissions and new data).
- Cleanups to the sack codepaths in tcp_output.c and tcp_sack.c.
- Add a new sysctl (net.inet.tcp.sack.initburst) that controls the
  number of sack retransmissions done upon initiation of sack recovery.

Submitted by:	Mohan Srinivasan <mohans@yahoo-inc.com>
2004-10-05 18:36:24 +00:00
John-Mark Gurney
b5d47ff592 fix up socket/ip layer violation... don't assume/know that
SO_DONTROUTE == IP_ROUTETOIF and SO_BROADCAST == IP_ALLOWBROADCAST...
2004-09-05 02:34:12 +00:00
Robert Watson
a4f757cd5d White space cleanup for netinet before branch:
- Trailing tab/space cleanup
- Remove spurious spaces between or before tabs

This change avoids touching files that Andre likely has in his working
set for PFIL hooks changes for IPFW/DUMMYNET.

Approved by:	re (scottl)
Submitted by:	Xin LI <delphij@frontfree.net>
2004-08-16 18:32:07 +00:00
Jayanth Vijayaraghavan
5d3b1b7556 Fix a bug in the sack code that was causing data to be retransmitted
with the FIN bit set for all segments, if a FIN has already been sent before.
The fix will allow the FIN bit to be set for only the last segment, in case
it has to be retransmitted.

Fix another bug that would have caused snd_nxt to be pulled by len if
there was an error from ip_output. snd_nxt should not be touched
during sack retransmissions.
2004-07-28 02:15:14 +00:00
Jayanth Vijayaraghavan
e9f2f80e09 Fix for a SACK bug where the very last segment retransmitted
from the SACK scoreboard could result in the next (untransmitted)
segment to be skipped.
2004-07-26 23:41:12 +00:00
Jayanth Vijayaraghavan
04f0d9a0ea Let IN_FASTREOCOVERY macro decide if we are in recovery mode.
Nuke sackhole_limit for now. We need to add it back to limit the total
number of sack blocks in the system.
2004-07-19 22:37:33 +00:00
Jayanth Vijayaraghavan
f787edd847 Fix a potential panic in the SACK code that was causing
1) data to be sent to the right of snd_recover.
2) send more data then whats in the send buffer.

The fix is to postpone sack retransmit to a subsequent recovery episode
if the current retransmit pointer is beyond snd_recover.

Thanks to Mohan Srinivasan for helping fix the bug.

Submitted by:Daniel Lang
2004-07-19 22:06:01 +00:00
Paul Saab
6d90faf3d8 Add support for TCP Selective Acknowledgements. The work for this
originated on RELENG_4 and was ported to -CURRENT.

The scoreboarding code was obtained from OpenBSD, and many
of the remaining changes were inspired by OpenBSD, but not
taken directly from there.

You can enable/disable sack using net.inet.tcp.do_sack. You can
also limit the number of sack holes that all senders can have in
the scoreboard with net.inet.tcp.sackhole_limit.

Reviewed by:	gnn
Obtained from:	Yahoo! (Mohan Srinivasan, Jayanth Vijayaraghavan)
2004-06-23 21:04:37 +00:00
Bruce M Simpson
f3e0b7ef7f Appease GCC. 2004-06-18 09:53:58 +00:00
Bruce M Simpson
5214cb3f59 If SO_DEBUG is enabled for a TCP socket, and a received segment is
encapsulated within an IPv6 datagram, do not abuse the 'ipov' pointer
when registering trace records.  'ipov' is specific to IPv4, and
will therefore be uninitialized.

[This fandango is only necessary in the first place because of our
host-byte-order IP field pessimization.]

PR:		kern/60856
Submitted by:	Galois Zheng
2004-06-18 03:31:07 +00:00
Bruce M Simpson
da181cc144 Don't set FIN on a retransmitted segment after a FIN has been sent,
unless the segment really contains the last of the data for the stream.

PR:		kern/34619
Obtained from:	OpenBSD (tcp_output.c rev 1.47)
Noticed by:	Joseph Ishac
Reviewed by:	George Neville-Neil
2004-06-18 02:47:59 +00:00
Robert Watson
c18b97c630 Switch to using the inpcb MAC label instead of socket MAC label when
labeling new mbufs created from sockets/inpcbs in IPv4.  This helps avoid
the need for socket layer locking in the lower level network paths
where inpcb locks are already frequently held where needed.  In
particular:

- Use the inpcb for label instead of socket in raw_append().
- Use the inpcb for label instead of socket in tcp_output().
- Use the inpcb for label instead of socket in tcp_respond().
- Use the inpcb for label instead of socket in tcp_twrespond().
- Use the inpcb for label instead of socket in syncache_respond().

While here, modify tcp_respond() to avoid assigning NULL to a stack
variable and centralize assertions about the inpcb when inp is
assigned.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, McAfee Research
2004-05-04 02:11:47 +00:00
Warner Losh
f36cfd49ad Remove advertising clause from University of California Regent's
license, per letter dated July 22, 1999 and email from Peter Wemm,
Alan Cox and Robert Watson.

Approved by: core, peter, alc, rwatson
2004-04-07 20:46:16 +00:00
Bruce M Simpson
265ed01285 Brucification.
Submitted by:	bde
2004-02-13 18:21:45 +00:00
Bruce M Simpson
bca0e5bfc3 style(9) pass; whitespace and comments.
Submitted by:	njl
2004-02-12 20:12:48 +00:00
Bruce M Simpson
45d370ee8b Fix a typo; left out preprocessor conditional for sigoff variable, which
is only used by TCP_SIGNATURE code.

Noticed by:	Roop Nanuwa
2004-02-11 09:46:54 +00:00
Bruce M Simpson
1cfd4b5326 Initial import of RFC 2385 (TCP-MD5) digest support.
This is the first of two commits; bringing in the kernel support first.
This can be enabled by compiling a kernel with options TCP_SIGNATURE
and FAST_IPSEC.

For the uninitiated, this is a TCP option which provides for a means of
authenticating TCP sessions which came into being before IPSEC. It is
still relevant today, however, as it is used by many commercial router
vendors, particularly with BGP, and as such has become a requirement for
interconnect at many major Internet points of presence.

Several parts of the TCP and IP headers, including the segment payload,
are digested with MD5, including a shared secret. The PF_KEY interface
is used to manage the secrets using security associations in the SADB.

There is a limitation here in that as there is no way to map a TCP flow
per-port back to an SPI without polluting tcpcb or using the SPD; the
code to do the latter is unstable at this time. Therefore this code only
supports per-host keying granularity.

Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6),
TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective
users of this feature, this will not pose any problem.

This implementation is output-only; that is, the option is honoured when
responding to a host initiating a TCP session, but no effort is made
[yet] to authenticate inbound traffic. This is, however, sufficient to
interwork with Cisco equipment.

Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with
local patches. Patches for tcpdump to validate TCP-MD5 sessions are also
available from me upon request.

Sponsored by:	sentex.net
2004-02-11 04:26:04 +00:00
Hajimu UMEMOTO
f073c60f73 pass pcb rather than so. it is expected that per socket policy
works again.
2004-02-03 18:20:55 +00:00
Andre Oppermann
201d185b69 Split the overloaded variable 'win' into two for their specific purposes:
recwin and sendwin.  This removes a big source of confusion and makes
following the code much easier.

Reviewed by:	sam (mentor)
Obtained from:	DragonFlyBSD rev 1.6 (hsu)
2004-01-22 23:22:14 +00:00
Andre Oppermann
97d8d152c2 Introduce tcp_hostcache and remove the tcp specific metrics from
the routing table.  Move all usage and references in the tcp stack
from the routing table metrics to the tcp hostcache.

It caches measured parameters of past tcp sessions to provide better
initial start values for following connections from or to the same
source or destination.  Depending on the network parameters to/from
the remote host this can lead to significant speedups for new tcp
connections after the first one because they inherit and shortcut
the learning curve.

tcp_hostcache is designed for multiple concurrent access in SMP
environments with high contention and is hash indexed by remote
ip address.

It removes significant locking requirements from the tcp stack with
regard to the routing table.

Reviewed by:	sam (mentor), bms
Reviewed by:	-net, -current, core@kame.net (IPv6 parts)
Approved by:	re (scottl)
2003-11-20 20:07:39 +00:00
Sam Leffler
fa286d7db2 replace mtx_assert by INP_LOCK_ASSERT
Supported by:	FreeBSD Foundation
2003-11-08 22:55:52 +00:00
Sam Leffler
27a940c9a2 unbreak compilation of FAST_IPSEC
Supported by:	FreeBSD Foundation
2003-11-08 00:34:34 +00:00
Hajimu UMEMOTO
0f9ade718d - cleanup SP refcnt issue.
- share policy-on-socket for listening socket.
- don't copy policy-on-socket at all.  secpolicy no longer contain
  spidx, which saves a lot of memory.
- deep-copy pcb policy if it is an ipsec policy.  assign ID field to
  all SPD entries.  make it possible for racoon to grab SPD entry on
  pcb.
- fixed the order of searching SA table for packets.
- fixed to get a security association header.  a mode is always needed
  to compare them.
- fixed that the incorrect time was set to
  sadb_comb_{hard|soft}_usetime.
- disallow port spec for tunnel mode policy (as we don't reassemble).
- an user can define a policy-id.
- clear enc/auth key before freeing.
- fixed that the kernel crashed when key_spdacquire() was called
  because key_spdacquire() had been implemented imcopletely.
- preparation for 64bit sequence number.
- maintain ordered list of SA, based on SA id.
- cleanup secasvar management; refcnt is key.c responsibility;
  alloc/free is keydb.c responsibility.
- cleanup, avoid double-loop.
- use hash for spi-based lookup.
- mark persistent SP "persistent".
  XXX in theory refcnt should do the right thing, however, we have
  "spdflush" which would touch all SPs.  another solution would be to
  de-register persistent SPs from sptree.
- u_short -> u_int16_t
- reduce kernel stack usage by auto variable secasindex.
- clarify function name confusion.  ipsec_*_policy ->
  ipsec_*_pcbpolicy.
- avoid variable name confusion.
  (struct inpcbpolicy *)pcb_sp, spp (struct secpolicy **), sp (struct
  secpolicy *)
- count number of ipsec encapsulations on ipsec4_output, so that we
  can tell ip_output() how to handle the packet further.
- When the value of the ul_proto is ICMP or ICMPV6, the port field in
  "src" of the spidx specifies ICMP type, and the port field in "dst"
  of the spidx specifies ICMP code.
- avoid from applying IPsec transport mode to the packets when the
  kernel forwards the packets.

Tested by:	nork
Obtained from:	KAME
2003-11-04 16:02:05 +00:00
Hartmut Brandt
91f467d592 The tcp_trace call needs the length of the header. Unfortunately the
code has rotten a bit so that the header length is not correct at
the point when tcp_trace is called. Temporarily compute the correct
value before the call and restore the old value after. This makes
ports/benchmarks/dbs to almost work.

This is a NOP unless you compile with TCPDEBUG.
2003-08-13 08:50:42 +00:00
Jonathan Lemon
7990938421 Convert tcp_fillheaders(tp, ...) -> tcpip_fillheaders(inp, ...) so the
routine does not require a tcpcb to operate.  Since we no longer keep
template mbufs around, move pseudo checksum out of this routine, and
merge it with the length update.

Sponsored by: DARPA, NAI Labs
2003-02-19 22:18:06 +00:00
Jonathan Lemon
3bfd6421c2 Clean up delayed acks and T/TCP interactions:
- delay acks for T/TCP regardless of delack setting
   - fix bug where a single pass through tcp_input might not delay acks
   - use callout_active() instead of callout_pending()

Sponsored by: DARPA, NAI Labs
2003-02-19 21:18:23 +00:00
Warner Losh
a163d034fa Back out M_* changes, per decision of the TRB.
Approved by: trb
2003-02-19 05:47:46 +00:00
Alfred Perlstein
44956c9863 Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.
2003-01-21 08:56:16 +00:00
Jeffrey Hsu
314e5a3daf Optimize away call to bzero() in the common case by directly checking
if a connection has any cached TAO information.
2003-01-18 19:03:26 +00:00
Matthew Dillon
abac41a659 Fix oops in my last commit, I was calculating a new length but then not
using it.  (The code is already correct in -stable).

Found by: silby
2002-10-16 19:16:33 +00:00
Sam Leffler
b9234fafa0 Tie new "Fast IPsec" code into the build. This involves the usual
configuration stuff as well as conditional code in the IPv4 and IPv6
areas.  Everything is conditional on FAST_IPSEC which is mutually
exclusive with IPSEC (KAME IPsec implmentation).

As noted previously, don't use FAST_IPSEC with INET6 at the moment.

Reviewed by:	KAME, rwatson
Approved by:	silence
Supported by:	Vernier Networks
2002-10-16 02:25:05 +00:00
Sam Leffler
5d84645305 Replace aux mbufs with packet tags:
o instead of a list of mbufs use a list of m_tag structures a la openbsd
o for netgraph et. al. extend the stock openbsd m_tag to include a 32-bit
  ABI/module number cookie
o for openbsd compatibility define a well-known cookie MTAG_ABI_COMPAT and
  use this in defining openbsd-compatible m_tag_find and m_tag_get routines
o rewrite KAME use of aux mbufs in terms of packet tags
o eliminate the most heavily used aux mbufs by adding an additional struct
  inpcb parameter to ip_output and ip6_output to allow the IPsec code to
  locate the security policy to apply to outbound packets
o bump __FreeBSD_version so code can be conditionalized
o fixup ipfilter's call to ip_output based on __FreeBSD_version

Reviewed by:	julian, luigi (silent), -arch, -net, darren
Approved by:	julian, silence from everyone else
Obtained from:	openbsd (mostly)
MFC after:	1 month
2002-10-16 01:54:46 +00:00
Matthew Dillon
28257b5ccc Update various comments mainly related to retransmit/FIN that I
documented while working on a previous bug.

Fix a PERSIST bug.  Properly account for a FIN sent during a PERSIST.

MFC after:	7 days
2002-10-10 19:21:50 +00:00
Jennifer Yang
4a03a8a8c7 Tempary fix for inet6. The final fix is to change in6_pcbnotify to take pcbinfo instead
of pcbhead. It is on the way.
2002-09-17 03:19:43 +00:00
Matthew Dillon
1fcc99b5de Implement TCP bandwidth delay product window limiting, similar to (but
not meant to duplicate) TCP/Vegas.  Add four sysctls and default the
implementation to 'off'.

net.inet.tcp.inflight_enable	enable algorithm (defaults to 0=off)
net.inet.tcp.inflight_debug	debugging (defaults to 1=on)
net.inet.tcp.inflight_min	minimum window limit
net.inet.tcp.inflight_max	maximum window limit

MFC after:	1 week
2002-08-17 18:26:02 +00:00
Jennifer Yang
3d6ade3a03 Assert that the inpcb lock is held when calling tcp_output().
Approved by:	hsu
2002-08-12 03:22:46 +00:00
Robert Watson
c488362e1a Introduce support for Mandatory Access Control and extensible
kernel access control.

Instrument the TCP socket code for packet generation and delivery:
label outgoing mbufs with the label of the socket, and check socket and
mbuf labels before permitting delivery to a socket.  Assign labels
to newly accepted connections when the syncache/cookie code has done
its business.  Also set peer labels as convenient.  Currently,
MAC policies cannot influence the PCB matching algorithm, so cannot
implement polyinstantiation.  Note that there is at least one case
where a PCB is not available due to the TCP packet not being associated
with any socket, so we don't label in that case, but need to handle
it in a special manner.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, NAI Labs
2002-07-31 19:06:49 +00:00
Luigi Rizzo
f10e85d797 Slightly restructure the #ifdef INET6 sections to make the code
more readable.

Remove the six "register" attributes from variables tcp_output(), the
compiler surely knows well how to allocate them.
2002-06-23 21:25:36 +00:00