Commit Graph

3086 Commits

Author SHA1 Message Date
silby
5083324c2a Change FreeBSD 7 so that it returns TCP options in
the same order that FreeBSD 6 and before did.  Doug
White and the other bloodhounds at ISC discovered that
while FreeBSD 7's ordering of options was more efficient,
it caused some cable modem routers to ignore the
SYN-ACKs ordered in this fashion.

The placement of sackOK after the timestamp option seems
to be the critical difference:

FreeBSD 6:
<mss 1460,nop,wscale 1,nop,nop,timestamp 3512155768 0,sackOK,eol>

FreeBSD 7.0:
<mss 1460,nop,wscale 3,sackOK,timestamp 1370692577 0>

FreeBSD 7.0 + this change:
<mss 1460,nop,wscale 3,nop,nop,timestamp 7371813 0,sackOK,eol>

MFC after: 1 week
2008-02-24 05:13:20 +00:00
rrs
64d271aebb Fixes a memory leak when VRF's are in play.
Submitted by:	Prasad Narasimha (snprasad@cisco.com)
Reviewed by:	rrs
2008-02-22 15:08:10 +00:00
rrs
22032b7ba8 - Takes out stray ifdef code that should not have been present. 2008-02-22 15:06:25 +00:00
glebius
c845f83019 If the vhid already present, return EEXIST instead of
non-informative EINVAL.
2008-02-07 13:18:59 +00:00
glebius
415a259ae1 Remove unused structure member from struct in_ifadown_arg. 2008-02-07 11:26:52 +00:00
silby
1ff81684ca Replace the random IP ID generation code we
obtained from OpenBSD with an algorithm suggested
by Amit Klein.  The OpenBSD algorithm has a few
flaws; see Amit's paper for more information.

For a description of how this algorithm works,
please see the comments within the code.

Note that this commit does not yet enable random IP ID
generation by default.  There are still some concerns
that doing so will adversely affect performance.

Reviewed by:  rwatson
MFC After: 2 weeks
2008-02-06 15:40:30 +00:00
bz
cfb85f0c07 Rather than passing around a cached 'priv', pass in an ucred to
ipsec*_set_policy and do the privilege check only if needed.

Try to assimilate both ip*_ctloutput code blocks calling ipsec*_set_policy.

Reviewed by:	rwatson
2008-02-02 14:11:31 +00:00
rwatson
c57fa54759 Correct two problems relating to sorflush(), which is called to flush
read socket buffers in shutdown() and close():

- Call socantrcvmore() before sblock() to dislodge any threads that
  might be sleeping (potentially indefinitely) while holding sblock(),
  such as a thread blocked in recv().

- Flag the sblock() call as non-interruptible so that a signal
  delivered to the thread calling sorflush() doesn't cause sblock() to
  fail.  The sblock() is required to ensure that all other socket
  consumer threads have, in fact, left, and do not enter, the socket
  buffer until we're done flushin it.

To implement the latter, change the 'flags' argument to sblock() to
accept two flags, SBL_WAIT and SBL_NOINTR, rather than one M_WAITOK
flag.  When SBL_NOINTR is set, it forces a non-interruptible sx
acquisition, regardless of the setting of the disposition of SB_NOINTR
on the socket buffer; without this change it would be possible for
another thread to clear SB_NOINTR between when the socket buffer mutex
is released and sblock() is invoked.

Reviewed by:	bz, kmacy
Reported by:	Jos Backus <jos at catnook dot com>
2008-01-31 08:22:24 +00:00
rrs
ce5fec50e4 - Fix a comment about prison.
- Fix it so the VRF is captured while locks are held.
MFC after:	1 week
2008-01-28 10:34:38 +00:00
rrs
dbf34dbcc6 - Change back to using prioity 0. Which means don't change the
prioity when running the thread. (this is for the sctp_interator thread).

MFC after:	1 week
2008-01-28 10:33:41 +00:00
rrs
9df3360d89 - Fix a bug where the socket may have been closed which
could cause a crash in the auth code.
Obtained from:	Michael Tuexen
MFC after:	1 week
2008-01-28 10:31:12 +00:00
rrs
13897491cf - Fixes a comparison wrap issue with sack gap ack blocks that
span the 32 bit roll over mark.
2008-01-28 10:25:43 +00:00
rwatson
1dcfe4a494 Hide ipfw internal data structures behind IPFW_INTERNAL rather than
exposing them to all consumers of ip_fw.h.  These structures are
used in both ipfw(8) and ipfw(4), but not part of the user<->kernel
interface for other applications to use, rather, shared
implementation.

MFC after:	3 days
Reported by:	Paul Vixie <paul at vix dot com>
2008-01-25 14:38:27 +00:00
bz
1c376286e0 Replace the last susers calls in netinet6/ with privilege checks.
Introduce a new privilege allowing to set certain IP header options
(hop-by-hop, routing headers).

Leave a few comments to be addressed later.

Reviewed by:	rwatson (older version, before addressing his comments)
2008-01-24 08:25:59 +00:00
bz
ca561e0217 Differentiate between addifaddr and delifaddr for the privilege check.
Reviewed by:	rwatson
MFC after:	2 weeks
2008-01-24 08:14:38 +00:00
rwatson
8aff4dd3cd tcp_usrreq.c:1.313 removed tcbinfo locking from tcp_usr_accept(), which
while in principle a good idea, opened us up to a race inherrent to
the syncache's direct insertion of incoming TCP connections into the
"completed connection" listen queue, as it transpires that the socket
is inserted before the inpcb is fully filled in by syncache_expand().
The bug manifested with the occasional returning of 0.0.0.0:0 in the
address returned by the accept() system call, which occurred if accept
managed to execute tcp_usr_accept() before syncache_expand() had copied
the endpoint addresses into inpcb connection state.

Re-add tcbinfo locking around the address copyout, which has the effect
of delaying the copy until syncache_expand() has finished running, as
it is run while the tcbinfo lock is held.  This is undesirable in that
it increases contention on tcbinfo further, but a more significant
change will be required to how the syncache inserts new sockets in
order to fix this and keep more granular locking here.  In particular,
either more state needs to be passed into sonewconn() so that
pru_attach() can fill in the fields *before* the socket is inserted, or
the socket needs to be inserted in the incomplete connection queue
until it is actually ready to be used.

Reported by:	glebius (and kris)
Tested by:	glebius
2008-01-23 21:15:51 +00:00
rwatson
ba4fb8ac52 In tcp_ctloutput(), don't hold the inpcb lock over sooptcopyin(), rather,
drop the lock and then re-acquire it, revalidating TCP connection state
assumptions when we do so.  This avoids a potential lock order reversal
(and potential deadlock, although none have been reported) due to the
inpcb lock being held over a page fault.

MFC after:	1 week
PR:		102752
Reviewed by:	bz
Reported by:	Václav Haisman <v dot haisman at sh dot cvut dot cz>
2008-01-18 12:19:50 +00:00
julian
3032c5c971 Don't duplicate the whole of arpresolve to arpresolve 2 for the sake
of two compares against 0. The negative effect of cache flushing
is probably more than the gain by not doing the two compares (the
value is almost certainly in register or at worst, cache).
Note that the uses of m_freem() are in error cases and m_freem()
handles NULL anyhow. So fast-path really isn't changed much at all.
2007-12-31 23:48:06 +00:00
oleg
596323ba22 Workaround p->numbytes overflow, which can result in infinite loop inside
dummynet module (prerequisite is using queues with "fat" pipe).

PR:		kern/113548
2007-12-25 09:36:51 +00:00
rwatson
f558a6bfd8 When IPSEC fails to allocate policy state for an inpcb, and MAC is in use,
free the MAC label on the inpcb before freeing the inpcb.

MFC after:	3 days
Submitted by:	tanyong <tanyong at ercist dot iscas dot ac dot cn>,
		zhouzhouyi
2007-12-22 10:06:11 +00:00
ru
82b21d0858 Fix bugs in the TCP syncache timeout code. including:
When system ticks are positive, for entries in the cache
bucket, syncache_timer() ran on every tick (doing nothing
useful) instead of the supposed 3, 6, 12, and 24 seconds
later (when it's time to retransmit SYN,ACK).

When ticks are negative, syncache_timer() was scheduled
for the too far future (up to ~25 days on systems with
HZ=1000), no SYN,ACK retransmits were attempted at all,
and syncache entries added in that period that correspond
to non-established connections stay there forever.

Only HEAD and RELENG_7 are affected.

Reviewed by:	silby, kmacy (earlier version)
Submitted by:	Maxim Dounin, ru
2007-12-19 16:56:28 +00:00
kmacy
fad994205a Remove extraneous debug statements.
Noticed by: Andrey Chernov
2007-12-19 05:17:40 +00:00
kmacy
7a03620a3b Incorporate TCP offload hooks in to core TCP code.
- Rename output routines tcp_gen_* -> tcp_output_*.
  - Rename notification routines that turn in to no-ops in the absence of TOE
    from tcp_gen_* -> tcp_offload_*.
  - Fix some minor comment nits.
  - Add a /* FALLTHROUGH */

Reviewed by: Sam Leffler, Robert Watson, and Mike Silbersack
2007-12-18 22:59:07 +00:00
rrs
285c9ed214 - sctp-iterator should run at PI_NET priority ...not 0.
MFC after:	1 week
2007-12-18 01:24:15 +00:00
kmacy
b45a98500c incorporate feedback since initial commit
- rename tcp_ofld.[ch] to tcp_offload.[ch]
- document usage and locking conventions of the functions in the
  toe_usrreqs function vector
- document tcpcb, inpcb, and socket fields used by toe
- widen the listen interface into 2 functions
- rename DISABLE_TCP_OFFLOAD to TCP_OFFLOAD_DISABLE
- shrink conditional compilation to reduce the likelihood of bitrot
- replace sc->sc_toepcb checks in tcp_syncache.c with TOEPCB_ISSET
2007-12-17 07:56:27 +00:00
kmacy
5d9e84762f widen the routing event interface (arp update, redirect, and eventually pmtu change)
into separate functions

revert previous commit's changes to arpresolve and add a new interface
arpresolve2 which does arp resolution without an mbuf
2007-12-17 07:40:34 +00:00
kmacy
139d7c3fb1 Don't panic in arpresolve if we're given a null mbuf. We could
insist that the caller just pass in an initialized mbuf even
if didn't have any data - but that seems rather contrived.
2007-12-17 04:19:25 +00:00
kmacy
fe47295d85 Update tod_connect call to reflect updated interface 2007-12-16 07:37:48 +00:00
kmacy
60377b3fbd Move arp update upcall to always be called for ARP replies - previous invocation
would not always get called at the appropriate times
2007-12-16 06:42:33 +00:00
kmacy
4b1ded755c Update the toedev's connect interface to reflect the fact that the inpcb
doesn't cache the rtentry in HEAD.
2007-12-16 05:30:21 +00:00
kmacy
93f8e2674a Add socket option for setting and retrieving the congestion control algorithm.
The name used is to allow compatibility with Linux.
2007-12-16 03:30:07 +00:00
kmacy
4925764ba9 make naming prefixes consistent across tom_info 2007-12-15 20:20:08 +00:00
kmacy
225412214c Fix error in previous commit - the style fix changed flag name without
changing references to the flag
2007-12-13 01:24:20 +00:00
kmacy
facc60167b Fix style issues with initial TCP offload commit
Requested by: rwatson
Submitted by: rwatson
2007-12-12 23:31:49 +00:00
kmacy
50706577a4 add interface for allowing consumers to register for ARP updates,
redirects, and path MTU changes

Reviewed by: silby
2007-12-12 20:53:25 +00:00
kmacy
dcdbd55c9a Add interface for tcp offload to syncache:
- make neccessary changes to release offload resources when a syncache
   entry is removed before connection establishment
 - disable checks for offloaded connection where insufficient information
   is available

Reviewed by: silby
2007-12-12 20:35:59 +00:00
kmacy
95a448c7cb Add driver independent interface to offload active established TCP connections
Reviewed by: silby
2007-12-12 20:21:39 +00:00
kmacy
a571860f41 Remove spurious timestamp check. RFC 1323 explicitly states that timestamps MAY
be transmitted if negotiated.
2007-12-12 06:11:50 +00:00
dwmalone
f0253dbb16 If we are walking the IPv6 header chain and we hit an IPPROTO_NONE
header, then don't try to pullup anything, because there is no next
header if we hit IPPROTO_NONE. Set ulp to a non-NULL value so the
search for an upper layer header terinates.

This is based on Pekka's diagnosis, but I chose a simpler fix.

PR:		115261
Submitted by:	Pekka Savola <pekkas@netcore.fi>
Reviewed by:	mlaier
MFC after:	2 weeks
2007-12-09 15:35:09 +00:00
kmacy
12b5f9c8c9 Add padding for anticipated functionality
- vimage
 - TOE
 - multiq
 - host rtentry caching

Rename spare used by 80211 to if_llsoftc

Reviewed by: rwatson, gnn
MFC after: 1 day
2007-12-07 01:46:13 +00:00
rrs
fad90600c1 - More fixes for lock misses on the transfer of data to
the sent_queue. Sometimes I wonder why any code
  ever works :-)
- Fix the pad of the last mbuf routine, It was working improperly
  on non-4 byte aligned chunks which could cause memory overruns.

MFC after:	1 week
2007-12-07 01:32:14 +00:00
des
90c2422b90 Simpler version of the previous commit. 2007-12-06 09:31:13 +00:00
rrs
475b561655 - optimize the initialization of the SB max variables.
- Missing lock when sending data and moving it to the
  outqueue.
- If a mbuf alloc fails during moving to outqueue the
  reassembly of the old mbuf chain was incorrect.
- some_taken becomes a counter in sctputil.c instead of a set to 1.
- Fix a panic to be only under invarients and have a proper recovery.
- msg_flags needed to be set.to the value collected not or'd.

MFC after:	1 week
2007-12-06 00:22:55 +00:00
rrs
9e75c558ad - More fixes for the non-blocking msg send, had the skip of the pre-block
test incorrect.
- Fix the initial buf calculation to be more friendly, calc is the same
  but we use different variable to make it easier amongst the different
  code versions.

MFC after:	1 week
2007-12-04 20:20:42 +00:00
rrs
a6029f7726 - Opps, signedness issue with one of the new var's (this is an issue
mainly in apple but with the right -Wall it could effect us too).

MFC after:	1 week
2007-12-04 14:47:39 +00:00
rrs
f08a32ba97 - Found a problem in non-blocking sends. When
sending, once the locks are all unlocked to
  do the copy's in, its possible that other
  events could then raise the number of bytes
  outstanding pushing it so not all the message
  would fit. This would then cause us to send
  only part of the message. This fix makes it
  so we keep a "reserved" amount that can be
  kept in mind when making calculations to send.
- rcv msg args with a NULL/NULL for to/tolen will return an error incorrectly
  for the 1-2-1 model.
- We were not doing 0 len return correctly and not setting cantrcv more
  correctly. Previouly we "fixed" this area by taking out the socantrcv
  since we then could not get the data out. The correct rix is to still
  flag the socket but alow a by-pass route to continue to read until
  all data is consumed.

MFC after:	1 week
2007-12-04 14:41:48 +00:00
yar
6e65a1fee2 For the sake of convenience, print the name of the network interface
IPv4 address duplication was detected on.

Idea by:	marck
2007-12-04 13:01:12 +00:00
silby
e548183f0a Fix SACK negotiation that was broken in rev 1.105.
Before this fix, FreeBSD would negotiate SACK on outgoing
connections, but would always fail to negotiate it on incoming
connections.

Discovered by: James Healy and Lawrence Stewart
Submitted by: James Healy and Lawrence Stewart
MFC after: 3 days
2007-12-04 07:11:13 +00:00
guido
e371eead9f Consider the following situation:
1. A packet comes in that is to be forwarded
2. The destination of the packet is rewritten by some firewall code
3. The next link's MTU is too small
4. The packet has the DF bit set

Then the current code is such that instead of setting the next
link's MTU in the ICMP error, ip_next_mtu() is called and a guess
is sent as to which MTU is supposed to be tried next. This is because
in this case ip_forward() is called with srcrt set to 1. In that
case the ia pointer remains NULL but it is needed to get the MTU
of the interface the packet is to be sent out from.
Thus, we always set ia to the outgoing interface.

MFC after:	2 weeks
2007-12-02 13:00:47 +00:00
bz
c9229e5969 Centralize and correct computation of TCP-MD5 signature offset within
the packet (tcp header options field).

Reviewed by:	tools/regression/netinet/tcpconnect
MFC after:	3 days
Tested by:	Nick Hilliard (see net@)
2007-11-30 23:46:51 +00:00
bz
376bf60faf Move call to tcp_signature_compute() after we adjusted the payload offset
in the tcp header. With relevant parts of the tcp header changing after
the 'signature' was computed, the signature becomes invalid.

Reviewed by:	tools/regression/netinet/tcpconnect
MFC after:	3 days
Tested by:	Nick Hilliard (see net@)
2007-11-30 23:41:51 +00:00
bz
621536d5d9 Let opt be an array. Though &opt[0] == opt == &opt, &opt is highly
confusing and hard to understand so change it to just opt and
remove the extra cast no longer/not needed.

Discussed with: rwatson
MFC after:      3 days
2007-11-28 13:33:27 +00:00
bz
373ab6f7ab Correctly get the authentication key for TCP-MD5 from the SA.
Submitted by:	Nick Hilliard on net@
MFC after:	8 weeks
2007-11-28 13:23:50 +00:00
rwatson
a32c33d2c7 More carefully handle various cases in sysctl_drop(), such as unlocking
the inpcb when there's an inpcb without associated timewait state, and
not unlocking when the inpcb has been freed.  This avoids a kernel panic
when tcpdrop(8) is run on a socket in the TIMEWAIT state.

MFC after:	3 days
Reported by:	Rako <rako29 at gmail dot com>
2007-11-24 18:43:59 +00:00
jb
0d56ea8bec Fix strict alias warnings. 2007-11-23 23:56:03 +00:00
bz
beb1cbd982 Make TSO work with IPSEC compiled into the kernel.
The lookup hurts a bit for connections but had been there anyway
if IPSEC was compiled in. So moving the lookup up a bit gives us
TSO support at not extra cost.

PR:		kern/115586
Tested by:	gallatin
Discussed with:	kmacy
MFC after:	2 months
2007-11-21 22:30:14 +00:00
silby
99338940b2 Comment out the syncache's test which ensures that hosts which negotiate TCP
timestamps in the initial SYN packet actually use them in the rest of the
connection.  Unfortunately, during the 7.0 testing cycle users have already
found network devices that violate this constraint.

RFC 1323 states 'and may send a TSopt in other segments' rather than
'and MUST send', so we must allow it.

Discovered by: Rob Zietlow
Tracked down by: Kip Macy
PR: bin/118005
2007-11-20 06:56:04 +00:00
oleg
4e6e975846 - New sysctl variable: net.inet.ip.dummynet.io_fast
If it is set to zero value (default) dummynet module will try to emulate
  real link as close as possible (bandwidth & latency): packet will not leave
  pipe faster than it should be on real link with given bandwidth.
  (This is original behaviour of dummynet which was altered in previous commit)
  If it is set to non-zero value only bandwidth is enforced: packet's latency
  can be lower comparing to real link with given bandwidth.

- Document recently introduced dummynet(4) sysctl variables.

Requested by:	luigi, julian
MFC after:	3 month
2007-11-17 21:54:57 +00:00
rrs
f665676ee0 - Fix a bug in sctp_calc_rwnd() which resulted in wrong rwnd predictions.
- Fix a signedness problem that shows up in some 64 bit platforms (macos).

MFC after:	1 week
2007-11-10 00:47:14 +00:00
oleg
7eef73ab3f 1) dummynet_io() declaration has changed.
2) Alter packet flow inside dummynet: allow certain packets to bypass
dummynet scheduler. Benefits are:

- lower latency: if packet flow does not exceed pipe bandwidth, packets
  will not be (up to tick) delayed (due to dummynet's scheduler granularity).
- lower overhead: if packet avoids dummynet scheduler it shouldn't reenter ip
  stack later. Such packets can be fastforwarded.
- recursion (which can lead to kernel stack exhaution) eliminated. This fix
  long existed panic, which can be triggered this way:
  	kldload dummynet
	sysctl net.inet.ip.fw.one_pass=0
	ipfw pipe 1 config bw 0
	for i in `jot 30`; do ipfw add 1 pipe 1 icmp from any to any; done
	ping -c 1 localhost

3) Three new sysctl nodes are added:
net.inet.ip.dummynet.io_pkt -		packets passed to dummynet
net.inet.ip.dummynet.io_pkt_fast - 	packets avoided dummynet scheduler
net.inet.ip.dummynet.io_pkt_drop -	packets dropped by dummynet

P.S. Above comments are true only for layer 3 packets. Layer 2 packet flow
     is not changed yet.

MFC after:	3 month
2007-11-06 23:01:42 +00:00
oleg
dd5717decc style(9) cleanup.
MFC after:	3 month
2007-11-06 22:53:41 +00:00
rrs
814ed57392 - Change the Time Wait of vtags value to match the cookie-life
- Select a tag gains ability to optionally save new tags
  off in the timewait system.
- When looking up associations do not give back a stcb that
  is in the about-to-be-freed state, and instead continue
  looking for other candiates.
- New function to query to see if value is in time-wait.
- Timewait had a time comparison error that caused very
  few vtags to actually stay in time-wait.
- When setting tags in time-wait, we now use the time
  requested NOT a fixed constant value.
- sstat now gets the proper associd when we do the query.
- When we process an association, we expect the tag chosen
  (if we have one from a cookie) to be in time-wait. Before
  we would NOT allow the assoc up by checking if its good.
  In theory this should have caused almost all assoc not
  to come up except for the time-comparison bug above (this
  bug was hidden by the time comparison bug :-D).
- Don't save tags for nonce values in the time-wait cache
  since these are used only during cookie collisions and do
  not matter if they are unique or not.
MFC after:	1 week
2007-10-30 14:09:24 +00:00
rwatson
369fd04f48 Continue to move from generic network entry points in the TrustedBSD MAC
Framework by moving from mac_mbuf_create_netlayer() to more specific
entry points for specific network services:

- mac_netinet_firewall_reply() to be used when replying to in-bound TCP
  segments in pf and ipfw (etc).

- Rename mac_netinet_icmp_reply() to mac_netinet_icmp_replyinplace() and
  add mac_netinet_icmp_reply(), reflecting that in some cases we overwrite
  a label in place, but in others we apply the label to a new mbuf.

Obtained from:	TrustedBSD Project
2007-10-28 17:12:48 +00:00
rwatson
2bca3d4001 Move towards more explicit support for various network protocol stacks
in the TrustedBSD MAC Framework:

- Add mac_atalk.c and add explicit entry point mac_netatalk_aarp_send()
  for AARP packet labeling, rather than using a generic link layer
  entry point.

- Add mac_inet6.c and add explicit entry point mac_netinet6_nd6_send()
  for ND6 packet labeling, rather than using a generic link layer entry
  point.

- Add expliict entry point mac_netinet_arp_send() for ARP packet
  labeling, and mac_netinet_igmp_send() for IGMP packet labeling,
  rather than using a generic link layer entry point.

- Remove previous genering link layer entry point,
  mac_mbuf_create_linklayer() as it is no longer used.

- Add implementations of new entry points to various policies, largely
  by replicating the existing link layer entry point for them; remove
  old link layer entry point implementation.

- Make MAC_IFNET_LOCK(), MAC_IFNET_UNLOCK(), and mac_ifnet_mtx global
  to the MAC Framework rather than static to mac_net.c as it is now
  needed outside of mac_net.c.

Obtained from:	TrustedBSD Project
2007-10-28 15:55:23 +00:00
rwatson
a3b8fc4866 Rename 'mac_mbuf_create_from_firewall' to 'mac_netinet_firewall_send' as
we move towards netinet as a pseudo-object for the MAC Framework.

Rename 'mac_create_mbuf_linklayer' to 'mac_mbuf_create_linklayer' to
reflect general object-first ordering preference.

Sponsored by:	SPARTA (original patches against Mac OS X)
Obtained from:	TrustedBSD Project, Apple Computer
2007-10-26 13:18:38 +00:00
rwatson
ad62572aa2 Normalize TCP syncache-related MAC Framework entry points to match most
other entry points in the form mac_<object>_method().

Discussed with:	csjp
Obtained from:	TrustedBSD Project
2007-10-25 14:37:37 +00:00
rwatson
60570a92bf Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

  mac_<object>_<method/action>
  mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme.  Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier.  Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods.  Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by:	SPARTA (original patches against Mac OS X)
Obtained from:	TrustedBSD Project, Apple Computer
2007-10-24 19:04:04 +00:00
julian
51d643caa6 Rename the kthread_xxx (e.g. kthread_create()) calls
to kproc_xxx as they actually make whole processes.
Thos makes way for us to add REAL kthread_create() and friends
that actually make theads. it turns out that most of these
calls actually end up being moved back to the thread version
when it's added. but we need to make this cosmetic change first.

I'd LOVE to do this rename in 7.0  so that we can eventually MFC the
new kthread_xxx() calls.
2007-10-20 23:23:23 +00:00
rpaulo
5ca00498b6 Remove IPTOS_CE and IPTOS_ECT constants. They were defined in RFC 2481
but later obsoleted by RFC 3168.
Discussed on freebsd-net with no objections.

Approved by: njl (mentor), rwatson
2007-10-19 12:46:15 +00:00
silby
85eb47c084 Pick the smallest possible TCP window scaling factor that will still allow
us to scale up to sb_max, aka kern.ipc.maxsockbuf.

We do this because there are broken firewalls that will corrupt the window
scale option, leading to the other endpoint believing that our advertised
window is unscaled.  At scale factors larger than 5 the unscaled window will
drop below 1500 bytes, leading to serious problems when traversing these
broken firewalls.

With the default maxsockbuf of 256K, a scale factor of 3 will be chosen by
this algorithm.  Those who choose a larger maxsockbuf should watch out
for the compatiblity problems mentioned above.

Reviewed by:	andre
2007-10-19 08:53:14 +00:00
rrs
ca7dd6ed00 - fix sctp_ifn initial refcount issue (prevents deletion)
- fix a bug during cookie collision that prevented an
  association from coming up in a specific restart case.
- Fix it so the shutdown-pending flag gets removed (this is
  more for correctness then needed) when we enter shutdown-sent
  or shutdown-ack-sent states.
- Fix a bug that caused the receiver to sometimes NOT send
  a SACK when a duplicate TSN arrived. Without this fix
  it was possible for the association to fall down if the
- Deleted primary destination is also stored when SCTP_MOBILITY_BASE.
  (Previously, it is stored when only SCTP_MOBILITY_FASTHANDOFF)
- Fix a locking issue where we might call send_initiate_ack() and
  incorrectly state the lock held/not held. Also fix it so that
  when we release the lock the inp cannot be deleted on us.
- Add the debug option that can cause the stack to panic instead
  of aborting an assoc. This does not and should never show up
  in options but is useful for debugging unexpected aborts.
- Add cumack_log sent to track sending cumack information for
  the debug case where we are running a special log per assoc.
- Added extra () aroudn sctp_sbspace macro to avoid compile warnings.
MFC after:	1 week
2007-10-16 14:05:51 +00:00
kevlo
7a9f1e285b Spelling fix for interupt -> interrupt 2007-10-12 06:03:46 +00:00
silby
f965c7bdc4 Add FBSDID to all files in netinet so that people can more
easily include file version information in bug reports.

Approved by:	re (kensmith)
2007-10-07 20:44:24 +00:00
silby
3faef02860 Improve the debugging message:
TCP: [X.X.X.X]:X to [X.X.X.X]:X tcpflags 0x18<PUSH,ACK>; tcp_do_segment: FIN_WAIT_2: Received data after socket was closed, sending RST and removing tcpcb

So that it also includes how many bytes of data were received.  It now looks
like this:

TCP: [X.X.X.X]:X to [X.X.X.X]:X tcpflags 0x18<PUSH,ACK>; tcp_do_segment: FIN_WAIT_2: Received X bytes of data after socket was closed, sending RST and removing tcpcb

Approved by:	re (gnn)
2007-10-07 00:07:27 +00:00
rrs
880e253277 - Fix the one-2-one model to properly do a socantrecv()
Approved by:	re@freeBSD.org (Ken Smith)
2007-10-06 13:23:42 +00:00
rwatson
54871123ad Disable TCP syncache debug logging by default. While useful in debugging
problems with the syncache, it produces a lot of console noise and has led
to quite a few false positive bug reports.  It can be selectively
re-enabled when debugging specific problems by frobbing the same sysctl.

Discussed with:	silby
Approved by:	re (gnn)
2007-10-05 22:39:44 +00:00
rrs
d602e03477 - We should return error = 0 and the upper processing would
return a zero length read. Otherwise we don't return the
  right error indication.

Approved by:	re@freebsd.org (gnn)
2007-10-04 09:29:33 +00:00
rrs
dfb6039bc1 - Bug fix managing congestion parameter on immediate
retransmittion by handover event (fast mobility code)
- Fixed problem of mobility code which is caused by remaining
  parameters in the deleted primary destination.
- Add a missing lock. When a peer sends an INIT, and while we
  are processing it to send an INIT-ACK the socket is closed,
  we did not hold a lock to keep the socket from going away.
  Add protection for this case.
- Fix so that arwnd is alway uses the minimal rwnd if the user
  has set the socket buffer smaller. Found this when the test
  org decided to see what happens when you set in a rwnd of 10
  bytes (which is not allowed per RFC .. 4k is minimum).
- Fixes so a cookie-echo ootb will NOT cause an abort to
  be sent. This was happening in a MPI collision case.
- Examined all panics and unless there was no recovery, moved
  any that were not already to INVARANTS.

Approved by:	re@freebsd.org (gnn)
2007-10-01 03:22:29 +00:00
maxim
7fef5796dc o For dynamic rules log a parent rule number. Prefix a log message
by 'ipfw: '.

PR:		kern/115755
Submitted by:	sem
Approved by:	re (gnn)
MFC after:	4 weeks
2007-09-29 15:01:41 +00:00
kib
dd74194c9c Revert rev. 1.94. After recent tcp backouts, tcp_close() may return NULL.
Check the return value of tcp_close() being NULL before dereferencing it
in #ifdef TCPDEBUG block.

Reviewed by:	rwatson
Approved by:	re (gnn)
2007-09-24 14:46:27 +00:00
silby
5fb86a6fa7 Two changes:
- Reintegrate the ANSI C function declaration change
  from tcp_timer.c rev 1.92

- Reorganize the tcpcb structure so that it has a single
  pointer to the "tcp_timer" structure which contains all
  of the tcp timer callouts.  This change means that when
  the single tcp timer change is reintegrated, tcpcb will
  not change in size, and therefore the ABI between
  netstat and the kernel will not change.

Neither of these changes should have any functional
impact.

Reviewed by: bmah, rrs
Approved by: re (bmah)
2007-09-24 05:26:24 +00:00
csjp
5328bd11db Certain consumers of rtalloc like gif(4) and if_stf(4) lookup the
route and once they are done with it, call rtfree().  rtfree() should
only be used when we are certain we hold the last reference to the
route.  This bug results in console messages like the following:

rtfree: 0xc40f7000 has 1 refs

This patch switches the rtfree() to use RTFREE_LOCKED() instead,
which should handle the reference counting on the route better.

Approved by:	re@ (gnn)
Reviewed by:	bms
Reported by:	many via net@ and current@
Tested by:	many
2007-09-23 17:50:17 +00:00
rrs
66d80bdf93 - fix (global) address handling in the presence of duplicates, the
last interface should own the address, but the current code
  fumbles the handoff. This fixes that.
- move address related debugs to PCB4 and add additional ones to
  help in debugging address problems.

Approved by:	re@freebsd.org (K Smith)
2007-09-21 04:19:33 +00:00
rrs
af4581daa3 - The address lock is changed to a rwlock. This
also involves macro changes to have a RLOCK and a WLOCK
  and placing the correct version within the code.
- The INP-INFO lock is changed to a rwlock.
- When sctp_shutdown() is called on Mac OS X, the socket lock is held.
  So call sctp_chunk_output with SCTP_SO_LOCKED and
  not SCTP_SO_NOT_LOCKED.
- Add SCTP_IPI_ADDR_[RW]LOCK and SCTP_IPI_ADDR_[RW]UNLOCK for Mac OS X.
- u_int64_t -> uint64_t
- add missing addr unlock for error return path
Approved by:	re@freebsd.org (K Smith)
2007-09-18 15:16:39 +00:00
rrs
44d85d753b - For the 1-to-1 model, fix an off by one error that
allowed an extra connection over the backlog (by one)
Approved by:	re@freebsd.org (B. Mah)
2007-09-16 23:03:38 +00:00
rrs
51cad52bc8 - Get rid of unsused constants for sysctl variables.
- Fix panic from mutex unlock on freed lock when ASCONF-ACK
  aborts an assoc
- Fix panic from addr lock recursion when ASCONFs are queued
  in the front states
- ASCONFs "queued" in the front states should really be
  bundled after the COOKIE-ACK, not in front of it
- Fix issue with addresses deleted in the front states from
  being sent with ASCONF(DELETE)-- replaced
  sctp_asconf_queue_add_sa() with delete specific function
- Comment change in sctp.h the drafts are now RFC's
Approved by:	re@freebsd.org (B Mah)
2007-09-15 19:07:42 +00:00
rrs
6368c8b699 - DF bit was on for COOKIE-ECHO chunks. This is
incorrect and should be OFF letting IP fragment
  large cookie-echos.
- Rename sysctl variable logging to log_level.
- Fix description of sysctl variable stats.
- Add sysctl variable log to make sctp_log readable via sysctl
  mechanism (this is by compile switch and targets non KTR platforms or
  when someone wants to do performance wise tracing).
 - Removed debug code

Approved by:	re@freebsd.org (B Mah)
2007-09-13 14:43:54 +00:00
rrs
73fcd49c86 - Incorrect error EAGAIN returned for invalid send on a locked
stream (using EEOR mode). Changed to EINVAL (in sctp_output.c)
- Static analysis comments added
- fix in mobility code to return a value (static analysis found).
- sctp6_notify function made visible instead of
  static (this is needed for Panda).

Approved by:	re@freebsd.org (B Mah)
2007-09-13 10:36:43 +00:00
rrs
8696d874ba - Removed debug code and more C++ style comments in the mobility
code in sctp_asconf.c
Approved by:	re@freebsd.org (B Mah)
2007-09-10 21:01:56 +00:00
rrs
1b1d8efe7c - Added some comments to tell where the htcp
code comes from.
- Fix a LOR on Mac OS X: Do not hold an stcb lock when
  calling soisconnected for a socket which has the
  SS_INCOMP bit set on so_state.
- fix a comment to be non c++ style.

Approved by:	re@freebsd.org (B Mah)
2007-09-10 17:06:25 +00:00
kensmith
671d1148ba Make sure that either inp is NULL or we have obtained a lock on it before
jumping to dropunlock to avoid a panic.  While here move the calls to
ipsec4_in_reject() and ipsec6_in_reject() so they are after we obtain
the lock on inp.

Original patch to avoid panic:	pjd
Review of locking adjustments:	gnn, sam
Approved by:			re (rwatson)
2007-09-10 14:49:32 +00:00
rwatson
200ce01ddb Further UDPv4 cleanup:
- Resort includes a bit.
- Correct typos and wording problems in comments.
- Rename udpcksum to udp_cksum to be consistent with other UDP-related
  configuration variables.
- Remove indirection of udp_notify through local notify variable in
  udp_ctlinput(), which is presumably due to copying and pasting from TCP,
  where multiple notify routines exist.

Approved by:	re (kensmith)
2007-09-10 14:22:15 +00:00
rrs
e1de0a1eda - send call has a reference to uio->uio_resid in
the recent send code, but uio may be NULL on sendfile
  calls. Change to use sndlen variable.
- EMSGSIZE is not being returned in non-blocking mode
  and needs a small tweak to look if the msg would
  ever fit when returning EWOULDBLOCK.
- FWD-TSN has a bug in stream processing which could
  cause a panic. This is a follow on to the codenomicon
  fix.
- PDAPI level 1 and 2 do not work unless the reader
  gets his returned buffer full. Fix so we can break
  out when at level 1 or 2.
- Fix fast-handoff features to copy across properly on
  accepted sockets
- Fix sctp_peeloff() system call when no true system call
  exists to screen arguments for errors. In cases where a
  real system call exists the system call itself does this.
- Fix raddr leak in recent add-ip code change for bundled
  asconfs (even when non-bundled asconfs are received)
- Make sure ipi_addr lock is held when walking global addr
  list. Need to change this lock type to a rwlock().
- Add don't wake flag on both input and output when the
  socket is closing.
- When deleting an address verify the interface is correct
  before allowing the delete to process. This protects panda
  and unnumbered.
- Clean up old sysctl stuff and get rid of the old Open/Net
  BSD structures.
- Add a function to watch the ranges in the sysctl sets.
- When appending in the reassembly queue, validate that
  the assoc has not gone to about to be freed. If so
  (in the middle) abort out. Note this especially effects
  MAC I think due to the lock/unlock they do (or with
  LOCK testing in place).
- Netstat patch to get rid of warnings.
- Make sure that no data gets queued to inactive/unconfirmed
  destinations. This especially effect CMT but also makes a
  impact on regular SCTP as well.
- During init collision when we detect seq number out
  of sync we need to treat it like Case C and discard
  the cookie (no invarient needed here).
- Atomic access to the random store.
- When we declare a vtag good, we need to shove it
  into the time wait hash to prevent further use. When
  the tag is put into the assoc hash, we need to remove it
  from the twait hash (where it will surely be). This prevents
  duplicate tag assignments.
- Move decr-ref count to better protect sysctl out of
  data.
- ltrace error corrections in sctp6_usrreq.c
- Add hook for interface up/down to be sent to us.
- Make sysctl() exported structures independent of processor
  architecture.
- Fix route and src addr cache clearing for delete address case.
- Make sure address marked SCTP_DEL_IP_ADDRESS is never selected
  as src addr.
- in icmp handling fixed so we actually look at the icmp codes
  to figure out what to do.
- Modified mobility code.
  Reception of DELETE IP ADDRESS for a primary destination and
  SET PRIMARY for a new primary destination is used for
  retransmission trigger to the new primary destination.
  Also, in this case, destination of chunks in send_queue are
  changed to the new primary destination.
- Fix so that we disallow sending by mbuf to ever have EEOR
  mode set upon it.

Approved by:	re@freebsd.org (B Mah)
2007-09-08 17:48:46 +00:00
rrs
4dd82bd675 - Locking compatiability changes. This involves adding
additional flags to many function calls. The flags only
  get used in BSD when we compile with lock testing. These
  flags allow apple to escape the "giant" lock it holds on
  the socket and have more fine-grained locking in the NKE.
  It also allows us to test (with witness) the locking used
  by apple via a compile switch (manually applied).

Approved by:	re@freebsd.org(B Mah)
2007-09-08 11:35:11 +00:00
rwatson
e14f216203 Back out tcp_timer.c:1.93 and associated changes that reimplemented the many
TCP timers as a single timer, but retain the API changes necessary to
reintroduce this change.  This will back out the source of at least two
reported problems: lock leaks in certain timer edge cases, and TCP timers
continuing to fire after a connection has closed (a bug previously fixed and
then reintroduced with the timer rewrite).

In a follow-up commit, some minor restylings and comment changes performed
after the TCP timer rewrite will be reapplied, and a further change to allow
the TCP timer rewrite to be added back without disturbing the ABI.  The new
design is believed to be a good thing, but the outstanding issues are
leading to significant stability/correctness problems that are holding
up 7.0.

This patch was generated by silby, but is being committed by proxy due to
poor network connectivity for silby this week.

Approved by:	re (kensmith)
Submitted by:	silby
Tested by:	rwatson, kris
Problems reported by:	peter, kris, others
2007-09-07 09:19:22 +00:00
green
a2737718b8 Repair ALTQ-tagging rules in IPFW which got broken in the last PF
import.  The PF mbuf-tagging support routines changed to link the
allocated tags into the provided mbuf themselves, so the left-over
m_tag_prepend() was trying to add a bogus (usually NULL) tag.

Reviewed by: mlaier
Approved by: re
2007-08-29 19:34:28 +00:00
rrs
e335457f91 - During shutdown pending, when the last sack came in and
the last message on the send stream was "null" but still
  there, a state we allow, we could get hung and not clean
  it up and wait for the shutdown guard timer to clear the
  association without a graceful close. Fix this so that
  that we properly clean up.
- Added support for Multiple ASCONF per new RFC. We only
  (so far) accept input of these and cannot yet generate
  a multi-asconf.
- Sysctl'd support for experimental Fast Handover feature. Always
  disabled unless sysctl or socket option changes to enable.
- Error case in add-ip where the peer supports AUTH and ADD-IP
  but does NOT require AUTH of ASCONF/ASCONF-ACK. We need to
  ABORT in this case.
- According to the Kyoto summit of socket api developers
  (Solaris, Linux, BSD). We need to have:
   o non-eeor mode messages be atomic - Fixed
   o Allow implicit setup of an assoc in 1-2-1 model if
     using the sctp_**() send calls - Fixed
   o Get rid of HAVE_XXX declarations - Done
   o add a sctp_pr_policy in hole in sndrcvinfo structure - Done
   o add a PR_SCTP_POLICY_VALID type flag - yet to-do in a future patch!
- Optimize sctp6 calls to reuse code in sctp_usrreq. Also optimize
  when we close sending out the data and disabling Nagle.
- Change key concatenation order to match the auth RFC
- When sending OOTB shutdown_complete always do csum.
- Don't send PKT-DROP to a PKT-DROP
- For abort chunks just always checksums same for
  shutdown-complete.
- inpcb_free front state had a bug where in queue
  data could wedge an assoc. We need to just abandon
  ones in front states (free_assoc).
- If a peer sends us a 64k abort, we would try to
  assemble a response packet which may be larger than
  64k. This then would be dropped by IP. Instead make
  a "minimum" size for us 64k-2k (we want at least
  2k for our initack). If we receive such an init
  discard it early without all the processing.
- When we peel off we must increment the tcb ref count
  to keep it from being freed from underneath us.
- handling fwd-tsn had bugs that caused memory overwrites
  when given faulty data, fixed so can't happen and we
  also stop at the first bad stream no.
- Fixed so comm-up generates the adaption indication.
- peeloff did not get the hmac params copied.
- fix it so we lock the addr list when doing src-addr selection
  (in future we need to use a multi-reader/one writer lock here)
- During lowlevel output, we could end up with a _l_addr set
  to null if the iterator is calling the output routine. This
  means we would possibly crash when we gather the MTU info.
  Fix so we only do the gather where we have a src address
  cached.
- we need to be sure to set abort flag on conn state when
  we receive an abort.
- peeloff could leak a socket. Moved code so the close will
  find the socket if the peeloff fails (uipc_syscalls.c)

Approved by:	re@freebsd.org(Ken Smith)
2007-08-27 05:19:48 +00:00
maxim
3eb0fa1342 o Fix bug I introduced in the previous commit (ipfw set extention):
pack a set number correctly.

Submitted by:	oleg

o Plug a memory leak.

Submitted by:	oleg and Andrey V. Elsukov
Approved by:	re (kensmith)
MFC after:	1 week
2007-08-26 18:38:31 +00:00
rrs
1d0af67d1a - Fix address add handling to clear cached routes and source addresses
when peer acks the add in case the routing table changes.
- Fix sctp_lower_sosend to send shutdown chunk for mbuf send
  case when sndlen = 0 and sinfoflag = SCTP_EOF
- Fix sctp_lower_sosend for SCTP_ABORT mbuf send case with null data,
  So that it does not send the "null" data mbuf out and cause
  it to get freed twice.
- Fix so auto-asconf sysctl actually effect the socket's asconf state.
- Do not allow SCTP_AUTO_ASCONF option to be used on subset bound sockets.
- Memset bug in sctp_output.c (arguments were reversed) submitted
  found and reported by Dave Jones (davej@codemonkey.org.uk).
- PD-API point needs to be invoked >= not just > to conform to socket api
  draft this fixes sctp_indata.c in the two places need to be >=.
- move M_NOTIFICATION to use M_PROTO5.
- PEER_ADDR_PARAMS did not fail properly if you specify an address
  that is not in the association with a valid assoc_id. This meant
  you got or set the stcb level values instead of the destination
  you thought you were going to get/set. Now validate if the
  stcb is non-null and the net is NULL that the sa_family is
  set and the address is unspecified otherwise return an error.
- The thread based iterator could crash if associations were freed
  at the exact time it was running. rework the worker thread to
  use the increment/decrement to prevent this and no longer use
  the markers that the timer based iterator uses.
- Fix the memleak in sctp_add_addr_to_vrf() for the case when it is
  detected that ifa is already pointing to a ifn.
- Fix it so that if someone is so insane that they drop the
  send window below the minimal add mark, they still can send.
- Changed all state for associations to use mask safe macro.
- During front states in association freeing in sctp_inpcbfree, we
  had a locking problem where locks were not in place where they
  should have been.
- Free association calls were not testing the return value in
  sctp_inpcb_free() properly... others should be cast  void returns
  where we don't care about the return value.
- If a reference count is held on an assoc, even from the "force free"
  we should not do the actual free.. but instead let the timer
  free it.
- When we enter sctp_input(), if the SCTP_ASOC_ABOUT_TO_BE_FREED
  flag is set, we must NOT process the packet but handle it like
  ootb. This is because while freeing an assoc we release the
  locks to get all the higher order locks so we can purge all
  the hash tables. This leaves a hole if a packet comes in
  just at that point. Now sctp_common_input_processing() will
  call the ootb code in such a case.
- Change MBUF M_NOTIFICATION to use M_PROTO5 (per Sam L). This makes
  it so we don't have a conflict (I think this is a covertity change).
  We made this change AFTER some conversation and looking to make sure
  that M_PROTO5 does not have a problem between SCTP and the 802.11
  stuff (which is the only other place its used).
- Fixed lock order reversal and missing atomic protection around
  locked_tcb during association lookup and the 1-2-1 model.
- Added debug to source address selection.
- V6 output must always do checksum even for loopback.
- Remove more locks around inp that are not needed for an atomically
  added/subtracted ref count.
- slight optimization in the way we zero the array in sctp_sack_check()
- It was possible to respond to a ABORT() with bad checksum with
  a PKT-DROP. This lead to a PKT-DROP/ABORT war. Add code to NOT
  send a PKT-DROP to any ABORT().
- Add an option for local logging (useful for macintosh or when
  you need better performing during debugging). Note no commands
  are here to get the log info, you must just use kgdb.
- The timer code needs to be aware of if it needs to call
  sctp_sack_check() to slide the maps and adjust the cum-ack.
  This is because it may be out of sync cum-ack wise.
- Added threshold managment logging.
- If the user picked just the right size, that just filled the send
  window minus one mtu, we would enter a forever loop not copying and
  at the same time not blocking. Change from < to <= solves this.
- Sysctl added to control the fragment interleave level which defaults
  to 1.
- My rwnd control was not being used to control the rwnd properly (we
  did not add and subtract to it :-() this is now fixed so we handle
  small messages (1 byte etc) better to bring our rwnd down more
  slowly.

Approved by:	re@freebsd.org (Bruce Mah)
2007-08-24 00:53:53 +00:00
rrs
1bcb372970 - Remove extra comment for 7.0 (no GIANT here).
- Remove unneeded WLOCK/UNLOCK of inp for getting TCB lock.
- Fix panic that may occur when freeing an assoc that has partial
  delivery in progress (may dereference null socket pointer when
  queuing partial delivery aborted notification)
- Some spacing and comment fixes.
- Fix address add handling to clear cached routes and source addresses
  when peer acks the add in case the routing table changes.
Approved by:	re@freebsd.org (Bruce Mah)
2007-08-16 01:51:22 +00:00