freebsd-dev

Author	SHA1	Message	Date
Gleb Smirnoff	3e630ef9a9	Add a sysctl net.inet.tcp.nolocaltimewait that allows to suppress creating a compress TIME WAIT states, if both connection endpoints are local. Default is off.	2006-09-08 13:09:15 +00:00
Ruslan Ermilov	751dea2935	Back when we had T/TCP support, we used to apply different timeouts for TCP and T/TCP connections in the TIME_WAIT state, and we had two separate timed wait queues for them. Now that is has gone, the timeout is always 2*MSL again, and there is no reason to keep two queues (the first was unused anyway!). Also, reimplement the remaining queue using a TAILQ (it was technically impossible before, with two queues).	2006-09-07 13:06:00 +00:00
Andre Oppermann	233dcce118	First step of TSO (TCP segmentation offload) support in our network stack. o add IFCAP_TSO[46] for drivers to announce this capability for IPv4 and IPv6 o add CSUM_TSO flag to mbuf pkthdr csum_flags field o add tso_segsz field to mbuf pkthdr o enhance ip_output() packet length check to allow for large TSO packets o extend tcp_maxmtu[46]() with a flag pointer to pass interface capabilities o adjust all callers of tcp_maxmtu[46]() accordingly Discussed on: -current, -net Sponsored by: TCP/IP Optimization Fundraise 2005	2006-09-06 21:51:59 +00:00
Gleb Smirnoff	2c857a9be9	o Backout rev. 1.125 of in_pcb.c. It appeared to behave extremely bad under high load. For example with 40k sockets and 25k tcptw entries, connect() syscall can run for seconds. Debugging showed that it iterates the cycle millions times and purges thousands of tcptw entries at a time. Besides practical unusability this change is architecturally wrong. First, in_pcblookup_local() is used in connect() and bind() syscalls. No stale entries purging shouldn't be done here. Second, it is a layering violation. o Return back the tcptw purging cycle to tcp_timer_2msl_tw(), that was removed in rev. 1.78 by rwatson. The commit log of this revision tells nothing about the reason cycle was removed. Now we need this cycle, since major cleaner of stale tcptw structures is removed. o Disable probably necessary, but now unused tcp_twrecycleable() function. Reviewed by: ru	2006-09-06 13:56:35 +00:00
Gleb Smirnoff	c3e07bf82a	Finally fix rev. 1.256 Pointy hat to: glebius	2006-09-05 14:00:59 +00:00
Gleb Smirnoff	23ebab416c	Remove extra parenthesis in last commit. Nitpicked by: ru	2006-09-05 12:22:54 +00:00
Gleb Smirnoff	1f1f90c3a7	- Make net.inet.tcp.maxtcptw modifiable at run time. - If net.inet.tcp.maxtcptw was ever set explicitly, do not change it if kern.ipc.maxsockets is changed.	2006-09-05 12:08:47 +00:00
Mohan Srinivasan	2374501ca4	Fix for a bug that causes the computation of "len" in tcp_output() to get messed up, resulting in an inconsistency between the TCP state and so_snd.	2006-08-26 17:53:19 +00:00
Mohan Srinivasan	464469c713	Fixes an edge case bug in timewait handling where ticks rolling over causing the timewait expiry to be exactly 0 corrupts the timewait queues (and that entry). Reviewed by: silby	2006-08-11 21:15:23 +00:00
Robert Watson	e850475248	Move soisdisconnected() in tcp_discardcb() to one of its calling contexts, tcp_twstart(), but not to the other, tcp_detach(), as the socket is already being torn down and therefore there are no listeners. This avoids a panic if kqueue state is registered on the socket at close(), and eliminates to XXX comments. There is one case remaining in which tcp_discardcb() reaches up to the socket layer as part of the TCP host cache, which would be good to avoid. Reported by: Goran Gajic <ggajic at afrodita dot rcub dot bg dot ac dot yu>	2006-08-02 16:18:05 +00:00
Robert Watson	a152f8a361	Change semantics of socket close and detach. Add a new protocol switch function, pru_close, to notify protocols that the file descriptor or other consumer of a socket is closing the socket. pru_abort is now a notification of close also, and no longer detaches. pru_detach is no longer used to notify of close, and will be called during socket tear-down by sofree() when all references to a socket evaporate after an earlier call to abort or close the socket. This means detach is now an unconditional teardown of a socket, whereas previously sockets could persist after detach of the protocol retained a reference. This faciliates sharing mutexes between layers of the network stack as the mutex is required during the checking and removal of references at the head of sofree(). With this change, pru_detach can now assume that the mutex will no longer be required by the socket layer after completion, whereas before this was not necessarily true. Reviewed by: gnn	2006-07-21 17:11:15 +00:00
Stephan Uphoff	d915b28015	Fix race conditions on enumerating pcb lists by moving the initialization ( and where appropriate the destruction) of the pcb mutex to the init/finit functions of the pcb zones. This allows locking of the pcb entries and race condition free comparison of the generation count. Rearrange locking a bit to avoid extra locking operation to update the generation count in in_pcballoc(). (in_pcballoc now returns the pcb locked) I am planning to convert pcb list handling from a type safe to a reference count model soon. ( As this allows really freeing the PCBs) Reviewed by: rwatson@, mohans@ MFC after: 1 week	2006-07-18 22:34:27 +00:00
Robert Watson	10702a2840	Abstract inpcb drop logic, previously just setting of INP_DROPPED in TCP, into in_pcbdrop(). Expand logic to detach the inpcb from its bound address/port so that dropping a TCP connection releases the inpcb resource reservation, which since the introduction of socket/pcb reference count updates, has been persisting until the socket closed rather than being released implicitly due to prior freeing of the inpcb on TCP drop. MFC after: 3 months	2006-04-25 11:17:35 +00:00
Robert Watson	9106a6d6b0	Replace isn_mtx direct use with ISN_*() lock macros so that locking details/strategy can be changed without touching every use. MFC after: 3 months	2006-04-23 12:27:42 +00:00
Robert Watson	4c0e8f41f6	Introduce a new TCP mutex, isn_mtx, which protects the initial sequence number state, rather than re-using pcbinfo. This introduces some additional mutex operations during isn query, but avoids hitting the TCP pcbinfo lock out of yet another frequently firing TCP timer. MFC after: 3 months	2006-04-22 19:23:24 +00:00
Paul Saab	4f590175b7	Allow for nmbclusters and maxsockets to be increased via sysctl. An eventhandler is used to update all the various zones that depend on these values.	2006-04-21 09:25:40 +00:00
Gleb Smirnoff	a73b656763	Add a tunable net.inet.tcp.maxtcptw, that allows to set a limit on tcptw zone independently from setting a limit on socket zone.	2006-04-04 14:31:37 +00:00
Robert Watson	ae0e714308	Before dereferencing intotw() when INP_TIMEWAIT, check for inp_ppcb being NULL. We currently do allow this to happen, but may want to remove that possibility in the future. This case can occur when a socket is left open after TCP wraps up, and the timewait state is recycled. This will be cleaned up in the future. Found by: Kazuaki Oda <kaakun at highway dot ne dot jp> MFC after: 3 months	2006-04-04 12:26:07 +00:00
Robert Watson	cb895fb9b0	In TCP notify routines, check inpcb for INP_TIMEWAIT and INP_DROPPED. The INP_DROPPED check replaces the current NULL checks; the INP_TIMEWAIT checks appear to have always been required, but not been there, which is/was a bug. This avoids unconditionally casting of in_ppcb to a tcpcb, when it may be a twtcb, which may have resulted in obscure ICMP-related panics in earlier releases. MFC after: 3 months	2006-04-03 14:07:50 +00:00
Robert Watson	afa39e25c4	Change inp_ppcb from caddr_t to void , fix/remove associated related casts. Consistently use intotw() to cast inp_ppcb pointers to struct tcptw pointers. Consistently use intotcpcb() to cast inp_ppcb pointers to struct tcpcb * pointers. Don't assign tp to the results to intotcpcb() during variable declation at the top of functions, as that is before the asserts relating to locking have been performed. Do this later in the function after appropriate assertions have run to allow that operation to be conisdered safe. MFC after: 3 months	2006-04-03 13:33:55 +00:00
Robert Watson	43f56a32a0	Style tweaks: convert to ANSI from K&R function prototypes. MFC after: 3 months	2006-04-03 12:59:27 +00:00
Robert Watson	2fc5ae87d0	Update comment on tcp_close() for new world order. MFC after: 3 months	2006-04-03 12:52:13 +00:00
Robert Watson	fa38deac65	Fix up locking surrounding tcp_drop sysctl: in the new world order, we don't free inpcbs until after the socket is closed, so we always need to unlock an inpcb after calling tcp_drop() on it. MFC after: 3 months	2006-04-03 11:57:12 +00:00
Robert Watson	34af7bae80	Properly handle an edge case previously not handled correctly: a socket can have a tcp connection that has entered time wait attached to it, in the event that shutdown() is called on the socket and the FINs properly exchange before close(). In this case we don't detach or free the inpcb, just leave the tcptw detached and freed, but we must release the inpcb lock (which we didn't previously). MFC after: 3 months	2006-04-01 23:53:25 +00:00
Robert Watson	623dce13c6	Update TCP for infrastructural changes to the socket/pcb refcount model, pru_abort(), pru_detach(), and in_pcbdetach(): - Universally support and enforce the invariant that so_pcb is never NULL, converting dozens of unnecessary NULL checks into assertions, and eliminating dozens of unnecessary error handling cases in protocol code. - In some cases, eliminate unnecessary pcbinfo locking, as it is no longer required to ensure so_pcb != NULL. For example, the receive code no longer requires the pcbinfo lock, and the send code only requires it if building a new connection on an otherwise unconnected socket triggered via sendto() with an address. This should significnatly reduce tcbinfo lock contention in the receive and send cases. - In order to support the invariant that so_pcb != NULL, it is now necessary for the TCP code to not discard the tcpcb any time a connection is dropped, but instead leave the tcpcb until the socket is shutdown. This case is handled by setting INP_DROPPED, to substitute for using a NULL so_pcb to indicate that the connection has been dropped. This requires the inpcb lock, but not the pcbinfo lock. - Unlike all other protocols in the tree, TCP may need to retain access to the socket after the file descriptor has been closed. Set SS_PROTOREF in tcp_detach() in order to prevent the socket from being freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether or not it needs to free the socket when the connection finally does close. The typical case where this occurs is if close() is called on a TCP socket before all sent data in the send socket buffer has been transmitted or acknowledged. If INP_SOCKREF is found when the connection is dropped, we release the inpcb, tcpcb, and socket instead of flagging INP_DROPPED. - Abort and detach protocol switch methods no longer return failures, nor attempt to free sockets, as the socket layer does this. - Annotate the existence of a long-standing race in the TCP timer code, in which timers are stopped but not drained when the socket is freed, as waiting for drain may lead to deadlocks, or have to occur in a context where waiting is not permitted. This race has been handled by testing to see if the tcpcb pointer in the inpcb is NULL (and vice versa), which is not normally permitted, but may be true of a inpcb and tcpcb have been freed. Add a counter to test how often this race has actually occurred, and a large comment for each instance where we compare potentially freed memory with NULL. This will have to be fixed in the near future, but requires is to further address how to handle the timer shutdown shutdown issue. - Several TCP calls no longer potentially free the passed inpcb/tcpcb, so no longer need to return a pointer to indicate whether the argument passed in is still valid. - Un-macroize debugging and locking setup for various protocol switch methods for TCP, as it lead to more obscurity, and as locking becomes more customized to the methods, offers less benefit. - Assert copyright on tcp_usrreq.c due to significant modifications that have been made as part of this work. These changes significantly modify the memory management and connection logic of our TCP implementation, and are (as such) High Risk Changes, and likely to contain serious bugs. Please report problems to the current@ mailing list ASAP, ideally with simple test cases, and optionally, packet traces. MFC after: 3 months	2006-04-01 16:36:36 +00:00
Andre Oppermann	eaf80179e2	Have TCP Inflight disable itself if the RTT is below a certain threshold. Inflight doesn't make sense on a LAN as it has trouble figuring out the maximal bandwidth because of the coarse tick granularity. The sysctl net.inet.tcp.inflight.rttthresh specifies the threshold in milliseconds below which inflight will disengage. It defaults to 10ms. Tested by: Joao Barros <joao.barros-at-gmail.com>, Rich Murphey <rich-at-whiteoaklabs.com> Sponsored by: TCP/IP Optimization Fundraise 2005	2006-02-16 19:38:07 +00:00
Andre Oppermann	34333b16cd	Retire MT_HEADER mbuf type and change its users to use MT_DATA. Having an additional MT_HEADER mbuf type is superfluous and redundant as nothing depends on it. It only adds a layer of confusion. The distinction between header mbuf's and data mbuf's is solely done through the m->m_flags M_PKTHDR flag. Non-native code is not changed in this commit. For compatibility MT_HEADER is mapped to MT_DATA. Sponsored by: TCP/IP Optimization Fundraise 2005	2005-11-02 13:46:32 +00:00
Philip Paeps	7691747aac	Unbreak the net.inet6.tcp6.getcred sysctl. This makes inetd/auth work again in IPv6 setups. Pointy hat to: ume/KAME	2005-10-12 09:24:18 +00:00
Maxim Konovalov	ac827533df	o Teach sysctl_drop() how to deal with the sockets in TIME_WAIT state. This is a special case because tcp_twstart() destroys a tcp control block via tcp_discardcb() so we cannot call tcp_drop(struct *tcpcb) on such connections. Use tcp_twclose() instead. MFC after: 5 days	2005-10-02 08:43:57 +00:00
Andre Oppermann	ffabe3dce8	In tcp_ctlinput() do not swap ip->ip_len a second time. It has been done in icmp_input() already. This fixes the ICMP_UNREACH_NEEDFRAG case where no MTU was proposed in the ICMP reply. PR: kern/81813 Submitted by: Vitezslav Novy <vita at fio.cz> MFC after: 3 days	2005-09-10 07:43:29 +00:00
Andre Oppermann	e0aec68255	Use the correct mbuf type for MGET().	2005-08-30 16:35:27 +00:00
Hajimu UMEMOTO	4dad226e45	recover the line which was wrongly disappeared during scope cleanup. tcpdrop(8) should work for IPv6, again.	2005-08-01 12:08:49 +00:00
Hajimu UMEMOTO	a1f7e5f8ee	scope cleanup. with this change - most of the kernel code will not care about the actual encoding of scope zone IDs and won't touch "s6_addr16[1]" directly. - similarly, most of the kernel code will not care about link-local scoped addresses as a special case. - scope boundary check will be stricter. For example, the current BSD code allows a packet with src=::1 and dst=(some global IPv6 address) to be sent outside of the node, if the application do: s = socket(AF_INET6); bind(s, "::1"); sendto(s, some_global_IPv6_addr); This is clearly wrong, since ::1 is only meaningful within a single node, but the current implementation of the BSD kernel cannot reject this attempt. Submitted by: JINMEI Tatuya <jinmei__at__isl.rdc.toshiba.co.jp> Obtained from: KAME	2005-07-25 12:31:43 +00:00
Robert Watson	f59a9ebf10	Remove no-op spl's and most comment references to spls, as TCP locking is believed to be basically done (modulo any remaining bugs). MFC after: 3 days	2005-07-19 12:21:26 +00:00
Paul Saab	482ac96888	Fix for a bug in the change that defers sack option processing until after PAWS checks. The symptom of this is an inconsistency in the cached sack state, caused by the fact that the sack scoreboard was not being updated for an ACK handled in the header prediction path. Found by: Andrey Chernov. Submitted by: Noritoshi Demizu, Raja Mukerji. Approved by: re	2005-07-01 22:54:18 +00:00
Robert Watson	e3d5315d01	Assert tcbinfo lock in tcp_drop() due to its call of tcp_close() Assert tcbinfo lock in tcp_close() due to its call to in{,6}_detach() Assert tcbinfo lock in tcp_drop_syn_sent() due to its call to tcp_drop() MFC after: 7 days	2005-06-01 12:06:07 +00:00
Colin Percival	fe2eee8231	Fix two issues which were missed in FreeBSD-SA-05:08.kmem. Reported by: Uwe Doering	2005-05-07 00:41:36 +00:00
Andre Oppermann	9e4ca6315d	If we don't get a suggested MTU during path MTU discovery look up the packet size of the packet that generated the response, step down the MTU by one step through ip_next_mtu() and try again. Suggested by: dwmalone	2005-05-04 13:48:44 +00:00
Paul Saab	a6235da61e	- Make the sack scoreboard logic use the TAILQ macros. This improves code readability and facilitates some anticipated optimizations in tcp_sack_option(). - Remove tcp_print_holes() and TCP_SACK_DEBUG. Submitted by: Raja Mukerji. Reviewed by: Mohan Srinivasan, Noritoshi Demizu.	2005-04-21 20:11:01 +00:00
Andre Oppermann	1aedbd9c80	Move Path MTU discovery ICMP processing from icmp_input() to tcp_ctlinput() and subject it to active tcpcb and sequence number checking. Previously any ICMP unreachable/needfrag message would cause an update to the TCP hostcache. Now only ICMP PMTU messages belonging to an active TCP session with the correct src/dst/port and sequence number will update the hostcache and complete the path MTU discovery process. Note that we don't entirely implement the recommended counter measures of Section 7.2 of the paper. However we close down the possible degradation vector from trivially easy to really complex and resource intensive. In addition we have limited the smallest acceptable MTU with net.inet.tcp.minmss sysctl for some time already, further reducing the effect of any degradation due to an attack. Security: draft-gont-tcpm-icmp-attacks-03.txt Section 7.2 MFC after: 3 days	2005-04-21 14:29:34 +00:00
Andre Oppermann	1600372b6b	Ignore ICMP Source Quench messages for TCP sessions. Source Quench is ineffective, depreciated and can be abused to degrade the performance of active TCP sessions if spoofed. Replace a bogus call to tcp_quench() in tcp_output() with the direct equivalent tcpcb variable assignment. Security: draft-gont-tcpm-icmp-attacks-03.txt Section 7.1 MFC after: 3 days	2005-04-21 12:37:12 +00:00
Paul Saab	e346eeff65	- If the reassembly queue limit was reached or if we couldn't allocate a reassembly queue state structure, don't update (receiver) sack report. - Similarly, if tcp_drain() is called, freeing up all items on the reassembly queue, clean the sack report. Found, Submitted by: Noritoshi Demizu <demizu at dd dot iij4u dot or dot jp> Reviewed by: Mohan Srinivasan (mohans at yahoo-inc dot com), Raja Mukerji (raja at moselle dot com).	2005-04-10 05:21:29 +00:00
Gleb Smirnoff	31199c8463	Use NET_CALLOUT_MPSAFE macro.	2005-03-01 12:01:17 +00:00
Maxim Konovalov	9945c0e21f	o Add handling of an IPv4-mapped IPv6 address. o Use SYSCTL_IN() macro instead of direct call of copyin(9). Submitted by: ume o Move sysctl_drop() implementation to sys/netinet/tcp_subr.c where most of tcp sysctls live. o There are net.inet[6].tcp[6].getcred sysctls already, no needs in a separate struct tcp_ident_mapping. Suggested by: ume	2005-02-14 07:37:51 +00:00
Hajimu UMEMOTO	6d0a982bdf	teach scope of IPv6 address to net.inet6.tcp6.getcred. MFC after: 1 week	2005-02-04 14:43:05 +00:00
Robert Watson	06456da2c6	Update an additional reference to the rate of ISN tick callouts that was missed in tcp_subr.c:1.216: projected_offset must also reflect how often the tcp_isn_tick() callout will fire. MFC after: 2 weeks Submitted by: silby	2005-01-31 01:35:01 +00:00
Robert Watson	54082796aa	Have tcp_isn_tick() fire 100 times a second, rather than HZ times a second; since the default hz has changed to 1000 times a second, this resulted in unecessary work being performed. MFC after: 2 weeks Discussed with: phk, cperciva General head nod: silby	2005-01-30 23:30:28 +00:00
Warner Losh	c398230b64	/* -> /*- for license, minor formatting changes	2005-01-07 01:45:51 +00:00
Robert Watson	452d9f5b1c	Attempt to consistently use () around return values in calls to return() in newer code (sysctl, ISN, timewait).	2004-12-23 01:34:26 +00:00
Robert Watson	06da46b241	Remove an XXXRW comment relating to whether or not the TCP timers are MPSAFE: they are now believed to be. Correct a typo in a second comment. MFC after: 2 weeks	2004-12-23 01:27:13 +00:00
Robert Watson	b9155d92b2	Assert inpcb lock in: tcpip_fillheaders() tcp_discardcb() tcp_close() tcp_notify() tcp_new_isn() tcp_xmit_bandwidth_limit() Fix a locking comment in tcp_twstart(): the pcbinfo will be locked (and is asserted). MFC after: 2 weeks	2004-12-05 22:27:53 +00:00
Robert Watson	cce83ffb5a	tcp_timewait() performs multiple non-atomic reads on the tcptw structure, so assert the inpcb lock associated with the tcptw. Also assert the tcbinfo lock, as tcp_timewait() may call tcp_twclose() or tcp_2msl_rest(), which require it. Since tcp_timewait() is already called with that lock from tcp_input(), this doesn't change current locking, merely documents reasons for it. In tcp_twstart(), assert the tcbinfo lock, as tcp_timer_2msl_rest() is called, which requires that lock. In tcp_twclose(), assert the tcbinfo lock, as tcp_timer_2msl_stop() is called, which requires that lock. Document the locking strategy for the time wait queues in tcp_timer.c, which consists of protecting the time wait queues in the same manner as the tcbinfo structure (using the tcbinfo lock). In tcp_timer_2msl_reset(), assert the tcbinfo lock, as the time wait queues are modified. In tcp_timer_2msl_stop(), assert the tcbinfo lock, as the time wait queues may be modified. In tcp_timer_2msl_tw(), assert the tcbinfo lock, as the time wait queues may be modified. MFC after: 2 weeks	2004-11-23 17:21:30 +00:00
Robert Watson	7258e91f0f	Assert the inpcb lock in tcp_twstart(), which does both read-modify-write on the tcpcb, but also calls into tcp_close() and tcp_twrespond(). Annotate that tcp_twrecycleable() requires the inpcb lock because it does a series of non-atomic reads of the tcpcb, but is currently called without the inpcb lock by the caller. This is a bug. Assert the inpcb lock in tcp_twclose() as it performs a read-modify-write of the timewait structure/inpcb, and calls in_pcbdetach() which requires the lock. Assert the inpcb lock in tcp_twrespond(), as it performs multiple non-atomic reads of the tcptw and inpcb structures, as well as calling mac_create_mbuf_from_inpcb(), tcpip_fillheaders(), which require the inpcb lock. MFC after: 2 weeks	2004-11-23 16:23:13 +00:00
Robert Watson	8263bab34d	Assert inpcb lock in tcp_quench(), tcp_drop_syn_sent(), tcp_mtudisc(), and tcp_drop(), due to read-modify-write of TCP state variables. MFC after: 2 weeks	2004-11-23 16:06:15 +00:00
Robert Watson	8438db0f59	Assert the tcbinfo write lock in tcp_new_isn(), as the tcbinfo lock protects access to the ISN state variables. Acquire the tcbinfo write lock in tcp_isn_tick() to synchronize timer-driven isn bumping. Staticize internal ISN variables since they're not used outside of tcp_subr.c. MFC after: 2 weeks	2004-11-23 15:59:43 +00:00
SUZUKI Shinsuke	3d54848fc2	support TCP-MD5(IPv4) in KAME-IPSEC, too. MFC after: 3 week	2004-11-08 18:49:51 +00:00
Andre Oppermann	c94c54e4df	Remove RFC1644 T/TCP support from the TCP side of the network stack. A complete rationale and discussion is given in this message and the resulting discussion: http://docs.freebsd.org/cgi/mid.cgi?4177C8AD.6060706 Note that this commit removes only the functional part of T/TCP from the tcp_* related functions in the kernel. Other features introduced with RFC1644 are left intact (socket layer changes, sendmsg(2) on connection oriented protocols) and are meant to be reused by a simpler and less intrusive reimplemention of the previous T/TCP functionality. Discussed on: -arch	2004-11-02 22:22:22 +00:00
Robert Watson	81158452be	Push acquisition of the accept mutex out of sofree() into the caller (sorele()/sotryfree()): - This permits the caller to acquire the accept mutex before the socket mutex, avoiding sofree() having to drop the socket mutex and re-order, which could lead to races permitting more than one thread to enter sofree() after a socket is ready to be free'd. - This also covers clearing of the so_pcb weak socket reference from the protocol to the socket, preventing races in clearing and evaluation of the reference such that sofree() might be called more than once on the same socket. This appears to close a race I was able to easily trigger by repeatedly opening and resetting TCP connections to a host, in which the tcp_close() code called as a result of the RST raced with the close() of the accepted socket in the user process resulting in simultaneous attempts to de-allocate the same socket. The new locking increases the overhead for operations that may potentially free the socket, so we will want to revise the synchronization strategy here as we normalize the reference counting model for sockets. The use of the accept mutex in freeing of sockets that are not listen sockets is primarily motivated by the potential need to remove the socket from the incomplete connection queue on its parent (listen) socket, so cleaning up the reference model here may allow us to substantially weaken the synchronization requirements. RELENG_5_3 candidate. MFC after: 3 days Reviewed by: dwhite Discussed with: gnn, dwhite, green Reported by: Marc UBM Bocklet <ubm at u-boot-man dot de> Reported by: Vlad <marchenko at gmail dot com>	2004-10-18 22:19:43 +00:00
Paul Saab	a55db2b6e6	- Estimate the amount of data in flight in sack recovery and use it to control the packets injected while in sack recovery (for both retransmissions and new data). - Cleanups to the sack codepaths in tcp_output.c and tcp_sack.c. - Add a new sysctl (net.inet.tcp.sack.initburst) that controls the number of sack retransmissions done upon initiation of sack recovery. Submitted by: Mohan Srinivasan <mohans@yahoo-inc.com>	2004-10-05 18:36:24 +00:00
John-Mark Gurney	b5d47ff592	fix up socket/ip layer violation... don't assume/know that SO_DONTROUTE == IP_ROUTETOIF and SO_BROADCAST == IP_ALLOWBROADCAST...	2004-09-05 02:34:12 +00:00
Andre Oppermann	6f2d4ea6f8	For IPv6 access pointer to tcpcb only after we have checked it is valid. Found by: Coverity's automated analysis (via Ted Unangst)	2004-08-19 20:16:17 +00:00
Robert Watson	a4f757cd5d	White space cleanup for netinet before branch: - Trailing tab/space cleanup - Remove spurious spaces between or before tabs This change avoids touching files that Andre likely has in his working set for PFIL hooks changes for IPFW/DUMMYNET. Approved by: re (scottl) Submitted by: Xin LI <delphij@frontfree.net>	2004-08-16 18:32:07 +00:00
David Malone	849112666a	In tcp6_ctlinput, lock tcbinfo around the call to syncache_unreach so that the locks held are the same as the IPv4 case. Reviewed by: rwatson	2004-08-12 18:19:36 +00:00
Andre Oppermann	420a281164	Backout removal of UMA_ZONE_NOFREE flag for all zones which are established for structures with timers in them. It might be that a timer might fire even when the associated structure has already been free'd. Having type- stable storage in this case is beneficial for graceful failure handling and debugging. Discussed with: bosko, tegge, rwatson	2004-08-11 20:30:08 +00:00
Andre Oppermann	4efb805c0c	Remove the UMA_ZONE_NOFREE flag to all uma_zcreate() calls in the IP and TCP code. This flag would have prevented giving back excessive free slabs to the global pool after a transient peak usage.	2004-08-11 17:08:31 +00:00
Robert Watson	f31f65a708	Pass pcbinfo structures to in6_pcbnotify() rather than pcbhead structures, allowing in6_pcbnotify() to lock the pcbinfo and each inpcb that it notifies of ICMPv6 events. This prevents inpcb assertions from firing when IPv6 generates and delievers event notifications for inpcbs. Reported by: kuriyama Tested by: kuriyama	2004-08-06 03:45:45 +00:00
Andre Oppermann	24a098ea9b	o Move the inflight sysctls to their own sub-tree under net.inet.tcp to be more consistent with the other sysctls around it.	2004-08-03 13:54:11 +00:00
Colin Percival	56f21b9d74	Rename suser_cred()'s PRISON_ROOT flag to SUSER_ALLOWJAIL. This is somewhat clearer, but more importantly allows for a consistent naming scheme for suser_cred flags. The old name is still defined, but will be removed in a few days (unless I hear any complaints...) Discussed with: rwatson, scottl Requested by: jhb	2004-07-26 07:24:04 +00:00
Jayanth Vijayaraghavan	04f0d9a0ea	Let IN_FASTREOCOVERY macro decide if we are in recovery mode. Nuke sackhole_limit for now. We need to add it back to limit the total number of sack blocks in the system.	2004-07-19 22:37:33 +00:00
Paul Saab	76947e3222	Move the sack sysctl's under net.inet.tcp.sack net.inet.tcp.do_sack -> net.inet.tcp.sack.enable net.inet.tcp.sackhole_limit -> net.inet.tcp.sack.sackhole_limit Requested by: wollman	2004-06-23 21:34:07 +00:00
Paul Saab	6d90faf3d8	Add support for TCP Selective Acknowledgements. The work for this originated on RELENG_4 and was ported to -CURRENT. The scoreboarding code was obtained from OpenBSD, and many of the remaining changes were inspired by OpenBSD, but not taken directly from there. You can enable/disable sack using net.inet.tcp.do_sack. You can also limit the number of sack holes that all senders can have in the scoreboard with net.inet.tcp.sackhole_limit. Reviewed by: gnn Obtained from: Yahoo! (Mohan Srinivasan, Jayanth Vijayaraghavan)	2004-06-23 21:04:37 +00:00
Robert Watson	d330008e3b	If debug.mpsafenet is set, initialize TCP callouts as CALLOUT_MPSAFE.	2004-06-20 21:44:50 +00:00
Robert Watson	395a08c904	Extend coverage of SOCK_LOCK(so) to include so_count, the socket reference count: - Assert SOCK_LOCK(so) macros that directly manipulate so_count: soref(), sorele(). - Assert SOCK_LOCK(so) in macros/functions that rely on the state of so_count: sofree(), sotryfree(). - Acquire SOCK_LOCK(so) before calling these functions or macros in various contexts in the stack, both at the socket and protocol layers. - In some cases, perform soisdisconnected() before sotryfree(), as this could result in frobbing of a non-present socket if sotryfree() actually frees the socket. - Note that sofree()/sotryfree() will release the socket lock even if they don't free the socket. Submitted by: sam Sponsored by: FreeBSD Foundation Obtained from: BSD/OS	2004-06-12 20:47:32 +00:00
Robert Watson	c18b97c630	Switch to using the inpcb MAC label instead of socket MAC label when labeling new mbufs created from sockets/inpcbs in IPv4. This helps avoid the need for socket layer locking in the lower level network paths where inpcb locks are already frequently held where needed. In particular: - Use the inpcb for label instead of socket in raw_append(). - Use the inpcb for label instead of socket in tcp_output(). - Use the inpcb for label instead of socket in tcp_respond(). - Use the inpcb for label instead of socket in tcp_twrespond(). - Use the inpcb for label instead of socket in syncache_respond(). While here, modify tcp_respond() to avoid assigning NULL to a stack variable and centralize assertions about the inpcb when inp is assigned. Obtained from: TrustedBSD Project Sponsored by: DARPA, McAfee Research	2004-05-04 02:11:47 +00:00
Mike Silbersack	c1537ef063	Enhance our RFC1948 implementation to perform better in some pathlogical TIME_WAIT recycling cases I was able to generate with http testing tools. In short, as the old algorithm relied on ticks to create the time offset component of an ISN, two connections with the exact same host, port pair that were generated between timer ticks would have the exact same sequence number. As a result, the second connection would fail to pass the TIME_WAIT check on the server side, and the SYN would never be acknowledged. I've "fixed" this by adding random positive increments to the time component between clock ticks so that ISNs will always be increasing, no matter how quickly the port is recycled. Except in such contrived benchmarking situations, this problem should never come up in normal usage... until networks get faster. No MFC planned, 4.x is missing other optimizations that are needed to even create the situation in which such quick port recycling will occur.	2004-04-20 06:33:39 +00:00
Warner Losh	f36cfd49ad	Remove advertising clause from University of California Regent's license, per letter dated July 22, 1999 and email from Peter Wemm, Alan Cox and Robert Watson. Approved by: core, peter, alc, rwatson	2004-04-07 20:46:16 +00:00
Robert Watson	47f32f6fa6	Two missed in previous commit -- compare pointer with NULL rather than using it as a boolean.	2004-04-05 00:52:05 +00:00
Robert Watson	24459934e9	Prefer NULL to 0 when checking pointer values as integers or booleans.	2004-04-05 00:49:07 +00:00
Robert Watson	a7b6a14aee	Remove now unneeded arguments to tcp_twrespond() -- so and msrc. These were needed by the MAC Framework until inpcbs gained labels. Submitted by: sam	2004-02-28 15:12:20 +00:00
Don Lewis	47934cef8f	Split the mlock() kernel code into two parts, mlock(), which unpacks the syscall arguments and does the suser() permission check, and kern_mlock(), which does the resource limit checking and calls vm_map_wire(). Split munlock() in a similar way. Enable the RLIMIT_MEMLOCK checking code in kern_mlock(). Replace calls to vslock() and vsunlock() in the sysctl code with calls to kern_mlock() and kern_munlock() so that the sysctl code will obey the wired memory limits. Nuke the vslock() and vsunlock() implementations, which are no longer used. Add a member to struct sysctl_req to track the amount of memory that is wired to handle the request. Modify sysctl_wire_old_buffer() to return an error if its call to kern_mlock() fails. Only wire the minimum of the length specified in the sysctl request and the length specified in its argument list. It is recommended that sysctl handlers that use sysctl_wire_old_buffer() should specify reasonable estimates for the amount of data they want to return so that only the minimum amount of memory is wired no matter what length has been specified by the request. Modify the callers of sysctl_wire_old_buffer() to look for the error return. Modify sysctl_old_user to obey the wired buffer length and clean up its implementation. Reviewed by: bms	2004-02-26 00:27:04 +00:00
Andre Oppermann	12e2e97051	Convert the tcp segment reassembly queue to UMA and limit the maximum amount of segments it will hold. The following tuneables and sysctls control the behaviour of the tcp segment reassembly queue: net.inet.tcp.reass.maxsegments (loader tuneable) specifies the maximum number of segments all tcp reassemly queues can hold (defaults to 1/16 of nmbclusters). net.inet.tcp.reass.maxqlen specifies the maximum number of segments any individual tcp session queue can hold (defaults to 48). net.inet.tcp.reass.cursegments (readonly) counts the number of segments currently in all reassembly queues. net.inet.tcp.reass.overflows (readonly) counts how often either the global or local queue limit has been reached. Tested by: bms, silby Reviewed by: bms, silby	2004-02-24 15:27:41 +00:00
Pawel Jakub Dawidek	41fe0c8ad5	Fixed ucred structure leak. Approved by: scottl (mentor) PR: 54163 MFC after: 3 days	2004-02-19 14:13:21 +00:00
Bruce M Simpson	32ff046639	Final brucification pass. Spell types consistently (u_int). Remove bogus casts. Remove unnecessary parenthesis. Submitted by: bde	2004-02-14 21:49:48 +00:00
Bruce M Simpson	265ed01285	Brucification. Submitted by: bde	2004-02-13 18:21:45 +00:00
Hajimu UMEMOTO	efddf5c64d	supported IPV6_RECVPATHMTU socket option. Obtained from: KAME	2004-02-13 14:50:01 +00:00
Bruce M Simpson	b30190b542	Update the prototype for tcpsignature_apply() to reflect the spelling of the types used by m_apply()'s callback function, f, as documented in mbuf(9). Noticed by: njl	2004-02-12 20:16:09 +00:00
Bruce M Simpson	bca0e5bfc3	style(9) pass; whitespace and comments. Submitted by: njl	2004-02-12 20:12:48 +00:00
Bruce M Simpson	1cfd4b5326	Initial import of RFC 2385 (TCP-MD5) digest support. This is the first of two commits; bringing in the kernel support first. This can be enabled by compiling a kernel with options TCP_SIGNATURE and FAST_IPSEC. For the uninitiated, this is a TCP option which provides for a means of authenticating TCP sessions which came into being before IPSEC. It is still relevant today, however, as it is used by many commercial router vendors, particularly with BGP, and as such has become a requirement for interconnect at many major Internet points of presence. Several parts of the TCP and IP headers, including the segment payload, are digested with MD5, including a shared secret. The PF_KEY interface is used to manage the secrets using security associations in the SADB. There is a limitation here in that as there is no way to map a TCP flow per-port back to an SPI without polluting tcpcb or using the SPD; the code to do the latter is unstable at this time. Therefore this code only supports per-host keying granularity. Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6), TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective users of this feature, this will not pose any problem. This implementation is output-only; that is, the option is honoured when responding to a host initiating a TCP session, but no effort is made [yet] to authenticate inbound traffic. This is, however, sufficient to interwork with Cisco equipment. Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with local patches. Patches for tcpdump to validate TCP-MD5 sessions are also available from me upon request. Sponsored by: sentex.net	2004-02-11 04:26:04 +00:00
Andre Oppermann	53369ac9bb	Limiters and sanity checks for TCP MSS (maximum segement size) resource exhaustion attacks. For network link optimization TCP can adjust its MSS and thus packet size according to the observed path MTU. This is done dynamically based on feedback from the remote host and network components along the packet path. This information can be abused to pretend an extremely low path MTU. The resource exhaustion works in two ways: o during tcp connection setup the advertized local MSS is exchanged between the endpoints. The remote endpoint can set this arbitrarily low (except for a minimum MTU of 64 octets enforced in the BSD code). When the local host is sending data it is forced to send many small IP packets instead of a large one. For example instead of the normal TCP payload size of 1448 it forces TCP payload size of 12 (MTU 64) and thus we have a 120 times increase in workload and packets. On fast links this quickly saturates the local CPU and may also hit pps processing limites of network components along the path. This type of attack is particularly effective for servers where the attacker can download large files (WWW and FTP). We mitigate it by enforcing a minimum MTU settable by sysctl net.inet.tcp.minmss defaulting to 256 octets. o the local host is reveiving data on a TCP connection from the remote host. The local host has no control over the packet size the remote host is sending. The remote host may chose to do what is described in the first attack and send the data in packets with an TCP payload of at least one byte. For each packet the tcp_input() function will be entered, the packet is processed and a sowakeup() is signalled to the connected process. For example an attack with 2 Mbit/s gives 4716 packets per second and the same amount of sowakeup()s to the process (and context switches). This type of attack is particularly effective for servers where the attacker can upload large amounts of data. Normally this is the case with WWW server where large POSTs can be made. We mitigate this by calculating the average MSS payload per second. If it goes below 'net.inet.tcp.minmss' and the pps rate is above 'net.inet.tcp.minmssoverload' defaulting to 1000 this particular TCP connection is resetted and dropped. MITRE CVE: CAN-2004-0002 Reviewed by: sam (mentor) MFC after: 1 day	2004-01-08 17:40:07 +00:00
Andre Oppermann	bf87c82ebb	If path mtu discovery is enabled set the DF bit in all cases we send packets on a tcp connection. PR: kern/60889 Tested by: Richard Wendland <richard@wendland.org.uk> Approved by: re (scottl)	2004-01-08 11:17:11 +00:00
Andre Oppermann	dba7bc6a65	Enable the following TCP options by default to give it more exposure: rfc3042 Limited retransmit rfc3390 Increasing TCP's initial congestion Window inflight TCP inflight bandwidth limiting All my production server have it enabled and there have been no issues. I am confident about having them on by default and it gives us better overall TCP performance. Reviewed by: sam (mentor)	2004-01-06 23:29:46 +00:00
John Baldwin	a5b061f9d2	Fix some becuase -> because typos. Reported by: Marco Wertejuk <wertejuk@mwcis.com>	2003-12-17 16:12:01 +00:00
Robert Watson	2d92ec9858	Switch TCP over to using the inpcb label when responding in timed wait, rather than the socket label. This avoids reaching up to the socket layer during connection close, which requires locking changes. To do this, introduce MAC Framework entry point mac_create_mbuf_from_inpcb(), which is called from tcp_twrespond() instead of calling mac_create_mbuf_from_socket() or mac_create_mbuf_netlayer(). Introduce MAC Policy entry point mpo_create_mbuf_from_inpcb(), and implementations for various policies, which generally just copy label data from the inpcb to the mbuf. Assert the inpcb lock in the entry point since we require consistency for the inpcb label reference. Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories	2003-12-17 14:55:11 +00:00
Andre Oppermann	0cfbbe3bde	Make sure all uses of stack allocated struct route's are properly zeroed. Doing a bzero on the entire struct route is not more expensive than assigning NULL to ro.ro_rt and bzero of ro.ro_dst. Reviewed by: sam (mentor) Approved by: re (scottl)	2003-11-26 20:31:13 +00:00
Andre Oppermann	97d8d152c2	Introduce tcp_hostcache and remove the tcp specific metrics from the routing table. Move all usage and references in the tcp stack from the routing table metrics to the tcp hostcache. It caches measured parameters of past tcp sessions to provide better initial start values for following connections from or to the same source or destination. Depending on the network parameters to/from the remote host this can lead to significant speedups for new tcp connections after the first one because they inherit and shortcut the learning curve. tcp_hostcache is designed for multiple concurrent access in SMP environments with high contention and is hash indexed by remote ip address. It removes significant locking requirements from the tcp stack with regard to the routing table. Reviewed by: sam (mentor), bms Reviewed by: -net, -current, core@kame.net (IPv6 parts) Approved by: re (scottl)	2003-11-20 20:07:39 +00:00
Sam Leffler	c29afad673	o correct locking problem: the inpcb must be held across tcp_respond o add assertions in tcp_respond to validate inpcb locking assumptions o use local variable instead of chasing pointers in tcp_respond Supported by: FreeBSD Foundation	2003-11-08 22:59:22 +00:00
Mike Silbersack	4bd4fa3fe6	Add an additional check to the tcp_twrecycleable function; I had previously only considered the send sequence space. Unfortunately, some OSes (windows) still use a random positive increments scheme for their syn-ack ISNs, so I must consider receive sequence space as well. The value of 250000 bytes / second for Microsoft's ISN rate of increase was determined by testing with an XP machine.	2003-11-02 07:47:03 +00:00
Mike Silbersack	96af9ea52b	- Add a new function tcp_twrecycleable, which tells us if the ISN which we will generate for a given ip/port tuple has advanced far enough for the time_wait socket in question to be safely recycled. - Have in_pcblookup_local use tcp_twrecycleable to determine if time_Wait sockets which are hogging local ports can be safely freed. This change preserves proper TIME_WAIT behavior under normal circumstances while allowing for safe and fast recycling whenever ephemeral port space is scarce.	2003-11-01 07:30:08 +00:00
Mike Silbersack	0709c23335	Reduce the number of tcp time_wait structs to maxsockets / 5; this ensures that at most 20% of sockets can be in time_wait at one time, ensuring that time_wait sockets do not starve real connections from inpcb structures. No implementation change is needed, jlemon already implemented a nice LRU-ish algorithm for tcp_tw structure recycling. This should reduce the need for sysadmins to lower the default msl on busy servers.	2003-10-24 05:44:14 +00:00
Mike Silbersack	184dcdc7c8	Change all SYSCTLS which are readonly and have a related TUNABLE from CTLFLAG_RD to CTLFLAG_RDTUN so that sysctl(8) can provide more useful error messages.	2003-10-21 18:28:36 +00:00

1 2 3 4 5 ...

310 Commits