freebsd-dev

Author	SHA1	Message	Date
Andre Oppermann	6489fe6553	Match up SYSCTL declaration style.	2007-03-19 19:00:51 +00:00
Robert Watson	8d0d6d112f	Remove unused and #if 0'd net.inet.tcp.tcp_rttdflt sysctl.	2007-03-16 13:42:26 +00:00
Mohan Srinivasan	7c72af8770	Reap FIN_WAIT_2 connections marked SOCANTRCVMORE faster. This mitigate potential issues where the peer does not close, potentially leaving thousands of connections in FIN_WAIT_2. This is controlled by a new sysctl fast_finwait2_recycle, which is disabled by default. Reviewed by: gnn, silby.	2007-02-26 22:25:21 +00:00
John Baldwin	54e3607de6	Whitespace fix and remove an extra cast.	2006-12-30 17:53:28 +00:00
Robert Watson	acd3428b7d	Sweep kernel replacing suser(9) calls with priv(9) calls, assigning specific privilege names to a broad range of privileges. These may require some future tweaking. Sponsored by: nCircle Network Security, Inc. Obtained from: TrustedBSD Project Discussed on: arch@ Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri, Alex Lyashkov <umka at sevcity dot net>, Skip Ford <skip dot ford at verizon dot net>, Antoine Brodin <antoine dot brodin at laposte dot net>	2006-11-06 13:42:10 +00:00
Robert Watson	aed5570872	Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now contains the userspace and user<->kernel API and definitions, with all in-kernel interfaces moved to mac_framework.h, which is now included across most of the kernel instead. This change is the first step in a larger cleanup and sweep of MAC Framework interfaces in the kernel, and will not be MFC'd. Obtained from: TrustedBSD Project Sponsored by: SPARTA	2006-10-22 11:52:19 +00:00
Maxim Konovalov	acc03ac6bb	o Convert w/spaces to tabs in the previous commit.	2006-09-29 06:46:31 +00:00
Mike Silbersack	d4bdcb16cc	Rather than autoscaling the number of TIME_WAIT sockets to maxsockets / 5, scale it to min(ephemeral port range / 2, maxsockets / 5) so that people with large gobs of memory and/or large maxsockets settings will not exhaust their entire ephemeral port range with sockets in the TIME_WAIT state during periods of heavy load. Those who wish to tweak the size of the TIME_WAIT zone can still do so with net.inet.tcp.maxtcptw. Reviewed by: glebius, ru	2006-09-29 06:24:26 +00:00
Gleb Smirnoff	3e630ef9a9	Add a sysctl net.inet.tcp.nolocaltimewait that allows to suppress creating a compress TIME WAIT states, if both connection endpoints are local. Default is off.	2006-09-08 13:09:15 +00:00
Ruslan Ermilov	751dea2935	Back when we had T/TCP support, we used to apply different timeouts for TCP and T/TCP connections in the TIME_WAIT state, and we had two separate timed wait queues for them. Now that is has gone, the timeout is always 2*MSL again, and there is no reason to keep two queues (the first was unused anyway!). Also, reimplement the remaining queue using a TAILQ (it was technically impossible before, with two queues).	2006-09-07 13:06:00 +00:00
Andre Oppermann	233dcce118	First step of TSO (TCP segmentation offload) support in our network stack. o add IFCAP_TSO[46] for drivers to announce this capability for IPv4 and IPv6 o add CSUM_TSO flag to mbuf pkthdr csum_flags field o add tso_segsz field to mbuf pkthdr o enhance ip_output() packet length check to allow for large TSO packets o extend tcp_maxmtu[46]() with a flag pointer to pass interface capabilities o adjust all callers of tcp_maxmtu[46]() accordingly Discussed on: -current, -net Sponsored by: TCP/IP Optimization Fundraise 2005	2006-09-06 21:51:59 +00:00
Gleb Smirnoff	2c857a9be9	o Backout rev. 1.125 of in_pcb.c. It appeared to behave extremely bad under high load. For example with 40k sockets and 25k tcptw entries, connect() syscall can run for seconds. Debugging showed that it iterates the cycle millions times and purges thousands of tcptw entries at a time. Besides practical unusability this change is architecturally wrong. First, in_pcblookup_local() is used in connect() and bind() syscalls. No stale entries purging shouldn't be done here. Second, it is a layering violation. o Return back the tcptw purging cycle to tcp_timer_2msl_tw(), that was removed in rev. 1.78 by rwatson. The commit log of this revision tells nothing about the reason cycle was removed. Now we need this cycle, since major cleaner of stale tcptw structures is removed. o Disable probably necessary, but now unused tcp_twrecycleable() function. Reviewed by: ru	2006-09-06 13:56:35 +00:00
Gleb Smirnoff	c3e07bf82a	Finally fix rev. 1.256 Pointy hat to: glebius	2006-09-05 14:00:59 +00:00
Gleb Smirnoff	23ebab416c	Remove extra parenthesis in last commit. Nitpicked by: ru	2006-09-05 12:22:54 +00:00
Gleb Smirnoff	1f1f90c3a7	- Make net.inet.tcp.maxtcptw modifiable at run time. - If net.inet.tcp.maxtcptw was ever set explicitly, do not change it if kern.ipc.maxsockets is changed.	2006-09-05 12:08:47 +00:00
Mohan Srinivasan	2374501ca4	Fix for a bug that causes the computation of "len" in tcp_output() to get messed up, resulting in an inconsistency between the TCP state and so_snd.	2006-08-26 17:53:19 +00:00
Mohan Srinivasan	464469c713	Fixes an edge case bug in timewait handling where ticks rolling over causing the timewait expiry to be exactly 0 corrupts the timewait queues (and that entry). Reviewed by: silby	2006-08-11 21:15:23 +00:00
Robert Watson	e850475248	Move soisdisconnected() in tcp_discardcb() to one of its calling contexts, tcp_twstart(), but not to the other, tcp_detach(), as the socket is already being torn down and therefore there are no listeners. This avoids a panic if kqueue state is registered on the socket at close(), and eliminates to XXX comments. There is one case remaining in which tcp_discardcb() reaches up to the socket layer as part of the TCP host cache, which would be good to avoid. Reported by: Goran Gajic <ggajic at afrodita dot rcub dot bg dot ac dot yu>	2006-08-02 16:18:05 +00:00
Robert Watson	a152f8a361	Change semantics of socket close and detach. Add a new protocol switch function, pru_close, to notify protocols that the file descriptor or other consumer of a socket is closing the socket. pru_abort is now a notification of close also, and no longer detaches. pru_detach is no longer used to notify of close, and will be called during socket tear-down by sofree() when all references to a socket evaporate after an earlier call to abort or close the socket. This means detach is now an unconditional teardown of a socket, whereas previously sockets could persist after detach of the protocol retained a reference. This faciliates sharing mutexes between layers of the network stack as the mutex is required during the checking and removal of references at the head of sofree(). With this change, pru_detach can now assume that the mutex will no longer be required by the socket layer after completion, whereas before this was not necessarily true. Reviewed by: gnn	2006-07-21 17:11:15 +00:00
Stephan Uphoff	d915b28015	Fix race conditions on enumerating pcb lists by moving the initialization ( and where appropriate the destruction) of the pcb mutex to the init/finit functions of the pcb zones. This allows locking of the pcb entries and race condition free comparison of the generation count. Rearrange locking a bit to avoid extra locking operation to update the generation count in in_pcballoc(). (in_pcballoc now returns the pcb locked) I am planning to convert pcb list handling from a type safe to a reference count model soon. ( As this allows really freeing the PCBs) Reviewed by: rwatson@, mohans@ MFC after: 1 week	2006-07-18 22:34:27 +00:00
Robert Watson	10702a2840	Abstract inpcb drop logic, previously just setting of INP_DROPPED in TCP, into in_pcbdrop(). Expand logic to detach the inpcb from its bound address/port so that dropping a TCP connection releases the inpcb resource reservation, which since the introduction of socket/pcb reference count updates, has been persisting until the socket closed rather than being released implicitly due to prior freeing of the inpcb on TCP drop. MFC after: 3 months	2006-04-25 11:17:35 +00:00
Robert Watson	9106a6d6b0	Replace isn_mtx direct use with ISN_*() lock macros so that locking details/strategy can be changed without touching every use. MFC after: 3 months	2006-04-23 12:27:42 +00:00
Robert Watson	4c0e8f41f6	Introduce a new TCP mutex, isn_mtx, which protects the initial sequence number state, rather than re-using pcbinfo. This introduces some additional mutex operations during isn query, but avoids hitting the TCP pcbinfo lock out of yet another frequently firing TCP timer. MFC after: 3 months	2006-04-22 19:23:24 +00:00
Paul Saab	4f590175b7	Allow for nmbclusters and maxsockets to be increased via sysctl. An eventhandler is used to update all the various zones that depend on these values.	2006-04-21 09:25:40 +00:00
Gleb Smirnoff	a73b656763	Add a tunable net.inet.tcp.maxtcptw, that allows to set a limit on tcptw zone independently from setting a limit on socket zone.	2006-04-04 14:31:37 +00:00
Robert Watson	ae0e714308	Before dereferencing intotw() when INP_TIMEWAIT, check for inp_ppcb being NULL. We currently do allow this to happen, but may want to remove that possibility in the future. This case can occur when a socket is left open after TCP wraps up, and the timewait state is recycled. This will be cleaned up in the future. Found by: Kazuaki Oda <kaakun at highway dot ne dot jp> MFC after: 3 months	2006-04-04 12:26:07 +00:00
Robert Watson	cb895fb9b0	In TCP notify routines, check inpcb for INP_TIMEWAIT and INP_DROPPED. The INP_DROPPED check replaces the current NULL checks; the INP_TIMEWAIT checks appear to have always been required, but not been there, which is/was a bug. This avoids unconditionally casting of in_ppcb to a tcpcb, when it may be a twtcb, which may have resulted in obscure ICMP-related panics in earlier releases. MFC after: 3 months	2006-04-03 14:07:50 +00:00
Robert Watson	afa39e25c4	Change inp_ppcb from caddr_t to void , fix/remove associated related casts. Consistently use intotw() to cast inp_ppcb pointers to struct tcptw pointers. Consistently use intotcpcb() to cast inp_ppcb pointers to struct tcpcb * pointers. Don't assign tp to the results to intotcpcb() during variable declation at the top of functions, as that is before the asserts relating to locking have been performed. Do this later in the function after appropriate assertions have run to allow that operation to be conisdered safe. MFC after: 3 months	2006-04-03 13:33:55 +00:00
Robert Watson	43f56a32a0	Style tweaks: convert to ANSI from K&R function prototypes. MFC after: 3 months	2006-04-03 12:59:27 +00:00
Robert Watson	2fc5ae87d0	Update comment on tcp_close() for new world order. MFC after: 3 months	2006-04-03 12:52:13 +00:00
Robert Watson	fa38deac65	Fix up locking surrounding tcp_drop sysctl: in the new world order, we don't free inpcbs until after the socket is closed, so we always need to unlock an inpcb after calling tcp_drop() on it. MFC after: 3 months	2006-04-03 11:57:12 +00:00
Robert Watson	34af7bae80	Properly handle an edge case previously not handled correctly: a socket can have a tcp connection that has entered time wait attached to it, in the event that shutdown() is called on the socket and the FINs properly exchange before close(). In this case we don't detach or free the inpcb, just leave the tcptw detached and freed, but we must release the inpcb lock (which we didn't previously). MFC after: 3 months	2006-04-01 23:53:25 +00:00
Robert Watson	623dce13c6	Update TCP for infrastructural changes to the socket/pcb refcount model, pru_abort(), pru_detach(), and in_pcbdetach(): - Universally support and enforce the invariant that so_pcb is never NULL, converting dozens of unnecessary NULL checks into assertions, and eliminating dozens of unnecessary error handling cases in protocol code. - In some cases, eliminate unnecessary pcbinfo locking, as it is no longer required to ensure so_pcb != NULL. For example, the receive code no longer requires the pcbinfo lock, and the send code only requires it if building a new connection on an otherwise unconnected socket triggered via sendto() with an address. This should significnatly reduce tcbinfo lock contention in the receive and send cases. - In order to support the invariant that so_pcb != NULL, it is now necessary for the TCP code to not discard the tcpcb any time a connection is dropped, but instead leave the tcpcb until the socket is shutdown. This case is handled by setting INP_DROPPED, to substitute for using a NULL so_pcb to indicate that the connection has been dropped. This requires the inpcb lock, but not the pcbinfo lock. - Unlike all other protocols in the tree, TCP may need to retain access to the socket after the file descriptor has been closed. Set SS_PROTOREF in tcp_detach() in order to prevent the socket from being freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether or not it needs to free the socket when the connection finally does close. The typical case where this occurs is if close() is called on a TCP socket before all sent data in the send socket buffer has been transmitted or acknowledged. If INP_SOCKREF is found when the connection is dropped, we release the inpcb, tcpcb, and socket instead of flagging INP_DROPPED. - Abort and detach protocol switch methods no longer return failures, nor attempt to free sockets, as the socket layer does this. - Annotate the existence of a long-standing race in the TCP timer code, in which timers are stopped but not drained when the socket is freed, as waiting for drain may lead to deadlocks, or have to occur in a context where waiting is not permitted. This race has been handled by testing to see if the tcpcb pointer in the inpcb is NULL (and vice versa), which is not normally permitted, but may be true of a inpcb and tcpcb have been freed. Add a counter to test how often this race has actually occurred, and a large comment for each instance where we compare potentially freed memory with NULL. This will have to be fixed in the near future, but requires is to further address how to handle the timer shutdown shutdown issue. - Several TCP calls no longer potentially free the passed inpcb/tcpcb, so no longer need to return a pointer to indicate whether the argument passed in is still valid. - Un-macroize debugging and locking setup for various protocol switch methods for TCP, as it lead to more obscurity, and as locking becomes more customized to the methods, offers less benefit. - Assert copyright on tcp_usrreq.c due to significant modifications that have been made as part of this work. These changes significantly modify the memory management and connection logic of our TCP implementation, and are (as such) High Risk Changes, and likely to contain serious bugs. Please report problems to the current@ mailing list ASAP, ideally with simple test cases, and optionally, packet traces. MFC after: 3 months	2006-04-01 16:36:36 +00:00
Andre Oppermann	eaf80179e2	Have TCP Inflight disable itself if the RTT is below a certain threshold. Inflight doesn't make sense on a LAN as it has trouble figuring out the maximal bandwidth because of the coarse tick granularity. The sysctl net.inet.tcp.inflight.rttthresh specifies the threshold in milliseconds below which inflight will disengage. It defaults to 10ms. Tested by: Joao Barros <joao.barros-at-gmail.com>, Rich Murphey <rich-at-whiteoaklabs.com> Sponsored by: TCP/IP Optimization Fundraise 2005	2006-02-16 19:38:07 +00:00
Andre Oppermann	34333b16cd	Retire MT_HEADER mbuf type and change its users to use MT_DATA. Having an additional MT_HEADER mbuf type is superfluous and redundant as nothing depends on it. It only adds a layer of confusion. The distinction between header mbuf's and data mbuf's is solely done through the m->m_flags M_PKTHDR flag. Non-native code is not changed in this commit. For compatibility MT_HEADER is mapped to MT_DATA. Sponsored by: TCP/IP Optimization Fundraise 2005	2005-11-02 13:46:32 +00:00
Philip Paeps	7691747aac	Unbreak the net.inet6.tcp6.getcred sysctl. This makes inetd/auth work again in IPv6 setups. Pointy hat to: ume/KAME	2005-10-12 09:24:18 +00:00
Maxim Konovalov	ac827533df	o Teach sysctl_drop() how to deal with the sockets in TIME_WAIT state. This is a special case because tcp_twstart() destroys a tcp control block via tcp_discardcb() so we cannot call tcp_drop(struct *tcpcb) on such connections. Use tcp_twclose() instead. MFC after: 5 days	2005-10-02 08:43:57 +00:00
Andre Oppermann	ffabe3dce8	In tcp_ctlinput() do not swap ip->ip_len a second time. It has been done in icmp_input() already. This fixes the ICMP_UNREACH_NEEDFRAG case where no MTU was proposed in the ICMP reply. PR: kern/81813 Submitted by: Vitezslav Novy <vita at fio.cz> MFC after: 3 days	2005-09-10 07:43:29 +00:00
Andre Oppermann	e0aec68255	Use the correct mbuf type for MGET().	2005-08-30 16:35:27 +00:00
Hajimu UMEMOTO	4dad226e45	recover the line which was wrongly disappeared during scope cleanup. tcpdrop(8) should work for IPv6, again.	2005-08-01 12:08:49 +00:00
Hajimu UMEMOTO	a1f7e5f8ee	scope cleanup. with this change - most of the kernel code will not care about the actual encoding of scope zone IDs and won't touch "s6_addr16[1]" directly. - similarly, most of the kernel code will not care about link-local scoped addresses as a special case. - scope boundary check will be stricter. For example, the current BSD code allows a packet with src=::1 and dst=(some global IPv6 address) to be sent outside of the node, if the application do: s = socket(AF_INET6); bind(s, "::1"); sendto(s, some_global_IPv6_addr); This is clearly wrong, since ::1 is only meaningful within a single node, but the current implementation of the BSD kernel cannot reject this attempt. Submitted by: JINMEI Tatuya <jinmei__at__isl.rdc.toshiba.co.jp> Obtained from: KAME	2005-07-25 12:31:43 +00:00
Robert Watson	f59a9ebf10	Remove no-op spl's and most comment references to spls, as TCP locking is believed to be basically done (modulo any remaining bugs). MFC after: 3 days	2005-07-19 12:21:26 +00:00
Paul Saab	482ac96888	Fix for a bug in the change that defers sack option processing until after PAWS checks. The symptom of this is an inconsistency in the cached sack state, caused by the fact that the sack scoreboard was not being updated for an ACK handled in the header prediction path. Found by: Andrey Chernov. Submitted by: Noritoshi Demizu, Raja Mukerji. Approved by: re	2005-07-01 22:54:18 +00:00
Robert Watson	e3d5315d01	Assert tcbinfo lock in tcp_drop() due to its call of tcp_close() Assert tcbinfo lock in tcp_close() due to its call to in{,6}_detach() Assert tcbinfo lock in tcp_drop_syn_sent() due to its call to tcp_drop() MFC after: 7 days	2005-06-01 12:06:07 +00:00
Colin Percival	fe2eee8231	Fix two issues which were missed in FreeBSD-SA-05:08.kmem. Reported by: Uwe Doering	2005-05-07 00:41:36 +00:00
Andre Oppermann	9e4ca6315d	If we don't get a suggested MTU during path MTU discovery look up the packet size of the packet that generated the response, step down the MTU by one step through ip_next_mtu() and try again. Suggested by: dwmalone	2005-05-04 13:48:44 +00:00
Paul Saab	a6235da61e	- Make the sack scoreboard logic use the TAILQ macros. This improves code readability and facilitates some anticipated optimizations in tcp_sack_option(). - Remove tcp_print_holes() and TCP_SACK_DEBUG. Submitted by: Raja Mukerji. Reviewed by: Mohan Srinivasan, Noritoshi Demizu.	2005-04-21 20:11:01 +00:00
Andre Oppermann	1aedbd9c80	Move Path MTU discovery ICMP processing from icmp_input() to tcp_ctlinput() and subject it to active tcpcb and sequence number checking. Previously any ICMP unreachable/needfrag message would cause an update to the TCP hostcache. Now only ICMP PMTU messages belonging to an active TCP session with the correct src/dst/port and sequence number will update the hostcache and complete the path MTU discovery process. Note that we don't entirely implement the recommended counter measures of Section 7.2 of the paper. However we close down the possible degradation vector from trivially easy to really complex and resource intensive. In addition we have limited the smallest acceptable MTU with net.inet.tcp.minmss sysctl for some time already, further reducing the effect of any degradation due to an attack. Security: draft-gont-tcpm-icmp-attacks-03.txt Section 7.2 MFC after: 3 days	2005-04-21 14:29:34 +00:00
Andre Oppermann	1600372b6b	Ignore ICMP Source Quench messages for TCP sessions. Source Quench is ineffective, depreciated and can be abused to degrade the performance of active TCP sessions if spoofed. Replace a bogus call to tcp_quench() in tcp_output() with the direct equivalent tcpcb variable assignment. Security: draft-gont-tcpm-icmp-attacks-03.txt Section 7.1 MFC after: 3 days	2005-04-21 12:37:12 +00:00
Paul Saab	e346eeff65	- If the reassembly queue limit was reached or if we couldn't allocate a reassembly queue state structure, don't update (receiver) sack report. - Similarly, if tcp_drain() is called, freeing up all items on the reassembly queue, clean the sack report. Found, Submitted by: Noritoshi Demizu <demizu at dd dot iij4u dot or dot jp> Reviewed by: Mohan Srinivasan (mohans at yahoo-inc dot com), Raja Mukerji (raja at moselle dot com).	2005-04-10 05:21:29 +00:00

1 2 3 4 5 ...

268 Commits