freebsd-nq

Author	SHA1	Message	Date
Robert Watson	74d4630b71	Remove an errant blank line apparently introduced in ip_output.c:1.194.	2004-12-25 22:59:42 +00:00
Robert Watson	42cf3289c3	In the dropafterack case of tcp_input(), it's OK to release the TCP pcbinfo lock before calling tcp_output(), as holding just the inpcb lock is sufficient to prevent garbage collection.	2004-12-25 22:26:13 +00:00
Robert Watson	e0bef1cb35	Revert parts of tcp_input.c:1.255 associated with the header predicted cases for tcp_input(): While it is true that the pcbinfo lock provides a pseudo-reference to inpcbs, both the inpcb and pcbinfo locks are required to free an un-referenced inpcb. As such, we can release the pcbinfo lock as long as the inpcb remains locked with the confidence that it will not be garbage-collected. This leads to a less conservative locking strategy that should reduce contention on the TCP pcbinfo lock. Discussed with: sam	2004-12-25 22:23:13 +00:00
Robert Watson	452d9f5b1c	Attempt to consistently use () around return values in calls to return() in newer code (sysctl, ISN, timewait).	2004-12-23 01:34:26 +00:00
Robert Watson	06da46b241	Remove an XXXRW comment relating to whether or not the TCP timers are MPSAFE: they are now believed to be. Correct a typo in a second comment. MFC after: 2 weeks	2004-12-23 01:27:13 +00:00
Robert Watson	db0aae38b6	Remove the now unused tcp_canceltimers() function. tcpcb timers are now stopped as part of tcp_discardcb(). MFC after: 2 weeks	2004-12-23 01:25:59 +00:00
Robert Watson	950ab1e470	Remove an annotation of a minor race relating to the update of multiple MIB entries using sysctl in short order, which might result in unexpected values for tcp_maxidle being generated by tcp_slowtimo. In practice, this will not happen, or at least, doesn't require an explicit comment. MFC after: 2 weeks	2004-12-23 01:21:54 +00:00
Gleb Smirnoff	5e5da86597	In certain cases ip_output() can free our route, so check for its presence before RTFREE(). Noticed by: ru	2004-12-10 07:51:14 +00:00
Gleb Smirnoff	d2a09f901a	Revert last change. Andre: First lets get major new features into the kernel in a clean and nice way, and then start optimizing. In this case we don't have any obfusication that makes later profiling and/or optimizing difficult in any way. Requested by: csjp, sam	2004-12-10 07:47:17 +00:00
Christian S.J. Peron	fbf2edb6e4	This commit adds a shared locking mechanism very similar to the mechanism used by pfil. This shared locking mechanism will remove a nasty lock order reversal which occurs when ucred based rules are used which results in hard locks while mpsafenet=1. So this removes the debug.mpsafenet=0 requirement when using ucred based rules with IPFW. It should be noted that this locking mechanism does not guarantee fairness between read and write locks, and that it will favor firewall chain readers over writers. This seemed acceptable since write operations to firewall chains protected by this lock tend to be less frequent than reads. Reviewed by: andre, rwatson Tested by: myself, seanc Silence on: ipfw@ MFC after: 1 month	2004-12-10 02:17:18 +00:00
Gleb Smirnoff	f5a19d3909	Check that DUMMYNET_LOADED before seeking dummynet m_tag. Reviewed by: andre MFC after: 1 week	2004-12-09 16:41:47 +00:00
Max Laier	067a8bab8a	More fixing of multiple addresses in the same prefix. This time do not try to arp resolve "secondary" local addresses. Found and submitted by: ru With additions from: OpenBSD (rev. 1.47) Reviewed by: ru	2004-12-09 00:12:41 +00:00
Ruslan Ermilov	5cae05ad33	Time out routes created by redirect.	2004-12-06 22:27:22 +00:00
Gleb Smirnoff	98335aa976	- Make route cacheing optional, configurable via IFF_LINK0 flag. - Turn it off by default. Requested by: many Reviewed by: andre Approved by: julian (mentor) MFC after: 3 days	2004-12-06 19:02:43 +00:00
Robert Watson	79a9e59c89	Assert the tcptw inpcb lock in tcp_timer_2msl_reset(), as fields in the tcptw undergo non-atomic read-modify-writes. MFC after: 2 weeks	2004-12-05 22:47:29 +00:00
Robert Watson	b9155d92b2	Assert inpcb lock in: tcpip_fillheaders() tcp_discardcb() tcp_close() tcp_notify() tcp_new_isn() tcp_xmit_bandwidth_limit() Fix a locking comment in tcp_twstart(): the pcbinfo will be locked (and is asserted). MFC after: 2 weeks	2004-12-05 22:27:53 +00:00
Robert Watson	6fbed4af22	Minor grammer fix in comment.	2004-12-05 22:20:59 +00:00
Robert Watson	89924e5865	Pass the inpcb reference into ip_getmoptions() rather than just the inp->inp_moptions pointer, so that ip_getmoptions() can perform necessary locking when doing non-atomic reads. Lock the inpcb by default to copy any data to local variables, then unlock before performing sooptcopyout(). MFC after: 2 weeks	2004-12-05 22:08:37 +00:00
Robert Watson	92c71ab30b	Define INP_UNLOCK_ASSERT() to assert that an inpcb is unlocked. MFC after: 2 weeks	2004-12-05 22:07:14 +00:00
Robert Watson	5c918b56d8	Push the inpcb argument into ip_setmoptions() when setting IP multicast socket options, so that it is available for locking.	2004-12-05 21:38:33 +00:00
Robert Watson	993d9505d4	Start working through inpcb locking for ip_ctloutput() by cleaning up modifications to the inpcb IP options mbuf: - Lock the inpcb before passing it into ip_pcbopts() in order to prevent simulatenous reads and read-modify-writes that could result in races. - Pass the inpcb reference into ip_pcbopts() instead of the option chain pointer in the inpcb. - Assert the inpcb lock in ip_pcbots. - Convert one or two uses of a pointer as a boolean or an integer comparison to a comparison with NULL for readability.	2004-12-05 19:11:09 +00:00
Paul Saab	7d5ed1ceea	Fixes a bug in SACK causing us to send data beyond the receive window. Found by: Pawel Worach and Daniel Hartmeier Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com	2004-11-29 18:47:27 +00:00
Robert Watson	2be3bf2244	Assert the inpcb lock in tcp_xmit_timer() as it performs read-modify- write of various time/rtt-related fields in the tcpcb.	2004-11-28 11:06:22 +00:00
Robert Watson	18ad5842c5	Expand coverage of the receive socket buffer lock when handling urgent pointer updates: test available space while holding the socket buffer mutex, and continue to hold until until the pointer update has been performed. MFC after: 2 weeks	2004-11-28 11:01:31 +00:00
Robert Watson	c8443a1dc0	Do export the advertised receive window via the tcpi_rcv_space field of struct tcp_info.	2004-11-27 20:20:11 +00:00
Robert Watson	b8af5dfa81	Implement parts of the TCP_INFO socket option as found in Linux 2.6. This socket option allows processes query a TCP socket for some low level transmission details, such as the current send, bandwidth, and congestion windows. Linux provides a 'struct tcpinfo' structure containing various variables, rather than separate socket options; this makes the API somewhat fragile as it makes it dificult to add new entries of interest as requirements and implementation evolve. As such, I've included a large pad at the end of the structure. Right now, relatively few of the Linux API fields are filled in, and some contain no logical equivilent on FreeBSD. I've include __'d entries in the structure to make it easier to figure ou what is and isn't omitted. This API/ABI should be considered unstable for the time being.	2004-11-26 18:58:46 +00:00
Mike Silbersack	6a220ed80a	Fix a problem where our TCP stack would ignore RST packets if the receive window was 0 bytes in size. This may have been the cause of unsolved "connection not closing" reports over the years. Thanks to Michiel Boland for providing the fix and providing a concise test program for the problem. Submitted by: Michiel Boland MFC after: 2 weeks	2004-11-25 19:04:20 +00:00
Robert Watson	de30ea131f	In tcp_reass(), assert the inpcb lock on the passed tcpcb, since the contents of the tcpcb are read and modified in volume. In tcp_input(), replace th comparison with 0 with a comparison with NULL. At the 'findpcb', 'dropafterack', and 'dropwithreset' labels in tcp_input(), assert 'headlocked'. Try to improve consistency between various assertions regarding headlocked to be more informative. MFC after: 2 weeks	2004-11-23 23:41:20 +00:00
Robert Watson	cce83ffb5a	tcp_timewait() performs multiple non-atomic reads on the tcptw structure, so assert the inpcb lock associated with the tcptw. Also assert the tcbinfo lock, as tcp_timewait() may call tcp_twclose() or tcp_2msl_rest(), which require it. Since tcp_timewait() is already called with that lock from tcp_input(), this doesn't change current locking, merely documents reasons for it. In tcp_twstart(), assert the tcbinfo lock, as tcp_timer_2msl_rest() is called, which requires that lock. In tcp_twclose(), assert the tcbinfo lock, as tcp_timer_2msl_stop() is called, which requires that lock. Document the locking strategy for the time wait queues in tcp_timer.c, which consists of protecting the time wait queues in the same manner as the tcbinfo structure (using the tcbinfo lock). In tcp_timer_2msl_reset(), assert the tcbinfo lock, as the time wait queues are modified. In tcp_timer_2msl_stop(), assert the tcbinfo lock, as the time wait queues may be modified. In tcp_timer_2msl_tw(), assert the tcbinfo lock, as the time wait queues may be modified. MFC after: 2 weeks	2004-11-23 17:21:30 +00:00
Robert Watson	b42ff86e73	De-spl tcp_slowtimo; tcp_maxidle assignment is subject to possible but unlikely races that could be corrected by having tcp_keepcnt and tcp_keepintvl modifications go through handler functions via sysctl, but probably is not worth doing. Updates to multiple sysctls within evaluation of a single addition are unlikely. Annotate that tcp_canceltimers() is currently unused. De-spl tcp_timer_delack(). De-spl tcp_timer_2msl(). MFC after: 2 weeks	2004-11-23 16:45:07 +00:00
Robert Watson	7258e91f0f	Assert the inpcb lock in tcp_twstart(), which does both read-modify-write on the tcpcb, but also calls into tcp_close() and tcp_twrespond(). Annotate that tcp_twrecycleable() requires the inpcb lock because it does a series of non-atomic reads of the tcpcb, but is currently called without the inpcb lock by the caller. This is a bug. Assert the inpcb lock in tcp_twclose() as it performs a read-modify-write of the timewait structure/inpcb, and calls in_pcbdetach() which requires the lock. Assert the inpcb lock in tcp_twrespond(), as it performs multiple non-atomic reads of the tcptw and inpcb structures, as well as calling mac_create_mbuf_from_inpcb(), tcpip_fillheaders(), which require the inpcb lock. MFC after: 2 weeks	2004-11-23 16:23:13 +00:00
Robert Watson	8263bab34d	Assert inpcb lock in tcp_quench(), tcp_drop_syn_sent(), tcp_mtudisc(), and tcp_drop(), due to read-modify-write of TCP state variables. MFC after: 2 weeks	2004-11-23 16:06:15 +00:00
Robert Watson	8438db0f59	Assert the tcbinfo write lock in tcp_new_isn(), as the tcbinfo lock protects access to the ISN state variables. Acquire the tcbinfo write lock in tcp_isn_tick() to synchronize timer-driven isn bumping. Staticize internal ISN variables since they're not used outside of tcp_subr.c. MFC after: 2 weeks	2004-11-23 15:59:43 +00:00
Robert Watson	ca127a3e80	Remove "Unlocked read" annotations associated with previously unlocked use of socket buffer fields in the TCP input code. These references are now protected by use of the receive socket buffer lock. MFC after: 1 week	2004-11-22 13:16:27 +00:00
Robert Watson	98734750b4	s/send/sent/ in comment describing TCPS_SYN_RECEIVED.	2004-11-21 14:38:04 +00:00
Gleb Smirnoff	c1384b5ae2	- Since divert protocol is not connection oriented, remove SS_ISCONNECTED flag from divert sockets. - Remove div_disconnect() method, since it shouldn't be called now. - Remove div_abort() method. It was never called directly, since protocol doesn't have listen queue. It was called only from div_disconnect(), which is removed now. Reviewed by: rwatson, maxim Approved by: julian (mentor) MT5 after: 1 week MT4 after: 1 month	2004-11-18 13:49:18 +00:00
Max Laier	9a6a6eeba2	Fix host route addition for more than one address to a loopback interface after allowing more than one address with the same prefix. Reported by: Vladimir Grebenschikov <vova NO fbsd SPAM ru> Submitted by: ru (also NetBSD rev. 1.83) Pointyhat to: mlaier	2004-11-17 23:14:03 +00:00
Max Laier	81d96ce8a4	Merge copyright notices. Requested by: njl	2004-11-13 17:05:40 +00:00
Gleb Smirnoff	ea0bd57615	Fix ng_ksocket(4) operation as a divert socket, which is pretty useful and has been broken twice: - in the beginning of div_output() replace KASSERT with assignment, as it was in rev. 1.83. [1] [to be MFCed] - refactor changes introduced in rev. 1.100: do not prepend a new tag unconditionally. Before doing this check whether we have one. [2] A small note for all hacking in this area: when divert socket is not a real userland, but ng_ksocket(4), we receive _the same_ mbufs, that we transmitted to socket. These mbufs have rcvif, the tags we've put on them. And we should treat them correctly. Discussed with: mlaier [1] Silence from: green [2] Reviewed by: maxim Approved by: julian (mentor) MFC after: 1 week	2004-11-12 22:17:42 +00:00
Max Laier	48321abefe	Change the way we automatically add prefix routes when adding a new address. This makes it possible to have more than one address with the same prefix. The first address added is used for the route. On deletion of an address with IFA_ROUTE set, we try to find a "fallback" address and hand over the route if possible. I plan to MFC this in 4 weeks, hence I keep the - now obsolete - argument to in_ifscrub as it must be considered KAPI as it is not static in in.c. I will clean this after the MFC. Discussed on: arch, net Tested by: many testers of the CARP patches Nits from: ru, Andrea Campi <andrea+freebsd_arch webcom it> Obtained from: WIDE via OpenBSD MFC after: 1 month	2004-11-12 20:53:51 +00:00
Poul-Henning Kamp	e21e4c19c9	Add missing '=' Spotted by: obrien	2004-11-11 19:02:01 +00:00
Andre Oppermann	5e7b233055	Fix a double-free in the 'hlen > m->m_len' sanity check. Bug report by: <james@towardex.com> MFC after: 2 weeks	2004-11-09 09:40:32 +00:00
SUZUKI Shinsuke	3d54848fc2	support TCP-MD5(IPv4) in KAME-IPSEC, too. MFC after: 3 week	2004-11-08 18:49:51 +00:00
Poul-Henning Kamp	756d52a195	Initialize struct pr_userreqs in new/sparse style and fill in common default elements in net_init_domain(). This makes it possible to grep these structures and see any bogosities.	2004-11-08 14:44:54 +00:00
Robert Watson	d6915262af	Do some re-sorting of TCP pcbinfo locking and assertions: make sure to retain the pcbinfo lock until we're done using a pcb in the in-bound path, as the pcbinfo lock acts as a pseuo-reference to prevent the pcb from potentially being recycled. Clean up assertions and make sure to assert that the pcbinfo is locked at the head of code subsections where it is needed. Free the mbuf at the end of tcp_input after releasing any held locks to reduce the time the locks are held. MFC after: 3 weeks	2004-11-07 19:19:35 +00:00
Andre Oppermann	e9a4cd2426	Fix a double-free in the 'm->m_len < sizeof (struct ip)' sanity check. Bug report by: <james@towardex.com> MFC after: 2 weeks	2004-11-06 10:47:36 +00:00
Poul-Henning Kamp	c83c1318f5	Hide udp_in6 behind #ifdef INET6	2004-11-04 07:14:03 +00:00
Bruce M Simpson	38f061057b	When performing IP fast forwarding, immediately drop traffic which is destined for a blackhole route. This also means that blackhole routes do not need to be bound to lo(4) or disc(4) interfaces for the net.inet.ip.fastforwarding=1 case. Submitted by: james at towardex dot com Sponsored by: eXtensible Open Router Project <URL:http://www.xorp.org/> MFC after: 3 weeks	2004-11-04 02:14:38 +00:00
Robert Watson	d4b509bd7f	Until this change, the UDP input code used global variables udp_in, udp_in6, and udp_ip6 to pass socket address state between udp_input(), udp_append(), and soappendaddr_locked(). While file in the default configuration, when running with multiple netisrs or direct ithread dispatch, this can result in races wherein user processes using recvmsg() get back the wrong source IP/port. To correct this and related races: - Eliminate udp_ip6, which is believed to be generated but then never used. Eliminate ip_2_ip6_hdr() as it is now unneeded. - Eliminate setting, testing, and existence of 'init' status fields for the IPv6 structures. While with multiple UDP delivery this could lead to amortization of IPv4 -> IPv6 conversion when delivering an IPv4 UDP packet to an IPv6 socket, it added substantial complexity and side effects. - Move global structures into the stack, declaring udp_in in udp_input(), and udp_in6 in udp_append() to be used if a conversion is required. Pass &udp_in into udp_append(). - Re-annotate comments to reflect updates. With this change, UDP appears to operate correctly in the presence of substantial inbound processing parallelism. This solution avoids introducing additional synchronization, but does increase the potential stack depth. Discovered by: kris (Bug Magnet) MFC after: 3 weeks	2004-11-04 01:25:23 +00:00
Andre Oppermann	c94c54e4df	Remove RFC1644 T/TCP support from the TCP side of the network stack. A complete rationale and discussion is given in this message and the resulting discussion: http://docs.freebsd.org/cgi/mid.cgi?4177C8AD.6060706 Note that this commit removes only the functional part of T/TCP from the tcp_* related functions in the kernel. Other features introduced with RFC1644 are left intact (socket layer changes, sendmsg(2) on connection oriented protocols) and are meant to be reused by a simpler and less intrusive reimplemention of the previous T/TCP functionality. Discussed on: -arch	2004-11-02 22:22:22 +00:00
Robert Watson	ab5c14d828	Correct a bug in TCP SACK that could result in wedging of the TCP stack under high load: only set function state to loop and continuing sending if there is no data left to send. RELENG_5_3 candidate. Feet provided: Peter Losher <Peter underscore Losher at isc dot org> Diagnosed by: Aniel Hartmeier <daniel at benzedrine dot cx> Submitted by: mohan <mohans at yahoo-inc dot com>	2004-10-30 12:02:50 +00:00
Robert Watson	c427483381	Add a matching tunable for net.inet.tcp.sack.enable sysctl.	2004-10-26 08:59:09 +00:00
Bruce M Simpson	d6fa5d2806	Check that rt_mask(rt) is non-NULL before dereferencing it, in the RTM_ADD case, thus avoiding a panic. Submitted by: Iasen Kostov	2004-10-26 03:31:58 +00:00
Andre Oppermann	84bb6a2e75	IPDIVERT is a module now and tell the other parts of the kernel about it. IPDIVERT depends on IPFIREWALL being loaded or compiled into the kernel.	2004-10-25 20:02:34 +00:00
Ruslan Ermilov	a35d88931c	For variables that are only checked with defined(), don't provide any fake value.	2004-10-24 15:33:08 +00:00
Andre Oppermann	cd109b0d82	Shave 40 unused bytes from struct tcpcb.	2004-10-22 19:55:04 +00:00
Andre Oppermann	21dcc96f4a	When printing the initialization string and IPDIVERT is not compiled into the kernel refer to it as "loadable" instead of "disabled".	2004-10-22 19:18:06 +00:00
Andre Oppermann	24fc79b0a4	Refuse to unload the ipdivert module unless the 'force' flag is given to kldunload. Reflect the fact that IPDIVERT is a loadable module in the divert(4) and ipfw(8) man pages.	2004-10-22 19:12:01 +00:00
Andre Oppermann	57bbe2e1ab	Destroy the UMA zone on unload.	2004-10-19 22:51:20 +00:00
Andre Oppermann	2de1a9eb6e	Slightly extend the locking during unload to fully cover the protocol deregistration. This does not entirely close the race but narrows the even previously extremely small chance of a race some more.	2004-10-19 22:08:13 +00:00
Robert Watson	279128e295	Annotate a newly introduced race present due to the unloading of protocols: it is possible for sockets to be created and attached to the divert protocol between the test for sockets present and successful unload of the registration handler. We will need to explore more mature APIs for unregistering the protocol and then draining consumers, or an atomic test-and-unregister mechanism.	2004-10-19 21:35:42 +00:00
Andre Oppermann	72584fd2c0	Convert IPDIVERT into a loadable module. This makes use of the dynamic loadability of protocols. The call to divert_packet() is done through a function pointer. All semantics of IPDIVERT remain intact. If IPDIVERT is not loaded ipfw will refuse to install divert rules and natd will complain about 'protocol not supported'. Once it is loaded both will work and accept rules and open the divert socket. The module can only be unloaded if no divert sockets are open. It does not close any divert sockets when an unload is requested but will return EBUSY instead.	2004-10-19 21:14:57 +00:00
Andre Oppermann	969bb53e80	Properly declare the "net.inet" sysctl subtree.	2004-10-19 21:06:14 +00:00
Andre Oppermann	539be79a9d	Pre-emptively define IPPROTO_SPACER to 32767, the same value as PROTO_SPACER to document that this value is globally assigned for a special purpose and may not be reused within the IPPROTO number space.	2004-10-19 20:59:01 +00:00
Andre Oppermann	dff3237ee5	Make use of the PROTO_SPACER functionality for dynamically loadable protocols in inetsw[] and define initially eight spacer slots. Remove conflicting declaration 'struct pr_usrreqs nousrreqs'. It is now declared and initialized in kern/uipc_domain.c.	2004-10-19 15:58:22 +00:00
Andre Oppermann	de38924dc0	Support for dynamically loadable and unloadable IP protocols in the ipmux. With pr_proto_register() it has become possible to dynamically load protocols within the PF_INET domain. However the PF_INET domain has a second important structure called ip_protox[] that is derived from the 'struct protosw inetsw[]' and takes care of the de-multiplexing of the various protocols that ride on top of IP packets. The functions ipproto_[un]register() allow to dynamically adjust the ip_protox[] array mux in a consistent and easy way. To register a protocol within ip_protox[] the existence of a corresponding and matching protocol definition in inetsw[] is required. The function does not allow to overwrite an already registered protocol. The unregister function simply replaces the mux slot with the default index pointer to IPPROTO_RAW as it was previously.	2004-10-19 15:45:57 +00:00
Andre Oppermann	1cf15713ed	Add a macro for the destruction of INP_INFO_LOCK's used by loadable modules.	2004-10-19 14:34:13 +00:00
Andre Oppermann	de1c2ac4bf	Make comments more clear. Change the order of one if() statement to check the more likely variable first.	2004-10-19 14:31:56 +00:00
Robert Watson	81158452be	Push acquisition of the accept mutex out of sofree() into the caller (sorele()/sotryfree()): - This permits the caller to acquire the accept mutex before the socket mutex, avoiding sofree() having to drop the socket mutex and re-order, which could lead to races permitting more than one thread to enter sofree() after a socket is ready to be free'd. - This also covers clearing of the so_pcb weak socket reference from the protocol to the socket, preventing races in clearing and evaluation of the reference such that sofree() might be called more than once on the same socket. This appears to close a race I was able to easily trigger by repeatedly opening and resetting TCP connections to a host, in which the tcp_close() code called as a result of the RST raced with the close() of the accepted socket in the user process resulting in simultaneous attempts to de-allocate the same socket. The new locking increases the overhead for operations that may potentially free the socket, so we will want to revise the synchronization strategy here as we normalize the reference counting model for sockets. The use of the accept mutex in freeing of sockets that are not listen sockets is primarily motivated by the potential need to remove the socket from the incomplete connection queue on its parent (listen) socket, so cleaning up the reference model here may allow us to substantially weaken the synchronization requirements. RELENG_5_3 candidate. MFC after: 3 days Reviewed by: dwhite Discussed with: gnn, dwhite, green Reported by: Marc UBM Bocklet <ubm at u-boot-man dot de> Reported by: Vlad <marchenko at gmail dot com>	2004-10-18 22:19:43 +00:00
Robert Watson	6b8e5a9862	Don't release the udbinfo lock until after the last use of UDP inpcb in udp_input(), since the udbinfo lock is used to prevent removal of the inpcb while in use (i.e., as a form of reference count) in the in-bound path. RELENG_5 candidate.	2004-10-12 20:03:56 +00:00
Robert Watson	00fcf9d12d	Modify the thrilling "%D is using my IP address %s!" message so that it isn't printed if the IP address in question is '0.0.0.0', which is used by nodes performing DHCP lookup, and so constitute a false positive as a report of misconfiguration.	2004-10-12 17:10:40 +00:00
Robert Watson	6c67b8b695	When the access control on creating raw sockets was modified so that processes in jail could create raw sockets, additional access control checks were added to raw IP sockets to limit the ways in which those sockets could be used. Specifically, only the socket option IP_HDRINCL was permitted in rip_ctloutput(). Other socket options were protected by a call to suser(). This change was required to prevent processes in a Jail from modifying system properties such as multicast routing and firewall rule sets. However, it also introduced a regression: processes that create a raw socket with root privilege, but then downgraded credential (i.e., a daemon giving up root, or a setuid process switching back to the real uid) could no longer issue other unprivileged generic IP socket option operations, such as IP_TOS, IP_TTL, and the multicast group membership options, which prevented multicast routing daemons (and some other tools) from operating correctly. This change pushes the access control decision down to the granularity of individual socket options, rather than all socket options, on raw IP sockets. When rip_ctloutput() doesn't implement an option, it will now pass the request directly to in_control() without an access control check. This should restore the functionality of the generic IP socket options for raw sockets in the above-described scenarios, which may be confirmed with the ipsockopt regression test. RELENG_5 candidate. Reviewed by: csjp	2004-10-12 16:47:25 +00:00
Robert Watson	cf2942b67c	Acquire the send socket buffer lock around tcp_output() activities reaching into the socket buffer. This prevents a number of potential races, including dereferencing of sb_mb while unlocked leading to a NULL pointer deref (how I found it). Potentially this might also explain other "odd" TCP behavior on SMP boxes (although haven't seen it reported). RELENG_5 candidate.	2004-10-09 16:48:51 +00:00
Robert Watson	fcf4e3a168	When running with debug.mpsafenet=0, initialize IP multicast routing callouts as non-CALLOUT_MPSAFE. Otherwise, they may trigger an assertion regarding Giant if they enter other parts of the stack from the callout. MFC after: 3 days Reported by: Dikshie < dikshie at ppk dot itb dot ac dot id >	2004-10-07 14:13:35 +00:00
Paul Saab	a55db2b6e6	- Estimate the amount of data in flight in sack recovery and use it to control the packets injected while in sack recovery (for both retransmissions and new data). - Cleanups to the sack codepaths in tcp_output.c and tcp_sack.c. - Add a new sysctl (net.inet.tcp.sack.initburst) that controls the number of sack retransmissions done upon initiation of sack recovery. Submitted by: Mohan Srinivasan <mohans@yahoo-inc.com>	2004-10-05 18:36:24 +00:00
Brian Feldman	c99ee9e042	Add support to IPFW for matching by TCP data length.	2004-10-03 00:47:15 +00:00
Brian Feldman	6daf7ebd28	Add support to IPFW for classification based on "diverted" status (that is, input via a divert socket).	2004-10-03 00:26:35 +00:00
Brian Feldman	974dfe3084	Add to IPFW the ability to do ALTQ classification/tagging.	2004-10-03 00:17:46 +00:00
Brian Feldman	88ef2880c1	Validate the action pointer to be within the rule size, so that trying to add corrupt ipfw rules would not potentially panic the system or worse.	2004-09-30 17:42:00 +00:00
Max Laier	d6a8d58875	Add an additional struct inpcb * argument to pfil(9) in order to enable passing along socket information. This is required to work around a LOR with the socket code which results in an easy reproducible hard lockup with debug.mpsafenet=1. This commit does not fix the LOR, but enables us to do so later. The missing piece is to turn the filter locking into a leaf lock and will follow in a seperate (later) commit. This will hopefully be MT5'ed in order to fix the problem for RELENG_5 in forseeable future. Suggested by: rwatson A lot of work by: csjp (he'd be even more helpful w/o mentor-reviews ;) Reviewed by: rwatson, csjp Tested by: -pf, -ipfw, LINT, csjp and myself MFC after: 3 days LOR IDs: 14 - 17 (not fixed yet)	2004-09-29 04:54:33 +00:00
Robert Watson	48ac555d83	Assign so_pcb to NULL rather than 0 as it's a pointer. Spotted by: dwhite	2004-09-29 04:01:13 +00:00
Maxim Konovalov	4bc37f9836	o Turn net.inet.ip.check_interface sysctl off by default. When net.inet.ip.check_interface was MFCed to RELENG_4 3+ years ago in rev. 1.130.2.17 ip_input.c it was 1 by default but shortly changed to 0 (accidently?) in rev. 1.130.2.20 in RELENG_4 only. Among with the fact this knob is not documented it breaks POLA especially in bridge environment. OK'ed by: andre Reviewed by: -current	2004-09-24 12:18:40 +00:00
Andre Oppermann	db09bef308	Fix an out of bounds write during the initialization of the PF_INET protocol family to the ip_protox[] array. The protocol number of IPPROTO_DIVERT is larger than IPPROTO_MAX and was initializing memory beyond the array. Catch all these kinds of errors by ignoring protocols that are higher than IPPROTO_MAX or 0 (zero). Add more comments ip_init().	2004-09-16 18:33:39 +00:00
Andre Oppermann	76ff6dcf46	Clarify some comments for the M_FASTFWD_OURS case in ip_input().	2004-09-15 20:17:03 +00:00
Andre Oppermann	e098266191	Remove the last two global variables that are used to store packet state while it travels through the IP stack. This wasn't much of a problem because IP source routing is disabled by default but when enabled together with SMP and preemption it would have very likely cross-corrupted the IP options in transit. The IP source route options of a packet are now stored in a mtag instead of the global variable.	2004-09-15 20:13:26 +00:00
Andre Oppermann	bda337d05e	Do not allow 'ipfw fwd' command when IPFIREWALL_FORWARD is not compiled into the kernel. Return EINVAL instead.	2004-09-13 19:27:23 +00:00
Andre Oppermann	f91248c1ad	If we have to 'ipfw fwd'-tag a packet the second time in ipfw_pfil_out() don't prepend an already existing tag again. Instead unlink it and prepend it again to have it as the first tag in the chain. PR: kern/71380	2004-09-13 19:20:14 +00:00
Andre Oppermann	f4fca2d8d3	Make comments more clear for the packet changed cases after pfil hooks.	2004-09-13 17:09:06 +00:00
Andre Oppermann	eedc0a7535	Fix ip_input() fallback for the destination modified cases (from the packet filters). After the ipfw to pfil move ip_input() expects M_FASTFWD_OURS tagged packets to have ip_len and ip_off in host byte order instead of network byte order. PR: kern/71652 Submitted by: mlaier (patch)	2004-09-13 17:01:53 +00:00
Andre Oppermann	7c0102f575	Make 'ipfw tee' behave as inteded and designed. A tee'd packet is copied and sent to the DIVERT socket while the original packet continues with the next rule. Unlike a normally diverted packet no IP reassembly attemts are made on tee'd packets and they are passed upwards totally unmodified. Note: This will not be MFC'd to 4.x because of major infrastucture changes. PR: kern/64240 (and many others collapsed into that one)	2004-09-13 16:46:05 +00:00
Gleb Smirnoff	324398687f	Check flag do_bridge always, even if kernel was compiled without BRIDGE support. This makes dynamic bridge.ko working. Reviewed by: sam Approved by: julian (mentor) MFC after: 1 week	2004-09-09 12:34:07 +00:00
John-Mark Gurney	cb459254a2	revert comment from rev1.158 now that rev1.225 backed it out.. MFC after: 3 days	2004-09-06 15:48:38 +00:00
Gleb Smirnoff	f46a6aac29	Recover normal behavior: return EINVAL to attempt to add a divert rule when module is built without IPDIVERT. Silence from: andre Approved by: julian (mentor)	2004-09-05 20:06:50 +00:00
John-Mark Gurney	b5d47ff592	fix up socket/ip layer violation... don't assume/know that SO_DONTROUTE == IP_ROUTETOIF and SO_BROADCAST == IP_ALLOWBROADCAST...	2004-09-05 02:34:12 +00:00
Andre Oppermann	3161f583ca	Apply error and success logic consistently to the function netisr_queue() and its users. netisr_queue() now returns (0) on success and ERRNO on failure. At the moment ENXIO (netisr queue not functional) and ENOBUFS (netisr queue full) are supported. Previously it would return (1) on success but the return value of IF_HANDOFF() was interpreted wrongly and (0) was actually returned on success. Due to this schednetisr() was never called to kick the scheduling of the isr. However this was masked by other normal packets coming through netisr_dispatch() causing the dequeueing of waiting packets. PR: kern/70988 Found by: MOROHOSHI Akihiko <moro@remus.dti.ne.jp> MFC after: 3 days	2004-08-27 18:33:08 +00:00
Andre Oppermann	a9c92b54a9	In the case the destination of a packet was changed by the packet filter to point to a local IP address; and the packet was sourced from this host we fill in the m_pkthdr.rcvif with a pointer to the loopback interface. Before the function ifunit("lo0") was used to obtain the ifp. However this is sub-optimal from a performance point of view and might be dangerous if the loopback interface has been renamed. Use the global variable 'loif' instead which always points to the loopback interface. Submitted by: brooks	2004-08-27 15:39:34 +00:00
Andre Oppermann	319c4c256a	Remove a junk line left over from the recent IPFW to PFIL_HOOKS conversion.	2004-08-27 15:32:28 +00:00
Andre Oppermann	c21fd23260	Always compile PFIL_HOOKS into the kernel and remove the associated kernel compile option. All FreeBSD packet filters now use the PFIL_HOOKS API and thus it becomes a standard part of the network stack. If no hooks are connected the entire packet filter hooks section and related activities are jumped over. This removes any performance impact if no hooks are active. Both OpenBSD and DragonFlyBSD have integrated PFIL_HOOKS permanently as well.	2004-08-27 15:16:24 +00:00
Ruslan Ermilov	9bfe6d472a	Revert the last change to sys/modules/ipfw/Makefile and fix a standalone module build in a better way. Silence from: andre MFC after: 3 days	2004-08-26 14:18:30 +00:00
Pawel Jakub Dawidek	a7f3feff1b	Allocate memory when dumping pipes with M_WAITOK flag. On a system with huge number of pipes, M_NOWAIT failes almost always, because of memory fragmentation. My fix is different than the patch proposed by Pawel Malachowski, because in FreeBSD 5.x we cannot sleep while holding dummynet mutex (in 4.x there is no such lock). My fix is also ugly, but there is no easy way to prepare nice and clean fix. PR: kern/46557 Submitted by: Eugene Grosbein <eugen@grosbein.pp.ru> Reviewed by: mlaier	2004-08-25 09:31:30 +00:00
Max Laier	ca7a789aa6	Allow early drop for non-ALTQ enabled queues in an ALTQ-enabled kernel. Previously the early drop was disabled unconditionally for ALTQ-enabled kernels. This should give some benefit for the normal gateway + LAN-server case with a busy LAN leg and an ALTQ managed uplink. Reviewed and style help from: cperciva, pjd	2004-08-22 16:42:28 +00:00
Robert Watson	392e840716	When sliding the m_data pointer forward, update m_pktrhdr.len as well as m_len, or the pkthdr length will be inconsistent with the actual length of data in the mbuf chain. The symptom of this occuring was "out of data" warnings from in_cksum_skip() on large UDP packets sent via the loopback interface. Foot shot: green	2004-08-22 01:32:48 +00:00
Christian S.J. Peron	5090559b7f	When a prison is given the ability to create raw sockets (when the security.jail.allow_raw_sockets sysctl MIB is set to 1) where privileged access to jails is given out, it is possible for prison root to manipulate various network parameters which effect the host environment. This commit plugs a number of security holes associated with the use of raw sockets and prisons. This commit makes the following changes: - Add a comment to rtioctl warning developers that if they add any ioctl commands, they should use super-user checks where necessary, as it is possible for PRISON root to make it this far in execution. - Add super-user checks for the execution of the SIOCGETVIFCNT and SIOCGETSGCNT IP multicast ioctl commands. - Add a super-user check to rip_ctloutput(). If the calling cred is PRISON root, make sure the socket option name is IP_HDRINCL, otherwise deny the request. Although this patch corrects a number of security problems associated with raw sockets and prisons, the warning in jail(8) should still apply, and by default we should keep the default value of security.jail.allow_raw_sockets MIB to 0 (or disabled) until we are certain that we have tracked down all the problems. Looking forward, we will probably want to eliminate the references to curthread. This may be a MFC candidate for RELENG_5. Reviewed by: rwatson Approved by: bmilekic (mentor)	2004-08-21 17:38:57 +00:00
Robert Watson	e6ccd70936	When prepending space onto outgoing UDP datagram payloads to hold the UDP/IP header, make sure that space is also allocated for the link layer header. If an mbuf must be allocated to hold the UDP/IP header (very likely), then this will avoid an additional mbuf allocation at the link layer. This trick is also used by TCP and other protocols to avoid extra calls to the mbuf allocator in the ethernet (and related) output routines.	2004-08-21 16:14:04 +00:00
Andre Oppermann	ce63226177	Fix a stupid typo which prevented an ipfw KLD unload from successfully cleaning up its remains. Do not terminate 'if' lines with ';'. Spotted by: claudio@OpenBSD.ORG (sitting 3m from my desk) Pointy hat to: andre	2004-08-20 00:36:55 +00:00
Andre Oppermann	70222723f3	When unloading ipfw module use callout_drain() to make absolutely sure that all callouts are stopped and finished. Move it before IPFW_LOCK() to avoid deadlocking when draining callouts.	2004-08-19 23:31:40 +00:00
Andre Oppermann	6f2d4ea6f8	For IPv6 access pointer to tcpcb only after we have checked it is valid. Found by: Coverity's automated analysis (via Ted Unangst)	2004-08-19 20:16:17 +00:00
Andre Oppermann	50ab727669	Give a useful error message if someone tries to compile IPFIREWALL into the kernel without specifying PFIL_HOOKS as well.	2004-08-19 18:38:23 +00:00
Andre Oppermann	9108601915	Do not unconditionally ignore IPDIVERT and IPFIREWALL_FORWARD when building the ipfw KLD. For IPFIREWALL_FORWARD this does not have any side effects. If the module has it but not the kernel it just doesn't do anything. For IPDIVERT the KLD will be unloadable if the kernel doesn't have IPDIVERT compiled in too. However this is the least disturbing behaviour. The user can just recompile either module or the kernel to match the other one. The access to the machine is not denied if ipfw refuses to load.	2004-08-19 17:59:26 +00:00
Andre Oppermann	e4c97eff8e	Bring back the sysctl 'net.inet.ip.fw.enable' to unbreak the startup scripts and to be able to disable ipfw if it was compiled directly into the kernel.	2004-08-19 17:38:47 +00:00
Robert Watson	5c32ea6517	Push down pcbinfo and inpcb locking from udp_send() into udp_output(). This provides greater context for the locking and allows us to avoid locking the pcbinfo structure if not binding operations will take place (i.e., already bound, connected, and no expliti sendto() address).	2004-08-19 01:13:10 +00:00
Robert Watson	4c2bb15a89	In in_pcbrehash(), do assert the inpcb lock as well as the pcbinfo lock.	2004-08-19 01:11:17 +00:00
Robert Watson	0f48e25b63	Fix build of ip_input.c with "options IPSEC" -- the "pass:" label is used with both FAST_IPSEC and IPSEC, but was defined for only FAST_IPSEC.	2004-08-18 03:11:04 +00:00
Peter Wemm	1e5cc10dc2	Make the kernel compile again if you are not using PFIL_HOOKS	2004-08-18 00:37:46 +00:00
Andre Oppermann	9b932e9e04	Convert ipfw to use PFIL_HOOKS. This is change is transparent to userland and preserves the ipfw ABI. The ipfw core packet inspection and filtering functions have not been changed, only how ipfw is invoked is different. However there are many changes how ipfw is and its add-on's are handled: In general ipfw is now called through the PFIL_HOOKS and most associated magic, that was in ip_input() or ip_output() previously, is now done in ipfw_check_[in\|out]() in the ipfw PFIL handler. IPDIVERT is entirely handled within the ipfw PFIL handlers. A packet to be diverted is checked if it is fragmented, if yes, ip_reass() gets in for reassembly. If not, or all fragments arrived and the packet is complete, divert_packet is called directly. For 'tee' no reassembly attempt is made and a copy of the packet is sent to the divert socket unmodified. The original packet continues its way through ip_input/output(). ipfw 'forward' is done via m_tag's. The ipfw PFIL handlers tag the packet with the new destination sockaddr_in. A check if the new destination is a local IP address is made and the m_flags are set appropriately. ip_input() and ip_output() have some more work to do here. For ip_input() the m_flags are checked and a packet for us is directly sent to the 'ours' section for further processing. Destination changes on the input path are only tagged and the 'srcrt' flag to ip_forward() is set to disable destination checks and ICMP replies at this stage. The tag is going to be handled on output. ip_output() again checks for m_flags and the 'ours' tag. If found, the packet will be dropped back to the IP netisr where it is going to be picked up by ip_input() again and the directly sent to the 'ours' section. When only the destination changes, the route's 'dst' is overwritten with the new destination from the forward m_tag. Then it jumps back at the route lookup again and skips the firewall check because it has been marked with M_SKIP_FIREWALL. ipfw 'forward' has to be compiled into the kernel with 'option IPFIREWALL_FORWARD' to enable it. DUMMYNET is entirely handled within the ipfw PFIL handlers. A packet for a dummynet pipe or queue is directly sent to dummynet_io(). Dummynet will then inject it back into ip_input/ip_output() after it has served its time. Dummynet packets are tagged and will continue from the next rule when they hit the ipfw PFIL handlers again after re-injection. BRIDGING and IPFW_ETHER are not changed yet and use ipfw_chk() directly as they did before. Later this will be changed to dedicated ETHER PFIL_HOOKS. More detailed changes to the code: conf/files Add netinet/ip_fw_pfil.c. conf/options Add IPFIREWALL_FORWARD option. modules/ipfw/Makefile Add ip_fw_pfil.c. net/bridge.c Disable PFIL_HOOKS if ipfw for bridging is active. Bridging ipfw is still directly invoked to handle layer2 headers and packets would get a double ipfw when run through PFIL_HOOKS as well. netinet/ip_divert.c Removed divert_clone() function. It is no longer used. netinet/ip_dummynet.[ch] Neither the route 'ro' nor the destination 'dst' need to be stored while in dummynet transit. Structure members and associated macros are removed. netinet/ip_fastfwd.c Removed all direct ipfw handling code and replace it with the new 'ipfw forward' handling code. netinet/ip_fw.h Removed 'ro' and 'dst' from struct ip_fw_args. netinet/ip_fw2.c (Re)moved some global variables and the module handling. netinet/ip_fw_pfil.c New file containing the ipfw PFIL handlers and module initialization. netinet/ip_input.c Removed all direct ipfw handling code and replace it with the new 'ipfw forward' handling code. ip_forward() does not longer require the 'next_hop' struct sockaddr_in argument. Disable early checks if 'srcrt' is set. netinet/ip_output.c Removed all direct ipfw handling code and replace it with the new 'ipfw forward' handling code. netinet/ip_var.h Add ip_reass() as general function. (Used from ipfw PFIL handlers for IPDIVERT.) netinet/raw_ip.c Directly check if ipfw and dummynet control pointers are active. netinet/tcp_input.c Rework the 'ipfw forward' to local code to work with the new way of forward tags. netinet/tcp_sack.c Remove include 'opt_ipfw.h' which is not needed here. sys/mbuf.h Remove m_claim_next() macro which was exclusively for ipfw 'forward' and is no longer needed. Approved by: re (scottl)	2004-08-17 22:05:54 +00:00
Robert Watson	a4f757cd5d	White space cleanup for netinet before branch: - Trailing tab/space cleanup - Remove spurious spaces between or before tabs This change avoids touching files that Andre likely has in his working set for PFIL hooks changes for IPFW/DUMMYNET. Approved by: re (scottl) Submitted by: Xin LI <delphij@frontfree.net>	2004-08-16 18:32:07 +00:00
David E. O'Brien	5af87d0ea1	Put the 'antispoof' opcode in the proper place in the opcode list such that it doesn't break the ipfw2 ABI.	2004-08-16 12:05:19 +00:00
David Malone	1f44b0a1b5	Get rid of the RANDOM_IP_ID option and make it a sysctl. NetBSD have already done this, so I have styled the patch on their work: 1) introduce a ip_newid() static inline function that checks the sysctl and then decides if it should return a sequential or random IP ID. 2) named the sysctl net.inet.ip.random_id 3) IPv6 flow IDs and fragment IDs are now always random. Flow IDs and frag IDs are significantly less common in the IPv6 world (ie. rarely generated per-packet), so there should be smaller performance concerns. The sysctl defaults to 0 (sequential IP IDs). Reviewed by: andre, silby, mlaier, ume Based on: NetBSD MFC after: 2 months	2004-08-14 15:32:40 +00:00
Poul-Henning Kamp	e7581f0fc2	Fix outgoing ICMP on global instance.	2004-08-14 14:21:09 +00:00
Christian S.J. Peron	31c88a3043	Add the ability to associate ipfw rules with a specific prison ID. Since the only thing truly unique about a prison is it's ID, I figured this would be the most granular way of handling this. This commit makes the following changes: - Adds tokenizing and parsing for the ``jail'' command line option to the ipfw(8) userspace utility. - Append the ipfw opcode list with O_JAIL. - While Iam here, add a comment informing others that if they want to add additional opcodes, they should append them to the end of the list to avoid ABI breakage. - Add ``fw_prid'' to the ipfw ucred cache structure. - When initializing ucred cache, if the process is jailed, set fw_prid to the prison ID, otherwise set it to -1. - Update man page to reflect these changes. This change was a strong motivator behind the ucred caching mechanism in ipfw. A sample usage of this new functionality could be: ipfw add count ip from any to any jail 2 It should be noted that because ucred based constraints are only implemented for TCP and UDP packets, the same applies for jail associations. Conceptual head nod by: pjd Reviewed by: rwatson Approved by: bmilekic (mentor)	2004-08-12 22:06:55 +00:00
David Malone	849112666a	In tcp6_ctlinput, lock tcbinfo around the call to syncache_unreach so that the locks held are the same as the IPv4 case. Reviewed by: rwatson	2004-08-12 18:19:36 +00:00
Andre Oppermann	9d804f818c	Fix two cases of incorrect IPQ_UNLOCK'ing in the merged ip_reass() function. The first one was going to 'dropfrag', which unlocks the IPQ, before the lock was aquired; The second one doing a unlock and then a 'goto dropfrag' which led to a double-unlock. Tripped over by: des	2004-08-12 08:37:42 +00:00
Robert Watson	c19c5239a6	When udp_send() fails, make sure to free the control mbufs as well as the data mbuf. This was done in most error cases, but not the case where the inpcb pointer is surprisingly NULL.	2004-08-12 01:34:27 +00:00
Andre Oppermann	420a281164	Backout removal of UMA_ZONE_NOFREE flag for all zones which are established for structures with timers in them. It might be that a timer might fire even when the associated structure has already been free'd. Having type- stable storage in this case is beneficial for graceful failure handling and debugging. Discussed with: bosko, tegge, rwatson	2004-08-11 20:30:08 +00:00
Andre Oppermann	4efb805c0c	Remove the UMA_ZONE_NOFREE flag to all uma_zcreate() calls in the IP and TCP code. This flag would have prevented giving back excessive free slabs to the global pool after a transient peak usage.	2004-08-11 17:08:31 +00:00
Andre Oppermann	67d0b24ed1	Make use of in_localip() function and replace previous direct LIST_FOREACH loops over INADDR_HASH.	2004-08-11 12:32:10 +00:00
Andre Oppermann	2eccc90b61	Add the function in_localip() which returns 1 if an internet address is for the local host and configured on one of its interfaces.	2004-08-11 11:49:48 +00:00
Andre Oppermann	6e234ede37	Only invoke verify_path() for verrevpath and versrcreach when we have an IP packet.	2004-08-11 11:41:11 +00:00
Andre Oppermann	767981878c	Only check for local broadcast addresses if the mbuf is flagged with M_BCAST.	2004-08-11 10:49:56 +00:00
Andre Oppermann	0b17fba7bc	Consistently use NULL for pointer comparisons.	2004-08-11 10:46:15 +00:00
Andre Oppermann	de2e5d1e20	Make IP fastforwarding ALTQ-aware by adding the input traffic conditioner check and disabling the early output interface queue length check.	2004-08-11 10:42:59 +00:00
Andre Oppermann	2f6e6e9b4c	Correct the displayed bandwidth calculation for a readout via sysctl. The saved value does not have to be scaled with HZ; it is already in bytes per second. Only the multiply by eight remains to show bits per second (bps).	2004-08-11 10:12:16 +00:00
Robert Watson	27f74fd0ed	Assert the locks of inpcbinfo's and inpcb's passed into in_pcbconnect() and in_pcbconnect_setup(), since these functions frob the port and address state of inpcbs.	2004-08-11 04:35:20 +00:00
Andre Oppermann	bb7c5b3055	Make a comment that IP source routing is not SMP and PREEMPTION safe.	2004-08-09 16:17:37 +00:00
Andre Oppermann	a5053398d4	Make a comment that "ipfw forward" is not SMP and PREEMPTION safe.	2004-08-09 16:16:10 +00:00
Andre Oppermann	5f9541ecbd	New ipfw option "antispoof": For incoming packets, the packet's source address is checked if it belongs to a directly connected network. If the network is directly connected, then the interface the packet came on in is compared to the interface the network is connected to. When incoming interface and directly connected interface are not the same, the packet does not match. Usage example: ipfw add deny ip from any to any not antispoof in Manpage education by: ru	2004-08-09 16:12:10 +00:00
Robert Watson	f31f65a708	Pass pcbinfo structures to in6_pcbnotify() rather than pcbhead structures, allowing in6_pcbnotify() to lock the pcbinfo and each inpcb that it notifies of ICMPv6 events. This prevents inpcb assertions from firing when IPv6 generates and delievers event notifications for inpcbs. Reported by: kuriyama Tested by: kuriyama	2004-08-06 03:45:45 +00:00
Robert Watson	9c1df6951f	When iterating the UDP inpcb list processing an inbound broadcast or multicast packet, we don't need to acquire the inpcb mutex unless we are actually using inpcb fields other than the bound port and address. Since we hold the pcbinfo lock already, these can't change. Defer acquiring the inpcb mutex until we have a high chance of a match. This avoids about 120 mutex operations per UDP broadcast packet received on one of my work systems. Reviewed by: sam	2004-08-06 02:08:31 +00:00
Robert Watson	98aed8ca56	Now that IPv6 performs basic in6pcb and inpcb locking, enable inpcb lock assertions even if IPv6 is compiled into the kernel. Previously, inclusion of IPv6 and locking assertions would result in a rapid assertion failure as IPv6 was not properly locking inpcbs.	2004-08-04 18:27:55 +00:00
Joe Marcus Clarke	5c7e7e80cc	Fix Skinny and PPTP NAT'ing after the introduction of the {ip,tcp,udp}_next functions. Basically, the ip_next() function was used to get the PPTP and Skinny headers when tcp_next() should have been used instead. Symptoms of this included a segfault in natd when trying to process a PPTP or Skinny packet. Approved by: des	2004-08-04 15:17:08 +00:00
Andre Oppermann	81007fd4eb	o Delayed checksums are now calculated in divert_packet() for diverted packets Remove the XXX-escaped code that did it in ip_output()'s IPHACK section.	2004-08-03 14:13:36 +00:00
Andre Oppermann	24a098ea9b	o Move the inflight sysctls to their own sub-tree under net.inet.tcp to be more consistent with the other sysctls around it.	2004-08-03 13:54:11 +00:00
Andre Oppermann	f0cada84b1	o Move all parts of the IP reassembly process into the function ip_reass() to make it fully self-contained. o ip_reass() now returns a new mbuf with the reassembled packet and ip->ip_len including the IP header. o Computation of the delayed checksum is moved into divert_packet(). Reviewed by: silby	2004-08-03 12:31:38 +00:00
Jeffrey Hsu	2ff39e1543	Fix bug with tracking the previous element in a list. Found by: edrt@citiz.net Submitted by: pavlin@icir.org	2004-08-03 02:01:44 +00:00
Yaroslav Tykhiy	a4eb4405e3	Disallow a particular kind of port theft described by the following scenario: Alice is too lazy to write a server application in PF-independent manner. Therefore she knocks up the server using PF_INET6 only and allows the IPv6 socket to accept mapped IPv4 as well. An evil hacker known on IRC as cheshire_cat has an account in the same system. He starts a process listening on the same port as used by Alice's server, but in PF_INET. As a consequence, cheshire_cat will distract all IPv4 traffic supposed to go to Alice's server. Such sort of port theft was initially enabled by copying the code that implemented the RFC 2553 semantics on IPv4/6 sockets (see inet6(4)) for the implied case of the same owner for both connections. After this change, the above scenario will be impossible. In the same setting, the user who attempts to start his server last will get EADDRINUSE. Of course, using IPv4 mapped to IPv6 leads to security complications in the first place, but there is no reason to make it even more unsafe. This change doesn't apply to KAME since it affects a FreeBSD-specific part of the code. It doesn't modify the out-of-box behaviour of the TCP/IP stack either as long as mapping IPv4 to IPv6 is off by default. MFC after: 1 month	2004-07-28 13:03:07 +00:00
Jayanth Vijayaraghavan	5d3b1b7556	Fix a bug in the sack code that was causing data to be retransmitted with the FIN bit set for all segments, if a FIN has already been sent before. The fix will allow the FIN bit to be set for only the last segment, in case it has to be retransmitted. Fix another bug that would have caused snd_nxt to be pulled by len if there was an error from ip_output. snd_nxt should not be touched during sack retransmissions.	2004-07-28 02:15:14 +00:00
Jayanth Vijayaraghavan	e9f2f80e09	Fix for a SACK bug where the very last segment retransmitted from the SACK scoreboard could result in the next (untransmitted) segment to be skipped.	2004-07-26 23:41:12 +00:00
John-Mark Gurney	0aa8ce5012	compare pointer against NULL, not 0 when inpcb is NULL, this is no longer invalid since jlemon added the tcp_twstart function... this prevents close "failing" w/ EINVAL when it really was successful... Reviewed by: jeremy (NetBSD)	2004-07-26 21:29:56 +00:00
Colin Percival	56f21b9d74	Rename suser_cred()'s PRISON_ROOT flag to SUSER_ALLOWJAIL. This is somewhat clearer, but more importantly allows for a consistent naming scheme for suser_cred flags. The old name is still defined, but will be removed in a few days (unless I hear any complaints...) Discussed with: rwatson, scottl Requested by: jhb	2004-07-26 07:24:04 +00:00
Andre Oppermann	55db762b76	Extend versrcreach by checking against the rt_flags for RTF_REJECT and RTF_BLACKHOLE as well. To quote the submitter: The uRPF loose-check implementation by the industry vendors, at least on Cisco and possibly Juniper, will fail the check if the route of the source address is pointed to Null0 (on Juniper, discard or reject route). What this means is, even if uRPF Loose-check finds the route, if the route is pointed to blackhole, uRPF loose-check must fail. This allows people to utilize uRPF loose-check mode as a pseudo-packet-firewall without using any manual filtering configuration -- one can simply inject a IGP or BGP prefix with next-hop set to a static route that directs to null/discard facility. This results in uRPF Loose-check failing on all packets with source addresses that are within the range of the nullroute. Submitted by: James Jun <james@towardex.com>	2004-07-21 19:55:14 +00:00
Robert Watson	2d01d331c6	M_PREPEND() the IP header on to the front of an outgoing raw IP packet using M_DONTWAIT rather than M_WAITOK to avoid sleeping on memory while holding a mutex.	2004-07-20 20:52:30 +00:00
Jayanth Vijayaraghavan	04f0d9a0ea	Let IN_FASTREOCOVERY macro decide if we are in recovery mode. Nuke sackhole_limit for now. We need to add it back to limit the total number of sack blocks in the system.	2004-07-19 22:37:33 +00:00
Jayanth Vijayaraghavan	f787edd847	Fix a potential panic in the SACK code that was causing 1) data to be sent to the right of snd_recover. 2) send more data then whats in the send buffer. The fix is to postpone sack retransmit to a subsequent recovery episode if the current retransmit pointer is beyond snd_recover. Thanks to Mohan Srinivasan for helping fix the bug. Submitted by:Daniel Lang	2004-07-19 22:06:01 +00:00
David Malone	932312d60b	Fix the !INET6 build. Reported by: alc	2004-07-17 21:40:14 +00:00
David Malone	969860f3ed	The tcp syncache code was leaving the IPv6 flowlabel uninitialised for the SYN\|ACK packet and then letting in6_pcbconnect set the flowlabel later. Arange for the syncache/syncookie code to set and recall the flow label so that the flowlabel used for the SYN\|ACK is consistent. This is done by using some of the cookie (when tcp cookies are enabeled) and by stashing the flowlabel in syncache. Tested and Discovered by: Orla McGann <orly@cnri.dit.ie> Approved by: ume, silby MFC after: 1 month	2004-07-17 19:44:13 +00:00
Max Laier	c550f2206d	Define semantic of M_SKIP_FIREWALL more precisely, i.e. also pass associated icmp_error() packets. While here retire PACKET_TAG_PF_GENERATED (which served the same purpose) and use M_SKIP_FIREWALL in pf as well. This should speed up things a bit as we get rid of the tag allocations. Discussed with: juli	2004-07-17 05:10:06 +00:00
Juli Mallett	765d141c78	Make M_SKIP_FIREWALL a global (and semantic) flag, preventing anything from using M_PROTO6 and possibly shooting someone's foot, as well as allowing the firewall to be used in multiple passes, or with a packet classifier frontend, that may need to explicitly allow a certain packet. Presently this is handled in the ipfw_chk code as before, though I have run with it moved to upper layers, and possibly it should apply to ipfilter and pf as well, though this has not been investigated. Discussed with: luigi, rwatson	2004-07-17 02:40:13 +00:00
Hajimu UMEMOTO	8a59da300c	when IN6P_AUTOFLOWLABEL is set, the flowlabel is not set on outgoing tcp connections. Reported by: Orla McGann <orly@cnri.dit.ie> Reviewed by: Orla McGann <orly@cnri.dit.ie> Obtained from: KAME	2004-07-16 18:08:13 +00:00
Poul-Henning Kamp	3e019deaed	Do a pass over all modules in the kernel and make them return EOPNOTSUPP for unknown events. A number of modules return EINVAL in this instance, and I have left those alone for now and instead taught MOD_QUIESCE to accept this as "didn't do anything".	2004-07-15 08:26:07 +00:00
Stefan Farfeleder	439dfb0c35	Remove erroneous semicolons.	2004-07-13 16:06:19 +00:00
Robert Watson	7cfc690440	After each label in tcp_input(), assert the inpcbinfo and inpcb lock state that we expect.	2004-07-12 19:28:07 +00:00
Brian Somers	0ac4013324	Change the following environment variables to kernel options: bootp -> BOOTP bootp.nfsroot -> BOOTP_NFSROOT bootp.nfsv3 -> BOOTP_NFSV3 bootp.compat -> BOOTP_COMPAT bootp.wired_to -> BOOTP_WIRED_TO - i.e. back out the previous commit. It's already possible to pxeboot(8) with a GENERIC kernel. Pointed out by: dwmalone	2004-07-08 22:35:36 +00:00
Brian Somers	59e1ebc9b5	Change the following kernel options to environment variables: BOOTP -> bootp BOOTP_NFSROOT -> bootp.nfsroot BOOTP_NFSV3 -> bootp.nfsv3 BOOTP_COMPAT -> bootp.compat BOOTP_WIRED_TO -> bootp.wired_to This lets you PXE boot with a GENERIC kernel by putting this sort of thing in loader.conf: bootp="YES" bootp.nfsroot="YES" bootp.nfsv3="YES" bootp.wired_to="bge1" or even setting the variables manually from the OK prompt.	2004-07-08 13:40:33 +00:00
Dag-Erling Smørgrav	de47739e71	Push WARNS back up to 6, but define NO_WERROR; I want the warts out in the open where people can see them and hopefully fix them.	2004-07-06 12:15:24 +00:00
Dag-Erling Smørgrav	9fa0fd2682	Introduce inline {ip,udp,tcp}_next() functions which take a pointer to an {ip,udp,tcp} header and return a void * pointing to the payload (i.e. the first byte past the end of the header and any required padding). Use them consistently throughout libalias to a) reduce code duplication, b) improve code legibility, c) get rid of a bunch of alignment warnings.	2004-07-06 12:13:28 +00:00
Dag-Erling Smørgrav	e3e2c21639	Rewrite twowords() to access its argument through a char pointer and not a short pointer. The previous implementation seems to be in a gray zone of the C standard, and GCC generates incorrect code for it at -O2 or higher on some platforms.	2004-07-06 09:22:18 +00:00
Dag-Erling Smørgrav	95347a8ee0	Temporarily lower WARNS to 3 while I figure out the alignment issues on alpha.	2004-07-06 08:44:41 +00:00
Dag-Erling Smørgrav	ed01a58215	Make libalias WARNS?=6-clean. This mostly involves renaming variables named link, foo_link or link_foo to lnk, foo_lnk or lnk_foo, fixing signed / unsigned comparisons, and shoving unused function arguments under the carpet. I was hoping WARNS?=6 might reveal more serious problems, and perhaps the source of the -O2 breakage, but found no smoking gun.	2004-07-05 11:10:57 +00:00
Dag-Erling Smørgrav	ffcb611a9d	Parenthesize return values.	2004-07-05 10:55:23 +00:00
Dag-Erling Smørgrav	f311ebb4ec	Mechanical whitespace cleanup.	2004-07-05 10:53:28 +00:00
Poul-Henning Kamp	e6bbb69149	Add LibAliasOutTry() which checks a packet for a hit in the tables, but does not create a new entry if none is found.	2004-07-04 12:53:07 +00:00
Ruslan Ermilov	1a0a934547	Mechanically kill hard sentence breaks.	2004-07-02 23:52:20 +00:00
Jayanth Vijayaraghavan	a0445c2e2c	On receiving 3 duplicate acknowledgements, SACK recovery was not being entered correctly. Fix this problem by separating out the SACK and the newreno cases. Also, check if we are in FASTRECOVERY for the sack case and if so, turn off dupacks. Fix an issue where the congestion window was not being incremented by ssthresh. Thanks to Mohan Srinivasan for finding this problem.	2004-07-01 23:34:06 +00:00
Ruslan Ermilov	c9a246418d	Bumped document date. Fixed markup. Fixed examples to match the new API.	2004-07-01 17:51:48 +00:00
Poul-Henning Kamp	e3e244bff6	Rwatson, write 100 times for tomorrow: First unlock, then assign NULL to pointer.	2004-06-27 21:54:34 +00:00
Pawel Jakub Dawidek	0a44517d3a	Those are unneeded too.	2004-06-27 09:06:10 +00:00
Pawel Jakub Dawidek	46e3b1cbe7	Add two missing includes and remove two uneeded. This is quite serious fix, because even with MAC framework compiled in, MAC entry points in those two files were simply ignored.	2004-06-27 09:03:22 +00:00
Robert Watson	1e4d7da707	Reduce the number of unnecessary unlock-relocks on socket buffer mutexes associated with performing a wakeup on the socket buffer: - When performing an sbappend*() followed by a so[rw]wakeup(), explicitly acquire the socket buffer lock and use the _locked() variants of both calls. Note that the _locked() sowakeup() versions unlock the mutex on return. This is done in uipc_send(), divert_packet(), mroute socket_send(), raw_append(), tcp_reass(), tcp_input(), and udp_append(). - When the socket buffer lock is dropped before a sowakeup(), remove the explicit unlock and use the _locked() sowakeup() variant. This is done in soisdisconnecting(), soisdisconnected() when setting the can't send/ receive flags and dropping data, and in uipc_rcvd() which adjusting back-pressure on the sockets. For UNIX domain sockets running mpsafe with a contention-intensive SMP mysql benchmark, this results in a 1.6% query rate improvement due to reduce mutex costs.	2004-06-26 19:10:39 +00:00
Robert Watson	3f9d1ef905	Remove spl's from TCP protocol entry points. While not all locking is merged here yet, this will ease the merge process by bringing the locked and unlocked versions into sync.	2004-06-26 17:50:50 +00:00
Paul Saab	652178a12a	White space & spelling fixes Submitted by: Xin LI <delphij@frontfree.net>	2004-06-25 04:11:26 +00:00
Bruce M Simpson	37332f049f	Whitespace.	2004-06-25 02:29:58 +00:00
Robert Watson	5905999b2f	Broaden scope of the socket buffer lock when processing an ACK so that the read and write of sb_cc are atomic. Call sbdrop_locked() instead of sbdrop() since we already hold the socket buffer lock.	2004-06-24 03:07:27 +00:00
Robert Watson	927c5cea3f	Protect so_oobmark with with SOCKBUF_LOCK(&so->so_rcv), and broaden locking in tcp_input() for TCP packets with urgent data pointers to hold the socket buffer lock across testing and updating oobmark from just protecting sb_state. Update socket locking annotations	2004-06-24 02:57:12 +00:00
Robert Watson	a138d21769	In ip_ctloutput(), acquire the inpcb lock around some of the basic inpcb flag and status updates.	2004-06-24 02:05:47 +00:00
Robert Watson	d67ec3dd48	When asserting non-Giant locks in the network stack, also assert Giant if debug.mpsafenet=0, as any points that require synchronization in the SMPng world also required it in the Giant-world: - inpcb locks (including IPv6) - inpcbinfo locks (including IPv6) - dummynet subsystem lock - ipfw2 subsystem lock	2004-06-24 02:01:48 +00:00
Robert Watson	3f11a2f374	Introduce sbreserve_locked(), which asserts the socket buffer lock on the socket buffer having its limits adjusted. sbreserve() now acquires the lock before calling sbreserve_locked(). In soreserve(), acquire socket buffer locks across read-modify-writes of socket buffer fields, and calls into sbreserve/sbrelease; make sure to acquire in keeping with the socket buffer lock order. In tcp_mss(), acquire the socket buffer lock in the calling context so that we have atomic read-modify -write on buffer sizes.	2004-06-24 01:37:04 +00:00
Paul Saab	76947e3222	Move the sack sysctl's under net.inet.tcp.sack net.inet.tcp.do_sack -> net.inet.tcp.sack.enable net.inet.tcp.sackhole_limit -> net.inet.tcp.sack.sackhole_limit Requested by: wollman	2004-06-23 21:34:07 +00:00
Paul Saab	6d90faf3d8	Add support for TCP Selective Acknowledgements. The work for this originated on RELENG_4 and was ported to -CURRENT. The scoreboarding code was obtained from OpenBSD, and many of the remaining changes were inspired by OpenBSD, but not taken directly from there. You can enable/disable sack using net.inet.tcp.do_sack. You can also limit the number of sack holes that all senders can have in the scoreboard with net.inet.tcp.sackhole_limit. Reviewed by: gnn Obtained from: Yahoo! (Mohan Srinivasan, Jayanth Vijayaraghavan)	2004-06-23 21:04:37 +00:00
Robert Watson	bb7479a613	Acquire socket lock around frobbing of socket state in divert sockets.	2004-06-22 04:00:51 +00:00
Robert Watson	ffcbc0e4c5	Prefer use of the inpcb as a MAC label source for outgoing packets sent via divert sockets, when available.	2004-06-22 03:58:50 +00:00
Robert Watson	d330008e3b	If debug.mpsafenet is set, initialize TCP callouts as CALLOUT_MPSAFE.	2004-06-20 21:44:50 +00:00
Robert Watson	1f82efb3b7	Assert the inpcb lock before letting MAC check whether we can deliver to the inpcb in tcp_input().	2004-06-20 20:17:29 +00:00
Robert Watson	1b83216eda	IP multicast code no longer needs to acquire Giant before appending an mbuf onto a socket buffer. This is left over from debug.mpsafenet affecting the forwarding/bridging plane only.	2004-06-20 20:10:05 +00:00
Robert Watson	4e397bc524	In tcp_ctloutput(), don't hold the inpcb lock over a call to ip_ctloutput(), as it may need to perform blocking memory allocations. This also improves consistency with locking relative to other points that call into ip_ctloutput(). Bumped into by: Grover Lines <grover@ceribus.net>	2004-06-18 20:22:21 +00:00
Bruce M Simpson	4f450ff9a5	Check that m->m_pkthdr.rcvif is not NULL before checking if a packet was received on a broadcast address on the input path. Under certain circumstances this could result in a panic, notably for locally-generated packets which do not have m_pkthdr.rcvif set. This is a similar situation to that which is solved by src/sys/netinet/ip_icmp.c rev 1.66. PR: kern/52935	2004-06-18 12:58:45 +00:00
Bruce M Simpson	f3e0b7ef7f	Appease GCC.	2004-06-18 09:53:58 +00:00
Bruce M Simpson	5214cb3f59	If SO_DEBUG is enabled for a TCP socket, and a received segment is encapsulated within an IPv6 datagram, do not abuse the 'ipov' pointer when registering trace records. 'ipov' is specific to IPv4, and will therefore be uninitialized. [This fandango is only necessary in the first place because of our host-byte-order IP field pessimization.] PR: kern/60856 Submitted by: Galois Zheng	2004-06-18 03:31:07 +00:00
Bruce M Simpson	da181cc144	Don't set FIN on a retransmitted segment after a FIN has been sent, unless the segment really contains the last of the data for the stream. PR: kern/34619 Obtained from: OpenBSD (tcp_output.c rev 1.47) Noticed by: Joseph Ishac Reviewed by: George Neville-Neil	2004-06-18 02:47:59 +00:00
Bruce M Simpson	27de0135ce	Ensure that dst is bzeroed before calling rtalloc_ign(), to avoid possible routing table corruption. PR: kern/40563, freebsd4/432 (KAME) Obtained from: NetBSD (in_gif.c rev 1.26.10.1) Requested by: Jean-Luc Richier	2004-06-18 02:04:07 +00:00
Max Laier	7c1fe95333	Commit pf version 3.5 and link additional files to the kernel build. Version 3.5 brings: - Atomic commits of ruleset changes (reduce the chance of ending up in an inconsistent state). - A 30% reduction in the size of state table entries. - Source-tracking (limit number of clients and states per client). - Sticky-address (the flexibility of round-robin with the benefits of source-hash). - Significant improvements to interface handling. - and many more ...	2004-06-16 23:24:02 +00:00

... 2 3 4 5 6 ...

2300 Commits