Commit Graph

2129 Commits

Author SHA1 Message Date
Paul Saab
7d5ed1ceea Fixes a bug in SACK causing us to send data beyond the receive window.
Found by: Pawel Worach and Daniel Hartmeier
Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com
2004-11-29 18:47:27 +00:00
Robert Watson
2be3bf2244 Assert the inpcb lock in tcp_xmit_timer() as it performs read-modify-
write of various time/rtt-related fields in the tcpcb.
2004-11-28 11:06:22 +00:00
Robert Watson
18ad5842c5 Expand coverage of the receive socket buffer lock when handling urgent
pointer updates: test available space while holding the socket buffer
mutex, and continue to hold until until the pointer update has been
performed.

MFC after:	2 weeks
2004-11-28 11:01:31 +00:00
Robert Watson
c8443a1dc0 Do export the advertised receive window via the tcpi_rcv_space field of
struct tcp_info.
2004-11-27 20:20:11 +00:00
Robert Watson
b8af5dfa81 Implement parts of the TCP_INFO socket option as found in Linux 2.6.
This socket option allows processes query a TCP socket for some low
level transmission details, such as the current send, bandwidth, and
congestion windows.  Linux provides a 'struct tcpinfo' structure
containing various variables, rather than separate socket options;
this makes the API somewhat fragile as it makes it dificult to add
new entries of interest as requirements and implementation evolve.
As such, I've included a large pad at the end of the structure.
Right now, relatively few of the Linux API fields are filled in, and
some contain no logical equivilent on FreeBSD.  I've include __'d
entries in the structure to make it easier to figure ou what is and
isn't omitted.  This API/ABI should be considered unstable for the
time being.
2004-11-26 18:58:46 +00:00
Mike Silbersack
6a220ed80a Fix a problem where our TCP stack would ignore RST packets if the receive
window was 0 bytes in size.  This may have been the cause of unsolved
"connection not closing" reports over the years.

Thanks to Michiel Boland for providing the fix and providing a concise
test program for the problem.

Submitted by:	Michiel Boland
MFC after:	2 weeks
2004-11-25 19:04:20 +00:00
Robert Watson
de30ea131f In tcp_reass(), assert the inpcb lock on the passed tcpcb, since the
contents of the tcpcb are read and modified in volume.

In tcp_input(), replace th comparison with 0 with a comparison with
NULL.

At the 'findpcb', 'dropafterack', and 'dropwithreset' labels in
tcp_input(), assert 'headlocked'.  Try to improve consistency between
various assertions regarding headlocked to be more informative.

MFC after:	2 weeks
2004-11-23 23:41:20 +00:00
Robert Watson
cce83ffb5a tcp_timewait() performs multiple non-atomic reads on the tcptw
structure, so assert the inpcb lock associated with the tcptw.
Also assert the tcbinfo lock, as tcp_timewait() may call
tcp_twclose() or tcp_2msl_rest(), which require it.  Since
tcp_timewait() is already called with that lock from tcp_input(),
this doesn't change current locking, merely documents reasons for
it.

In tcp_twstart(), assert the tcbinfo lock, as tcp_timer_2msl_rest()
is called, which requires that lock.

In tcp_twclose(), assert the tcbinfo lock, as tcp_timer_2msl_stop()
is called, which requires that lock.

Document the locking strategy for the time wait queues in tcp_timer.c,
which consists of protecting the time wait queues in the same manner
as the tcbinfo structure (using the tcbinfo lock).

In tcp_timer_2msl_reset(), assert the tcbinfo lock, as the time wait
queues are modified.

In tcp_timer_2msl_stop(), assert the tcbinfo lock, as the time wait
queues may be modified.

In tcp_timer_2msl_tw(), assert the tcbinfo lock, as the time wait
queues may be modified.

MFC after:	2 weeks
2004-11-23 17:21:30 +00:00
Robert Watson
b42ff86e73 De-spl tcp_slowtimo; tcp_maxidle assignment is subject to possible
but unlikely races that could be corrected by having tcp_keepcnt
and tcp_keepintvl modifications go through handler functions via
sysctl, but probably is not worth doing.  Updates to multiple
sysctls within evaluation of a single addition are unlikely.

Annotate that tcp_canceltimers() is currently unused.

De-spl tcp_timer_delack().

De-spl tcp_timer_2msl().

MFC after:	2 weeks
2004-11-23 16:45:07 +00:00
Robert Watson
7258e91f0f Assert the inpcb lock in tcp_twstart(), which does both read-modify-write
on the tcpcb, but also calls into tcp_close() and tcp_twrespond().

Annotate that tcp_twrecycleable() requires the inpcb lock because it does
a series of non-atomic reads of the tcpcb, but is currently called
without the inpcb lock by the caller.  This is a bug.

Assert the inpcb lock in tcp_twclose() as it performs a read-modify-write
of the timewait structure/inpcb, and calls in_pcbdetach() which requires
the lock.

Assert the inpcb lock in tcp_twrespond(), as it performs multiple
non-atomic reads of the tcptw and inpcb structures, as well as calling
mac_create_mbuf_from_inpcb(), tcpip_fillheaders(), which require the
inpcb lock.

MFC after:	2 weeks
2004-11-23 16:23:13 +00:00
Robert Watson
8263bab34d Assert inpcb lock in tcp_quench(), tcp_drop_syn_sent(), tcp_mtudisc(),
and tcp_drop(), due to read-modify-write of TCP state variables.

MFC after:	2 weeks
2004-11-23 16:06:15 +00:00
Robert Watson
8438db0f59 Assert the tcbinfo write lock in tcp_new_isn(), as the tcbinfo lock
protects access to the ISN state variables.

Acquire the tcbinfo write lock in tcp_isn_tick() to synchronize
timer-driven isn bumping.

Staticize internal ISN variables since they're not used outside of
tcp_subr.c.

MFC after:	2 weeks
2004-11-23 15:59:43 +00:00
Robert Watson
ca127a3e80 Remove "Unlocked read" annotations associated with previously unlocked
use of socket buffer fields in the TCP input code.  These references
are now protected by use of the receive socket buffer lock.

MFC after:	1 week
2004-11-22 13:16:27 +00:00
Robert Watson
98734750b4 s/send/sent/ in comment describing TCPS_SYN_RECEIVED. 2004-11-21 14:38:04 +00:00
Gleb Smirnoff
c1384b5ae2 - Since divert protocol is not connection oriented, remove SS_ISCONNECTED flag
from divert sockets.
- Remove div_disconnect() method, since it shouldn't be called now.
- Remove div_abort() method. It was never called directly, since protocol
  doesn't have listen queue. It was called only from div_disconnect(),
  which is removed now.

Reviewed by:	rwatson, maxim
Approved by:	julian (mentor)
MT5 after:	1 week
MT4 after:	1 month
2004-11-18 13:49:18 +00:00
Max Laier
9a6a6eeba2 Fix host route addition for more than one address to a loopback interface
after allowing more than one address with the same prefix.

Reported by:	Vladimir Grebenschikov <vova NO fbsd SPAM ru>
Submitted by:	ru (also NetBSD rev. 1.83)
Pointyhat to:	mlaier
2004-11-17 23:14:03 +00:00
Max Laier
81d96ce8a4 Merge copyright notices.
Requested by:	njl
2004-11-13 17:05:40 +00:00
Gleb Smirnoff
ea0bd57615 Fix ng_ksocket(4) operation as a divert socket, which is pretty useful
and has been broken twice:

- in the beginning of div_output() replace KASSERT with assignment, as
  it was in rev. 1.83. [1] [to be MFCed]
- refactor changes introduced in rev. 1.100: do not prepend a new tag
  unconditionally. Before doing this check whether we have one. [2]

A small note for all hacking in this area:
when divert socket is not a real userland, but ng_ksocket(4), we receive
_the same_ mbufs, that we transmitted to socket. These mbufs have rcvif,
the tags we've put on them. And we should treat them correctly.

Discussed with:	mlaier [1]
Silence from:	green [2]
Reviewed by:	maxim
Approved by:	julian (mentor)
MFC after:	1 week
2004-11-12 22:17:42 +00:00
Max Laier
48321abefe Change the way we automatically add prefix routes when adding a new address.
This makes it possible to have more than one address with the same prefix.
The first address added is used for the route. On deletion of an address
with IFA_ROUTE set, we try to find a "fallback" address and hand over the
route if possible.
I plan to MFC this in 4 weeks, hence I keep the - now obsolete - argument to
in_ifscrub as it must be considered KAPI as it is not static in in.c. I will
clean this after the MFC.

Discussed on:	arch, net
Tested by:	many testers of the CARP patches
Nits from:	ru, Andrea Campi <andrea+freebsd_arch webcom it>
Obtained from:	WIDE via OpenBSD
MFC after:	1 month
2004-11-12 20:53:51 +00:00
Poul-Henning Kamp
e21e4c19c9 Add missing '='
Spotted by:	obrien
2004-11-11 19:02:01 +00:00
Andre Oppermann
5e7b233055 Fix a double-free in the 'hlen > m->m_len' sanity check.
Bug report by:	<james@towardex.com>
MFC after:	2 weeks
2004-11-09 09:40:32 +00:00
SUZUKI Shinsuke
3d54848fc2 support TCP-MD5(IPv4) in KAME-IPSEC, too.
MFC after: 3 week
2004-11-08 18:49:51 +00:00
Poul-Henning Kamp
756d52a195 Initialize struct pr_userreqs in new/sparse style and fill in common
default elements in net_init_domain().

This makes it possible to grep these structures and see any bogosities.
2004-11-08 14:44:54 +00:00
Robert Watson
d6915262af Do some re-sorting of TCP pcbinfo locking and assertions: make sure to
retain the pcbinfo lock until we're done using a pcb in the in-bound
path, as the pcbinfo lock acts as a pseuo-reference to prevent the pcb
from potentially being recycled.  Clean up assertions and make sure to
assert that the pcbinfo is locked at the head of code subsections where
it is needed.  Free the mbuf at the end of tcp_input after releasing
any held locks to reduce the time the locks are held.

MFC after:	3 weeks
2004-11-07 19:19:35 +00:00
Andre Oppermann
e9a4cd2426 Fix a double-free in the 'm->m_len < sizeof (struct ip)' sanity check.
Bug report by:	<james@towardex.com>
MFC after:	2 weeks
2004-11-06 10:47:36 +00:00
Poul-Henning Kamp
c83c1318f5 Hide udp_in6 behind #ifdef INET6 2004-11-04 07:14:03 +00:00
Bruce M Simpson
38f061057b When performing IP fast forwarding, immediately drop traffic which is
destined for a blackhole route.

This also means that blackhole routes do not need to be bound to lo(4)
or disc(4) interfaces for the net.inet.ip.fastforwarding=1 case.

Submitted by:	james at towardex dot com
Sponsored by:	eXtensible Open Router Project <URL:http://www.xorp.org/>
MFC after:	3 weeks
2004-11-04 02:14:38 +00:00
Robert Watson
d4b509bd7f Until this change, the UDP input code used global variables udp_in,
udp_in6, and udp_ip6 to pass socket address state between udp_input(),
udp_append(), and soappendaddr_locked().  While file in the default
configuration, when running with multiple netisrs or direct ithread
dispatch, this can result in races wherein user processes using
recvmsg() get back the wrong source IP/port.  To correct this and
related races:

- Eliminate udp_ip6, which is believed to be generated but then never
  used.  Eliminate ip_2_ip6_hdr() as it is now unneeded.

- Eliminate setting, testing, and existence of 'init' status fields
  for the IPv6 structures.  While with multiple UDP delivery this
  could lead to amortization of IPv4 -> IPv6 conversion when
  delivering an IPv4 UDP packet to an IPv6 socket, it added
  substantial complexity and side effects.

- Move global structures into the stack, declaring udp_in in
  udp_input(), and udp_in6 in udp_append() to be used if a conversion
  is required.  Pass &udp_in into udp_append().

- Re-annotate comments to reflect updates.

With this change, UDP appears to operate correctly in the presence of
substantial inbound processing parallelism.  This solution avoids
introducing additional synchronization, but does increase the
potential stack depth.

Discovered by:	kris (Bug Magnet)
MFC after:	3 weeks
2004-11-04 01:25:23 +00:00
Andre Oppermann
c94c54e4df Remove RFC1644 T/TCP support from the TCP side of the network stack.
A complete rationale and discussion is given in this message
and the resulting discussion:

 http://docs.freebsd.org/cgi/mid.cgi?4177C8AD.6060706

Note that this commit removes only the functional part of T/TCP
from the tcp_* related functions in the kernel.  Other features
introduced with RFC1644 are left intact (socket layer changes,
sendmsg(2) on connection oriented protocols)  and are meant to
be reused by a simpler and less intrusive reimplemention of the
previous T/TCP functionality.

Discussed on:	-arch
2004-11-02 22:22:22 +00:00
Robert Watson
ab5c14d828 Correct a bug in TCP SACK that could result in wedging of the TCP stack
under high load: only set function state to loop and continuing sending
if there is no data left to send.

RELENG_5_3 candidate.

Feet provided:	Peter Losher <Peter underscore Losher at isc dot org>
Diagnosed by:	Aniel Hartmeier <daniel at benzedrine dot cx>
Submitted by:	mohan <mohans at yahoo-inc dot com>
2004-10-30 12:02:50 +00:00
Robert Watson
c427483381 Add a matching tunable for net.inet.tcp.sack.enable sysctl. 2004-10-26 08:59:09 +00:00
Bruce M Simpson
d6fa5d2806 Check that rt_mask(rt) is non-NULL before dereferencing it, in the
RTM_ADD case, thus avoiding a panic.

Submitted by:	Iasen Kostov
2004-10-26 03:31:58 +00:00
Andre Oppermann
84bb6a2e75 IPDIVERT is a module now and tell the other parts of the kernel about it.
IPDIVERT depends on IPFIREWALL being loaded or compiled into the kernel.
2004-10-25 20:02:34 +00:00
Ruslan Ermilov
a35d88931c For variables that are only checked with defined(), don't provide
any fake value.
2004-10-24 15:33:08 +00:00
Andre Oppermann
cd109b0d82 Shave 40 unused bytes from struct tcpcb. 2004-10-22 19:55:04 +00:00
Andre Oppermann
21dcc96f4a When printing the initialization string and IPDIVERT is not compiled into the
kernel refer to it as "loadable" instead of "disabled".
2004-10-22 19:18:06 +00:00
Andre Oppermann
24fc79b0a4 Refuse to unload the ipdivert module unless the 'force' flag is given to kldunload.
Reflect the fact that IPDIVERT is a loadable module in the divert(4) and ipfw(8)
man pages.
2004-10-22 19:12:01 +00:00
Andre Oppermann
57bbe2e1ab Destroy the UMA zone on unload. 2004-10-19 22:51:20 +00:00
Andre Oppermann
2de1a9eb6e Slightly extend the locking during unload to fully cover the protocol
deregistration.  This does not entirely close the race but narrows the
even previously extremely small chance of a race some more.
2004-10-19 22:08:13 +00:00
Robert Watson
279128e295 Annotate a newly introduced race present due to the unloading of
protocols: it is possible for sockets to be created and attached
to the divert protocol between the test for sockets present and
successful unload of the registration handler.  We will need to
explore more mature APIs for unregistering the protocol and then
draining consumers, or an atomic test-and-unregister mechanism.
2004-10-19 21:35:42 +00:00
Andre Oppermann
72584fd2c0 Convert IPDIVERT into a loadable module. This makes use of the dynamic loadability
of protocols.  The call to divert_packet() is done through a function pointer.  All
semantics of IPDIVERT remain intact.  If IPDIVERT is not loaded ipfw will refuse to
install divert rules and  natd will complain about 'protocol not supported'.  Once
it is loaded both will work and accept rules and open the divert socket.  The module
can only be unloaded if no divert sockets are open.  It does not close any divert
sockets when an unload is requested but will return EBUSY instead.
2004-10-19 21:14:57 +00:00
Andre Oppermann
969bb53e80 Properly declare the "net.inet" sysctl subtree. 2004-10-19 21:06:14 +00:00
Andre Oppermann
539be79a9d Pre-emptively define IPPROTO_SPACER to 32767, the same value as PROTO_SPACER
to document that this value is globally assigned for a special purpose and
may not be reused within the IPPROTO number space.
2004-10-19 20:59:01 +00:00
Andre Oppermann
dff3237ee5 Make use of the PROTO_SPACER functionality for dynamically loadable
protocols in inetsw[] and define initially eight spacer slots.

Remove conflicting declaration 'struct pr_usrreqs nousrreqs'.  It is
now declared and initialized in kern/uipc_domain.c.
2004-10-19 15:58:22 +00:00
Andre Oppermann
de38924dc0 Support for dynamically loadable and unloadable IP protocols in the ipmux.
With pr_proto_register() it has become possible to dynamically load protocols
within the PF_INET domain.  However the PF_INET domain has a second important
structure called ip_protox[] that is derived from the 'struct protosw inetsw[]'
and takes care of the de-multiplexing of the various protocols that ride on
top of IP packets.

The functions ipproto_[un]register() allow to dynamically adjust the ip_protox[]
array mux in a consistent and easy way.  To register a protocol within
ip_protox[] the existence of a corresponding and matching protocol definition
in inetsw[] is required.  The function does not allow to overwrite an already
registered protocol.  The unregister function simply replaces the mux slot with
the default index pointer to IPPROTO_RAW as it was previously.
2004-10-19 15:45:57 +00:00
Andre Oppermann
1cf15713ed Add a macro for the destruction of INP_INFO_LOCK's used by loadable modules. 2004-10-19 14:34:13 +00:00
Andre Oppermann
de1c2ac4bf Make comments more clear. Change the order of one if() statement to check the
more likely variable first.
2004-10-19 14:31:56 +00:00
Robert Watson
81158452be Push acquisition of the accept mutex out of sofree() into the caller
(sorele()/sotryfree()):

- This permits the caller to acquire the accept mutex before the socket
  mutex, avoiding sofree() having to drop the socket mutex and re-order,
  which could lead to races permitting more than one thread to enter
  sofree() after a socket is ready to be free'd.

- This also covers clearing of the so_pcb weak socket reference from
  the protocol to the socket, preventing races in clearing and
  evaluation of the reference such that sofree() might be called more
  than once on the same socket.

This appears to close a race I was able to easily trigger by repeatedly
opening and resetting TCP connections to a host, in which the
tcp_close() code called as a result of the RST raced with the close()
of the accepted socket in the user process resulting in simultaneous
attempts to de-allocate the same socket.  The new locking increases
the overhead for operations that may potentially free the socket, so we
will want to revise the synchronization strategy here as we normalize
the reference counting model for sockets.  The use of the accept mutex
in freeing of sockets that are not listen sockets is primarily
motivated by the potential need to remove the socket from the
incomplete connection queue on its parent (listen) socket, so cleaning
up the reference model here may allow us to substantially weaken the
synchronization requirements.

RELENG_5_3 candidate.

MFC after:	3 days
Reviewed by:	dwhite
Discussed with:	gnn, dwhite, green
Reported by:	Marc UBM Bocklet <ubm at u-boot-man dot de>
Reported by:	Vlad <marchenko at gmail dot com>
2004-10-18 22:19:43 +00:00
Robert Watson
6b8e5a9862 Don't release the udbinfo lock until after the last use of UDP inpcb
in udp_input(), since the udbinfo lock is used to prevent removal of
the inpcb while in use (i.e., as a form of reference count) in the
in-bound path.

RELENG_5 candidate.
2004-10-12 20:03:56 +00:00
Robert Watson
00fcf9d12d Modify the thrilling "%D is using my IP address %s!" message so that
it isn't printed if the IP address in question is '0.0.0.0', which is
used by nodes performing DHCP lookup, and so constitute a false
positive as a report of misconfiguration.
2004-10-12 17:10:40 +00:00