Commit Graph

4098 Commits

Author SHA1 Message Date
Robert Watson
fa046d8774 Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
  inpcb counter.  This lock is now relegated to a small number of
  allocation and free operations, and occasional operations that walk
  all connections (including, awkwardly, certain UDP multicast receive
  operations -- something to revisit).

- A new ipi_hash_lock protects the two inpcbinfo hash tables for
  looking up connections and bound sockets, manipulated using new
  INP_HASH_*() macros.  This lock, combined with inpcb locks, protects
  the 4-tuple address space.

Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required.  As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.

A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb.  Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed.  In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup.  New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:

  INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
  INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb

Callers must pass exactly one of these flags (for the time being).

Some notes:

- All protocols are updated to work within the new regime; especially,
  TCP, UDPv4, and UDPv6.  pcbinfo ipi_lock acquisitions are largely
  eliminated, and global hash lock hold times are dramatically reduced
  compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
  may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
  is no longer available -- hash lookup locks are now held only very
  briefly during inpcb lookup, rather than for potentially extended
  periods.  However, the pcbinfo ipi_lock will still be acquired if a
  connection state might change such that a connection is added or
  removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
  due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
  callers to acquire hash locks and perform one or more lookups atomically
  with 4-tuple allocation: this is required only for TCPv6, as there is no
  in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
  locking, which relates to source address selection.  This needs
  attention, as it likely significantly reduces parallelism in this code
  for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
  somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
  is no longer sufficient.  A second check once the inpcb lock is held
  should do the trick, keeping the general case from requiring the inpcb
  lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
  which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
  undesirable, and probably another argument is required to take care of
  this (or a char array name field in the pcbinfo?).

This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics.  It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.

Reviewed by:    bz
Sponsored by:   Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
Andrey V. Elsukov
d832ded1a1 Wrap long line.
MFC after:	2 weeks
2011-05-30 05:53:00 +00:00
Andrey V. Elsukov
41b6083752 Add tablearg support for ipfw setfib.
PR:		kern/156410
MFC after:	2 weeks
2011-05-30 05:37:26 +00:00
Michael Tuexen
14cfa970bf Get rid of unused functions.
MFC after: 1 week.
2011-05-29 18:41:06 +00:00
Qing Li
92322284cd Supply the LLE_STATIC flag bit to in_ifscurb() when scrubbing interface
address so that proper clean up will take place in the routing code.
This patch fixes the bootp panic on startup problem. Also, added more
error handling and logging code in function in_scrubprefix().

MFC after:	5 days
2011-05-29 02:21:35 +00:00
Bjoern A. Zeeb
8d5a3ca77b Add FEATURE() definitions for IPv4 and IPv6 so that we can use
feature_present(3) to dynamically decide whether to use one or the
other family.

Reviewed by:	gnn
Sponsored by:	The FreeBSD Foundation
Sponsored by:	iXsystems
MFC after:	10 days
2011-05-25 00:34:25 +00:00
Robert Watson
61401ec2de An inpcb lock is no longer required in in_pcbref() since the move to
refcount(9).

MFC after:      3 weeks
Sponsored by:   Juniper Networks, Inc.
2011-05-24 13:08:59 +00:00
Robert Watson
79bdc6e5d3 Continue to refine inpcb reference counting and locking, in preparation for
reworking of inpcbinfo locking:

(1) Convert inpcb reference counting from manually manipulated integers to
    the refcount(9) KPI.  This allows the refcount to be managed atomically
    with an inpcb read lock rather than write lock, or even with no inpcb
    lock at all.  As a result, in_pcbref() also no longer requires an inpcb
    lock, so can be performed solely using the lock used to look up an
    inpcb.

(2) Shift more inpcb freeing activity from the in_pcbrele() context (via
    in_pcbfree_internal) to the explicit in_pcbfree() context.  This means
    that the inpcb refcount is increasingly used only to maintain memory
    stability, not actually defer the clean up of inpcb protocol parts.
    This is desirable as many of those protocol parts required the pcbinfo
    lock, which we'd like not to acquire in in_pcbrele() contexts.  Document
    this in comments better.

(3) Introduce new read-locked and write-locked in_pcbrele() variations,
    in_pcbrele_rlocked() and in_pcbrele_wlocked(), which allow the inpcb to
    be properly unlocked as needed.  in_pcbrele() is a wrapper around the
    latter, and should probably go away at some point.  This makes it
    easier to use this weak reference model when holding only a read lock,
    as will happen in the future.

This may well be safe to MFC, but some more KBI analysis is required.

Reviewed by:    bz
MFC after:      3 weeks
Sponsored by:   Juniper Networks, Inc.
2011-05-23 19:32:02 +00:00
Robert Watson
68e0d7e06a Move from passing a wildcard boolean to a general set up lookup flags into
in_pcb_lport(), in_pcblookup_local(), and in_pcblookup_hash(), and similarly
for IPv6 functions.  In the future, we would like to support other flags
relating to locking strategy.

This change doesn't appear to modify the KBI in practice, as callers already
passed in INPLOOKUP_WILDCARD rather than a simple boolean.

MFC after:      3 weeks
Reviewed by:    bz
Sponsored by:   Juniper Networks, Inc.
2011-05-23 15:23:18 +00:00
Robert Watson
82a5be494a A number of quite incremental refinements to struct inpcbinfo's definition:
(1) Add a locking guide for inpcbinfo.
(2) Annotate inpcbinfo fields with synchronisation information; not all
    annotations are 100% satisfactory.
(3) Reorder inpcbinfo fields so that the lock is at the head of the
    structure, and close to fields it protects.
(4) Sort fields that will eventually be hashlock/pcbgroup-related together
    even though they remain locked by ipi_lock for now.

Reviewed by:	bz
Sponsored by:	Juniper Networks
X-MFC after:	KBI analysis required
2011-05-23 13:51:57 +00:00
Qing Li
5b84dc789a The statically configured (permanent) ARP entries are removed when an
interface is brought down, even though the interface address is still
valid. This patch maintains the permanent ARP entries as long as the
interface address (having the same prefix as that of the ARP entries)
is valid.

Reviewed by:	delphij
MFC after:	5 days
2011-05-20 19:12:20 +00:00
Michael Tuexen
b7e08865e8 Unbreak INET-less build.
Reported by bz@
MFC after: 1 week
2011-05-18 19:49:39 +00:00
Michael Tuexen
4f36da915f Copy out the mtu when calling getsockopt() with SCTP_GET_PEER_ADDR_INFO.
MFC after: 1 week.
2011-05-17 15:57:31 +00:00
Michael Tuexen
c954cac48b Fix whitespacing.
Reported by scf@

MFC after: 1 week.
2011-05-17 15:46:28 +00:00
Michael Tuexen
96f4bcfff2 Fix the source address selection for boundall sockets
when sending INITs to a global IPv4 address having
only private IPv4 address.
Allow the usage of a private address and make sure
that no other private address will be used by the
association.
Initial work was done by rrs@.

MFC after: 1 week.
2011-05-14 18:22:14 +00:00
John Baldwin
5891ebd6cd Oops, fix order of sequence numbers in KASSERT()'s to catch negative
receive windows to match the labels in the panic message.

Submitted by:	trociny
2011-05-14 14:41:40 +00:00
Alexander Motin
bc7d18ae72 Refactor TCP ISN increment logic. Instead of firing callout at 100Hz to
keep constant ISN growth rate, do the same directly inside tcp_new_isn(),
taking into account how much time (ticks) passed since the last call.

On my test systems this decreases idle interrupt rate from 140Hz to 70Hz.
2011-05-09 07:37:47 +00:00
Michael Tuexen
689e6a5fa3 Fix a locking issue showing up on Mac OS X when subscribing to
authentication events. DTLS/SCTP renegotiations trigger the bug.

MFC after: 2 weeks.
2011-05-08 09:11:59 +00:00
Michael Tuexen
936fc35bb3 Change the name of an internal structure, since the name
is used by a structure of the (new) SCTP API.

MFC after: 1 week.
2011-05-06 20:40:33 +00:00
Andrey V. Elsukov
318b735cc3 Convert delay parameter back to ms when reporting to user.
PR:		156838
MFC after:	1 week
2011-05-06 07:13:34 +00:00
Michael Tuexen
c3d72c80d3 Implement Resource Pooling V2 and an MPTCP like congestion
control.
Based on a patch received from Martin Becke.

MFC after: 2 weeks.
2011-05-04 21:27:05 +00:00
Michael Tuexen
274b0bd51d Remove code with any effect. 2011-05-03 20:34:02 +00:00
Michael Tuexen
1d663b4658 Add a missing break. This bug was introduced in r221249.
MFC after: 1 week
2011-05-03 20:32:21 +00:00
John Baldwin
f701e30d7f Handle a rare edge case with nearly full TCP receive buffers. If a TCP
buffer fills up causing the remote sender to enter into persist mode, but
there is still room available in the receive buffer when a window probe
arrives (either due to window scaling, or due to the local application
very slowing draining data from the receive buffer), then the single byte
of data in the window probe is accepted.  However, this can cause rcv_nxt
to be greater than rcv_adv.  This condition will only last until the next
ACK packet is pushed out via tcp_output(), and since the previous ACK
advertised a zero window, the ACK should be pushed out while the TCP
pcb is write-locked.

During the window while rcv_nxt is greather than rcv_adv, a few places
would compute the remaining receive window via rcv_adv - rcv_nxt.
However, this value was then (uint32_t)-1.  On a 64 bit machine this
could expand to a positive 2^32 - 1 when cast to a long.  In particular,
when calculating the receive window in tcp_output(), the result would be
that the receive window was computed as 2^32 - 1 resulting in advertising
a far larger window to the remote peer than actually existed.

Fix various places that compute the remaining receive window to either
assert that it is not negative (i.e. rcv_nxt <= rcv_adv), or treat the
window as full if rcv_nxt is greather than rcv_adv.

Reviewed by:	bz
MFC after:	1 month
2011-05-02 21:05:52 +00:00
Michael Tuexen
ea5eba1157 Some more cleanups related to an kernel without INET.
MFC after: 1 week
2011-05-02 15:53:00 +00:00
Bjoern A. Zeeb
29bd2010d4 Fix a mismerge from p4 in that in_localaddr() is not available without INET.
Sponsored by:	The FreeBSD Foundation
Sponsored by:	iXsystems
MFC after:	4 days
2011-04-30 16:30:18 +00:00
Michael Tuexen
d085528d04 Remove some leftover debug code.
MFC after: 1 week
2011-04-30 11:22:30 +00:00
Bjoern A. Zeeb
b287c6c70c Make the TCP code compile without INET. Sort #includes and add #ifdef INETs.
Add some comments at #endifs given more nestedness.  To make the compiler
happy, some default initializations were added in accordance with the style
on the files.

Reviewed by:	gnn
Sponsored by:	The FreeBSD Foundation
Sponsored by:	iXsystems
MFC after:	4 days
2011-04-30 11:21:29 +00:00
Michael Tuexen
e6194c2ed4 Improve compilation of SCTP code without INET support.
Some bugs where fixed while doing this:
* ASCONF-ACK messages might use wrong port number when using
  IPv6.
* Checking for additional addresses takes the correct address
  into account and also does not do more comparisons than
  necessary.

This patch is based on one received from bz@ who was
sponsored by The FreeBSD Foundation and iXsystems.

MFC after: 1 week
2011-04-30 11:18:16 +00:00
Bjoern A. Zeeb
79288c112c Make the UDP code compile without INET. Expose udp_usrreq.c to IPv6 only
as well compiling out most functions adding or extending #ifdef INET
coverage.

Reviewed by:	gnn
Sponsored by:	The FreeBSD Foundation
Sponsored by:	iXsystems
MFC after:	4 days
2011-04-30 11:17:00 +00:00
Bjoern A. Zeeb
67107f4594 Make the PCB code compile without INET support by adding #ifdef INETs
and correcting few #includes.

Reviewed by:	gnn
Sponsored by:	The FreeBSD Foundation
Sponsored by:	iXsystems
MFC after:	4 days
2011-04-30 11:04:34 +00:00
John Baldwin
672dc4aea2 TCP reuses t_rxtshift to determine the backoff timer used for both the
persist state and the retransmit timer.  However, the code that implements
"bad retransmit recovery" only checks t_rxtshift to see if an ACK has been
received in during the first retransmit timeout window.  As a result, if
ticks has wrapped over to a negative value and a socket is in the persist
state, it can incorrectly treat an ACK from the remote peer as a
"bad retransmit recovery" and restore saved values such as snd_ssthresh and
snd_cwnd.  However, if the socket has never had a retransmit timeout, then
these saved values will be zero, so snd_ssthresh and snd_cwnd will be set
to 0.

If the socket is in fast recovery (this can be caused by excessive
duplicate ACKs such as those fixed by 220794), then each ACK that arrives
triggers either NewReno or SACK partial ACK handling which clamps snd_cwnd
to be no larger than snd_ssthresh.  In effect, the socket's send window
is permamently stuck at 0 even though the remote peer is advertising a
much larger window and pending data is only sent via TCP window probes
(so one byte every few seconds).

Fix this by adding a new TCP pcb flag (TF_PREVVALID) that indicates that
the various snd_*_prev fields in the pcb are valid and only perform
"bad retransmit recovery" if this flag is set in the pcb.  The flag is set
on the first retransmit timeout that occurs and is cleared on subsequent
retransmit timeouts or when entering the persist state.

Reviewed by:	bz
MFC after:	2 weeks
2011-04-29 15:40:12 +00:00
Bjoern A. Zeeb
b8e463e644 MfP4 CH=192029:
Expose ip_icmp.c to INET6 as well and only export badport_bandlim()
along with the two sysctls in the non-INET case.
The bandlim types work for all cases I reviewed in IPv6 as well and
the sysctls are available as we export net.inet.* from in_proto.c.

Reviewed by:	gnn
Sponsored by:	The FreeBSD Foundation
Sponsored by:	iXsystems
MFC after:	4 days
2011-04-27 19:36:35 +00:00
Bjoern A. Zeeb
74e9dcf786 MfP4 CH=192004:
Move ip_defttl to raw_ip.c where it is actually used.  In an IPv6
only world we do not want to compile ip_input.c in for that and
it is a shared default with INET6.

Reviewed by:	gnn
Sponsored by:	The FreeBSD Foundation
Sponsored by:	iXsystems
MFC after:	4 days
2011-04-27 19:32:27 +00:00
Bjoern A. Zeeb
a0ae8f04e8 Make various (pseudo) interfaces compile without INET in the kernel
adding appropriate #ifdefs.  For module builds the framework needs
adjustments for at least carp.

Reviewed by:	gnn
Sponsored by:	The FreeBSD Foundation
Sponsored by:	iXsystems
MFC after:	4 days
2011-04-27 19:30:44 +00:00
Attilio Rao
2903309aca Add the possibility to verify MD5 hash of incoming TCP packets.
As long as this is a costy function, even when compiled in (along with
the option TCP_SIGNATURE), it can be disabled via the
net.inet.tcp.signature_verify_input sysctl.

Sponsored by:	Sandvine Incorporated
Reviewed by:	emaste, bz
MFC after:	2 weeks
2011-04-25 17:13:40 +00:00
Bjoern A. Zeeb
acaeca65b3 Be less strict on includes than in r220746. We need in.h for both
INET or INET6 as it holds all the IPPROTO_* definitions needed
for the SYSCTL_NODE definitions.

Reviewed by:	gnn
Sponsored by:	The FreeBSD Foundation
Sponsored by:	iXsystems
MFC after:	5 days
2011-04-25 16:36:16 +00:00
Gleb Smirnoff
acdef0460e Use size_t for sopt_valsize.
Submitted by:	Brandon Gooch <jamesbrandongooch gmail.com>
2011-04-21 08:18:55 +00:00
Bjoern A. Zeeb
00c081e908 MFp4 CH=191760:
When compiling out INET we still need the initialization routines
as well as the tuning and montoring sysctls shared with IPv6.

Move the two send/recvspace variables up from the middle of the
file to ease compiling out the INET only code.

Reviewed by:	gnn
Sponsored by:	The FreeBSD Foundation
Sponsored by:	iXsystems
MFC after:	3 days
2011-04-20 08:03:22 +00:00
Bjoern A. Zeeb
aae49dd304 MFp4 CH=191470:
Move the ipport_tick_callout and related functions from ip_input.c
to in_pcb.c.  The random source port allocation code has been merged
and is now local to in_pcb.c only.
Use a SYSINIT to get the callout started and no longer depend on
initialization from the inet code, which would not work in an IPv6
only setup.

Reviewed by:	gnn
Sponsored by:	The FreeBSD Foundation
Sponsored by:	iXsystems
MFC after:	4 days
2011-04-20 08:00:29 +00:00
Bjoern A. Zeeb
ec4f97277f MFp4 CH=191466:
Move fw_one_pass to where it belongs: it is a property of ipfw,
not of ip_input.

Reviewed by:	gnn
Sponsored by:	The FreeBSD Foundation
Sponsored by:	iXsystems
MFC after:	3 days
2011-04-20 07:55:33 +00:00
Gleb Smirnoff
9d0a2ddf69 - Rewrite functions that copyin/out NAT configuration, so that they
calculate required memory size dynamically.
- Fix races on chain re-lock.
- Introduce new field to ip_fw_chain - generation count. Now utilized
  only in the NAT configuration, but can be utilized wider in ipfw.
- Get rid of NAT_BUF_LEN in ip_fw.h

PR:		kern/143653
2011-04-19 15:06:33 +00:00
Andrey V. Elsukov
e3665201f5 Add sysctl handlers for net.inet.ip.dummynet.hash_size, .pipe_byte_limit
and .pipe_slot_limit oids to prevent to set incorrect values.

MFC after:	2 weeks
2011-04-19 11:33:39 +00:00
Andrey V. Elsukov
8ad66025f6 ipdn_bound_var() functions is designed to bound a variable between
specified minimum and maximum. In case when specified default value
is out of bounds it does not work as expected and does not limit
variable. Check that default value is in range and limit it if needed.
Also bump max_hash_size value to 65536 to correspond with manual page.

PR:		kern/152887
MFC after:	2 weeks
2011-04-19 11:29:09 +00:00
Andrey V. Elsukov
3ab4af737d Use M_WAITOK instead M_WAIT for malloc. Remove unneded checks.
MFC after:	1 week
2011-04-19 05:59:37 +00:00
Gleb Smirnoff
ca47294ddf LibAliasInit() should allocate memory with M_WAITOK flag. Modify it
and its callers.
2011-04-18 20:07:08 +00:00
Gleb Smirnoff
d0e16e0d1e Pullup up to TCP header length before matching against 'tcpopts'.
PR:		kern/156180
Reviewed by:	luigi
2011-04-18 18:22:10 +00:00
John Baldwin
da84b2e6c5 When checking to see if a window update should be sent to the remote peer,
don't force a window update if the window would not actually grow due to
window scaling.  Specifically, if the window scaling factor is larger than
2 * MSS, then after the local reader has drained 2 * MSS bytes from the
socket, a window update can end up advertising the same window.  If this
happens, the supposed window update actually ends up being a duplicate ACK.
This can result in an excessive number of duplicate ACKs when using a
higher maximum socket buffer size.

Reviewed by:	bz
MFC after:	1 month
2011-04-18 17:43:16 +00:00
Bjoern A. Zeeb
336d023b2e Make in_proto.c dependent on either inet or inet6.
While it does not provide any functionality for IPv6, it provides
the sysctl nodes for net.inet.* that a lot of functionality shared
between IPv4 and IPv6 depends on.  We cannot change these anymore
without breaking a lot of management and tuning.

In case of IPv6 only, we compile out everything but the sysctl node
declarations.

Reviewed by:	gnn
Sponsored by:	The FreeBSD Foundation
Sponsored by:	iXsystems
MFC After:	5 days
2011-04-17 16:35:16 +00:00
Edward Tomasz Napierala
79bb84fb15 Refactor udp_input(), moving calls to u_tun_func() into udp_append().
Obtained from: Wheel Systems Sp. z o.o.
Reviewed by:	bz@
2011-04-14 10:40:57 +00:00