Commit Graph

4554 Commits

Author SHA1 Message Date
tuexen
f1eb961773 Move from early SSN assignment to late SSN assignment.
This doesn't change functionality, but makes upcoming change
much easier.
Developed with rrs@ at the IETF 85.

MFC after: 1 week
2012-11-05 20:55:17 +00:00
andre
b51779d7ea Back out r242262. The simplified window change/update logic wasn't
complete and ready for production use.

PR:	kern/173309
2012-11-05 09:13:06 +00:00
ae
4354018055 Remove the recently added sysctl variable net.pfil.forward.
Instead, add protocol specific mbuf flags M_IP_NEXTHOP and
M_IP6_NEXTHOP. Use them to indicate that the mbuf's chain
contains the PACKET_TAG_IPFORWARD tag. And do a tag lookup
only when this flag is set.

Suggested by:	andre
2012-11-02 01:20:55 +00:00
tuexen
139b791e20 Whitespace changes due to upstream integration of SCTP changes in the
FreeBSD code base.
2012-10-29 20:47:32 +00:00
tuexen
bd5ecc606d Add braces (as used elsewhere in the SCTP code). 2012-10-29 20:44:29 +00:00
tuexen
02bbac6d05 Use ntohs() and htons() in correct order. However, this doesn't change
functionality.
2012-10-29 20:42:48 +00:00
andre
844d4d2472 Forced commit to provide the correct commit message to r242251:
Defer sending an independent window update if a delayed ACK is pending
  saving a packet.  The window update then gets piggy-backed on the next
  already scheduled ACK.

Added grammar fixes as well.

MFC after:	2 weeks
2012-10-29 13:16:33 +00:00
andre
abf1521166 Define the delayed ACK timeout value directly as hz/10 instead of
obfuscating it by going through PR_FASTHZ.  No functional change.

MFC after:	2 weeks
2012-10-29 12:17:02 +00:00
andre
07dc51f3cc If the user has closed the socket then drop a persisting connection
after a much reduced timeout.

Typically web servers close their sockets quickly under the assumption
that the TCP connections goes away as well.  That is not entirely true
however.  If the peer closed the window we're going to wait for a long
time with lots of data in the send buffer.

MFC after:	2 weeks
2012-10-28 19:58:20 +00:00
andre
b824892b57 Increase the initial CWND to 10 segments as defined in IETF TCPM
draft-ietf-tcpm-initcwnd-05. It explains why the increased initial
window improves the overall performance of many web services without
risking congestion collapse.

As long as it remains a draft it is placed under a sysctl marking it
as experimental:
 net.inet.tcp.experimental.initcwnd10 = 1
When it becomes an official RFC soon the sysctl will be changed to
the RFC number and moved to net.inet.tcp.

This implementation differs from the RFC draft in that it is a bit
more conservative in the case of packet loss on SYN or SYN|ACK because
we haven't reduced the default RTO to 1 second yet.  Also the restart
window isn't yet increased as allowed.  Both will be adjusted with
upcoming changes.

Is is enabled by default.  In Linux it is enabled since kernel 3.0.

MFC after:	2 weeks
2012-10-28 19:47:46 +00:00
andre
36473a548b Update comment to reflect the change made in r242263.
MFC after:	2 weeks
2012-10-28 19:22:18 +00:00
andre
ab8a697d0a Add SACK_PERMIT to the list of TCP options that are switched off after
retransmitting a SYN three times.

MFC after:	2 weeks
2012-10-28 19:20:23 +00:00
andre
b21f6ebbaa Simplify and enhance the window change/update acceptance logic,
especially in the presence of bi-directional data transfers.

snd_wl1 tracks the right edge, including data in the reassembly
queue, of valid incoming data.  This makes it like rcv_nxt plus
reassembly.  It never goes backwards to prevent older, possibly
reordered segments from updating the window.

snd_wl2 tracks the left edge of sent data.  This makes it a duplicate
of snd_una.  However joining them right now is difficult due to
separate update dependencies in different places in the code flow.

snd_wnd tracks the current advertized send window by the peer.  In
tcp_output() the effective window is calculated by subtracting the
already in-flight data, snd_nxt less snd_una, from it.

ACK's become the main clock of window updates and will always update
the window when the left edge of what we sent is advanced.  The ACK
clock is the primary signaling mechanism in ongoing data transfers.
This works reliably even in the presence of reordering, reassembly
and retransmitted segments.  The ACK clock is most important because
it determines how much data we are allowed to inject into the network.

Zero window updates get us out of persistence mode are crucial.  Here
a segment that neither moves ACK nor SEQ but enlarges WND is accepted.

When the ACK clock is not active (that is we're not or no longer
sending any data) any segment that moves the extended right SEQ edge,
including out-of-order segments, updates the window.  This gives us
updates especially during ping-pong transfers where the peer isn't
done consuming the already acknowledged data from the receive buffer
while responding with data.

The SSH protocol is a prime candidate to benefit from the improved
bi-directional window update logic as it has its own windowing
mechanism on top of TCP and is frequently sending back protocol ACK's.

Tcpdump provided by:	darrenr
Tested by:	darrenr
MFC after:	2 weeks
2012-10-28 19:16:22 +00:00
andre
ee161fee4d For retransmits of SYN|ACK from the syncache use the slightly more
aggressive special tcp_syn_backoff[] retransmit schedule instead of
the normal tcp_backoff[] schedule for established connections.

MFC after:	2 weeks
2012-10-28 19:02:07 +00:00
andre
891f33973f When retransmitting SYN in TCPS_SYN_SENT state use TCPTV_RTOBASE,
the default retransmit timeout, as base to calculate the backoff
time until next try instead of the TCP_REXMTVAL() macro which only
works correctly when we already have measured an actual RTT+RTTVAR.

Before it would cause the first retransmit at RTOBASE, the next
four at the same time (!) about 200ms later, and then another one
again RTOBASE later.

MFC after:	2 weeks
2012-10-28 18:56:57 +00:00
andre
06a013a7a6 Remove bogus 'else' in #ifdef that prevented the rttvar from being reset
tcp_timer_rexmt() on retransmit for IPv6 sessions.

MFC after:	2 weeks
2012-10-28 18:45:04 +00:00
andre
ff213d7494 Allow arbitrary MSS sizes and don't mind about the cluster size anymore.
We've got more cluster sizes for quite some time now and the orginally
imposed limits and the previously codified thoughts on efficiency gains
are no longer true.

MFC after:	2 weeks
2012-10-28 18:33:52 +00:00
andre
2d42646150 Change the syncache count reporting the current number of entries
from an unprotected u_int that reports garbage on SMP to a function
based sysctl obtaining the current value from UMA.

Also read back the actual cache_limit after page size rounding by UMA.

PR:		kern/165879
MFC after:	2 weeks
2012-10-28 18:07:34 +00:00
andre
df63a1d6ea Simplify implementation of net.inet.tcp.reass.maxsegments and
net.inet.tcp.reass.cursegments.

MFC after:	2 weeks
2012-10-28 17:59:46 +00:00
andre
a04f01c8df Prevent a flurry of forced window updates when an application is
doing small reads on a (partially) filled receive socket buffer.

Normally one would a send a window update every time the available
space in the socket buffer increases by two times MSS.  This leads
to a flurry of window updates that do not provide any meaningful
new information to the sender.  There still is available space in
the window and the sender can continue sending data.  All window
updates then get carried by the regular ACKs.  Only when the socket
buffer was (almost) full and the window closed accordingly a window
updates delivery new information and allows the sender to start
sending more data again.

Send window updates only every two MSS when the socket buffer
has less than 1/8 space available, or the available space in the
socket buffer increased by 1/4 its full capacity, or the socket
buffer is very small.  The next regular data ACK will carry and
report the exact window size again.

Reported by:	sbruno
Tested by:	darrenr
Tested by:	Darren Baginski
PR:		kern/116335
MFC after:	2 weeks
2012-10-28 17:40:35 +00:00
andre
afe4bf4cff When SYN or SYN/ACK had to be retransmitted RFC5681 requires us to
reduce the initial CWND to one segment.  This reduction got lost
some time ago due to a change in initialization ordering.

Additionally in tcp_timer_rexmt() avoid entering fast recovery when
we're still in TCPS_SYN_SENT state.

MFC after:	2 weeks
2012-10-28 17:30:28 +00:00
andre
79dbdb05fd When SYN or SYN/ACK had to be retransmitted RFC5681 requires us to
reduce the initial CWND to one segment.  This reduction got lost
some time ago due to a change in initialization ordering.

Additionally in tcp_timer_rexmt() avoid entering fast recovery when
we're still in TCPS_SYN_SENT state.

MFC after:	2 weeks
2012-10-28 17:25:08 +00:00
andre
5589a42386 Adjust the initial default CWND upon connection establishment to the
new and increased values specified by RFC5681 Section 3.1.

The even larger initial CWND per RFC3390, if enabled, is not affected.

MFC after:	2 weeks
2012-10-28 17:16:09 +00:00
glebius
f79061ff05 o Remove last argument to ip_fragment(), and obtain all needed information
on checksums directly from mbuf flags. This simplifies code.
o Clear CSUM_IP from the mbuf in ip_fragment() if we did checksums in
  hardware. Some driver may not announce CSUM_IP in theur if_hwassist,
  although try to do checksums if CSUM_IP set on mbuf. Example is em(4).
o While here, consistently use CSUM_IP instead of its alias CSUM_DELAY_IP.
  After this change CSUM_DELAY_IP vanishes from the stack.

Submitted by:	Sebastian Kuzminsky <seb lineratesystems.com>
2012-10-26 21:06:33 +00:00
ae
71112b5a8e Remove the IPFIREWALL_FORWARD kernel option and make possible to turn
on the related functionality in the runtime via the sysctl variable
net.pfil.forward. It is turned off by default.

Sponsored by:	Yandex LLC
Discussed with:	net@
MFC after:	2 weeks
2012-10-25 09:39:14 +00:00
glebius
3d11eb1465 After r241923 the updated ip_len no longer needed. 2012-10-25 09:02:21 +00:00
glebius
a5c4b7118d Fix error in r241913 that had broken fragment reassembly. 2012-10-25 09:00:57 +00:00
glebius
285432154c Use ip_stripoptions() instead of handrolled version. 2012-10-23 10:30:09 +00:00
glebius
e4588fbb85 Simplify ip_stripoptions() reducing number of intermediate
variables.
2012-10-23 10:29:31 +00:00
glebius
fea857f2a8 Do not reduce ip_len by size of IP header in the ip_input()
before passing a packet to protocol input routines.
  For several protocols this mean that now protocol needs to
do subtraction itself, and for another half this means that
we do not need to add header length back to the packet.

  Make ip_stripoptions() to adjust ip_len, since now we enter
this function with a packet header whose ip_len does represent
length of entire packet, not payload only.
2012-10-23 08:33:13 +00:00
delphij
3948ce713c Remove __P.
Submitted by:	kevlo
Reviewed by:	md5(1)
MFC after:	2 months
2012-10-22 21:49:56 +00:00
glebius
5cc3ac5902 Switch the entire IPv4 stack to keep the IP packet header
in network byte order. Any host byte order processing is
done in local variables and host byte order values are
never[1] written to a packet.

  After this change a packet processed by the stack isn't
modified at all[2] except for TTL.

  After this change a network stack hacker doesn't need to
scratch his head trying to figure out what is the byte order
at the given place in the stack.

[1] One exception still remains. The raw sockets convert host
byte order before pass a packet to an application. Probably
this would remain for ages for compatibility.

[2] The ip_input() still subtructs header len from ip->ip_len,
but this is planned to be fixed soon.

Reviewed by:	luigi, Maxim Dounin <mdounin mdounin.ru>
Tested by:	ray, Olivier Cochard-Labbe <olivier cochard.me>
2012-10-22 21:09:03 +00:00
zont
5d9ce2d3e8 - Update cachelimit after hashsize and bucketlimit were set.
Reported by:	az
Reviewed by:	melifaro
Approved by:	kib (mentor)
MFC after:	1 week
2012-10-19 14:00:03 +00:00
andre
34a9a386cb Mechanically remove the last stray remains of spl* calls from net*/*.
They have been Noop's for a long time now.
2012-10-18 13:57:24 +00:00
emaste
da1e109451 Avoid potential bad pointer dereference.
Previously RuleAdd would leave entry->la unset for the first entry in
the proxyList.

Sponsored by: ADARA Networks
MFC After: 1 week
2012-10-17 20:23:07 +00:00
glebius
eecb11a14e We don't need to convert ip6_len to host byte order before
ip6_output(), the IPv6 stack is working in net byte order.

The reason this code worked before is that ip6_output()
doesn't look at ip6_plen at all and recalculates it based
on mbuf length.
2012-10-15 07:57:55 +00:00
glebius
b6c0be02b6 Fix a miss from r241344: in ip_mloopback() we need to go to
net byte order prior to calling in_delayed_cksum().

Reported by:	 Olivier Cochard-Labbe <olivier cochard.me>
2012-10-14 15:08:07 +00:00
melifaro
85ee5d74ce Cleanup documentation: cloning route support has been removed in r186119.
MFC after:	2 weeks
2012-10-13 09:31:01 +00:00
glebius
1e75ca6470 Revert fixup of ip_len from r241480. Now stack isn't yet
ready for that change.
2012-10-12 09:32:38 +00:00
glebius
9879a454af In ip_stripoptions():
- Remove unused argument and incorrect comment.
  - Fixup ip_len after stripping.
2012-10-12 09:24:24 +00:00
melifaro
02e40e1b73 Do not check if found IPv4 rte is dynamic if net.inet.icmp.drop_redirect is
enabled. This eliminates one mtx_lock() per each routing lookup thus improving
performance in several cases (routing to directly connected interface or routing
to default gateway).

Icmp redirects should not be used to provide routing direction nowadays, even
for end hosts. Routers should not use them too (and this is explicitly restricted
in IPv6, see RFC 4861, clause 8.2).

Current commit changes rnh_machaddr function to 'stock' rn_match (and back) for every
AF_INET routing table in given VNET instance on drop_redirect sysctl change.

This change is part of bigger patch eliminating rte locking.

Sponsored by:	Yandex LLC
MFC after:	2 weeks
2012-10-10 19:06:11 +00:00
kevlo
ceb08698f2 Revert previous commit...
Pointyhat to:	kevlo (myself)
2012-10-10 08:36:38 +00:00
kevlo
8747a46991 Prefer NULL over 0 for pointers 2012-10-09 08:27:40 +00:00
glebius
9086143e8c After r241245 it appeared that in_delayed_cksum(), which still expects
host byte order, was sometimes called with net byte order. Since we are
moving towards net byte order throughout the stack, the function was
converted to expect net byte order, and its consumers fixed appropriately:
  - ip_output(), ipfilter(4) not changed, since already call
    in_delayed_cksum() with header in net byte order.
  - divert(4), ng_nat(4), ipfw_nat(4) now don't need to swap byte order
    there and back.
  - mrouting code and IPv6 ipsec now need to switch byte order there and
    back, but I hope, this is temporary solution.
  - In ipsec(4) shifted switch to net byte order prior to in_delayed_cksum().
  - pf_route() catches up on r241245 changes to ip_output().
2012-10-08 08:03:58 +00:00
glebius
ef52d4e591 No reason to play with IP header before calling sctp_delayed_cksum()
with offset beyond the IP header.
2012-10-08 07:21:32 +00:00
glebius
f3a0231bff A step in resolving mess with byte ordering for AF_INET. After this change:
- All packets in NETISR_IP queue are in net byte order.
  - ip_input() is entered in net byte order and converts packet
    to host byte order right _after_ processing pfil(9) hooks.
  - ip_output() is entered in host byte order and converts packet
    to net byte order right _before_ processing pfil(9) hooks.
  - ip_fragment() accepts and emits packet in net byte order.
  - ip_forward(), ip_mloopback() use host byte order (untouched actually).
  - ip_fastforward() no longer modifies packet at all (except ip_ttl).
  - Swapping of byte order there and back removed from the following modules:
    pf(4), ipfw(4), enc(4), if_bridge(4).
  - Swapping of byte order added to ipfilter(4), based on __FreeBSD_version
  - __FreeBSD_version bumped.
  - pfil(9) manual page updated.

Reviewed by:	ray, luigi, eri, melifaro
Tested by:	glebius (LE), ray (BE)
2012-10-06 10:02:11 +00:00
glebius
a73c365c3b There is a complex race in in_pcblookup_hash() and in_pcblookup_group().
Both functions need to obtain lock on the found PCB, and they can't do
classic inter-lock with the PCB hash lock, due to lock order reversal.
To keep the PCB stable, these functions put a reference on it and after PCB
lock is acquired drop it. If the reference was the last one, this means
we've raced with in_pcbfree() and the PCB is no longer valid.

  This approach works okay only if we are acquiring writer-lock on the PCB.
In case of reader-lock, the following scenario can happen:

  - 2 threads locate pcb, and do in_pcbref() on it.
  - These 2 threads drop the inp hash lock.
  - Another thread comes to delete pcb via in_pcbfree(), it obtains hash lock,
    does in_pcbremlists(), drops hash lock, and runs in_pcbrele_wlocked(), which
    doesn't free the pcb due to two references on it. Then it unlocks the pcb.
  - 2 aforementioned threads acquire reader lock on the pcb and run
    in_pcbrele_rlocked(). One gets 1 from in_pcbrele_rlocked() and continues,
    second gets 0 and considers pcb freed, returns.
  - The thread that got 1 continutes working with detached pcb, which later
    leads to panic in the underlying protocol level.

  To plumb that problem an additional INPCB flag introduced - INP_FREED. We
check for that flag in the in_pcbrele_rlocked() and if it is set, we pretend
that that was the last reference.

Discussed with:		rwatson, jhb
Reported by:		Vladimir Medvedkin <medved rambler-co.ru>
2012-10-02 12:03:02 +00:00
glebius
a16cfb3463 carp_send_ad() should never return without rescheduling next run. 2012-09-29 05:52:19 +00:00
glebius
b83730f01b Fix bug in TCP_KEEPCNT setting, which slipped in in the last round
of reviewing of r231025.

Unlike other options from this family TCP_KEEPCNT doesn't specify
time interval, but a count, thus parameter supplied doesn't need
to be multiplied by hz.

Reported & tested by:	amdmi3
2012-09-27 07:13:21 +00:00
tuexen
7e065d782a Whitespace change.
MFC after:	3 days
2012-09-23 07:43:10 +00:00
tuexen
0f22c3a78e Declare a static function as such.
MFC after:	3 days
2012-09-23 07:23:18 +00:00
tuexen
fa9edb685b Fix a bug related to handling Re-config chunks. It is not true that
the association can be removed if the socket is gone.

MFC after:	3 days
2012-09-22 22:04:17 +00:00
tuexen
392169f5f0 Small cleanups. No functional change.
MFC after:	10 days
2012-09-22 14:39:20 +00:00
kevlo
98ccaea0f9 Fix typo: s/pakcet/packet 2012-09-20 03:29:43 +00:00
eadler
752dbba611 s/teh/the/g
Approved by:	cperciva
MFC after:	3 days
2012-09-14 21:59:55 +00:00
tuexen
68d0c9b61d Small cleanups. No functional change.
MFC after:	10 days
2012-09-14 18:32:20 +00:00
glebius
0ccf4838d7 o Create directory sys/netpfil, where all packet filters should
reside, and move there ipfw(4) and pf(4).

o Move most modified parts of pf out of contrib.

Actual movements:

sys/contrib/pf/net/*.c		-> sys/netpfil/pf/
sys/contrib/pf/net/*.h		-> sys/net/
contrib/pf/pfctl/*.c		-> sbin/pfctl
contrib/pf/pfctl/*.h		-> sbin/pfctl
contrib/pf/pfctl/pfctl.8	-> sbin/pfctl
contrib/pf/pfctl/*.4		-> share/man/man4
contrib/pf/pfctl/*.5		-> share/man/man5

sys/netinet/ipfw		-> sys/netpfil/ipfw

The arguable movement is pf/net/*.h -> sys/net. There are
future plans to refactor pf includes, so I decided not to
break things twice.

Not modified bits of pf left in contrib: authpf, ftp-proxy,
tftp-proxy, pflogd.

The ipfw(4) movement is planned to be merged to stable/9,
to make head and stable match.

Discussed with:		bz, luigi
2012-09-14 11:51:49 +00:00
tuexen
3e31feb073 Whitespace changes.
MFC after: 10 days
2012-09-09 08:14:04 +00:00
tuexen
5f95805e1a Whitespace cleanup.
MFC after: 10 days
2012-09-08 20:54:54 +00:00
glebius
5190d38ee3 Merge the projects/pf/head branch, that was worked on for last six months,
into head. The most significant achievements in the new code:

 o Fine grained locking, thus much better performance.
 o Fixes to many problems in pf, that were specific to FreeBSD port.

New code doesn't have that many ifdefs and much less OpenBSDisms, thus
is more attractive to our developers.

  Those interested in details, can browse through SVN log of the
projects/pf/head branch. And for reference, here is exact list of
revisions merged:

r232043, r232044, r232062, r232148, r232149, r232150, r232298, r232330,
r232332, r232340, r232386, r232390, r232391, r232605, r232655, r232656,
r232661, r232662, r232663, r232664, r232673, r232691, r233309, r233782,
r233829, r233830, r233834, r233835, r233836, r233865, r233866, r233868,
r233873, r234056, r234096, r234100, r234108, r234175, r234187, r234223,
r234271, r234272, r234282, r234307, r234309, r234382, r234384, r234456,
r234486, r234606, r234640, r234641, r234642, r234644, r234651, r235505,
r235506, r235535, r235605, r235606, r235826, r235991, r235993, r236168,
r236173, r236179, r236180, r236181, r236186, r236223, r236227, r236230,
r236252, r236254, r236298, r236299, r236300, r236301, r236397, r236398,
r236399, r236499, r236512, r236513, r236525, r236526, r236545, r236548,
r236553, r236554, r236556, r236557, r236561, r236570, r236630, r236672,
r236673, r236679, r236706, r236710, r236718, r237154, r237155, r237169,
r237314, r237363, r237364, r237368, r237369, r237376, r237440, r237442,
r237751, r237783, r237784, r237785, r237788, r237791, r238421, r238522,
r238523, r238524, r238525, r239173, r239186, r239644, r239652, r239661,
r239773, r240125, r240130, r240131, r240136, r240186, r240196, r240212.

I'd like to thank people who participated in early testing:

Tested by:	Florian Smeets <flo freebsd.org>
Tested by:	Chekaluk Vitaly <artemrts ukr.net>
Tested by:	Ben Wilber <ben desync.com>
Tested by:	Ian FREISLICH <ianf cloudseed.co.za>
2012-09-08 06:41:54 +00:00
tuexen
001c4aa6c4 Don't include a structure containing a flexible array in another
structure.

MFC after:	10 days
2012-09-07 13:36:42 +00:00
tuexen
53b4991234 Get rid of a gcc'ism.
MFC after: 10 days
2012-09-06 07:03:56 +00:00
tuexen
1f0bc9debb Using %p in a format string requires a void *.
MFC after: 10 days
2012-09-05 18:52:01 +00:00
tuexen
924bc7cbef Use the consistenly the size of a variable. This helps to keep the code
simpler for the userland implementation.

MFC after: 3 days
2012-09-04 22:45:00 +00:00
tuexen
a58740bf0b Whitespace change.
MFC after: 3 days
2012-09-04 22:40:49 +00:00
melifaro
1fbae66b6e Introduce new link-layer PFIL hook V_link_pfil_hook.
Merge ether_ipfw_chk() and part of bridge_pfil() into
unified ipfw_check_frame() function called by PFIL.
This change was suggested by rwatson? @ DevSummit.

Remove ipfw headers from ether/bridge code since they are unneeded now.

Note this thange introduce some (temporary) performance penalty since
PFIL read lock has to be acquired for every link-level packet.

MFC after:     3 weeks
2012-09-04 19:43:26 +00:00
glebius
9b72c7eaa7 Provide a sysctl switch that allows to install ARP entries
with multicast bit set. FreeBSD refuses to install such
entries since 9.0, and this broke installations running
Microsoft NLB, which are violating standards.

Tested by:	Tarasov Oleg <oleg_tarasov sg-tea.com>
2012-09-03 14:29:28 +00:00
tuexen
03552c901b Fix a typo which results in RTT to be off by a factor of 10, if the RTT is
larger than 1 second.

MFC after:	3 days
2012-09-02 12:37:30 +00:00
eadler
bb5f6cf89c Mark the ipfw interface type as not being ether. This fixes an issue
where uuidgen tried to obtain a ipfw device's mac address which was
    always zero.

    PR:		170460
    Submitted by:	wxs
    Reviewed by:	bdrewery
    Reviewed by:	delphij
    Approved by:	cperciva
    MFC after:	1 week
2012-09-01 23:33:49 +00:00
rrs
4952c1e53e This small change takes care of a race condition
that can occur when both sides close at the same time.
If that occurs, without this fix the connection enters
FIN1 on both sides and they will forever send FIN|ACK at
each other until the connection times out. This is because
we stopped processing the FIN|ACK and thus did not advance
the sequence and so never ACK'd each others FIN. This
fix adjusts it so we *do* process the FIN properly and
the race goes away ;-)

MFC after:	1 month
2012-08-25 09:26:37 +00:00
np
7a7bbaad5a Correctly handle the case where an inp has already been dropped by the time
the TOE driver reports that an active open failed.  toe_connect_failed is
supposed to handle this but it should be provided the inpcb instead of the
tcpcb which may no longer be around.
2012-08-21 18:09:33 +00:00
rrs
09ab09a1b5 Though I disagree, I conceed to jhb & Rui. Note
that we still have a problem with this whole structure of
locks and in_input.c [it does not lock which it should not, but
this *can* lead to crashes]. (I have seen it in our SQA
testbed.. besides the one with a refcnt issue that I will
have SQA work on next week ;-)
2012-08-19 11:54:02 +00:00
rrs
1bcd97d239 Ok jhb, lets move the ifa_free() down to the bottom to
assure that *all* tables and such are removed before
we start to free. This won't protect the Hash in ip_input.c
but in theory should protect any other uses that *do* use locks.

MFC after:	1 week (or more)
2012-08-17 05:51:46 +00:00
lstewart
71de2d67ba The TCP PAWS fix for kernels with fast tick rates (r231767) changed the TCP
timestamp related stack variables to reference ms directly instead of ticks.
The h_ertt(4) Khelp module relies on TCP timestamp information in order to
calculate its enhanced RTT estimates, but was not updated as part of r231767.

Consequently, h_ertt has not been calculating correct RTT estimates since
r231767 was comitted, which in turn broke all delay-based congestion control
algorithms because they rely on the h_ertt RTT estimates.

Fix the breakage by switching h_ertt to use tcp_ts_getticks() in place of all
previous uses of the ticks variable. This ensures all timestamp related
variables in h_ertt use the same units as the TCP stack and therefore results in
meaningful comparisons and RTT estimate calculations.

Reported & tested by:	Naeem Khademi (naeemk at ifi uio no)
Discussed with:	bz
MFC after:	3 days
2012-08-17 01:49:51 +00:00
rrs
7c7c85dcac Its never a good idea to double free the same
address.

MFC after:	1 week (after the other commits ahead of this gets MFC'd)
2012-08-16 17:55:16 +00:00
luigi
1f3be6fa90 s/lenght/length/ in comments 2012-08-07 07:52:25 +00:00
luigi
eed7c1d3a5 move functions outside the SYSBEGIN/SYSEND block
(SYSBEGIN/SYSEND are specific to ipfw/dummynet and are used to
emulate sysctl on platforms that do not have them, and they work
by creating an array which contains all the sysctl-ed symbols.)
2012-08-06 11:02:23 +00:00
luigi
b53e8390d6 use FREE_PKT instead of m_freem to free an mbuf.
The former is the standard form used in ipfw/dummynet, so that
it is easier to remap it to different memory managers depending
on the platform.
2012-08-06 10:50:43 +00:00
tuexen
af3a9a01e1 Fix a bug found by dim@:
Don't use an uninitilized variable, if INVARIANTS is on and an illegal
packet with destination 0 is received.

MFC after:	3 days
X-MFC with:	238003
2012-08-06 10:50:23 +00:00
trociny
6bade2af3b In tcp timers, check INP_DROPPED flag a little later, after
callout_deactivate(), so if INP_DROPPED is set we return with the
timer active flag cleared.

For me this fixes negative keep timer values reported by `netstat -x'
for connections in CLOSE state.

Approved by:	net (silence)
MFC after:	2 weeks
2012-08-05 17:30:17 +00:00
tuexen
f3596e49d4 Fix a refcount issue. The called only decrements is stcb is NULL.
MFC after:	3 days
Discussed with:	rrs
2012-08-05 10:47:18 +00:00
tuexen
3ad906801b Fix a bug reported by Simon L. B. Nielsen:
If an SCTP endpoint receives an ASCONF with a wildcard
lookup address and incorrect verification tag, the system
crashes.

MFC after:	3 days.
2012-08-04 20:40:36 +00:00
tuexen
09a1e2c3bc Testing an interface property should depend on the interface, not
on an address.

MFC after:	3 days
2012-08-04 08:03:30 +00:00
glebius
abf245020a Fix races between in_lltable_prefix_free(), lla_lookup(),
llentry_free() and arptimer():

o Use callout_init_rw() for lle timeout, this allows us safely
  disestablish them.
  - This allows us to simplify the arptimer() and make it
    race safe.
o Consistently use ifp->if_afdata_lock to lock access to
  linked lists in the lle hashes.
o Introduce new lle flag LLE_LINKED, which marks an entry that
  is attached to the hash.
  - Use LLE_LINKED to avoid double unlinking via consequent
    calls to llentry_free().
  - Mark lle with LLE_DELETED via |= operation istead of =,
    so that other flags won't be lost.
o Make LLE_ADDREF(), LLE_REMREF() and LLE_FREE_LOCKED() more
  consistent and provide more informative KASSERTs.

The patch is a collaborative work of all submitters and myself.

PR:		kern/165863
Submitted by:	Andrey Zonov <andrey zonov.org>
Submitted by:	Ryan Stone <rysto32 gmail.com>
Submitted by:	Eric van Gyzen <eric_van_gyzen dell.com>
2012-08-02 13:57:49 +00:00
luigi
55897c521e replace __unused with a portable construct;
fix a couple of signed/unsigned warnings.
2012-08-02 12:45:13 +00:00
luigi
15efb3237e replace inet_ntoa_r with the more standard inet_ntop().
As discussed on -current, inet_ntoa_r() is non standard,
has different arguments in userspace and kernel, and
almost unused (no clients in userspace, only
net/flowtable.c, net/if_llatbl.c, netinet/in_pcb.c, netinet/tcp_subr.c
in the kernel)
2012-08-01 18:52:07 +00:00
luigi
bf14eed582 add a cast to avoid a signed/unsigned warning (to be removed
when we will have TUNABLE_UINT constructors)
2012-08-01 18:49:00 +00:00
glebius
588de42f27 Some more whitespace cleanup. 2012-08-01 09:00:26 +00:00
glebius
53cb168f80 Some style(9) and whitespace changes.
Together with:	Andrey Zonov <andrey zonov.org>
2012-07-31 11:31:12 +00:00
luigi
bf32267c00 nobody uses this file except the userspace ipfw code, but the cast
of a pointer to an integer needs a cast to prevent a warning for
size mismatch.

MFC after:	1 week
2012-07-31 08:04:49 +00:00
tuexen
df16a3505f Fix the sctp_sockstore union such that userland programs don't depend
on INET and/or INET6 to be defined and in-tune with how the kernel
was compiled.

MFC after:	3 days
Discussed with:	rrs
2012-07-26 08:10:29 +00:00
bz
d78628df35 Fix a problem when CARP is enabled on the interface for IPv4
but not for IPv6.  The current checks in nd6_nbr.c along with the
old version will result in ifa being NULL and subsequently the
packet will be dropped.  This prevented NS/NA, from working and
with that IPv6.

Now return the ifa from the carp lookup function in two cases:
1) if the address matches, is a carp address, and we are MASTER
   (as before),
2) if the address matches but it is not a carp address at all (new).

Reported by:	Peter Wemm (new Y! FreeBSD cluster, eating our own dogfood)
Tested on:	New Y! FreeBSD cluster machines
Reviewed by:	glebius
2012-07-25 12:14:39 +00:00
rwatson
bb5e5ce48a Update some stale comments regarding tcbinfo locking in the TCP input
path: read locks on tcbinfo are no longer used, so won't happen.  No
functional change.

MFC after:	3 days
2012-07-22 17:31:36 +00:00
glebius
3b4ff3bafb Plug a reference leak: before doing 'goto again' we need to unref
ia->ia_ifa if there is any.

Submitted by:	Andrey Zonov <andrey zonov.org>
2012-07-18 08:58:30 +00:00
glebius
636d3aa68f When traversing global in_ifaddr list in the IFP_TO_IA() macro, we need
to obtain IN_IFADDR_RLOCK().
2012-07-18 08:41:00 +00:00
tuexen
d0f5f8dadb Fix a refcount bug when freeing an association.
While there: Change code to be consistent.
Discussed with rrs@.
MFC after: 3 days
2012-07-17 13:03:47 +00:00
glebius
b5cd2a8e46 If ip_output() returns EMSGSIZE to tcp_output(), then the latter calls
tcp_mtudisc(), which in its turn may call tcp_output(). Under certain
conditions (must admit they are very special) an infinite recursion can
happen.

To avoid recursion we can pass struct route to ip_output() and obtain
correct mtu. This allows us not to use tcp_mtudisc() but call tcp_mss_update()
directly.

PR:		kern/155585
Submitted by:	Andrey Zonov <andrey zonov.org> (original version of patch)
2012-07-16 07:08:34 +00:00
tuexen
2357a49326 Changes which improve compilation if neither INET nor INET6 is defined.
MFC after: 3 days
2012-07-15 20:16:17 +00:00
tuexen
5895ece053 #ifdef INET and INET6 consistently. This also fixes a bug, where
it was done wrong.

MFC after: 3 days
2012-07-15 11:04:49 +00:00
tuexen
86ea1d09c9 Provide the correct notification type (SCTP_SEND_FAILED_EVENT)
for unsent messages.

MFC after: 3 days
2012-07-14 21:25:14 +00:00