Instead, add protocol specific mbuf flags M_IP_NEXTHOP and
M_IP6_NEXTHOP. Use them to indicate that the mbuf's chain
contains the PACKET_TAG_IPFORWARD tag. And do a tag lookup
only when this flag is set.
Suggested by: andre
Defer sending an independent window update if a delayed ACK is pending
saving a packet. The window update then gets piggy-backed on the next
already scheduled ACK.
Added grammar fixes as well.
MFC after: 2 weeks
after a much reduced timeout.
Typically web servers close their sockets quickly under the assumption
that the TCP connections goes away as well. That is not entirely true
however. If the peer closed the window we're going to wait for a long
time with lots of data in the send buffer.
MFC after: 2 weeks
draft-ietf-tcpm-initcwnd-05. It explains why the increased initial
window improves the overall performance of many web services without
risking congestion collapse.
As long as it remains a draft it is placed under a sysctl marking it
as experimental:
net.inet.tcp.experimental.initcwnd10 = 1
When it becomes an official RFC soon the sysctl will be changed to
the RFC number and moved to net.inet.tcp.
This implementation differs from the RFC draft in that it is a bit
more conservative in the case of packet loss on SYN or SYN|ACK because
we haven't reduced the default RTO to 1 second yet. Also the restart
window isn't yet increased as allowed. Both will be adjusted with
upcoming changes.
Is is enabled by default. In Linux it is enabled since kernel 3.0.
MFC after: 2 weeks
especially in the presence of bi-directional data transfers.
snd_wl1 tracks the right edge, including data in the reassembly
queue, of valid incoming data. This makes it like rcv_nxt plus
reassembly. It never goes backwards to prevent older, possibly
reordered segments from updating the window.
snd_wl2 tracks the left edge of sent data. This makes it a duplicate
of snd_una. However joining them right now is difficult due to
separate update dependencies in different places in the code flow.
snd_wnd tracks the current advertized send window by the peer. In
tcp_output() the effective window is calculated by subtracting the
already in-flight data, snd_nxt less snd_una, from it.
ACK's become the main clock of window updates and will always update
the window when the left edge of what we sent is advanced. The ACK
clock is the primary signaling mechanism in ongoing data transfers.
This works reliably even in the presence of reordering, reassembly
and retransmitted segments. The ACK clock is most important because
it determines how much data we are allowed to inject into the network.
Zero window updates get us out of persistence mode are crucial. Here
a segment that neither moves ACK nor SEQ but enlarges WND is accepted.
When the ACK clock is not active (that is we're not or no longer
sending any data) any segment that moves the extended right SEQ edge,
including out-of-order segments, updates the window. This gives us
updates especially during ping-pong transfers where the peer isn't
done consuming the already acknowledged data from the receive buffer
while responding with data.
The SSH protocol is a prime candidate to benefit from the improved
bi-directional window update logic as it has its own windowing
mechanism on top of TCP and is frequently sending back protocol ACK's.
Tcpdump provided by: darrenr
Tested by: darrenr
MFC after: 2 weeks
the default retransmit timeout, as base to calculate the backoff
time until next try instead of the TCP_REXMTVAL() macro which only
works correctly when we already have measured an actual RTT+RTTVAR.
Before it would cause the first retransmit at RTOBASE, the next
four at the same time (!) about 200ms later, and then another one
again RTOBASE later.
MFC after: 2 weeks
We've got more cluster sizes for quite some time now and the orginally
imposed limits and the previously codified thoughts on efficiency gains
are no longer true.
MFC after: 2 weeks
from an unprotected u_int that reports garbage on SMP to a function
based sysctl obtaining the current value from UMA.
Also read back the actual cache_limit after page size rounding by UMA.
PR: kern/165879
MFC after: 2 weeks
doing small reads on a (partially) filled receive socket buffer.
Normally one would a send a window update every time the available
space in the socket buffer increases by two times MSS. This leads
to a flurry of window updates that do not provide any meaningful
new information to the sender. There still is available space in
the window and the sender can continue sending data. All window
updates then get carried by the regular ACKs. Only when the socket
buffer was (almost) full and the window closed accordingly a window
updates delivery new information and allows the sender to start
sending more data again.
Send window updates only every two MSS when the socket buffer
has less than 1/8 space available, or the available space in the
socket buffer increased by 1/4 its full capacity, or the socket
buffer is very small. The next regular data ACK will carry and
report the exact window size again.
Reported by: sbruno
Tested by: darrenr
Tested by: Darren Baginski
PR: kern/116335
MFC after: 2 weeks
reduce the initial CWND to one segment. This reduction got lost
some time ago due to a change in initialization ordering.
Additionally in tcp_timer_rexmt() avoid entering fast recovery when
we're still in TCPS_SYN_SENT state.
MFC after: 2 weeks
reduce the initial CWND to one segment. This reduction got lost
some time ago due to a change in initialization ordering.
Additionally in tcp_timer_rexmt() avoid entering fast recovery when
we're still in TCPS_SYN_SENT state.
MFC after: 2 weeks
on checksums directly from mbuf flags. This simplifies code.
o Clear CSUM_IP from the mbuf in ip_fragment() if we did checksums in
hardware. Some driver may not announce CSUM_IP in theur if_hwassist,
although try to do checksums if CSUM_IP set on mbuf. Example is em(4).
o While here, consistently use CSUM_IP instead of its alias CSUM_DELAY_IP.
After this change CSUM_DELAY_IP vanishes from the stack.
Submitted by: Sebastian Kuzminsky <seb lineratesystems.com>
on the related functionality in the runtime via the sysctl variable
net.pfil.forward. It is turned off by default.
Sponsored by: Yandex LLC
Discussed with: net@
MFC after: 2 weeks
before passing a packet to protocol input routines.
For several protocols this mean that now protocol needs to
do subtraction itself, and for another half this means that
we do not need to add header length back to the packet.
Make ip_stripoptions() to adjust ip_len, since now we enter
this function with a packet header whose ip_len does represent
length of entire packet, not payload only.
in network byte order. Any host byte order processing is
done in local variables and host byte order values are
never[1] written to a packet.
After this change a packet processed by the stack isn't
modified at all[2] except for TTL.
After this change a network stack hacker doesn't need to
scratch his head trying to figure out what is the byte order
at the given place in the stack.
[1] One exception still remains. The raw sockets convert host
byte order before pass a packet to an application. Probably
this would remain for ages for compatibility.
[2] The ip_input() still subtructs header len from ip->ip_len,
but this is planned to be fixed soon.
Reviewed by: luigi, Maxim Dounin <mdounin mdounin.ru>
Tested by: ray, Olivier Cochard-Labbe <olivier cochard.me>
ip6_output(), the IPv6 stack is working in net byte order.
The reason this code worked before is that ip6_output()
doesn't look at ip6_plen at all and recalculates it based
on mbuf length.
enabled. This eliminates one mtx_lock() per each routing lookup thus improving
performance in several cases (routing to directly connected interface or routing
to default gateway).
Icmp redirects should not be used to provide routing direction nowadays, even
for end hosts. Routers should not use them too (and this is explicitly restricted
in IPv6, see RFC 4861, clause 8.2).
Current commit changes rnh_machaddr function to 'stock' rn_match (and back) for every
AF_INET routing table in given VNET instance on drop_redirect sysctl change.
This change is part of bigger patch eliminating rte locking.
Sponsored by: Yandex LLC
MFC after: 2 weeks
host byte order, was sometimes called with net byte order. Since we are
moving towards net byte order throughout the stack, the function was
converted to expect net byte order, and its consumers fixed appropriately:
- ip_output(), ipfilter(4) not changed, since already call
in_delayed_cksum() with header in net byte order.
- divert(4), ng_nat(4), ipfw_nat(4) now don't need to swap byte order
there and back.
- mrouting code and IPv6 ipsec now need to switch byte order there and
back, but I hope, this is temporary solution.
- In ipsec(4) shifted switch to net byte order prior to in_delayed_cksum().
- pf_route() catches up on r241245 changes to ip_output().
- All packets in NETISR_IP queue are in net byte order.
- ip_input() is entered in net byte order and converts packet
to host byte order right _after_ processing pfil(9) hooks.
- ip_output() is entered in host byte order and converts packet
to net byte order right _before_ processing pfil(9) hooks.
- ip_fragment() accepts and emits packet in net byte order.
- ip_forward(), ip_mloopback() use host byte order (untouched actually).
- ip_fastforward() no longer modifies packet at all (except ip_ttl).
- Swapping of byte order there and back removed from the following modules:
pf(4), ipfw(4), enc(4), if_bridge(4).
- Swapping of byte order added to ipfilter(4), based on __FreeBSD_version
- __FreeBSD_version bumped.
- pfil(9) manual page updated.
Reviewed by: ray, luigi, eri, melifaro
Tested by: glebius (LE), ray (BE)
Both functions need to obtain lock on the found PCB, and they can't do
classic inter-lock with the PCB hash lock, due to lock order reversal.
To keep the PCB stable, these functions put a reference on it and after PCB
lock is acquired drop it. If the reference was the last one, this means
we've raced with in_pcbfree() and the PCB is no longer valid.
This approach works okay only if we are acquiring writer-lock on the PCB.
In case of reader-lock, the following scenario can happen:
- 2 threads locate pcb, and do in_pcbref() on it.
- These 2 threads drop the inp hash lock.
- Another thread comes to delete pcb via in_pcbfree(), it obtains hash lock,
does in_pcbremlists(), drops hash lock, and runs in_pcbrele_wlocked(), which
doesn't free the pcb due to two references on it. Then it unlocks the pcb.
- 2 aforementioned threads acquire reader lock on the pcb and run
in_pcbrele_rlocked(). One gets 1 from in_pcbrele_rlocked() and continues,
second gets 0 and considers pcb freed, returns.
- The thread that got 1 continutes working with detached pcb, which later
leads to panic in the underlying protocol level.
To plumb that problem an additional INPCB flag introduced - INP_FREED. We
check for that flag in the in_pcbrele_rlocked() and if it is set, we pretend
that that was the last reference.
Discussed with: rwatson, jhb
Reported by: Vladimir Medvedkin <medved rambler-co.ru>
of reviewing of r231025.
Unlike other options from this family TCP_KEEPCNT doesn't specify
time interval, but a count, thus parameter supplied doesn't need
to be multiplied by hz.
Reported & tested by: amdmi3
reside, and move there ipfw(4) and pf(4).
o Move most modified parts of pf out of contrib.
Actual movements:
sys/contrib/pf/net/*.c -> sys/netpfil/pf/
sys/contrib/pf/net/*.h -> sys/net/
contrib/pf/pfctl/*.c -> sbin/pfctl
contrib/pf/pfctl/*.h -> sbin/pfctl
contrib/pf/pfctl/pfctl.8 -> sbin/pfctl
contrib/pf/pfctl/*.4 -> share/man/man4
contrib/pf/pfctl/*.5 -> share/man/man5
sys/netinet/ipfw -> sys/netpfil/ipfw
The arguable movement is pf/net/*.h -> sys/net. There are
future plans to refactor pf includes, so I decided not to
break things twice.
Not modified bits of pf left in contrib: authpf, ftp-proxy,
tftp-proxy, pflogd.
The ipfw(4) movement is planned to be merged to stable/9,
to make head and stable match.
Discussed with: bz, luigi
Merge ether_ipfw_chk() and part of bridge_pfil() into
unified ipfw_check_frame() function called by PFIL.
This change was suggested by rwatson? @ DevSummit.
Remove ipfw headers from ether/bridge code since they are unneeded now.
Note this thange introduce some (temporary) performance penalty since
PFIL read lock has to be acquired for every link-level packet.
MFC after: 3 weeks
with multicast bit set. FreeBSD refuses to install such
entries since 9.0, and this broke installations running
Microsoft NLB, which are violating standards.
Tested by: Tarasov Oleg <oleg_tarasov sg-tea.com>
that can occur when both sides close at the same time.
If that occurs, without this fix the connection enters
FIN1 on both sides and they will forever send FIN|ACK at
each other until the connection times out. This is because
we stopped processing the FIN|ACK and thus did not advance
the sequence and so never ACK'd each others FIN. This
fix adjusts it so we *do* process the FIN properly and
the race goes away ;-)
MFC after: 1 month
the TOE driver reports that an active open failed. toe_connect_failed is
supposed to handle this but it should be provided the inpcb instead of the
tcpcb which may no longer be around.
that we still have a problem with this whole structure of
locks and in_input.c [it does not lock which it should not, but
this *can* lead to crashes]. (I have seen it in our SQA
testbed.. besides the one with a refcnt issue that I will
have SQA work on next week ;-)
assure that *all* tables and such are removed before
we start to free. This won't protect the Hash in ip_input.c
but in theory should protect any other uses that *do* use locks.
MFC after: 1 week (or more)
timestamp related stack variables to reference ms directly instead of ticks.
The h_ertt(4) Khelp module relies on TCP timestamp information in order to
calculate its enhanced RTT estimates, but was not updated as part of r231767.
Consequently, h_ertt has not been calculating correct RTT estimates since
r231767 was comitted, which in turn broke all delay-based congestion control
algorithms because they rely on the h_ertt RTT estimates.
Fix the breakage by switching h_ertt to use tcp_ts_getticks() in place of all
previous uses of the ticks variable. This ensures all timestamp related
variables in h_ertt use the same units as the TCP stack and therefore results in
meaningful comparisons and RTT estimate calculations.
Reported & tested by: Naeem Khademi (naeemk at ifi uio no)
Discussed with: bz
MFC after: 3 days
(SYSBEGIN/SYSEND are specific to ipfw/dummynet and are used to
emulate sysctl on platforms that do not have them, and they work
by creating an array which contains all the sysctl-ed symbols.)
callout_deactivate(), so if INP_DROPPED is set we return with the
timer active flag cleared.
For me this fixes negative keep timer values reported by `netstat -x'
for connections in CLOSE state.
Approved by: net (silence)
MFC after: 2 weeks
llentry_free() and arptimer():
o Use callout_init_rw() for lle timeout, this allows us safely
disestablish them.
- This allows us to simplify the arptimer() and make it
race safe.
o Consistently use ifp->if_afdata_lock to lock access to
linked lists in the lle hashes.
o Introduce new lle flag LLE_LINKED, which marks an entry that
is attached to the hash.
- Use LLE_LINKED to avoid double unlinking via consequent
calls to llentry_free().
- Mark lle with LLE_DELETED via |= operation istead of =,
so that other flags won't be lost.
o Make LLE_ADDREF(), LLE_REMREF() and LLE_FREE_LOCKED() more
consistent and provide more informative KASSERTs.
The patch is a collaborative work of all submitters and myself.
PR: kern/165863
Submitted by: Andrey Zonov <andrey zonov.org>
Submitted by: Ryan Stone <rysto32 gmail.com>
Submitted by: Eric van Gyzen <eric_van_gyzen dell.com>
As discussed on -current, inet_ntoa_r() is non standard,
has different arguments in userspace and kernel, and
almost unused (no clients in userspace, only
net/flowtable.c, net/if_llatbl.c, netinet/in_pcb.c, netinet/tcp_subr.c
in the kernel)
but not for IPv6. The current checks in nd6_nbr.c along with the
old version will result in ifa being NULL and subsequently the
packet will be dropped. This prevented NS/NA, from working and
with that IPv6.
Now return the ifa from the carp lookup function in two cases:
1) if the address matches, is a carp address, and we are MASTER
(as before),
2) if the address matches but it is not a carp address at all (new).
Reported by: Peter Wemm (new Y! FreeBSD cluster, eating our own dogfood)
Tested on: New Y! FreeBSD cluster machines
Reviewed by: glebius
tcp_mtudisc(), which in its turn may call tcp_output(). Under certain
conditions (must admit they are very special) an infinite recursion can
happen.
To avoid recursion we can pass struct route to ip_output() and obtain
correct mtu. This allows us not to use tcp_mtudisc() but call tcp_mss_update()
directly.
PR: kern/155585
Submitted by: Andrey Zonov <andrey zonov.org> (original version of patch)