structure is used, but they already have equal fields in the struct
newipsecstat, that was introduced with FAST_IPSEC and then was merged
together with old ipsecstat structure.
This fixes kernel stack overflow on some architectures after migration
ipsecstat to PCPU counters.
Reported by: Taku YAMAMOTO, Maciej Milewski
duplicated sockets a multicast address is bound and either
SO_REUSEPORT or SO_REUSEADDR is set.
But actually it works for the following combinations:
* SO_REUSEPORT is set for the fist socket and SO_REUSEPORT for the new;
* SO_REUSEADDR is set for the fist socket and SO_REUSEADDR for the new;
* SO_REUSEPORT is set for the fist socket and SO_REUSEADDR for the new;
and fails for this:
* SO_REUSEADDR is set for the fist socket and SO_REUSEPORT for the new.
Fix the last case.
PR: 179901
MFC after: 1 month
information into the ISN (initial sequence number) without the additional
use of timestamp bits and switching to the very fast and cryptographically
strong SipHash-2-4 MAC hash algorithm to protect the SYN cookie against
forgeries.
The purpose of SYN cookies is to encode all necessary session state in
the 32 bits of our initial sequence number to avoid storing any information
locally in memory. This is especially important when under heavy spoofed
SYN attacks where we would either run out of memory or the syncache would
fill with bogus connection attempts swamping out legitimate connections.
The original SYN cookies method only stored an indexed MSS values in the
cookie. This isn't sufficient anymore and breaks down in the presence of
WSCALE information which is only exchanged during SYN and SYN-ACK. If we
can't keep track of it then we may severely underestimate the available
send or receive window. This is compounded with large windows whose size
information on the TCP segment header is even lower numerically. A number
of years back SYN cookies were extended to store the additional state in
the TCP timestamp fields, if available on a connection. While timestamps
are common among the BSD, Linux and other *nix systems Windows never enabled
them by default and thus are not present for the vast majority of clients
seen on the Internet.
The common parameters used on TCP sessions have changed quite a bit since
SYN cookies very invented some 17 years ago. Today we have a lot more
bandwidth available making the use window scaling almost mandatory. Also
SACK has become standard making recovering from packet loss much more
efficient.
This change moves all necessary information into the ISS removing the need
for timestamps. Both the MSS (16 bits) and send WSCALE (4 bits) are stored
in 3 bit indexed form together with a single bit for SACK. While this is
significantly less than the original range, it is sufficient to encode all
common values with minimal rounding.
The MSS depends on the MTU of the path and with the dominance of ethernet
the main value seen is around 1460 bytes. Encapsulations for DSL lines
and some other overheads reduce it by a few more bytes for many connections
seen. Rounding down to the next lower value in some cases isn't a problem
as we send only slightly more packets for the same amount of data.
The send WSCALE index is bit more tricky as rounding down under-estimates
the available send space available towards the remote host, however a small
number values dominate and are carefully selected again.
The receive WSCALE isn't encoded at all but recalculated based on the local
receive socket buffer size when a valid SYN cookie returns. A listen socket
buffer size is unlikely to change while active.
The index values for MSS and WSCALE are selected for minimal rounding errors
based on large traffic surveys. These values have to be periodically
validated against newer traffic surveys adjusting the arrays tcp_sc_msstab[]
and tcp_sc_wstab[] if necessary.
In addition the hash MAC to protect the SYN cookies is changed from MD5
to SipHash-2-4, a much faster and cryptographically secure algorithm.
Reviewed by: dwmalone
Tested by: Fabian Keil <fk@fabiankeil.de>
dereferencing, when checking for SO_REUSEPORT option (and SO_REUSEADDR
for multicast), INP_REUSEPORT flag was introduced to cache the socket
option. It was decided then that one flag would be enough to cache
both SO_REUSEPORT and SO_REUSEADDR: when processing SO_REUSEADDR
setsockopt(2), it was checked if it was called for a multicast address
and INP_REUSEPORT was set accordingly.
Unfortunately that approach does not work when setsockopt(2) is called
before binding to a multicast address: the multicast check fails and
INP_REUSEPORT is not set.
Fix this by adding INP_REUSEADDR flag to unconditionally cache
SO_REUSEADDR.
PR: 179901
Submitted by: Michael Gmelin freebsd grem.de (initial version)
Reviewed by: rwatson
MFC after: 1 week
algorithm, which is based on the 2011 v0.1 patch release and described in the
paper "Revisiting TCP Congestion Control using Delay Gradients" by David Hayes
and Grenville Armitage. It is implemented as a kernel module compatible with the
modular congestion control framework.
CDG is a hybrid congestion control algorithm which reacts to both packet loss
and inferred queuing delay. It attempts to operate as a delay-based algorithm
where possible, but utilises heuristics to detect loss-based TCP cross traffic
and will compete effectively as required. CDG is therefore incrementally
deployable and suitable for use on shared networks.
In collaboration with: David Hayes <david.hayes at ieee.org> and
Grenville Armitage <garmitage at swin edu au>
MFC after: 4 days
Sponsored by: Cisco University Research Program and FreeBSD Foundation
increased the pointer, not the memory it points to.
In collaboration with: kib
Reported & tested by: Ian FREISLICH <ianf clue.co.za>
Sponsored by: Nginx, Inc.
limited in the amount of data they can handle at once.
Drivers can set ifp->if_hw_tsomax before calling ether_ifattach() to
change the limit.
The lowest allowable size is IP_MAXPACKET / 8 (8192 bytes) as anything
less wouldn't be very useful anymore. The upper limit is still at
IP_MAXPACKET (65536 bytes). Raising it requires further auditing of
the IPv4/v6 code path's as the length field in the IP header would
overflow leading to confusion in firewalls and others packet handler on
the real size of the packet.
The placement into "struct ifnet" is a bit hackish but the best place
that was found. When the stack/driver boundary is updated it should
be handled in a better way.
Submitted by: cperciva (earlier version)
Reviewed by: cperciva
Tested by: cperciva
MFC after: 1 week (using spare struct members to preserve ABI)
Address. Although KAME implementation used FF02:0:0:0:0:2::/96 based on
older versions of draft-ietf-ipngwg-icmp-name-lookup, it has been changed
in RFC 4620.
The kernel always joins the /104-prefixed address, and additionally does
/96-prefixed one only when net.inet6.icmp6.nodeinfo_oldmcprefix=1.
The default value of the sysctl is 1.
ping6(8) -N flag now uses /104-prefixed one. When this flag is specified
twice, it uses /96-prefixed one instead.
Reviewed by: ume
Based on work by: Thomas Scheffler
PR: conf/174957
MFC after: 2 weeks
same place as dst, or to the sockaddr in the routing table.
The const constraint of gw makes us safe from modifing routing table
accidentially. And "onstantness" of dst allows us to remove several
bandaids, when we switched it back at &ro->ro_dst, now it always
points there.
Reviewed by: rrs
route. What it was is there are two places in ip_output.c
where we do a goto again. One place was fine, it
copies out the new address and then resets dst = ro->rt_dst;
But the other place does *not* do that, which means earlier
when we found the gateway, we have dst pointing there
aka dst = ro->rt_gateway is done.. then we do a
goto again.. bam now we clobber the default route.
The fix is just to move the again so we are always
doing dst = &ro->rt_dst; in the again loop.
PR: 174749,157796
MFC after: 1 week
duplicate ACK make sure we actually have new data to send.
This prevents us from sending unneccessary pure ACKs.
Reported by: Matt Miller <matt@matthewjmiller.net>
Tested by: Matt Miller <matt@matthewjmiller.net>
MFC after: 2 weeks
mbuf allocation fails, as in a case when ip_output() returns error.
To achieve that, move large block of code that updates tcpcb below
the out: label.
This fixes a panic, that requires the following sequence to happen:
1) The SYN was sent to the network, tp->snd_nxt = iss + 1, tp->snd_una = iss
2) The retransmit timeout happened for the SYN we had sent,
tcp_timer_rexmt() sets tp->snd_nxt = tp->snd_una, and calls tcp_output().
In tcp_output m_get() fails.
3) Later on the SYN|ACK for the SYN sent in step 1) came,
tcp_input sets tp->snd_una += 1, which leads to
tp->snd_una > tp->snd_nxt inconsistency, that later panics in
socket buffer code.
For reference, this bug fixed in DragonflyBSD repo:
http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/1ff9b7d322dc5a26f7173aa8c38ecb79da80e419
Reviewed by: andre
Tested by: pho
Sponsored by: Nginx, Inc.
PR: kern/177456
Submitted by: HouYeFei&XiBoLiu <lglion718 163.com>
reside on their own cache line to prevent false sharing with other
nearby structures, especially for those in the .bss segment.
NB: Those mutexes and rwlocks with variables next to them that get
changed on every invocation do not benefit from their own cache line.
Actually it may be net negative because two cache misses would be
incurred in those cases.
connections in the accept queue and contiguous new incoming SYNs.
Compared to the original submitters patch I've moved the test
next to the SYN handling to have it together in a logical unit
and reworded the comment explaining the issue.
Submitted by: Matt Miller <matt@matthewjmiller.net>
Submitted by: Juan Mojica <jmojica@gmail.com>
Reviewed by: Matt Miller (changes)
Tested by: pho
MFC after: 1 week
Convert 'struct ipstat' and 'struct tcpstat' to counter(9).
This speeds up IP forwarding at extreme packet rates, and
makes accounting more precise.
Sponsored by: Nginx, Inc.
For TIMEWAIT handling tcp_input may have to jump back for an additional
pass through pcblookup. Prior to this change the fwd_tag had been
discarded after the first lookup, so a new connection attempt delivered
locally via 'ipfw fwd' would fail to find a match.
As of r248886 the tag will be detached and freed when passed to the
socket buffer.
Setting DSCP support is done via O_SETDSCP which works for both
IPv4 and IPv6 packets. Fast checksum recalculation (RFC 1624) is done for IPv4.
Dscp can be specified by name (AFXY, CSX, BE, EF), by value
(0..63) or via tablearg.
Matching DSCP is done via another opcode (O_DSCP) which accepts several
classes at once (af11,af22,be). Classes are stored in bitmask (2 u32 words).
Many people made their variants of this patch, the ones I'm aware of are
(in alphabetic order):
Dmitrii Tejblum
Marcelo Araujo
Roman Bogorodskiy (novel)
Sergey Matveichuk (sem)
Sergey Ryabin
PR: kern/102471, kern/121122
MFC after: 2 weeks
and that can drive someone crazy. While m_get2() is young and not
documented yet, change its order of arguments to match m_getm2().
Sorry for churn, but better now than later.
Fix a siftr(4) potential memory leak and INVARIANTS triggered kernel panic in
hashdestroy() by ensuring the last array index in the flow counter hash table is
flushed of entries.
MFC after: 3 days
precise time event generation. This greatly improves granularity of
callouts which are not anymore constrained to wait next tick to be
scheduled.
- Extend the callout KPI introducing a set of callout_reset_sbt* functions,
which take a sbintime_t as timeout argument. The new KPI also offers a
way for consumers to specify precision tolerance they allow, so that
callout can coalesce events and reduce number of interrupts as well as
potentially avoid scheduling a SWI thread.
- Introduce support for dispatching callouts directly from hardware
interrupt context, specifying an additional flag. This feature should be
used carefully, as long as interrupt context has some limitations
(e.g. no sleeping locks can be held).
- Enhance mechanisms to gather informations about callwheel, introducing
a new sysctl to obtain stats.
This change breaks the KBI. struct callout fields has been changed, in
particular 'int ticks' (4 bytes) has been replaced with 'sbintime_t'
(8 bytes) and another 'sbintime_t' field was added for precision.
Together with: mav
Reviewed by: attilio, bde, luigi, phk
Sponsored by: Google Summer of Code 2012, iXsystems inc.
Tested by: flo (amd64, sparc64), marius (sparc64), ian (arm),
markj (amd64), mav, Fabian Keil
Specifcially, in_cksum_hdr() returns 0 (not 0xffff) when the IPv4
checksum is correct. Without this fix, the tcp_lro code will reject
good IPv4 traffic from drivers that do not implement IPv4 header
harder csum offload.
Sponsored by: Myricom Inc.
MFC after: 7 days
- fix indentation
- put the operator at the end of the line for long statements
- remove spaces between the type and the variable in a cast
- remove excessive parentheses
Tested by: md5
ABORT cause, since the user can also provide this kind of
information. So the receiver doesn't know who provided the
information.
While there: Fix a bug where the stack would send a malformed
ABORT chunk when using a send() call with SCTP_ABORT|SCT_SENDALL
flags.
MFC after: 3 days
of helper functions:
- carp_master() - boolean function which is true if an address
is in the MASTER state.
- ifa_preferred() - boolean function that compares two addresses,
and is aware of CARP.
Utilize ifa_preferred() in ifa_ifwithnet().
The previous version of patch also changed source address selection
logic in jails using carp_master(), but we failed to negotiate this part
with Bjoern. May be we will approach this problem again later.
Reported & tested by: Anton Yuzhaninov <citrin citrin.ru>
Sponsored by: Nginx, Inc
Since ARP and routing are separated, "proxy only" entries
don't have any meaning, thus we don't need additional field
in sockaddr to pass SIN_PROXY flag.
New kernel is binary compatible with old tools, since sizes
of sockaddr_inarp and sockaddr_in match, and sa_family are
filled with same value.
The structure declaration is left for compatibility with
third party software, but in tree code no longer use it.
Reviewed by: ru, andre, net@
lle_event replaced arp_update_event after the ARP rewrite and ended up
in if_ether.h simply because arp_update_event used to be there too.
IPv6 neighbor discovery is going to grow lle_event support and this is a
good time to move it to if_llatbl.h.
The two in-tree consumers of this event - OFED and toecore - are not
affected.
Reviewed by: bz@
logic (refer to [1] for associated discussion). snd_cwnd and snd_wnd are
unsigned long and on 64 bit hosts, min() will truncate them to 32 bits and could
therefore potentially corrupt the result (although under normal operation,
neither variable should legitmately exceed 32 bits).
[1] http://lists.freebsd.org/pipermail/freebsd-net/2013-January/034297.html
Submitted by: jhb
MFC after: 1 week
SYNs (or SYN/ACK replies) are dropped due to network congestion, then the
remote end of the connection may act as if options such as window scaling
are enabled but the local end will think they are not. This can result in
very slow data transfers in the case of window scaling disagreements.
The old behavior can be obtained by setting the
net.inet.tcp.rexmit_drop_options sysctl to a non-zero value.
Reviewed by: net@
MFC after: 2 weeks
to the current demotion factor instead of assigning it.
This allows external scripts to control demotion factor together
with kernel in a raceless manner.
all interested parties in case if interface flag IFF_UP has changed.
However, not only SIOCSIFFLAGS can raise the flag, but SIOCAIFADDR
and SIOCAIFADDR_IN6 can, too. The actual |= is done not in the protocol
code, but in code of interface drivers. To fix this historical layering
violation, we will check whether ifp->if_ioctl(SIOCSIFADDR) raised the
IFF_UP flag, and if it did, run the if_up() handler.
This fixes configuring an address under CARP control on an interface
that was initially !IFF_UP.
P.S. I intentionally omitted handling the IFF_SMART flag. This flag was
never ever used in any driver since it was introduced, and since it
means another layering violation, it should be garbage collected instead
of pretended to be supported.
entering llentry_free(), and in case if we lose the race, we should simply
perform LLE_FREE_LOCKED(). Otherwise, if the race is lost by the thread
performing arptimer(), it will remove two references from the lle instead
of one.
Reported by: Ian FREISLICH <ianf clue.co.za>
but later after processing and freeing the tag, we need to jump back again
to the findpcb label. Since the fwd_tag pointer wasn't NULL we tried to
process and free the tag for second time.
Reported & tested by: Pawel Tyll <ptyll nitronet.pl>
MFC after: 3 days
While here, also make the code that enforces power-of-two more
forgiving, instead of just resetting to 512, graciously round-down
to the next lower power of two.
chunks for each SCTP outgoing stream are in the send and
sent queue.
While there, improve the naming of NR-SACK related constants
recently introduced.
MFC after: 1 week
Instead, add protocol specific mbuf flags M_IP_NEXTHOP and
M_IP6_NEXTHOP. Use them to indicate that the mbuf's chain
contains the PACKET_TAG_IPFORWARD tag. And do a tag lookup
only when this flag is set.
Suggested by: andre
Defer sending an independent window update if a delayed ACK is pending
saving a packet. The window update then gets piggy-backed on the next
already scheduled ACK.
Added grammar fixes as well.
MFC after: 2 weeks
after a much reduced timeout.
Typically web servers close their sockets quickly under the assumption
that the TCP connections goes away as well. That is not entirely true
however. If the peer closed the window we're going to wait for a long
time with lots of data in the send buffer.
MFC after: 2 weeks
draft-ietf-tcpm-initcwnd-05. It explains why the increased initial
window improves the overall performance of many web services without
risking congestion collapse.
As long as it remains a draft it is placed under a sysctl marking it
as experimental:
net.inet.tcp.experimental.initcwnd10 = 1
When it becomes an official RFC soon the sysctl will be changed to
the RFC number and moved to net.inet.tcp.
This implementation differs from the RFC draft in that it is a bit
more conservative in the case of packet loss on SYN or SYN|ACK because
we haven't reduced the default RTO to 1 second yet. Also the restart
window isn't yet increased as allowed. Both will be adjusted with
upcoming changes.
Is is enabled by default. In Linux it is enabled since kernel 3.0.
MFC after: 2 weeks
especially in the presence of bi-directional data transfers.
snd_wl1 tracks the right edge, including data in the reassembly
queue, of valid incoming data. This makes it like rcv_nxt plus
reassembly. It never goes backwards to prevent older, possibly
reordered segments from updating the window.
snd_wl2 tracks the left edge of sent data. This makes it a duplicate
of snd_una. However joining them right now is difficult due to
separate update dependencies in different places in the code flow.
snd_wnd tracks the current advertized send window by the peer. In
tcp_output() the effective window is calculated by subtracting the
already in-flight data, snd_nxt less snd_una, from it.
ACK's become the main clock of window updates and will always update
the window when the left edge of what we sent is advanced. The ACK
clock is the primary signaling mechanism in ongoing data transfers.
This works reliably even in the presence of reordering, reassembly
and retransmitted segments. The ACK clock is most important because
it determines how much data we are allowed to inject into the network.
Zero window updates get us out of persistence mode are crucial. Here
a segment that neither moves ACK nor SEQ but enlarges WND is accepted.
When the ACK clock is not active (that is we're not or no longer
sending any data) any segment that moves the extended right SEQ edge,
including out-of-order segments, updates the window. This gives us
updates especially during ping-pong transfers where the peer isn't
done consuming the already acknowledged data from the receive buffer
while responding with data.
The SSH protocol is a prime candidate to benefit from the improved
bi-directional window update logic as it has its own windowing
mechanism on top of TCP and is frequently sending back protocol ACK's.
Tcpdump provided by: darrenr
Tested by: darrenr
MFC after: 2 weeks
the default retransmit timeout, as base to calculate the backoff
time until next try instead of the TCP_REXMTVAL() macro which only
works correctly when we already have measured an actual RTT+RTTVAR.
Before it would cause the first retransmit at RTOBASE, the next
four at the same time (!) about 200ms later, and then another one
again RTOBASE later.
MFC after: 2 weeks
We've got more cluster sizes for quite some time now and the orginally
imposed limits and the previously codified thoughts on efficiency gains
are no longer true.
MFC after: 2 weeks
from an unprotected u_int that reports garbage on SMP to a function
based sysctl obtaining the current value from UMA.
Also read back the actual cache_limit after page size rounding by UMA.
PR: kern/165879
MFC after: 2 weeks
doing small reads on a (partially) filled receive socket buffer.
Normally one would a send a window update every time the available
space in the socket buffer increases by two times MSS. This leads
to a flurry of window updates that do not provide any meaningful
new information to the sender. There still is available space in
the window and the sender can continue sending data. All window
updates then get carried by the regular ACKs. Only when the socket
buffer was (almost) full and the window closed accordingly a window
updates delivery new information and allows the sender to start
sending more data again.
Send window updates only every two MSS when the socket buffer
has less than 1/8 space available, or the available space in the
socket buffer increased by 1/4 its full capacity, or the socket
buffer is very small. The next regular data ACK will carry and
report the exact window size again.
Reported by: sbruno
Tested by: darrenr
Tested by: Darren Baginski
PR: kern/116335
MFC after: 2 weeks