Make it harder to exploit certain in_control() related races between the
intiial lookup at the beginning and the time we will remove the entry
from the lists by re-checking that entry is still in the list before
trying to remove it.
(*) It is believed that with the current code and locking strategy we
cannot completely fix all race.
Reported by: Nima Misaghian (nima_misa hotmail.com) on net@ 20100817
Tested by: Nima Misaghian (nima_misa hotmail.com) (original version)
PR: kern/146250
Submitted by: Mikolaj Golub (to.my.trociny gmail.com) (different version)
MFC after: 1 week
too coarse grained to be useful and the default value significantly degrades TCP
performance on moderate to high bandwidth-delay product paths with non-zero loss
(e.g. 5+Mbps connections across the public Internet often suffer).
Replace the outgoing mechanism with an individual per-queue limit based on the
number of MSS segments that fit into the socket's receive buffer. This should
strike a good balance between performance and the potential for resource
exhaustion when FreeBSD is acting as a TCP receiver. With socket buffer
autotuning (which is enabled by default), the reassembly queue tracks the
socket buffer and benefits too.
As the XXX comment suggests, my testing uncovered some unexpected behaviour
which requires further investigation. By using so->so_rcv.sb_hiwat
instead of sbspace(&so->so_rcv), we allow more segments to be held across both
the socket receive buffer and reassembly queue than we probably should. The
tradeoff is better performance in at least one common scenario, versus a devious
sender's ability to consume more resources on a FreeBSD receiver.
Sponsored by: FreeBSD Foundation
Reviewed by: andre, gnn, rpaulo
MFC after: 2 weeks
"net.inet.tcp.reass.maxsegments" sysctl variables to be based on UMA zone
stats. The value returned by the cursegments sysctl is approximate owing to
the way in which uma_zone_get_cur is implemented.
- Discontinue use of V_tcp_reass_qsize as a global reassembly segment count
variable in the reassembly implementation. The variable was used without
proper synchronisation and was duplicating accounting done by UMA already. The
lack of synchronisation was particularly problematic on SMP systems
terminating many TCP sessions, resulting in poor TCP performance for
connections with non-zero packet loss.
Sponsored by: FreeBSD Foundation
Reviewed by: andre, gnn, rpaulo (as part of a larger patch)
MFC after: 2 weeks
This fixes the bug where setting bw > 1 MTU/tick resulted in
infinite bandwidth if io_fast=1
PR: 147245 148429
Obtained from: Riccardo Panicucci
MFC after: 3 days
net.inet.ip.fw.one_pass and always moved to the next rule
in case of a successful nat.
This should fix several related PR (waiting for feedback
before closing them)
PR: 145167 149572 150141
MFC after: 3 days
un-expiring.
The previous version of code have no locking when testing rt_refcnt.
The result of the lack of locking may result in a condition where
a routing entry have a reference count but at the same time have
RTPRF_OURS bit set and an expiration timer. These would eventually
lead to a panic:
panic: rtqkill route really not free
When the system have ICMP redirects accepted from local gateway
in a moderate frequency, for instance.
Commit this workaround for now until we have some better solution.
PR: kern/149804
Reviewed by: bz
Tested by: Zhao Xin, Pete French
MFC after: 2 weeks
not be used outside of the reassembly queue implementation. Provide a new
function to flush all segments from a reassembly queue and call it from the
appropriate places instead of manipulating the queue directly.
Sponsored by: FreeBSD Foundation
Reviewed by: andre, gnn, rpaulo
MFC after: 2 weeks
in the kernel (just as inet_ntoa() and inet_aton()) are and sync their
prototype accordingly with already mentioned functions.
Sponsored by: Sandvine Incorporated
Reviewed by: emaste, rstone
Approved by: dfr
MFC after: 2 weeks
separate the decision logic, of whether we can do TSO, and the
calculation of the burst length into two distinct parts.
Change the way the TSO burst length calculation is done. While
TSO could do bursts of 65535 bytes that can't be represented in
ip_len together with the IP and TCP header. Account for that and
use IP_MAXPACKET instead of TCP_MAXWIN as base constant (both
have the same value of 64K). When more data is available prevent
less than MSS sized segments from being sent during the current
TSO burst.
Add two more KASSERTs to ensure the integrity of the packets.
Tested by: Ben Wilber <ben-at-desync com>
MFC after: 10 days
to give way for the pluggable congestion control framework. It is
the task of the congestion control algorithm to set the congestion
window and amount of inflight data without external interference.
In 'struct tcpcb' the variables previously used by the inflight
limiter are renamed to spares to keep the ABI intact and to have
some more space for future extensions.
In 'struct tcp_info' the variable 'tcpi_snd_bwnd' is not removed to
preserve the ABI. It is always set to 0.
In siftr.c in 'struct pkt_node' the variable 'snd_bwnd' is not removed
to preserve the ABI. It is always set to 0.
These unused variable in the various structures may be reused in the
future or garbage collected before the next release or at some other
point when an ABI change happens anyway for other reasons.
No MFC is planned. The inflight bandwidth limiter stays disabled by
default in the other branches but remains available.
a small difference in the last paragraph though) as suggested by jhb.
Clarify that the 'reviewed by' in r212653 by lstewart was for the
functional change, not the comments in the committed version.
artificial power-of-2 rounded number to their real values specified
in RFC879 and RFC2460.
From the history and existing comments it appears that the rounded
numbers were intended to be advantageous for the kernel and mbuf
system. However this hasn't been the case at for at least a long
time. The mbuf clusters used in tcp_output() have enough space
to hold the larger real value for the default MSS for both IPv4 and
IPv6. Note that the default MSS is only used when path MTU discovery
is disabled.
Update and expand related comments.
Reviewed by: lsteward (including some word-smithing)
MFC after: 2 weeks
using ipproto_{un,}register() and the newly created ip6proto_{un,}register()
so that it can again receive IPPROTO_CARP packets allowing its state machine
to work.
Reviewed by: bz
Approved by: ken (mentor)
In protosw we define pr_protocol as short, while on the wire
it is an uint8_t. That way we can have "internal" protocols
like DIVERT, SEND or gaps for modules (PROTO_SPACER).
Switch ipproto_{un,}register to accept a short protocol number(*)
and do an upfront check for valid boundries. With this we
also consistently report EPROTONOSUPPORT for out of bounds
protocols, as we did for proto == 0. This allows a caller
to not error for this case, which is especially important
if we want to automatically call these from domain handling.
(*) the functions have been without any in-tree consumer
since the initial introducation, so this is considered save.
Implement ip6proto_{un,}register() similarly to their legacy IP
counter parts to allow modules to hook up dynamically.
Reviewed by: philip, will
MFC after: 1 week
pseudo-interface. This leads to a panic due to uninitialized
if_broadcastaddr address. Initialize it and implement ip_output()
method to prevent mbuf leak later.
ipfw pseudo-interface should never send anything therefore call
panic(9) in if_start() method.
PR: kern/149807
Submitted by: Dmitrij Tejblum
MFC after: 2 weeks
Fix the switching on/off of PF and NR-SACKs using sysctl.
Add minor improvement in handling malloc failures.
Improve the address checks when sending.
MFC after: 4 weeks