AppleTalk was a network transport protocol for Apple Macintosh devices
in 80s and then 90s. Starting with Mac OS X in 2000 the AppleTalk was
a legacy protocol and primary networking protocol is TCP/IP. The last
Mac OS X release to support AppleTalk happened in 2009. The same year
routing equipment vendors (namely Cisco) end their support.
Thus, AppleTalk won't be supported in FreeBSD 11.0-RELEASE.
IPX was a network transport protocol in Novell's NetWare network operating
system from late 80s and then 90s. The NetWare itself switched to TCP/IP
as default transport in 1998. Later, in this century the Novell Open
Enterprise Server became successor of Novell NetWare. The last release
that claimed to still support IPX was OES 2 in 2007. Routing equipment
vendors (e.g. Cisco) discontinued support for IPX in 2011.
Thus, IPX won't be supported in FreeBSD 11.0-RELEASE.
The virtio driver needs the offset to be stored in csum_data,
like in the case for UDP and TCP.
The virtio problem was reported by
Niu Zhixiong <kaiaixi@gmail.com>, who helped in debugging
and testing the patch.
MFC after: 3 days
the protocol specific mbuf flags are shared between them.
- Move all M_FOO definitions into a single place: netinet/in6.h, to
avoid future clashes.
- Resolve clash between M_DECRYPTED and M_SKIP_FIREWALL which resulted
in a failure of operation of IPSEC and packet filters.
Thanks to Nicolas and Georgios for all the hard work on bisecting,
testing and finally finding the root of the problem.
PR: kern/186755
PR: kern/185876
In collaboration with: Georgios Amanakis <gamanakis gmail.com>
In collaboration with: Nicolas DEFFAYET <nicolas-ml deffayet.com>
Sponsored by: Nginx, Inc.
- Use counter(9) for rt_pksent (former rt_rmx.rmx_pksent). This
removes another cache trashing ++ from packet forwarding path.
- Create zini/fini methods for the rtentry UMA zone. Via initialize
mutex and counter in them.
- Fix reporting of rmx_pksent to routing socket.
- Fix netstat(1) to report "Use" both in kvm(3) and sysctl(3) mode.
The change is mostly targeted for stable/10 merge. For head,
rt_pksent is expected to just disappear.
Discussed with: melifaro
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
The ifaddr is already referenced by the rtentry, and we are holding
reference on the rtentry throughout the function execution.
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
were primarily used to size the sysctl name list macros that were removed
in r254295. A few other constants either did not have an associated
sysctl node, or the associated node used OID_AUTO instead.
PR: ports/184525 (exp-run)
tested and is unfinished. However, I've tested my version,
it works okay. As before it is unfinished: timeout aren't
driven by TCP session state. To enable the HASH_ALL mode,
one needs in kernel config:
options FLOWTABLE_HASH_ALL
o Reduce the alignment on flentry to 64 bytes. Without
the FLOWTABLE_HASH_ALL option, twice less memory would
be consumed by flows.
o API to ip_output()/ip6_output() got even more thin: 1 liner.
o Remove unused unions. Simply use fle->f_key[].
o Merge all IPv4 code into flowtable_lookup_ipv4(), and do same
flowtable_lookup_ipv6(). Stop copying data to on stack
sockaddr structures, simply use key[] on stack.
o Move code from flowtable_lookup_common() that actually works
on insertion into flowtable_insert().
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
- ip_output() and ip_output6() simply call flowtable_lookup(),
passing mbuf and address family. That's the only code under
#ifdef FLOWTABLE in the protocols code now.
o Revamp statistics gathering and export.
- Remove hand made pcpu stats, and utilize counter(9).
- Snapshot of statistics is available via 'netstat -rs'.
- All sysctls are moved into net.flowtable namespace, since
spreading them over net.inet isn't correct.
o Properly separate at compile time INET and INET6 parts.
o General cleanup.
- Remove chain of multiple flowtables. We simply have one for
IPv4 and one for IPv6.
- Flowtables are allocated in flowtable.c, symbols are static.
- With proper argument to SYSINIT() we no longer need flowtable_ready.
- Hash salt doesn't need to be per-VNET.
- Removed rudimentary debugging, which use quite useless in dtrace era.
The runtime behavior of flowtable shouldn't be changed by this commit.
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
By definition, the very first FIN is not a duplicate. Process it normally
and don't feed it to congestion control as though it were a dupe. Don't
prevent CC from seeing later dupe acks while in a half close state.
the INP_INFO lock from tcp_usr_accept. As the PR/patch states
this was following the advice already in the code.
See the PR below for a full disucssion of this change and its
measured effects.
PR: 183659
Submitted by: Julian Charbon
Reviewed by: jhb
pim_input() properly.
While here, remove extra variable and incorrect condition
before m_pullup().
Reported by: Olivier Cochard-Labbé <olivier cochard.me>
Sponsored by: Nginx, Inc.
* Set ia address/mask values BEFORE attaching to address lists.
Inet6 address assignment is not atomic, so the simplest way to
do this atomically is to fill in ia before attach.
* Validate irfa->ia_addr field before use (we permit ANY sockaddr in old code).
* Do some renamings:
in6_ifinit -> in6_notify_ifa (interaction with other subsystems is here)
in6_setup_ifa -> in6_broadcast_ifa (LLE/Multicast/DaD code)
in6_ifaddloop -> nd6_add_ifa_lle
in6_ifremloop -> nd6_rem_ifa_lle
* Split working with LLE and route announce code for last two.
Add temporary in6_newaddrmsg() function to mimic current rtsock behaviour.
* Call device SIOCSIFADDR handler IFF we're adding first address.
In IPv4 we have to call it on every address change since ARP record
is installed by arp_ifinit() which is called by given handler.
IPv6 stack, on the opposite is responsible to call nd6_add_ifa_lle() so
there is no reason to call SIOCSIFADDR often.
of a syncache connection, copy it into the inp_flowid field.
Without this, an incoming TCP connection won't have an inp_flowid marked
until some data comes in, and this means that things like the per-CPU
TCP timer option will choose a different CPU for the timer work.
(It also means that if one grabbed the flowid via an ioctl from userland,
it won't be available until some data has been received.)
Sponsored by: Netflix, Inc.
Problem case:
Original lookup returns route with GW set, so gw points to
rte->rt_gateway.
After that we're changing dst and performing lookup another time.
Since fwd host is most probably directly reachable, resulting
rte does not contain rt_gateway, so gw is not set. Finally, we
end with packet transmitted to proper interface but wrong
link-layer address.
Found by: lstewart
Discussed with: ae,lstewart
MFC after: 2 weeks
Sponsored by: Yandex LLC
has the same prefix as some other alias on the same interface, use
newly-added rt_addrmsg() instead of hand-rolled in_addralias_rtmsg().
This eliminates the following rtsock messages:
Pinned RTM_ADD for prefix (for alias addition).
Pinned RTM_DELETE for prefix (for alias withdrawal).
Example (got 10.0.0.1/24 on vlan4, playing with 10.0.0.2/24):
before commit, addition:
got message of size 116 on Fri Jan 10 14:13:15 2014
RTM_NEWADDR: address being added to iface: len 116, metric 0, flags:
sockaddrs: <NETMASK,IFP,IFA,BRD>
255.255.255.0 vlan4:8.0.27.c5.29.d4 10.0.0.2 10.0.0.255
got message of size 192 on Fri Jan 10 14:13:15 2014
RTM_ADD: Add Route: len 192, pid: 0, seq 0, errno 0, flags:<UP,PINNED>
locks: inits:
sockaddrs: <DST,GATEWAY,NETMASK>
10.0.0.0 10.0.0.2 (255) ffff ffff ff
after commit, addition:
got message of size 116 on Fri Jan 10 13:56:26 2014
RTM_NEWADDR: address being added to iface: len 116, metric 0, flags:
sockaddrs: <NETMASK,IFP,IFA,BRD>
255.255.255.0 vlan4:8.0.27.c5.29.d4 14.0.0.2 14.0.0.255
before commit, wihdrawal:
got message of size 192 on Fri Jan 10 13:58:59 2014
RTM_DELETE: Delete Route: len 192, pid: 0, seq 0, errno 0, flags:<UP,PINNED>
locks: inits:
sockaddrs: <DST,GATEWAY,NETMASK>
10.0.0.0 10.0.0.2 (255) ffff ffff ff
got message of size 116 on Fri Jan 10 13:58:59 2014
RTM_DELADDR: address being removed from iface: len 116, metric 0, flags:
sockaddrs: <NETMASK,IFP,IFA,BRD>
255.255.255.0 vlan4:8.0.27.c5.29.d4 10.0.0.2 10.0.0.255
adter commit, withdrawal:
got message of size 116 on Fri Jan 10 14:14:11 2014
RTM_DELADDR: address being removed from iface: len 116, metric 0, flags:
sockaddrs: <NETMASK,IFP,IFA,BRD>
255.255.255.0 vlan4:8.0.27.c5.29.d4 10.0.0.2 10.0.0.255
Sending both RTM_ADD/RTM_DELETE messages to rtsock is completely wrong
(and requires some hacks to keep prefix in route table on RTM_DELETE).
I've tested this change with quagga (no change) and bird (*).
bird alias handling is already broken in *BSD sysdep code, so nothing
changes here, too.
I'm going to MFC this change if there will be no complains about behavior
change.
While here, fix some style(9) bugs introduced by r260488
(pointed by glebius and bde).
Sponsored by: Yandex LLC
MFC after: 4 weeks
Thus we can use IF_AFDATA_RLOCK() instead of IF_AFDATA_LOCK() when doing
lla_lookup() without LLE_CREATE flag.
Reviewed by: glebius, adrian
MFC after: 1 week
Sponsored by: Yandex LLC
an interface:
- in in_control() skip over not AF_INET addresses.
- in in_aifaddr_ioctl() and in_difaddr_ioctl() do correct check
of address family, w/o accessing memory beyond struct ifaddr.
Sponsored by: Nginx, Inc.
- Host Identity Protocol (RFC5201)
- Shim6 Protocol (RFC5533)
- 2x experimentation and testing (RFC3692, RFC4727)
This does not indicate interest to implement/support these protocols,
but they are part of the "IPv6 Extension Header Types" [1] based on RFC7045
and might thus be needed by filtering and next header parsing
implementations.
References: [1] http://www.iana.org/assignments/ipv6-parameters
Obtained from: http://www.iana.org/assignments/protocol-numbers
MFC after: 1 week
- Remove locking, since all module(9) events are running under &Giant.
- Use TAILQ for protocol handlers and fix a bug which led to
infinite cycle. Bug found in VirtualBox [1]
- Simplify code everywhere.
- Fix documentation.
[1] https://www.virtualbox.org/pipermail/vbox-dev/2013-November/011936.html
PR: 183792 [1]
Submitted by: Valery Ushakov <uwe NetBSD.org> [1]
Sponsored by: Nginx, Inc.
eneough queue space before queuing a bunch of IP fragments.
As the comment in the committed change says, in the post-if_transmit(),
post-SMP, post-preemption world, there's just too much overlapping
concurrent code paths and different approaches to driver transmit
queue management to have this code even remotely be effective.
The only specific place it could be useful is if ALTQ is enabled
but again it doesn't at all promise that all the fragments will be
transmitted anyway.
The main reason for committing this change is to disable a parallel
place where the drops counter is incremented. This is a side effect
of an upcoming change to ixgbe/cxgbe to handle the queue drops
counter slightly better.
Sponsored by: Netflix, Inc.
- The remote host sends a FIN
- in an ACK for a sequence number for which an ACK has already
been received
- There is still unacked data on route to the remote host
- The packet does not contain a window update
The packet may be dropped without processing the FIN flag.
PR: kern/99188
Submitted by: Staffan Ulfberg <staffan@ulfberg.se>
Discussed with: andre
MFC after: never
http://svnweb.freebsd.org/changeset/base/258221
I introduced a bug which initialized global locks
whenever the SCTP stack initialized. This was fixed in
http://svnweb.freebsd.org/changeset/base/258574
by rodrigc@. He just initialized the locks for
the default vnet. This fix reverts to the old
behaviour before r258221, which explicitly makes
sure it is only called once, because this works also on
other platforms.
MFC after: 3 days
X-MFC with: r258574.
In its stead use the Solaris / illumos approach of emulating '-' (dash)
in probe names with '__' (two consecutive underscores).
Reviewed by: markj
MFC after: 3 weeks
In r208160, sctp_it_ctl was made a global variable, across all VNETs.
However, sctp_init() is called for every VNET that is created. This results
in the same global mutexes which are part of sctp_it_ctl being initialized. This can result
in crashes if many jails are created.
To reproduce the problem:
(1) Take a GENERIC kernel config, and add options for: VIMAGE, WITNESS,
INVARIANTS.
(2) Run this command in a loop:
jail -l -u root -c path=/ name=foo persist vnet && jexec foo ifconfig lo0 127.0.0.1/8 && jail -r foo
(see http://lists.freebsd.org/pipermail/freebsd-current/2010-November/021280.html )
Witness will warn about the same mutex being initialized.
Fix the problem by only initializing these mutexes in the default VNET.
option, unbreak the lock tracing release semantic by embedding
calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
version of the releasing functions for mutex, rwlock and sxlock.
Failing to do so skips the lockstat_probe_func invokation for
unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
kernel compiled without lock debugging options, potentially every
consumer must be compiled including opt_kdtrace.h.
Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
is linked there and it is only used as a compile-time stub [0].
[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested. As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while. Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].
Sponsored by: EMC / Isilon storage division
Discussed with: rstone
[0] Reported by: rstone
[1] Discussed with: philip
o Provide separate functions for SIOCAIFADDR and for SIOCDIFADDR, with
clear code flow from beginning to the end. After that the rest of
in_control() gets very small and clear.
o Provide sx(9) lock to protect against parallel ioctl() invocations.
o Reimplement logic from r201282, that tried to keep localhost route in
table when multiple P2P interfaces with same local address are created
and deleted.
Discussed with: pluknet, melifaro
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
default from the very beginning. It was placed in wrong namespace
net.link.ether, originally it had been at another wrong namespace. It was
incorrectly documented at incorrect manual page arp(8). Since new-ARP commit,
the tunable have been consulted only on route addition, and ignored on route
deletion. Behaviour of a system with tunable turned off is not fully correct,
and has no advantages comparing to normal behavior.
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
segments thinking it received only one segment. This causes it to enable
the delay the ACK for 100ms to wait for another segment which may never
come because all the data was received already.
Doing delayed ACK for LRO segments is bogus for two reasons: a) it pushes
us further away from acking every other packet; b) it introduces additional
delay in responding to the sender. The latter is especially bad because it
is in the nature of LRO to aggregated all segments of a burst with no more
coming until an ACK is sent back.
Change the delayed ACK logic to detect LRO segments by being larger than
the MSS for this connection and issuing an immediate ACK for them to keep
the ACK clock ticking without interruption.
Reported by: julian, cperciva
Tested by: cperciva
Reviewed by: lstewart
MFC after: 3 days
four counters to struct ifaddr. This kills '+=' on a variables shared
between processors for every packet.
- Nuke struct if_data from struct ifaddr.
- In ip_input() do not put a reference on ifaddr, instead update statistics
right now in place and do IN_IFADDR_RUNLOCK(). These removes atomic(9)
for every packet. [1]
- To properly support NET_RT_IFLISTL sysctl used by getifaddrs(3), in
rtsock.c fill if_data fields using counter_u64_fetch().
- Accidentially fix bug in COMPAT_32 version of NET_RT_IFLISTL, which
took if_data not from the ifaddr, but from ifaddr's ifnet. [2]
Submitted by: melifaro [1], pluknet[2]
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
sbdrop_locked() to cut acked mbufs from the socket buffer. Free this
chain a batch manner after the socket buffer lock is dropped.
This measurably reduces contention on socket buffer.
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
Approved by: re (marius)
input path. These probes get some of the fields in host order, whereas the
output probes get them in network order, so a single translator isn't
enough. This workaround ensures that the problem is essentially invisble
to users: none of the probe arguments or their fields have changed.
Approved by: re (hrs)
so that fixed TCP_SIGNATURE handling can later be merged.
This is derived from follow-up work to SVN r183001 posted to
net@ on Sep 13 2008.
Approved by: re (gjb)
matches the types used when computing hash indices and the type of the
maximum size of mfchashtbl[].
PR: kern/181821
Submitted by: Sven-Thorsten Dietrich <sven@vyatta.com> (IPv4)
MFC after: 1 week
* Remove non working code related to SHA224.
* Remove support for non-standardised HMAC-IDs using SHA384 and SHA512.
* Prefer SHA256 over SHA1.
* Minor cleanup.
MFC after: 2 weeks
Add a last-modified timestamp to each LRO entry and provide an interface
to flush all inactive entries. Drivers decide when to flush and what
the inactivity threshold should be.
Network drivers that process an rx queue to completion can enter a
livelock type situation when the rate at which packets are received
reaches equilibrium with the rate at which the rx thread is processing
them. When this happens the final LRO flush (normally when the rx
routine is done) does not occur. Pure ACKs and segments with total
payload < 64K can get stuck in an LRO entry. Symptoms are that TCP
tx-mostly connections' performance falls off a cliff during heavy,
unrelated rx on the interface.
Flushing only inactive LRO entries works better than any of these
alternates that I tried:
- don't LRO pure ACKs
- flush _all_ LRO entries periodically (every 'x' microseconds or every
'y' descriptors)
- stop rx processing in the driver periodically and schedule remaining
work for later.
Reviewed by: andre
dynamic translation so that their arguments match the definitions for
these providers in Solaris and illumos. Thus, existing scripts for these
providers should work unmodified on FreeBSD.
Tested by: gnn, hiren
MFC after: 1 month
features. The changes in particular are:
o Remove rarely used "header" pointer and replace it with a 64bit protocol/
layer specific union PH_loc for local use. Protocols can flexibly overlay
their own 8 to 64 bit fields to store information while the packet is
worked on.
o Mechanically convert IP reassembly, IGMP/MLD and ATM to use pkthdr.PH_loc
instead of pkthdr.header.
o Extend csum_flags to 64bits to allow for additional future offload
information to be carried (e.g. iSCSI, IPsec offload, and others).
o Move the RSS hash type enumerator from abusing m_flags to its own 8bit
rsstype field. Adjust accessor macros.
o Add cosqos field to store Class of Service / Quality of Service information
with the packet. It is not yet supported in any drivers but allows us to
get on par with Cisco/Juniper in routing applications (plus MPLS QoS) with
a modernized ALTQ.
o Add four 8 bit fields l[2-5]hlen to store the relative header offsets
from the start of the packet. This is important for various offload
capabilities and to relieve the drivers from having to parse the packet
and protocol headers to find out location of checksums and other
information. Header parsing in drivers is a lot of copy-paste and
unhandled corner cases which we want to avoid.
o Add another flexible 64bit union to map various additional persistent
packet information, like ether_vtag, tso_segsz and csum fields.
Depending on the csum_flags settings some fields may have different usage
making it very flexible and adaptable to future capabilities.
o Restructure the CSUM flags to better signify their outbound (down the
stack) and inbound (up the stack) use. The CSUM flags used to be a bit
chaotic and rather poorly documented leading to incorrect use in many
places. Bring clarity into their use through better naming.
Compatibility mappings are provided to preserve the API. The drivers
can be corrected one by one and MFC'd without issue.
o The size of pkthdr stays the same at 48/56bytes (32/64bit architectures).
Sponsored by: The FreeBSD Foundation
When exporting to xinpcb, just export the lower
32-bit. Using there also 64-bits will break the
ABI and will be committed separetly.
MFC after: 2 weeks
X-MFC with: 254248
together.
Add M_FLAG_PRINTF for use with printf(9) %b indentifier.
Use the generic mbuf flags print names in the net80211 code and adjust
the protocol specific bits for their new positions.
Change SCTP M_PROTO mapping from 5 to 1 to fit within the 16bit field
they use internally to store some additional information.
Discussed with: trociny, glebius
flag instead. The flag is only used within the IP and IPv6 layer 3
protocols.
Because some firewall packages treat IPv4 and IPv6 packets the same the
flag should have the same value for both.
Discussed with: trociny, glebius
using SDT_PROBE_ARGTYPE(). This will make it easy to extend the SDT(9) API
to allow probes with dynamically-translated types.
There is no functional change.
MFC after: 2 weeks
structure is used, but they already have equal fields in the struct
newipsecstat, that was introduced with FAST_IPSEC and then was merged
together with old ipsecstat structure.
This fixes kernel stack overflow on some architectures after migration
ipsecstat to PCPU counters.
Reported by: Taku YAMAMOTO, Maciej Milewski
duplicated sockets a multicast address is bound and either
SO_REUSEPORT or SO_REUSEADDR is set.
But actually it works for the following combinations:
* SO_REUSEPORT is set for the fist socket and SO_REUSEPORT for the new;
* SO_REUSEADDR is set for the fist socket and SO_REUSEADDR for the new;
* SO_REUSEPORT is set for the fist socket and SO_REUSEADDR for the new;
and fails for this:
* SO_REUSEADDR is set for the fist socket and SO_REUSEPORT for the new.
Fix the last case.
PR: 179901
MFC after: 1 month
information into the ISN (initial sequence number) without the additional
use of timestamp bits and switching to the very fast and cryptographically
strong SipHash-2-4 MAC hash algorithm to protect the SYN cookie against
forgeries.
The purpose of SYN cookies is to encode all necessary session state in
the 32 bits of our initial sequence number to avoid storing any information
locally in memory. This is especially important when under heavy spoofed
SYN attacks where we would either run out of memory or the syncache would
fill with bogus connection attempts swamping out legitimate connections.
The original SYN cookies method only stored an indexed MSS values in the
cookie. This isn't sufficient anymore and breaks down in the presence of
WSCALE information which is only exchanged during SYN and SYN-ACK. If we
can't keep track of it then we may severely underestimate the available
send or receive window. This is compounded with large windows whose size
information on the TCP segment header is even lower numerically. A number
of years back SYN cookies were extended to store the additional state in
the TCP timestamp fields, if available on a connection. While timestamps
are common among the BSD, Linux and other *nix systems Windows never enabled
them by default and thus are not present for the vast majority of clients
seen on the Internet.
The common parameters used on TCP sessions have changed quite a bit since
SYN cookies very invented some 17 years ago. Today we have a lot more
bandwidth available making the use window scaling almost mandatory. Also
SACK has become standard making recovering from packet loss much more
efficient.
This change moves all necessary information into the ISS removing the need
for timestamps. Both the MSS (16 bits) and send WSCALE (4 bits) are stored
in 3 bit indexed form together with a single bit for SACK. While this is
significantly less than the original range, it is sufficient to encode all
common values with minimal rounding.
The MSS depends on the MTU of the path and with the dominance of ethernet
the main value seen is around 1460 bytes. Encapsulations for DSL lines
and some other overheads reduce it by a few more bytes for many connections
seen. Rounding down to the next lower value in some cases isn't a problem
as we send only slightly more packets for the same amount of data.
The send WSCALE index is bit more tricky as rounding down under-estimates
the available send space available towards the remote host, however a small
number values dominate and are carefully selected again.
The receive WSCALE isn't encoded at all but recalculated based on the local
receive socket buffer size when a valid SYN cookie returns. A listen socket
buffer size is unlikely to change while active.
The index values for MSS and WSCALE are selected for minimal rounding errors
based on large traffic surveys. These values have to be periodically
validated against newer traffic surveys adjusting the arrays tcp_sc_msstab[]
and tcp_sc_wstab[] if necessary.
In addition the hash MAC to protect the SYN cookies is changed from MD5
to SipHash-2-4, a much faster and cryptographically secure algorithm.
Reviewed by: dwmalone
Tested by: Fabian Keil <fk@fabiankeil.de>
dereferencing, when checking for SO_REUSEPORT option (and SO_REUSEADDR
for multicast), INP_REUSEPORT flag was introduced to cache the socket
option. It was decided then that one flag would be enough to cache
both SO_REUSEPORT and SO_REUSEADDR: when processing SO_REUSEADDR
setsockopt(2), it was checked if it was called for a multicast address
and INP_REUSEPORT was set accordingly.
Unfortunately that approach does not work when setsockopt(2) is called
before binding to a multicast address: the multicast check fails and
INP_REUSEPORT is not set.
Fix this by adding INP_REUSEADDR flag to unconditionally cache
SO_REUSEADDR.
PR: 179901
Submitted by: Michael Gmelin freebsd grem.de (initial version)
Reviewed by: rwatson
MFC after: 1 week
algorithm, which is based on the 2011 v0.1 patch release and described in the
paper "Revisiting TCP Congestion Control using Delay Gradients" by David Hayes
and Grenville Armitage. It is implemented as a kernel module compatible with the
modular congestion control framework.
CDG is a hybrid congestion control algorithm which reacts to both packet loss
and inferred queuing delay. It attempts to operate as a delay-based algorithm
where possible, but utilises heuristics to detect loss-based TCP cross traffic
and will compete effectively as required. CDG is therefore incrementally
deployable and suitable for use on shared networks.
In collaboration with: David Hayes <david.hayes at ieee.org> and
Grenville Armitage <garmitage at swin edu au>
MFC after: 4 days
Sponsored by: Cisco University Research Program and FreeBSD Foundation
increased the pointer, not the memory it points to.
In collaboration with: kib
Reported & tested by: Ian FREISLICH <ianf clue.co.za>
Sponsored by: Nginx, Inc.
limited in the amount of data they can handle at once.
Drivers can set ifp->if_hw_tsomax before calling ether_ifattach() to
change the limit.
The lowest allowable size is IP_MAXPACKET / 8 (8192 bytes) as anything
less wouldn't be very useful anymore. The upper limit is still at
IP_MAXPACKET (65536 bytes). Raising it requires further auditing of
the IPv4/v6 code path's as the length field in the IP header would
overflow leading to confusion in firewalls and others packet handler on
the real size of the packet.
The placement into "struct ifnet" is a bit hackish but the best place
that was found. When the stack/driver boundary is updated it should
be handled in a better way.
Submitted by: cperciva (earlier version)
Reviewed by: cperciva
Tested by: cperciva
MFC after: 1 week (using spare struct members to preserve ABI)
Address. Although KAME implementation used FF02:0:0:0:0:2::/96 based on
older versions of draft-ietf-ipngwg-icmp-name-lookup, it has been changed
in RFC 4620.
The kernel always joins the /104-prefixed address, and additionally does
/96-prefixed one only when net.inet6.icmp6.nodeinfo_oldmcprefix=1.
The default value of the sysctl is 1.
ping6(8) -N flag now uses /104-prefixed one. When this flag is specified
twice, it uses /96-prefixed one instead.
Reviewed by: ume
Based on work by: Thomas Scheffler
PR: conf/174957
MFC after: 2 weeks
same place as dst, or to the sockaddr in the routing table.
The const constraint of gw makes us safe from modifing routing table
accidentially. And "onstantness" of dst allows us to remove several
bandaids, when we switched it back at &ro->ro_dst, now it always
points there.
Reviewed by: rrs
route. What it was is there are two places in ip_output.c
where we do a goto again. One place was fine, it
copies out the new address and then resets dst = ro->rt_dst;
But the other place does *not* do that, which means earlier
when we found the gateway, we have dst pointing there
aka dst = ro->rt_gateway is done.. then we do a
goto again.. bam now we clobber the default route.
The fix is just to move the again so we are always
doing dst = &ro->rt_dst; in the again loop.
PR: 174749,157796
MFC after: 1 week
duplicate ACK make sure we actually have new data to send.
This prevents us from sending unneccessary pure ACKs.
Reported by: Matt Miller <matt@matthewjmiller.net>
Tested by: Matt Miller <matt@matthewjmiller.net>
MFC after: 2 weeks
mbuf allocation fails, as in a case when ip_output() returns error.
To achieve that, move large block of code that updates tcpcb below
the out: label.
This fixes a panic, that requires the following sequence to happen:
1) The SYN was sent to the network, tp->snd_nxt = iss + 1, tp->snd_una = iss
2) The retransmit timeout happened for the SYN we had sent,
tcp_timer_rexmt() sets tp->snd_nxt = tp->snd_una, and calls tcp_output().
In tcp_output m_get() fails.
3) Later on the SYN|ACK for the SYN sent in step 1) came,
tcp_input sets tp->snd_una += 1, which leads to
tp->snd_una > tp->snd_nxt inconsistency, that later panics in
socket buffer code.
For reference, this bug fixed in DragonflyBSD repo:
http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/1ff9b7d322dc5a26f7173aa8c38ecb79da80e419
Reviewed by: andre
Tested by: pho
Sponsored by: Nginx, Inc.
PR: kern/177456
Submitted by: HouYeFei&XiBoLiu <lglion718 163.com>
reside on their own cache line to prevent false sharing with other
nearby structures, especially for those in the .bss segment.
NB: Those mutexes and rwlocks with variables next to them that get
changed on every invocation do not benefit from their own cache line.
Actually it may be net negative because two cache misses would be
incurred in those cases.
connections in the accept queue and contiguous new incoming SYNs.
Compared to the original submitters patch I've moved the test
next to the SYN handling to have it together in a logical unit
and reworded the comment explaining the issue.
Submitted by: Matt Miller <matt@matthewjmiller.net>
Submitted by: Juan Mojica <jmojica@gmail.com>
Reviewed by: Matt Miller (changes)
Tested by: pho
MFC after: 1 week
Convert 'struct ipstat' and 'struct tcpstat' to counter(9).
This speeds up IP forwarding at extreme packet rates, and
makes accounting more precise.
Sponsored by: Nginx, Inc.
For TIMEWAIT handling tcp_input may have to jump back for an additional
pass through pcblookup. Prior to this change the fwd_tag had been
discarded after the first lookup, so a new connection attempt delivered
locally via 'ipfw fwd' would fail to find a match.
As of r248886 the tag will be detached and freed when passed to the
socket buffer.
Setting DSCP support is done via O_SETDSCP which works for both
IPv4 and IPv6 packets. Fast checksum recalculation (RFC 1624) is done for IPv4.
Dscp can be specified by name (AFXY, CSX, BE, EF), by value
(0..63) or via tablearg.
Matching DSCP is done via another opcode (O_DSCP) which accepts several
classes at once (af11,af22,be). Classes are stored in bitmask (2 u32 words).
Many people made their variants of this patch, the ones I'm aware of are
(in alphabetic order):
Dmitrii Tejblum
Marcelo Araujo
Roman Bogorodskiy (novel)
Sergey Matveichuk (sem)
Sergey Ryabin
PR: kern/102471, kern/121122
MFC after: 2 weeks
and that can drive someone crazy. While m_get2() is young and not
documented yet, change its order of arguments to match m_getm2().
Sorry for churn, but better now than later.
Fix a siftr(4) potential memory leak and INVARIANTS triggered kernel panic in
hashdestroy() by ensuring the last array index in the flow counter hash table is
flushed of entries.
MFC after: 3 days
precise time event generation. This greatly improves granularity of
callouts which are not anymore constrained to wait next tick to be
scheduled.
- Extend the callout KPI introducing a set of callout_reset_sbt* functions,
which take a sbintime_t as timeout argument. The new KPI also offers a
way for consumers to specify precision tolerance they allow, so that
callout can coalesce events and reduce number of interrupts as well as
potentially avoid scheduling a SWI thread.
- Introduce support for dispatching callouts directly from hardware
interrupt context, specifying an additional flag. This feature should be
used carefully, as long as interrupt context has some limitations
(e.g. no sleeping locks can be held).
- Enhance mechanisms to gather informations about callwheel, introducing
a new sysctl to obtain stats.
This change breaks the KBI. struct callout fields has been changed, in
particular 'int ticks' (4 bytes) has been replaced with 'sbintime_t'
(8 bytes) and another 'sbintime_t' field was added for precision.
Together with: mav
Reviewed by: attilio, bde, luigi, phk
Sponsored by: Google Summer of Code 2012, iXsystems inc.
Tested by: flo (amd64, sparc64), marius (sparc64), ian (arm),
markj (amd64), mav, Fabian Keil
Specifcially, in_cksum_hdr() returns 0 (not 0xffff) when the IPv4
checksum is correct. Without this fix, the tcp_lro code will reject
good IPv4 traffic from drivers that do not implement IPv4 header
harder csum offload.
Sponsored by: Myricom Inc.
MFC after: 7 days
- fix indentation
- put the operator at the end of the line for long statements
- remove spaces between the type and the variable in a cast
- remove excessive parentheses
Tested by: md5
ABORT cause, since the user can also provide this kind of
information. So the receiver doesn't know who provided the
information.
While there: Fix a bug where the stack would send a malformed
ABORT chunk when using a send() call with SCTP_ABORT|SCT_SENDALL
flags.
MFC after: 3 days
of helper functions:
- carp_master() - boolean function which is true if an address
is in the MASTER state.
- ifa_preferred() - boolean function that compares two addresses,
and is aware of CARP.
Utilize ifa_preferred() in ifa_ifwithnet().
The previous version of patch also changed source address selection
logic in jails using carp_master(), but we failed to negotiate this part
with Bjoern. May be we will approach this problem again later.
Reported & tested by: Anton Yuzhaninov <citrin citrin.ru>
Sponsored by: Nginx, Inc
Since ARP and routing are separated, "proxy only" entries
don't have any meaning, thus we don't need additional field
in sockaddr to pass SIN_PROXY flag.
New kernel is binary compatible with old tools, since sizes
of sockaddr_inarp and sockaddr_in match, and sa_family are
filled with same value.
The structure declaration is left for compatibility with
third party software, but in tree code no longer use it.
Reviewed by: ru, andre, net@