This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple
programs or threads to bind to the same port and incoming connections will be
load balanced using a hash function.
Most of the code was copied from a similar patch for DragonflyBSD.
However, in DragonflyBSD, load balancing is a global on/off setting and can not
be set per socket. This patch allows for simultaneous use of both the current
SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system.
Required changes to structures
Globally change so_options from 16 to 32 bit value to allow for more options.
Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets.
Limitations
As DragonflyBSD, a load balance group is limited to 256 pcbs
(256 programs or threads sharing the same socket).
Submitted by: Johannes Lundberg <johanlun0@gmail.com>
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D11003
It is the forerunner/foundational work of bringing in both Rack and BBR
which use hpts for pacing out packets. The feature is optional and requires
the TCPHPTS option to be enabled before the feature will be active. TCP
modules that use it must assure that the base component is compile in
the kernel in which they are loaded.
MFC after: Never
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D15020
While Arcnet has some continued deployment in industrial controls, the
lack of drivers for any of the PCI, USB, or PCIe NICs on the market
suggests such users aren't running FreeBSD.
Evidence in the PR database suggests that the cm(4) driver (our sole
Arcnet NIC) was broken in 5.0 and has not worked since.
PR: 182297
Reviewed by: jhibbits, vangyzen
Relnotes: yes
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15057
Defines in net/if_media.h remain in case code copied from ifconfig is in
use elsewere (supporting non-existant media type is harmless).
Reviewed by: kib, jhb
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15017
Export additional information which may be helpful to userspace
consumers and rename the sysctl to net.inet.tcp.function_info.
Sponsored by: Netflix, Inc.
opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.
Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.
Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.
Reviewed by: kib, cem, jhb, jtl
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14941
tcp_drop() may unlock the INP. Currently, tcp_usr_abort() does not
check for this case, which results in a panic while trying to unlock
the already-unlocked INP (not to mention, a use-after-free violation).
Make tcp_usr_abort() check the return value of tcp_drop(). In the case
where tcp_drop() returns NULL, tcp_usr_abort() can skip further steps
to abort the connection and simply unlock the INP_INFO lock prior to
returning.
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: Netflix, Inc.
assumption is violated, "bad things" could follow.
I believe such an assert would have detected some of the problems jch@
was chasing in PR 203175 (see r307551). We also use it in our internal
TCP development efforts. And, in case a bug does slip through to
released code, this change silently ignores subsequent calls to
in_pcbfree().
Reviewed by: rrs
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D14990
strcpy was used to copy a string into a buffer copied to userland, which
left uninitialized data after the terminating 0-byte. Use the same
approach as in tcp_subr.c: strncpy and explicit '\0'.
admbugs: 765, 822
MFC after: 1 day
Reported by: Ilja Van Sprundel <ivansprundel@ioactive.com>
Reported by: Vlad Tsyrklevich
Security: Kernel memory disclosure
Sponsored by: The FreeBSD Foundation
captured and reported the highest window advertisement with the same
SEQ/ACK. However, the window comparison uses modulo 2**16 math, rather
than directly comparing the absolute values. Because windows use
absolute values and not modulo 2**16 math (i.e. they don't wrap), we
need to compare the absolute values.
Reviewed by: gallatin
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D14937
This fixes 32-bit compat (no ioctl command defintions are required
as struct ifreq is the same size). This is believed to be sufficent to
fully support ifconfig on 32-bit systems.
Reviewed by: kib
Obtained from: CheriBSD
MFC after: 1 week
Relnotes: yes
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14900
Singed calculations in cubic_cwnd() can result in negative cwnd
value which is then cast to an unsigned value. Values less than
1 mss are generally bad for other parts of the code, also fixed.
Submitted by: Jason Eggleston <jason@eggnet.com>
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14141
controlled by the TCP_BLACKBOX option.
Enable this as part of amd64 GENERIC. For now, leave it disabled on
other platforms.
Sponsored by: Netflix, Inc.
Forwarded packets passed through PFIL_OUT, which made it difficult for
firewalls to figure out if they were forwarding or producing packets. This in
turn is an issue for pf for IPv6 fragment handling: it needs to call
ip6_output() or ip6_forward() to handle the fragments. Figuring out which was
difficult (and until now, incorrect).
Having pfil distinguish the two removes an ugly piece of code from pf.
Introduce a new variant of the netpfil callbacks with a flags variable, which
has PFIL_FWD set for forwarded packets. This allows pf to reliably work out if
a packet is forwarded.
Reviewed by: ae, kevans
Differential Revision: https://reviews.freebsd.org/D13715
summits at BSDCan and BSDCam in 2017.
The TCP Blackbox Recorder allows you to capture events on a TCP connection
in a ring buffer. It stores metadata with the event. It optionally stores
the TCP header associated with an event (if the event is associated with a
packet) and also optionally stores information on the sockets.
It supports setting a log ID on a TCP connection and using this to correlate
multiple connections that share a common log ID.
You can log connections in different modes. If you are doing a coordinated
test with a particular connection, you may tell the system to put it in
mode 4 (continuous dump). Or, if you just want to monitor for errors, you
can put it in mode 1 (ring buffer) and dump all the ring buffers associated
with the connection ID when we receive an error signal for that connection
ID. You can set a default mode that will be applied to a particular ratio
of incoming connections. You can also manually set a mode using a socket
option.
This commit includes only basic probes. rrs@ has added quite an abundance
of probes in his TCP development work. He plans to commit those soon.
There are user-space programs which we plan to commit as ports. These read
the data from the log device and output pcapng files, and then let you
analyze the data (and metadata) in the pcapng files.
Reviewed by: gnn (previous version)
Obtained from: Netflix, Inc.
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D11085
from entering the TIME_WAIT state. However, it omits sending the ACK for the
FIN, which results in RST. This becomes a bigger deal if the sysctl
net.inet.tcp.blackhole is 2. In this case RST isn't send, so the other side of
the connection (also local) keeps retransmitting FINs.
To fix that in tcp_twstart() we will not call tcp_close() immediately. Instead
we will allocate a tcptw on stack and proceed to the end of the function all
the way to tcp_twrespond(), to generate the correct ACK, then we will drop the
last PCB reference.
While here, make a few tiny improvements:
- use bools for boolean variable
- staticize nolocaltimewait
- remove pointless acquisiton of socket lock
Reported by: jtl
Reviewed by: jtl
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D14697
through the lock-switching hoops.
A few of the INP lookup operations that lock INPs after the lookup do
so using this mechanism (to maintain lock ordering):
1. Lock lookup structure.
2. Find INP.
3. Acquire reference on INP.
4. Drop lock on lookup structure.
5. Acquire INP lock.
6. Drop reference on INP.
This change provides a slightly shorter path for cases where the INP
lock is uncontested:
1. Lock lookup structure.
2. Find INP.
3. Try to acquire the INP lock.
4. If successful, drop lock on lookup structure.
Of course, if the INP lock is contested, the functions will need to
revert to the previous way of switching locks safely.
This saves a few atomic operations when the INP lock is uncontested.
Discussed with: gallatin, rrs, rwatson
MFC after: 2 weeks
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D12911
ECN (ABE)" proposal to the New Reno congestion control algorithm module.
ABE reduces the amount of congestion window reduction in response to
ECN-signalled congestion relative to the loss-inferred congestion response.
More details about ABE can be found in the Internet-Draft:
https://tools.ietf.org/html/draft-ietf-tcpm-alternativebackoff-ecn
The implementation introduces four new sysctls:
- net.inet.tcp.cc.abe defaults to 0 (disabled) and can be set to non-zero to
enable ABE for ECN-enabled TCP connections.
- net.inet.tcp.cc.newreno.beta and net.inet.tcp.cc.newreno.beta_ecn set the
multiplicative window decrease factor, specified as a percentage, applied to
the congestion window in response to a loss-based or ECN-based congestion
signal respectively. They default to the values specified in the draft i.e.
beta=50 and beta_ecn=80.
- net.inet.tcp.cc.abe_frlossreduce defaults to 0 (disabled) and can be set to
non-zero to enable the use of standard beta (50% by default) when repairing
loss during an ECN-signalled congestion recovery episode. It enables a more
conservative congestion response and is provided for the purposes of
experimentation as a result of some discussion at IETF 100 in Singapore.
The values of beta and beta_ecn can also be set per-connection by way of the
TCP_CCALGOOPT TCP-level socket option and the new CC_NEWRENO_BETA or
CC_NEWRENO_BETA_ECN CC algo sub-options.
Submitted by: Tom Jones <tj@enoti.me>
Tested by: Tom Jones <tj@enoti.me>, Grenville Armitage <garmitage@swin.edu.au>
Relnotes: Yes
Differential Revision: https://reviews.freebsd.org/D11616
Current arp/nd code relies on the feedback from the datapath indicating
that the entry is still used. This mechanism is incorporated into the
arpresolve()/nd6_resolve() routines. After the inpcb route cache
introduction, the packet path for the locally-originated packets changed,
passing cached lle pointer to the ether_output() directly. This resulted
in the arp/ndp entry expire each time exactly after the configured max_age
interval. During the small window between the ARP/NDP request and reply
from the router, most of the packets got lost.
Fix this behaviour by plugging datapath notification code to the packet
path used by route cache. Unify the notification code by using single
inlined function with the per-AF callbacks.
Reported by: sthaug at nethelp.no
Reviewed by: ae
MFC after: 2 weeks
net.inet6.ip6.v6only=0.
Without this patch, the inp_vflag would have INP_IPV4 and the
INP_IPV6 flags for accepted TCP/IPv6 connections if the sysctl
variable net.inet6.ip6.v6only is 0. This resulted in netstat
to report the source and destination addresses as IPv4 addresses,
even they are IPv6 addresses.
PR: 226421
Reviewed by: bz, hiren, kib
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D13514
rrs - Lets make the LRO code look for true dup-acks and window update acks
fly on through and combine.
rrs - Make the LRO engine a bit more aware of ack-only seq space. Lets not
have it incorrectly wipe out newer acks for older acks when we have
out-of-order acks (common in wifi environments).
jeggleston - LRO eating window updates
Based on all of the above I think we are RFC compliant doing it this way:
https://tools.ietf.org/html/rfc1122
section 4.2.2.16
"Note that TCP has a heuristic to select the latest window update despite
possible datagram reordering; as a result, it may ignore a window update with
a smaller window than previously offered if neither the sequence number nor the
acknowledgment number is increased."
Submitted by: Kevin Bowling <kevin.bowling@kev009.com>
Reviewed by: rstone gallatin
Sponsored by: NetFlix and Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14540
for the minimum length.
This fixes a bug where cookies of length 2 bytes (which is smaller
than the minimum length of 4) is provided by the server.
Sponsored by: Netflix, Inc.
The conditional compilation support is now centralized in
tcp_fastopen.h and tcp_var.h. This doesn't provide the minimum
theoretical code/data footprint when TCP_RFC7413 is disabled, but
nearly all the TFO code should wind up being removed by the optimizer,
the additional footprint in the syncache entries is a single pointer,
and the additional overhead in the tcpcb is at the end of the
structure.
This enables the TCP_RFC7413 kernel option by default in amd64 and
arm64 GENERIC.
Reviewed by: hiren
MFC after: 1 month
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14048
[RFC7413]. It also includes a pre-shared key mode of operation in
which the server requires the client to be in possession of a shared
secret in order to successfully open TFO connections with that server.
The names of some existing fastopen sysctls have changed (e.g.,
net.inet.tcp.fastopen.enabled -> net.inet.tcp.fastopen.server_enable).
Reviewed by: tuexen
MFC after: 1 month
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14047
The keylist lock was not being acquired early enough. The only side
effect of this bug is that the effective add time of a new key could
be slightly later than it would have been otherwise, as seen by a TFO
client.
Reviewed by: tuexen
MFC after: 1 month
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14046
later by TCP-MD5 code.
This fixes the problem with broken TCP-MD5 over IPv4 when NIC has
disabled TCP checksum offloading.
PR: 223835
MFC after: 1 week
o added struct ipfw_dyn_info that keeps all needed for ipfw_chk and
for dynamic states implementation information;
o added DYN_LOOKUP_NEEDED() macro that can be used to determine the
need of new lookup of dynamic states;
o ipfw_dyn_rule now becomes obsolete. Currently it used to pass
information from kernel to userland only.
o IPv4 and IPv6 states now described by different structures
dyn_ipv4_state and dyn_ipv6_state;
o IPv6 scope zones support is added;
o ipfw(4) now depends from Concurrency Kit;
o states are linked with "entry" field using CK_SLIST. This allows
lockless lookup and protected by mutex modifications.
o the "expired" SLIST field is used for states expiring.
o struct dyn_data is used to keep generic information for both IPv4
and IPv6;
o struct dyn_parent is used to keep O_LIMIT_PARENT information;
o IPv4 and IPv6 states are stored in different hash tables;
o O_LIMIT_PARENT states now are kept separately from O_LIMIT and
O_KEEP_STATE states;
o per-cpu dyn_hp pointers are used to implement hazard pointers and they
prevent freeing states that are locklessly used by lookup threads;
o mutexes to protect modification of lists in hash tables now kept in
separate arrays. 65535 limit to maximum number of hash buckets now
removed.
o Separate lookup and install functions added for IPv4 and IPv6 states
and for parent states.
o By default now is used Jenkinks hash function.
Obtained from: Yandex LLC
MFC after: 42 days
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D12685
This used to work by accident with ld.bfd even though always_keepalive
was marked as static. LLD honors static more correctly, so export this
variable properly (including moving it into the tcp_* namespace).
Reviewed by: bz, emaste
MFC after: 2 weeks
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D14129
Mechanically replace uses of MALLOC/FREE with appropriate invocations of
malloc(9) / free(9) (a series of sed expressions). Something like:
* MALLOC(a, b, ... -> a = malloc(...
* FREE( -> free(
* free((caddr_t) -> free(
No functional change.
For now, punt on modifying contrib ipfilter code, leaving a definition of
the macro in its KMALLOC().
Reported by: jhb
Reviewed by: cy, imp, markj, rmacklem
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14035
The header passed to these probes has some fields converted to host
order by tcp_fields_to_host(), so the tcpinfo_t translator doesn't do
what we want.
Submitted by: Hannes Mehnert
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D12647
the first mbuf of the reassembled datagram should have a pkthdr.
This was discovered with cxgbe(4) + IPSEC + ping with payload more than
interface MTU. cxgbe can generate !M_WRITEABLE mbufs and this results
in m_unshare being called on the reassembled datagram, and it complains:
panic: m_unshare: m0 0xfffff80020f82600, m 0xfffff8005d054100 has M_PKTHDR
PR: 224922
Reviewed by: ae@
MFC after: 1 week
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D14009
Add a new macro to clear both the L3 and L2 route caches, to
hopefully prevent future instances where only the L3 cache was
cleared when both should have been.
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13989
Reviewed by: karels
When processing a SACK advancing the cumtsn-ack in fast recovery,
increment the miss-indications for all TSN's reported as missing.
Thanks to Fabian Ising for finding the bug and to Timo Voelker
for provinding a fix.
This fix moves also CMT related initialisation of some variables
to a more appropriate place.
MFC after: 1 week
Make the calloc wrappers behave as expected by using mallocarray.
It is rather weird that the malloc wrappers also zeroes the memory: update
a comment to reflect at least two cases where it is expected.
Reviewed by: tuexen
to ipfw in https://svnweb.freebsd.org/changeset/base/326233,
a dependency on the SCTP stack was added to ipfw by accident.
This was noted by Kevel Bowling in https://reviews.freebsd.org/D13594
where also a solution was suggested. This patch is based on Kevin's
suggestion, but implements the required SCTP checksum computation
without any dependency on other SCTP sources.
While there, do some cleanups and improve comments.
Thanks to Kevin Kevin Browling for reporting the issue and suggesting
a fix.
This reduces noise when kernel is compiled by newer GCC versions,
such as one used by external toolchain ports.
Reviewed by: kib, andrew(sys/arm and sys/arm64), emaste(partial), erj(partial)
Reviewed by: jhb (sys/dev/pci/* sys/kern/vfs_aio.c and sys/kern/kern_synch.c)
Differential Revision: https://reviews.freebsd.org/D10385
This option was used in the early days to allow performance measurements
extrapolating the use of SCTP checksum offloading. Since this feature
is now available, get rid of this option.
This also un-breaks the LINT kernel. Thanks to markj@ for making me
aware of the problem.
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
No functional change intended.
This is similar to the TCP case. where a TCP RST segment can be sent.
There is one limitation: When sending an ABORT in response to an incoming
packet, it should be tested if there is no ABORT chunk in the received
packet. Currently, it is only checked if the first chunk is an ABORT
chunk to avoid parsing the whole packet, which could result in a DOS attack.
Thanks to Timo Voelker for helping me to test this patch.
Reviewed by: bcr@ (man page part), ae@ (generic, non-SCTP part)
Differential Revision: https://reviews.freebsd.org/D13239
Mainly focus on files that use BSD 3-Clause license.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
and other similar socket options.
Provide new control message SCM_TIME_INFO to supply information about
timestamp. Currently it indicates that the timestamp was
hardware-assisted and high-precision, for software timestamps the
message is not returned. Reserved fields are added to ABI to report
additional info about it, it is expected that raw hardware clock value
might be useful for some applications.
Reviewed by: gallatin (previous version), hselasky
Sponsored by: Mellanox Technologies
MFC after: 2 weeks
X-Differential revision: https://reviews.freebsd.org/D12638
related to a signed/unsigned mismatch.
This should most likely fix the issue in sctp_sosend reported by
Dmitry Vyukov on the freebsd-hackers mailing list and found by
running syzkaller.
In all cases where cif_vrs list is modified, two locks are held: per-ifnet
CIF_LOCK and global carp_sx. It means to read that list only one of them
is enough to be held, so we can skip CIF_LOCK when we already have carp_sx.
This fixes kernel panic, caused by attempts of copyout() to sleep while
holding non-sleepable CIF_LOCK mutex.
Discussed with: glebius
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Ensure that the current behaviour is consistent: stop processing
of the chunk, but finish the processing of the previous chunks.
This behaviour might be changed in a later commit to ABORT the
assoication due to a protocol violation, but changing this
is a separate issue.
MFC after: 3 days
can use them. Gather all TCP tunables in tcp_var.h in one place and
alphabetically sort them, to ease maintainance of the list.
Don't copy and paste declarations in tcp_stacks/fastpath.c.
not extraneous in the TCP Fast Open (TFO) passive-open case. In the
TFO passive-open case, syncache_socket() is being called during
processing of a TFO SYN bearing a valid cookie, and a call to
soisconnected() is required in order to allow the application to
immediately consume any data delivered in the SYN and to have a chance
to generate response data to accompany the SYN-ACK. The removal of
this call to soisconnected() effectively converted all TFO passive
opens to having the same RTT cost as a standard 3WHS.
This commit adds a call to soisconnected() to syncache_tfo_expand() so
that it is only in the TFO passive-open path, thereby restoring TFO
passve-open RTT performance and preserving the non-TFO connection-rate
performance gains realized by r307966.
MFC after: 1 week
Sponsored by: Limelight Networks
been destroyed before its tcptw with INVARIANTS undefined.
This is a symmetric change of r307551:
A INP_TIMEWAIT inp should not be destroyed before its tcptw, and INVARIANTS
will catch this case. If INVARIANTS is undefined it will emit a log(LOG_ERR)
and avoid a hard to debug infinite loop in tcp_tw_2msl_scan().
Reported by: Ben Rubson, hselasky
Submitted by: hselasky
Tested by: Ben Rubson, jch
MFC after: 1 week
Sponsored by: Verisign, inc
Differential Revision: https://reviews.freebsd.org/D12267
* check mbuf length before doing mtod() and accessing to IP header;
* update oip pointer and all depending pointers after m_pullup();
* remove extra checks and extra parentheses, wrap long lines;
PR: 222670
Reported by: Prabhakar Lakhera
MFC after: 1 week
_NO_ OSes actually "negotiate" MSS.
RFC 879:
"... This Maximum Segment Size (MSS) announcement (often mistakenly
called a negotiation) ..."
This negotiation behaviour was introduced 11 years ago by r159955
without any explaination about why FreeBSD had to "negotiate" MSS:
In syncache_respond() do not reply with a MSS that is larger than what
the peer announced to us but make it at least tcp_minmss in size.
Sponsored by: TCP/IP Optimization Fundraise 2005
The tcp_minmss behaviour is still kept.
Syncookie fix was prodded by tuexen, who also helped to test this
patch w/ packetdrill.
Reviewed by: tuexen, karels, bz (previous version)
MFC after: 2 week
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D12430
transmit queues aswell as non-ratelimited ones.
Add the required structure bits in order to support a backpressure
indication with ratelimited connections aswell as non-ratelimited
ones. The backpressure indicator is a value between zero and 65535
inclusivly, indicating if the destination transmit queue is empty or
full respectivly. Applications can use this value as a decision point
for when to stop transmitting data to avoid endless ENOBUFS error
codes upon transmitting an mbuf. This indicator is also useful to
reduce the latency for ratelimited queues.
Reviewed by: gallatin, kib, gnn
Differential Revision: https://reviews.freebsd.org/D11518
Sponsored by: Mellanox Technologies
There were two bugs related to the blackhole detection:
* The smalles size was tried more than two times.
* The restored MSS was not the original one, but the second
candidate.
MFC after: 1 week
Sponsored by: Netflix, Inc.
The check for timestamps are too early to handle SYN-ACK correctly.
So move it down after the corresponing processing has been done.
PR: 216832
Obtained from: antonfb@hesiod.org
MFC after: 1 week
Make sure that the flags INP_IPV4 and INP_IPV6 are consistently set
for inpcbs used for TCP sockets, no matter if the setting is derived
from the net.inet6.ip6.v6only sysctl or the IPV6_V6ONLY socket option.
For UDP this was already done right.
PR: 221385
MFC after: 1 week
flowtable anymore (as flowtable was never considered to be useful in
the forwarding path).
Reviewed by: np
Differential Revision: https://reviews.freebsd.org/D11448
r307901 was reverted in r321480, restoring an incorrect block
delimitation bug present in the original cc_cubic commit. Restore
only the bugfix (brace addition) from r307901.
CID: 1090182
Approved by: sbruno
This was discussed between various transport@ members and it was
requested to be reverted and discussed.
Submitted by: Kevin Bowling <kevin.bowling@kev009.com>
Reported by: lawrence
Reviewed by: hiren
Sponsored by: Limelight Networks
This was discussed between various transport@ members and it was
requested to be reverted and discussed.
Submitted by: kevin
Reported by: lawerence
Reviewed by: hiren
While there, appropriately handle the overhead depending on
the usage of DATA or I-DATA chunks. Take the overhead only
into account, when required.
Joint work with rrs@
MFC after: 1 week
Using the https://github.com/google/capsicum-test/ suite, the
PosixMqueue.CapModeForked test was failing due to an ECAPMODE after
calling kmq_notify(). On further inspection, the dynamically
loaded syscall entry was initialized with sy_flags zeroed out, since
SYSCALL_INIT_HELPER() left sysent.sy_flags with the default value.
Add a new helper SYSCALL{,32}_INIT_HELPER_F() which takes an
additional argument to specify the sy_flags value.
Submitted by: Siva Mahadevan <smahadevan@freebsdfoundation.org>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11576
sbavail() returns u_int and sendwin is a uint32_t. Therefore, min() (which
operates on two u_int values) is able to correctly calculate the minimum
of these two arguments.
Reported by: rrs
MFC after: 1 week
Sponsored by: Netflix
This allows them to be sent in a non truncated way and addresses a warning
given by newver versions of gcc.
Thanks to Anselm Jonas Scholl for reporting it and providing a patch.
o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().
o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.
o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.
o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770
stack modules.
It adds support for mangling symbols exported by a module by prepending
a string to them. (This avoids overlapping symbols in the kernel linker.)
It allows the use of a macro as the module name in the DECLARE_MACRO()
and MACRO_VERSION() macros.
It allows the code to register stack aliases (e.g. both a generic name
["default"] and version-specific name ["default_10_3p1"]).
With these changes, it is trivial to compile TCP stack modules with
the name defined in the Makefile and to load multiple versions of the
same stack simultaneously. This functionality can be used to enable
side-by-side testing of an old and new version of the same TCP stack.
It also could support upgrading the TCP stack without a reboot.
Reviewed by: gnn, sjg (makefiles only)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D11086