The AccECN handshake and TCP header flags are supported,
no support yet for the AccECN option. This minimalistic
implementation is sufficient to support DCTCP while
dramatically cutting the number of ACKs, and provide ECN
response from the receiver to the CC modules.
Reviewed By: #transport, #manpages, rrs, pauamma
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D21011
While a receiver should continue sending SACK blocks for the
duration of a SACK loss recovery, if for some reason the
TCP options no longer contain these SACK blocks, but we
already started maintaining the Scoreboard, keep on handling
incoming ACKs (without SACK) as belonging to the SACK recovery.
Reported by: thj
Reviewed by: tuexen, #transport
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D36046
The mpidr register is 64 bit on arm64 and 32 bit on arm. Fix this by
extending the arm64 definition to include the top 32 bits.
To preserve KBI when MFCing split the value into two 32 bit values.
This will be cleaned up later only on main.
Reviewed by: bz
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D36346
Here go cons of using inpcb for divert:
- divert(4) uses only 16 bits (local port) out of struct inpcb,
which is 424 bytes today.
- The inpcb KPI isn't able to provide hashing for divert(4),
thus it uses global inpcb list for lookups.
- divert(4) uses INET-specific part of the KPI, making INET
a requirement for IPDIVERT.
Maintain our own very simple hash lookup database instead. It
has mutex protection for write and epoch protection for lookups.
Since now so->so_pcb no longer points to struct inpcb, don't
initialize protosw methods to methods that belong to PF_INET.
Also, drop support for setting options on a divert socket. My
review of software in base and ports confirms that this has no
use and unlikely worked before.
Differential revision: https://reviews.freebsd.org/D36382
Instead of incrementing pretty random counters in the IP statistics,
create divert socket statistics structure. Export via netstat(1).
Differential revision: https://reviews.freebsd.org/D36381
Since 4.4BSD the protosw was used to implement socket types created
by socket(2) syscall and at the same to demultiplex incoming IPv4
datagrams (later copied to IPv6). This story ended with 78b1fc05b2.
These entries (e.g. IPPROTO_ICMP) in inetsw that were added to catch
packets in ip_input(), they would also be returned by pffindproto()
if user says socket(AF_INET, SOCK_RAW, IPPROTO_ICMP). Thus, for raw
sockets to work correctly, all the entries were pointing at raw_usrreq
differentiating only in the value of pr_protocol.
With 78b1fc05b2 all these entries are no longer needed, as ip_protox
is independent of protosw. Any socket syscall requesting SOCK_RAW type
would end up with rip_protosw. And this protosw has its pr_protocol
set to 0, allowing to mark socket with any protocol.
For IPv6 raw socket the change required two small fixes:
o Validate user provided protocol value
o Always use protocol number stored in inp in rip6_attach, instead
of protosw value, which is now always 0.
Differential revision: https://reviews.freebsd.org/D36380
The divert(4) is not a protocol of IPv4. It is a socket to
intercept packets from ipfw(4) to userland and re-inject them
back. It can divert and re-inject IPv4 and IPv6 packets today,
but potentially it is not limited to these two protocols. The
IPPROTO_DIVERT does not belong to known IP protocols, it
doesn't even fit into u_char. I guess, the implementation of
divert(4) was done the way it is done basically because it was
easier to do it this way, back when protocols for sockets were
intertwined with IP protocols and domains were statically
compiled in.
Moving divert(4) out of inetsw accomplished two important things:
1) IPDIVERT is getting much closer to be not dependent on INET.
This will be finalized in following changes.
2) Now divert socket no longer aliases with raw IPv4 socket.
Domain/proto selection code won't need a hack for SOCK_RAW and
multiple entries in inetsw implementing different flavors of
raw socket can merge into one without requirement of raw IPv4
being the last member of dom_protosw.
Differential revision: https://reviews.freebsd.org/D36379
Recent problems related to NFSv4 mounts has been traced
to multiple NFSv4 clients using the same /etc/hostid
(or kern.hostuuid, if you prefer).
This patch adds a sentence to the man page noting that
clients must have unique /etc/hostid's.
This is a content change.
Reviewed by: gbe (manpages)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D36392
To avoid issues starting any USB transfers before the open
function is complete.
Differential Revision: https://reviews.freebsd.org/D36391
MFC after: 1 week
Sponsored by: NVIDIA Networking
Some controllers like the XHCI(4) loose track of the data toggle value when
USB receive transfers are cancelled at close. This in turn can lead to to
data loss after the next open.
To avoid data loss, make sure both the receive and transmit data toggles
get reset, before trying to read or write any data.
Differential Revision: https://reviews.freebsd.org/D36391
Submitted by: Dave Baukus <daveb@spectralogic.com>
MFC after: 1 week
Sponsored by: NVIDIA Networking
domain_init() called at SI_SUB_PROTO_DOMAIN/SI_ORDER_SECOND is always
called right after domain_add(), that had been called at SI_ORDER_FIRST.
Note that protocols aren't initialized yet at this point, since they are
usually scheduled to initialize at SI_ORDER_THIRD.
After this merge it becomes clear that DOMF_SUPPORTED / DOMF_INITED
can be garbage collected as they are set & checked in the same function.
For initialization of the domain system itself it is now clear that
domaininit() can be garbage collected and static initializer is enough.
o Statically initialize max_linkhdr to default value without relying
on domain(9) code doing that.
o Statically initialize max_protohdr to a sane value, without relying
on TCP being always compiled in.
o Retire max_datalen. Set, but not used.
o Don't make the domain(9) system responsible in validating these
values and updating max_hdr. Instead provide KPI max_linkhdr_grow()
and max_protohdr_grow().
o Call max_linkhdr_grow() from IEEE802.11 and max_protohdr_grow() from
TCP. Those are the only protocols today that may want to grow.
Reviewed by: tuexen
Differential revision: https://reviews.freebsd.org/D36376
- Ignore I/O requests with insufficiently sized input or output
buffers (those not containing compete request headers).
- Ignore control requests with improperly sized buffers.
- While here, explicitly zero the output header of an I/O request to
avoid leaking malloc garbage from the host if the header is not
fully populated.
PR: 264521
Reported by: Robert Morris <rtm@lcs.mit.edu>
Reviewed by: mav, emaste
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D36271
Use a consistent prefix ("virtio-scsi: ") similar to the e1000 device
model.
Reviewed by: mav, emaste
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D36270
When preparing to transmit pending packets, ensure that the head (TDH)
and tail (TDT) indices are in bounds. Note that validating values
when they are written is not sufficient along as the transmit length
(TDLEN) could be changed turning a value that was valid when written
into an out of bounds value.
While here, add further restrictions to the head register (TDH). The
manual states that writing to this value while transmit is enabled can
cause unexpected behavior and that it should only be written after a
reset. As such, ignore attempts to write while transmit is active,
and also ignore writes of non-zero values. Later e1000 chipsets have
this register as read-only.
Also ignore any attempts to transmit packets if the transmit ring's
size is zero.
PR: 264567
Reported by: Robert Morris <rtm@lcs.mit.edu>
Reviewed by: emaste
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D36269
The io:::start and end probes trace individual I/O requests.
Also remove the unimplemented wait-start and wait-done probes.
PR: 266098
MFC after: 1 week
In RB_INSERT_COLOR and RB_REMOVE_COLOR, avoid reading a parent pointer
from memory, and then reading the left-color bit from memory, and then
reading the right-color bit from memory, since they're all in the same
field. The compiler can't infer that only the first read is really
necessary, so write the code in a way so that it doesn't have to.
Drop RB_RED_LEFT and RB_RED_RIGHT macros that reach into memory to get
those bits. Drop RB_COLOR, the only thing left using RB_RED_LEFT and
RB_RED_RIGHT after the other changes, and go straight to DIAGNOSTIC
code in subr_stats to implement RB_COLOR for its single, dubious use
there.
Reviewed by: alc
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D36353
Add IF_DEBUG_LEVEL() macro to ensure all debug output preparation
is run only if the current debug level is sufficient. Consistently
use it within routing subsystem.
MFC after: 2 weeks
* add nhop_get_unlinked() used to prepare referenced but not
linked nexthop, that can later be used as a clone source.
* add nhop_check_gateway() to check for allowed address family
combinations between the rib family and neighbor family (useful
for 4o6 or direct routes)
* add nhop_set_upper_family() to allow copying IPv6 nexthops to
IPv4 rib.
* add rt_get_rnd() wrapper, returning both nexthop/group and its
weight attached to the rtentry.
* Add CHT_SLIST_FOREACH_SAFE(), allowing to delete items during
iteration.
MFC after: 2 weeks
Multiple consumers in the kernel space want to install IPv4 or IPv6
default route. Provide convenient wrapper to simplify the code
inside the customers.
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D36167
Construct the desired hexthops directly instead of using the
"translation" layer in form of filling rt_addrinfo data.
Simplify V_rt_add_addr_allfibs handling by using recently-added
rib_copy_route() to propagate the routes to the non-primary address
fibs.
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D36166
Further updates based on ways Peter Holm found to corrupt UFS
superblocks in ways that could cause kernel hangs or crashes.
No legitimate superblocks should fail as a result of these changes.
Reported by: Peter Holm
Tested by: Peter Holm
Sponsored by: The FreeBSD Foundation
- Record the physical address for doorbell page region
For supporting RDMA device with multiple user contexts with their
individual doorbell pages, record the start address of doorbell page
region for use by the RDMA driver to allocate user context doorbell IDs.
- Handle vport sharing between devices
For outgoing packets, the PF requires the VF to configure the vport with
corresponding protection domain and doorbell ID for the kernel or user
context. The vport can't be shared between different contexts.
Implement the logic to exclusively take over the vport by either the
Ethernet device or RDMA device.
- Add functions for allocating doorbell page from GDMA
The RDMA device needs to allocate doorbell pages for each user context.
Implement those functions and expose them for use by the RDMA driver.
- Export Work Queue functions for use by RDMA driver
RDMA device may need to create Ethernet device queues for use by Queue
Pair type RAW. This allows a user-mode context accesses Ethernet hardware
queues. Export the supporting functions for use by the RDMA driver.
- Define max values for SGL entries
The number of maximum SGl entries should be computed from the maximum
WQE size for the intended queue type and the corresponding OOB data
size. This guarantees the hardware queue can successfully queue requests
up to the queue depth exposed to the upper layer.
- Define and process GDMA response code GDMA_STATUS_MORE_ENTRIES
When doing memory registration, the PF may respond with
GDMA_STATUS_MORE_ENTRIES to indicate a follow request is needed. This is
not an error and should be processed as expected.
- Define data structures for protection domain and memory registration
The MANA hardware support protection domain and memory registration for use
in RDMA environment. Add those definitions and expose them for use by the
RDMA driver.
MFC after: 2 weeks
Sponsored by: Microsoft