o Introduce a notion of "not ready" mbufs in socket buffers. These
mbufs are now being populated by some I/O in background and are
referenced outside. This forces following implications:
- An mbuf which is "not ready" can't be taken out of the buffer.
- An mbuf that is behind a "not ready" in the queue neither.
- If sockbet buffer is flushed, then "not ready" mbufs shouln't be
freed.
o In struct sockbuf the sb_cc field is split into sb_ccc and sb_acc.
The sb_ccc stands for ""claimed character count", or "committed
character count". And the sb_acc is "available character count".
Consumers of socket buffer API shouldn't already access them directly,
but use sbused() and sbavail() respectively.
o Not ready mbufs are marked with M_NOTREADY, and ready but blocked ones
with M_BLOCKED.
o New field sb_fnrdy points to the first not ready mbuf, to avoid linear
search.
o New function sbready() is provided to activate certain amount of mbufs
in a socket buffer.
A special note on SCTP:
SCTP has its own sockbufs. Unfortunately, FreeBSD stack doesn't yet
allow protocol specific sockbufs. Thus, SCTP does some hacks to make
itself compatible with FreeBSD: it manages sockbufs on its own, but keeps
sb_cc updated to inform the stack of amount of data in them. The new
notion of "not ready" data isn't supported by SCTP. Instead, only a
mechanical substitute is done: s/sb_cc/sb_ccc/.
A proper solution would be to take away struct sockbuf from struct
socket and allow protocols to implement their own socket buffers, like
SCTP already does. This was discussed with rrs@.
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
leave a port permanently disabled when a copper cable is unplugged and
then plugged right back in.
lacp_linkstate goes looking for the current ifmedia on a link state
change and it could get stale information from cxgbe(4) on a module
unplug followed by replug. The fix is to process module events before
link-state events within the driver, and to always rebuild the ifmedia
list on a module change event (instead of rebuilding it lazily).
Thanks to asomers@ for the problem report and detailed analysis to go
with it.
MFC after: 1 week
sb_cc member of struct sockbuf to a couple of inline functions:
sbavail() and sbused()
Right now they are equal, but once notion of "not ready socket buffer data",
will be checked in, they are going to be different.
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
initial MPA exchange must be tracked this way so that t4_tom's state for
the tid is all clean at the time the tid transitions to RDMA mode. Once
it does, t4_tom is out of the way and iw_cxgbe uses the qp endpoints
directly.
Sponsored by: Chelsio Communications
- Free rt in c4iw_connect only if it is allocated.
- Call soclose instead of so_shutdown if there is an abort from the peer.
- Close socket and return failure if TOE is not enabled.
Submitted by: Hariprasad at Chelsio dot com
Sponsored by: Chelsio Communications
- Wrong integer type was specified.
- Wrong or missing "access" specifier. The "access" specifier
sometimes included the SYSCTL type, which it should not, except for
procedural SYSCTL nodes.
- Logical OR where binary OR was expected.
- Properly assert the "access" argument passed to all SYSCTL macros,
using the CTASSERT macro. This applies to both static- and dynamically
created SYSCTLs.
- Properly assert the the data type for both static and dynamic
SYSCTLs. In the case of static SYSCTLs we only assert that the data
pointed to by the SYSCTL data pointer has the correct size, hence
there is no easy way to assert types in the C language outside a
C-function.
- Rewrote some code which doesn't pass a constant "access" specifier
when creating dynamic SYSCTL nodes, which is now a requirement.
- Updated "EXAMPLES" section in SYSCTL manual page.
MFC after: 3 days
Sponsored by: Mellanox Technologies
Mellanox hardware driver(s):
- Properly name an inclusion guard
- Fix compile warnings regarding unsigned enums
- Add two new sysctl nodes
- Remove all empty linux header files
- Make an error printout more verbose
- Use "mod_delayed_work()" instead of
cancelling and starting a timeout.
- Implement more Linux scatterlist
functions.
MFC after: 3 days
Sponsored by: Mellanox Technologies
cannot be sent to the chip because a prerequisite L2 resolution
failed.
Submitted by: Hariprasad at chelsio dot com (original version)
MFC after: 2 weeks.
available before any uppper layer driver (TOE, iWARP, or iSCSI)
registers with the base cxgbe(4) driver.
Submitted by: Hariprasad at chelsio dot com
Reviewed by: np@
addresses. (The chip doesn't really care, it's just that it needs to be
told explicitly if unicast DMACs are checked for "hits" in the hash that
is used after the TCAM entries are all used up).
an mbuf's storage (internal or external).
Add a new M_SIZE() mbuf macro that returns the size of an mbuf's
storage (internal or external).
These contrast with m_data and m_len, which are with respect to data
in the buffer, rather than the buffer itself.
Rewrite M_LEADINGSPACE() and M_TRAILINGSPACE() in terms of M_START()
and M_SIZE().
This is done as we currently have many instances of using mbuf flags
to generate pointers or lengths for internal storage in header and
regular mbufs, as well as to external storage. Rather than replicate
this logic throughout the network stack, centralising the
implementation will make it easier for us to refine mbuf storage.
This should also help reduce bugs by limiting the amount of
mbuf-type-specific pointer arithmetic. Followup changes will
propagate use of the macros throughout the stack.
M_SIZE() conflicts with one macro in the Chelsio driver; rename that
macro in a slightly unsatisfying way to eliminate the collision.
MFC after: 3 days
Obtained from: jeff (with enhancements)
Sponsored by: EMC / Isilon Storage Division
Reviewed by: bz, glebius, np
Differential Revision: https://reviews.freebsd.org/D753
hardware driver update from Mellanox Technologies.
- Remove empty files from the OFED Linux Emulation layer.
- Fix compile warnings related to printf() and the "%lld" and "%llx"
format specifiers.
- Add some missing 2-clause BSD copyrights.
- Add "Mellanox Technologies, Ltd." to list of copyright holders.
- Add some new compatibility files.
- Fix order of uninit in the mlx4ib module to avoid crash at unload
using the new module_exit_order() function.
MFC after: 1 week
Sponsored by: Mellanox Technologies
Mostly bugfixes or features developed in the past 6 months,
so this is a 10.1 candidate.
Basically no user API changes (some bugfixes in sys/net/netmap_user.h).
In detail:
1. netmap support for virtio-net, including in netmap mode.
Under bhyve and with a netmap backend [2] we reach over 1Mpps
with standard APIs (e.g. libpcap), and 5-8 Mpps in netmap mode.
2. (kernel) add support for multiple memory allocators, so we can
better partition physical and virtual interfaces giving access
to separate users. The most visible effect is one additional
argument to the various kernel functions to compute buffer
addresses. All netmap-supported drivers are affected, but changes
are mechanical and trivial
3. (kernel) simplify the prototype for *txsync() and *rxsync()
driver methods. All netmap drivers affected, changes mostly mechanical.
4. add support for netmap-monitor ports. Think of it as a mirroring
port on a physical switch: a netmap monitor port replicates traffic
present on the main port. Restrictions apply. Drive carefully.
5. if_lem.c: support for various paravirtualization features,
experimental and disabled by default.
Most of these are described in our ANCS'13 paper [1].
Paravirtualized support in netmap mode is new, and beats the
numbers in the paper by a large factor (under qemu-kvm,
we measured gues-host throughput up to 10-12 Mpps).
A lot of refactoring and additional documentation in the files
in sys/dev/netmap, but apart from #2 and #3 above, almost nothing
of this stuff is visible to other kernel parts.
Example programs in tools/tools/netmap have been updated with bugfixes
and to support more of the existing features.
This is meant to go into 10.1 so we plan an MFC before the Aug.22 deadline.
A lot of this code has been contributed by my colleagues at UNIPI,
including Giuseppe Lettieri, Vincenzo Maffione, Stefano Garzarella.
MFC after: 3 days.
Reorganize struct sge_iq. Make the iq entry size a compile time
constant. While here, eliminate RX_FL_ESIZE and use EQ_ESIZE directly.
MFC after: 2 weeks
(I'm committing this on behalf of my colleagues in the Storage team
at Chelsio).
Submitted by: Sreenivasa Honnur <shonnur at chelsio dot com>
Sponsored by: Chelsio Communications.
custom free routine (rxb_free) in the driver. Fail MOD_UNLOAD with
EBUSY if any such cluster has been handed up to the kernel but hasn't
been freed yet. This prevents a panic later when the cluster finally
needs to be freed but rxb_free is gone from the kernel.
MFC after: 1 week
firmware allows up to 48B to be read this way but the driver limits
itself to 8B at a time to remain compatible with old cxgbetool
binaries.
MFC after: 1 week
These changes prevent sysctl(8) from returning proper output,
such as:
1) no output from sysctl(8)
2) erroneously returning ENOMEM with tools like truss(1)
or uname(1)
truss: can not get etype: Cannot allocate memory
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.
Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.
MFC after: 2 weeks
Sponsored by: Mellanox Technologies
the queue where to enqueue pages that are going to be unwired.
- Add stronger checks to the enqueue/dequeue for the pagequeues when
adding and removing pages to them.
Of course, for unmanaged pages the queue parameter of vm_page_unwire() will
be ignored, just as the active parameter today.
This makes adding new pagequeues quicker.
This change effectively modifies the KPI. __FreeBSD_version will be,
however, bumped just when the full cache of free pages will be
evicted.
Sponsored by: EMC / Isilon storage division
Reviewed by: alc
Tested by: pho