Mainly focus on files that use BSD 3-Clause license.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
Different compilers may optimise the enum type in different ways. This ensures
coherency when range checking the value of enums in ibcore.
Sponsored by: Mellanox Technologies
MFC after: 1 week
variable. Access the variable directly instead of going through the sysctl()
interface in the kernel.
Sponsored by: Mellanox Technologies
MFC after: 1 week
Set the IPOIB_FLAG_INITIALIZED on dev_open and clear it on dev_stop to
avoid a race between ipoib load and the underlying device driver.
The device module must dispatch the IB_EVENT_PORT_ACTIVE event before ipoib
module is loaded. Otherwise, the flush will fail since no one set the
IPOIB_FLAG_INITIALIZED.
Submitted by: Slava Shwartsman <slavash@mellanox.com>
Sponsored by: Mellanox Technologies
MFC after: 1 week
Else the IPv6 address matching might fail. This change adds support for both
embedded and non-embedded IPv6 scope IDs when passing a IPv6 link-local socket
address to RDMA. Prior to this change only global IPv6 addresses would work
with RDMA.
Sponsored by: Mellanox Technologies
MFC after: 1 week
1) Fail to resolve RDMA address if rtalloc1() returns the loopback
device, lo0, as the gateway interface. Currently RDMA loopback is
not supported.
2) Use ip_dev_find() and ip6_dev_find() to lookup network interfaces
with matching IPv4 and IPv6 addresses, respectivly.
3) In addr_resolve() make sure the "ifa" pointer is always set, also when
the "ifp" is NULL. Else a NULL pointer access might happen trying to
read from the "ifa" pointer later on.
4) In rdma_addr_find_dmac_by_grh() make sure the "bound_dev_if" field
gets set properly instead of passing the scope ID through the IPv6
socket address structure. This is more in line with upstream OFED
in Linux.
5) In rdma_addr_find_smac_by_sgid() there is no need to pass the
scope ID for IPv6. Either it is stored in the "bound_dev_if" field
or ip6_dev_find() will find the correct network device regardless
of the scope ID.
Sponsored by: Mellanox Technologies
MFC after: 1 week
iWarp and RoCE in ibcore. The selection of RDMA_PS_TCP can not be used
to indicate iWarp protocol use. Backport the proper IB device
capabilities from Linux upstream to distinguish between iWarp and
RoCE. Only allocate the additional socket required for iWarp for RDMA
IDs when at least one iWarp device present. This resolves
interopability issues between iWarp and RoCE in ibcore
Reviewed by: np @
Differential Revision: https://reviews.freebsd.org/D12563
Sponsored by: Mellanox Technologies
MFC after: 3 days
- Add more sanity checks.
- Preserve source port number when resolving address.
- Remove no longer needed scope ID hacks for IPv6.
Sponsored by: Mellanox Technologies
This patch fixes an interopability issue between FreeBSD and non-FreeBSD
systems when the connection establishment is aborted. Refer to the
initial commit in Linux, drivers/infiniband/core/cm.c,
for a more detailed description.
Obtained from: Linux
MFC after: 3 days
Sponsored by: Mellanox Technologies
list.h includes a number of FreeBSD headers as a workaround for the
LIST_HEAD name collision. To reduce pollution, avoid including list.h
in commonly used headers when it is not explicitly needed.
Reviewed by: hselasky
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D11249
This patch currently supports:
- ibcore as a kernel module only
- krping as a kernel module only
- ipoib as a kernel module only
Sponsored by: Mellanox Technologies
o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().
o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.
o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.
o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770
Support is implemented by mapping Linux's "struct net" into FreeBSD's
"struct vnet". Currently only vnet0 is supported by ibcore.
MFC after: 1 week
Sponsored by: Mellanox Technologies
Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.
Submitted by: Jan Schaumann <jschauma@stevens.edu>
Pull Request: https://github.com/freebsd/freebsd/pull/96
Sockets representing the TCP endpoints for iWARP connections are
allocated by the ibcore module. Before this revision they were closed
either by the ibcore module or the iw_cxgbe hardware driver depending on
the state transitions during connection teardown. This is error prone
and there were cases where both iw_cxgbe and ibcore closed the socket
leading to double-free panics. The fix is to let ibcore close the
sockets it creates and never do it in the driver.
- Use sodisconnect instead of soclose (preceded by solinger = 0) in the
driver to tear down an RDMA connection abruptly. This does what's
intended without releasing the socket's fd reference.
- Close the socket in ibcore when the iWARP iw_cm_id is destroyed. This
works for all kinds of sockets: clients that initiate connections,
listeners, and sockets accepted off of listeners.
Reviewed by: Steve Wise @ Open Grid Computing, hselasky@
MFC after: 3 days
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D9796
like other PCI network drivers. The sys/ofed directory is now mainly
reserved for generic infiniband code, with exception of the mthca driver.
- Add new manual page, mlx4en(4), describing how to configure and load
mlx4en.
- All relevant driver C-files are now prefixed mlx4, mlx4_en and
mlx4_ib respectivly to avoid object filename collisions when compiling
the kernel. This also fixes an issue with proper dependency file
generation for the C-files in question.
- Device mlxen is now device mlx4en and depends on device mlx4, see
mlx4en(4). Only the network device name remains unchanged.
- The mlx4 and mlx4en modules are now built by default on i386 and
amd64 targets. Only building the mlx4ib module depends on
WITH_OFED=YES .
Sponsored by: Mellanox Technologies
by returning success on FIONBIO and FIOASYNC IOCTLs. The actual flags
handling is done by the kern_ioctl() function.
Reported by: Alex Bowden <alex.bowden@outlook.com>
Sponsored by: Mellanox Technologies
MFC after: 1 week
In pathological situations where the master subnet manager becomes
unresponsive for an extended period, we may otherwise end up queuing all
of the system's mbufs while waiting for a response to a path record lookup.
This addresses the same issue as commit 1e85b806f9 in Linux.
Reviewed by: cem, ngie
MFC after: 2 weeks
Sponsored by: EMC / Isilon Storage Division
This addresses a regression from an earlier upstream change which caused
cma_acquire_dev() to bypass the port GID cache and instead query the HCA
for each entry in its GID table. These queries can become extremely slow on
multiport devices, which has a negative impact on connection setup times.
Discussed with: hselasky
Obtained from: Linux
MFC after: 2 weeks
Sponsored by: EMC / Isilon Storage Division
SDP transmit and receive rings are always created in a sleepable context,
so we can use M_WAITOK and remove error checks.
Sponsored by: EMC / Isilon Storage Division
The generic socket may be detached from the PCB before the completion
queue is drained and destroyed, so this change closes a race condition
in connection teardown.
Sponsored by: EMC / Isilon Storage Division
tunable SYSCTL's. Linux module parameters are associated with the
module they belong to. FreeBSD does not share this concept of a parent
module. Instead add macros which define the prefix to use for the
module parameters in the LinuxKPI consumers.
While at it convert all "bool" LinuxKPI module parameters to "byte"
type, because we don't have a "bool" type of SYSCTL in FreeBSD.
Sponsored by: Mellanox Technologies
MFC after: 1 week
- Added check that the SCOPE ID is only restored for IPv6 linklocal
addresses.
- Changes made by r237263 in the "cma_bind_addr()" function did not
check if the socket address was of type IPv6 and used the IPv4
socket address for IPv6 addresses. This caused the function to
fail. Fixed this.
- In the "rdma_gid2ip()" function and some other places the "sin6_len"
and "sin6_scope_id" fields were not set for IPv6 socket
addresses. Fixed this.
- The scope ID is not stored as part of the GID entries and must be
passed as an argument to "rdma_gid2ip()".
- Added new method to "struct ib_device" which returns a pointer to
the network interface which belongs to the given infiniband
device. This is needed to be able to get the scope ID for IPv6
addresses via the associated ethernet interface.
- Added convenience function, "rdma_get_ipv6_scope_id()", to get the
scope ID for IPv6 addresses.
- Implemented new "get_netdev" method for mlx4ib. Other IB controller
drivers which want to support IPv6 addresses needs to implement this
aswell.
- Bumped the FreeBSD version due to changing "struct ib_device".
Sponsored by: Mellanox Technologies
MFC after: 1 week
This fixes a kernel panic when using IPoIB with VIMAGE and infiniband.
PR: 208957
Sponsored by: Mellanox Technologies
Tested by: Justin Clift <justin@postgresql.org>
MFC after: 1 week
IPv6 addresses has a scope ID which sometimes is stored in the
"sin6_scope_id" field of "struct sockaddr_in6" and sometimes as part
of the IPv6 address itself depending on the context. If the scope ID
is not in the expected location, the IPv6 address lookups in the
so-called GID table will fail. Some code factoring has been made to
achieve a clean exit of the "addr_resolve" function via a common
"done" label.
Sponsored by: Mellanox Technologies
Submitted by: Shani Michaeli <shanim@mellanox.com>
MFC after: 1 week
The FreeBSD's TCP/IP stack assumes that the IP-header is 32-bits aligned
when decoding it. Else unaligned 32-bit memory access can happen, which
not all processor architectures support.
Sponsored by: Mellanox Technologies
MFC after: 1 week
When downing a mlxen network adapter we need to check the port_up variable
to ensure we don't continue to transmit data or restart timers which can
reside in freed memory.
Sponsored by: Mellanox Technologies
MFC after: 1 week
And factor out tcp_lro_rx_done, which deduplicates the same logic with
netinet/tcp_lro.c
Reviewed by: gallatin (1st version), hps, zbb, np, Dexuan Cui <decui microsoft com>
Sponsored by: Microsoft OSTC
Differential Revision: https://reviews.freebsd.org/D5725
connection manager. Examining so_comp without synchronization with
iw_so_event_handler is a harmless race.
Submitted by: Krishnamraju Eraparaju @ Chelsio
Reviewed by: Steve Wise @ Open Grid Computing
Sponsored by: Chelsio Communications
Use the Toeplitz hash value as source for the flowid. This makes the
hash value more suitable for so-called hash bucket algorithms which
are used in the FreeBSD's TCP/IP stack when RSS is enabled.
Sponsored by: Mellanox Technologies
MFC after: 1 week
ipoib module.
The bpfdetach() function is trying to turn off promiscious mode on the
network interface it is attached to while holding a mutex. The fix
consists of ignoring any further calls to the ipoib_ioctl() function
when the network interface is going to be detached. The ipoib_ioctl()
function might sleep.
Sponsored by: Mellanox Technologies
MFC after: 1 week
- Add some new hlist macros.
- Update existing hlist macros removing the need for a temporary
iteration variable.
- Properly define the RCU hlist macros to be SMP safe with regard
to RCU.
- Safe list macro arguments by adding a pair of parentheses.
- Prefix the _list_add() and _list_splice() functions with "linux"
to reflect they are LinuxKPI internal functions.
Obtained from: Linux
MFC after: 1 week
Sponsored by: Mellanox Technologies
The iWARP Connection Manager (CM) on FreeBSD creates a TCP socket to
represent an iWARP endpoint when the connection is over TCP. For
servers the current approach is to invoke create_listen callback for
each iWARP RNIC registered with the CM. This doesn't work too well for
INADDR_ANY because a listen on any TCP socket already notifies all
hardware TOEs/RNICs of the new listener. This patch fixes the server
side of things for FreeBSD. We've tried to keep all these modifications
in the iWARP/TCP specific parts of the OFED infrastructure as much as
possible.
Submitted by: Krishnamraju Eraparaju @ Chelsio (with design inputs from Steve Wise)
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D4801
The only piece of information that is required is rt_flags subset.
In particular, if_loop() requires RTF_REJECT and RTF_BLACKHOLE flags
to check if this particular mbuf needs to be dropped (and what
error should be returned).
Note that if_loop() will always return EHOSTUNREACH for "reject" routes
regardless of RTF_HOST flag existence. This is due to upcoming routing
changes where RTF_HOST value won't be available as lookup result.
All other functions require RTF_GATEWAY flag to check if they need
to return EHOSTUNREACH instead of EHOSTDOWN error.
There are 11 places where non-zero 'struct route' is passed to if_output().
For most of the callers (forwarding, bpf, arp) does not care about exact
error value. In fact, the only place where this result is propagated
is ip_output(). (ip6_output() passes NULL route to nd6_output_ifp()).
Given that, add 3 new 'struct route' flags (RT_REJECT, RT_BLACKHOLE and
RT_IS_GW) and inline function (rt_update_ro_flags()) to copy necessary
rte flags to ro_flags. Call this function in ip_output() after looking up/
verifying rte.
Reviewed by: ae
sbappendstream() does. Although, M_NOTREADY may appear only on SOCK_STREAM
sockets, due to sendfile(2) supporting only the latter, there is a corner
case of AF_UNIX/SOCK_STREAM socket, that still uses records for the sake
of control data, albeit being stream socket.
Provide private version of m_clrprotoflags(), which understands PRUS_NOTREADY,
similar to m_demote().
Add if_requestencap() interface method which is capable of calculating
various link headers for given interface. Right now there is support
for INET/INET6/ARP llheader calculation (IFENCAP_LL type request).
Other types are planned to support more complex calculation
(L2 multipath lagg nexthops, tunnel encap nexthops, etc..).
Reshape 'struct route' to be able to pass additional data (with is length)
to prepend to mbuf.
These two changes permits routing code to pass pre-calculated nexthop data
(like L2 header for route w/gateway) down to the stack eliminating the
need for other lookups. It also brings us closer to more complex scenarios
like transparently handling MPLS nexthops and tunnel interfaces.
Last, but not least, it removes layering violation introduced by flowtable
code (ro_lle) and simplifies handling of existing if_output consumers.
ARP/ND changes:
Make arp/ndp stack pre-calculate link header upon installing/updating lle
record. Interface link address change are handled by re-calculating
headers for all lles based on if_lladdr event. After these changes,
arpresolve()/nd6_resolve() returns full pre-calculated header for
supported interfaces thus simplifying if_output().
Move these lookups to separate ether_resolve_addr() function which ether
returs error or fully-prepared link header. Add <arp|nd6_>resolve_addr()
compat versions to return link addresses instead of pre-calculated data.
BPF changes:
Raw bpf writes occupied _two_ cases: AF_UNSPEC and pseudo_AF_HDRCMPLT.
Despite the naming, both of there have ther header "complete". The only
difference is that interface source mac has to be filled by OS for
AF_UNSPEC (controlled via BIOCGHDRCMPLT). This logic has to stay inside
BPF and not pollute if_output() routines. Convert BPF to pass prepend data
via new 'struct route' mechanism. Note that it does not change
non-optimized if_output(): ro_prepend handling is purely optional.
Side note: hackish pseudo_AF_HDRCMPLT is supported for ethernet and FDDI.
It is not needed for ethernet anymore. The only remaining FDDI user is
dev/pdq mostly untouched since 2007. FDDI support was eliminated from
OpenBSD in 2013 (sys/net/if_fddisubr.c rev 1.65).
Flowtable changes:
Flowtable violates layering by saving (and not correctly managing)
rtes/lles. Instead of passing lle pointer, pass pointer to pre-calculated
header data from that lle.
Differential Revision: https://reviews.freebsd.org/D4102
vtophys() when loading mbufs for transmission and reception. While at
it all pointer arithmetic and cast qualifier issues were fixed, mostly
related to transmission and reception.
MFC after: 1 week
Sponsored by: Mellanox Technologies
Differential Revision: https://reviews.freebsd.org/D4284
- Added support for dumping the SFP EEPROM content to dmesg.
- Fixed handling of network interface capability IOCTLs.
- Fixed race when loading and unloading the mlxen driver by applying
appropriate locking.
- Removed two unused C-files.
MFC after: 1 week
Submitted by: Mark Bloch <markb@mellanox.com>
Sponsored by: Mellanox Technologies
Differential Revision: https://reviews.freebsd.org/D4283
using GCC for 32-bit platforms. The integer size in this case is
hardcoded 64-bit while the pointer size is 32-bit.
Sponsored by: Mellanox Technologies
MFC after: 2 weeks
- Move all files related to the LinuxKPI into sys/compat/linuxkpi and
its subfolders.
- Update sys/conf/files and some Makefiles to use new file locations.
- Added description of COMPAT_LINUXKPI to sys/conf/NOTES which in turn
adds the LinuxKPI to all LINT builds.
- The LinuxKPI can be added to the kernel by setting the
COMPAT_LINUXKPI option. The OFED kernel option no longer builds the
LinuxKPI into the kernel. This was done to keep the build rules for
the LinuxKPI in sys/conf/files simple.
- Extend the LinuxKPI module to include support for USB by moving the
Linux USB compat from usb.ko to linuxkpi.ko.
- Bump the FreeBSD_version.
- A universe kernel build has been done.
Reviewed by: np @ (cxgb and cxgbe related changes only)
Sponsored by: Mellanox Technologies
- Remove redundant NBLONG macro and use BIT_WORD()
and BIT_MASK() instead.
- Correctly define BIT_MASK() according to Linux and
update all users of this macro.
- Add missing GENMASK() macro.
- Remove all comments deriving from Linux.
Sponsored by: Mellanox Technologies
- Redefine DIV_ROUND_UP as a function macro taking two arguments
instead of none.
- Implement more Linux kernel functions related to various forms
of DELAY() and basic mathematical operations.
Sponsored by: Mellanox Technologies
- Define the kref structure identical to the one found in Linux.
- Update clients referring inside the kref structure.
- Implement kref_sub() for FreeBSD.
Reviewed by: np @
Sponsored by: Mellanox Technologies
- Reimplement ktime header file to distinguish more from Linux.
- Add new time header file to handle time related Linux functions.
Sponsored by: Mellanox Technologies
- Avoid using PAGE_MASK, because Linux defines it differently.
Use (PAGE_SIZE - 1) instead.
- Add support for for_each_sg_page() and sg_page_iter_dma_address().
Sponsored by: Mellanox Technologies
- Added support for multiple new Linux functions.
- Properly implement DEFINE_WAIT() and init_waitqueue_head() macros.
- Removed FreeBSD specific __wait_queue_head structure definition.
Sponsored by: Mellanox Technologies
Problem description:
How do we currently perform layer 2 resolution and header imposition:
For IPv4 we have the following chain:
ip_output() -> (ether|atm|whatever)_output() -> arpresolve()
Lookup is done in proper place (link-layer output routine) and it is possible
to provide cached lle data.
For IPv6 situation is more complex:
ip6_output() -> nd6_output() -> nd6_output_ifp() -> (whatever)_output() ->
nd6_storelladdr()
We have ip6_ouput() which calls nd6_output() instead of link output routine.
nd6_output() does the following:
* checks if lle exists, creates it if needed (similar to arpresolve())
* performes lle state transitions (similar to arpresolve())
* calls nd6_output_ifp() which pushes packets to link output routine along
with running SeND/MAC hooks regardless of lle state
(e.g. works as run-hooks placeholder).
After that, iface output routine like ether_output() calls nd6_storelladdr()
which performs lle lookup once again.
As a result, we perform lookup twice for each outgoing packet for most types
of interfaces. We also need to maintain runtime-checked table of 'nd6-free'
interfaces (see nd6_need_cache()).
Fix this behavior by eliminating first ND lookup. To be more specific:
* make all nd6_output() consumers use nd6_output_ifp() instead
* rename nd6_output[_slow]() to nd6_resolve_[slow]()
* convert nd6_resolve() and nd6_resolve_slow() to arpresolve() semantics,
e.g. copy L2 address to buffer instead of pushing packet towards lower
layers
* Make all nd6_storelladdr() users use nd6_resolve()
* eliminate nd6_storelladdr()
The resulting callchain is the following:
ip6_output() -> nd6_output_ifp() -> (whatever)_output() -> nd6_resolve()
Error handling:
Currently sending packet to non-existing la results in ip6_<output|forward>
-> nd6_output() -> nd6_output _lle() which returns 0.
In new scenario packet is propagated to <ether|whatever>_output() ->
nd6_resolve() which will return EWOULDBLOCK, and that result
will be converted to 0.
(And EWOULDBLOCK is actually used by IB/TOE code).
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D1469
before proceeding. Otherwise, nothing prevents it from running after the
MAD agent struct has been been freed, and this results in a use-after-free
when the task's ta_pending count is incremented in the callout handler.
MFC after: 2 weeks
Sponsored by: EMC / Isilon Storage Division
operations that map a single page that has an associated vm_page_t.
This does not permit mapping larger regions (such as a PCI memory
BAR) and it does not permit mapping addresses beyond the top of RAM
(such as a 64-bit BAR located above the top of RAM).
Instead of using a single OBJT_DEVICE object and passing the physaddr via
the offset as a hack, create a new sglist and OBJT_SG object for each
mmap request. The requested memory attribute is applied to the object
thus affecting all pages mapped by the request.
Reviewed by: hselasky, np
MFC after: 1 week
Sponsored by: Chelsio
Differential Revision: https://reviews.freebsd.org/D3386
the last OFED update (r278886).
iWARP on FreeBSD is properly integrated with the network stack and the
iWARP drivers _never_ operate out of any private TCP port-space that is
invisible to the kernel. Instead, an iWARP connection shows up as a TCP
socket (which is what it is) fully visible to the kernel and standard
tools like netstat, sockstat, etc.
order, but IN_ZERONET and IN_LOOPBACK expect it in host order.
Submitted by: Tao Liu <Tao.Liu@isilon.com>
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
In tf_dequeue(), if we reach the end of the list without finding a
non-cancelled element, "tmp" will be a pointer into the list head, so the
tmp->canceled check is bogus. Use a flag instead.
Submitted by: Tao Liu <Tao.Liu@isilon.com>
Reviewed by: hselasky
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D3244
kmalloc() call. Make function global instead of static inline to fix
compiler warnings about passing variable argument lists to inline
functions.
MFC after: 1 week
Sponsored by: Mellanox Technologies
Use the same scheme implemented to manage credentials.
Code needing to look at process's credentials (as opposed to thred's) is
provided with *_proc variants of relevant functions.
Places which possibly had to take the proc lock anyway still use the proc
pointer to access limits.
with fresh firmware. The low level code is based on code provided by
Mellanox.
Thanks to Mellanox and their distributor Must (http://mustcompany.ru)
for providing hardware.
In collaboration with: Andre Melkoumian <andre mellanox.com>
Reviewed by: hselasky
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
years for head. However, it is continuously misused as the mpsafe argument
for callout_init(9). Deprecate the flag and clean up callout_init() calls
to make them more consistent.
Differential Revision: https://reviews.freebsd.org/D2613
Reviewed by: jhb
MFC after: 2 weeks
positive value since it adds the current tick to its result. This differs
from the behaviour in Linux, whose implementation does not add the extra
tick, so subtract the extra tick in the OFED compat layer implementation.
This addresses some incorrect handling of IB MAD timeouts, since some IB
code depends on msecs_to_jiffies(0) returning 0.
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
lies within the last block of the bit set and no bits are set beyond the
offset, terminate the search immediately instead of continuing as though
there are further blocks in the set and subsequently returning an incorrect
result.
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
is present in the tree. Otherwise there exists a window during which the
element could be removed by another thread, triggering an incorrect
assertion failure.
Reviewed by: jeff
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
- make sure the timeout computations are always above zero by using
the existing "linux_timer_jiffies_until()" function. Negative timeouts
can result in undefined behaviour.
- declare all completion functions like external symbols and move the
code to the LinuxAPI kernel module.
- add a proper prefix to all LinuxAPI kernel functions to avoid
namespace collision with other parts of the FreeBSD kernel.
- clean up header file inclusions in the linux/completion.h, linux/in.h
and linux/fs.h header files.
MFC after: 1 week
Sponsored by: Mellanox Technologies
conversion:
- The linux compat API layer casts the ticks to unsigned long which
might cause problems when the ticks value is negative.
- Guard against already expired ticks values, by checking if the
passed expiry tick is already elapsed.
- While at it avoid referring the address of an inlined function.
MFC after: 3 days
Sponsored by: Mellanox Technologies
drivers can use it. This avoids some code duplication. Add missing
default case to all switch statements while at it. Also move the
hashing of the IPv6 flow field to layer 4 because the IPv6 flow field
is constant on a per L4 connection basis and not on a per L3 network.
Differential Revision: https://reviews.freebsd.org/D1987
Sponsored by: Mellanox Technologies
MFC after: 1 month
> List of fixes:
* use correct format for GID printouts
* double array indexing
* spelling in printouts
* void pointer arithmetic
* allow more receive rings
* correct maximum number of transmit rings
* use "const" instead of "static" for constants
* check for invalid VLAN tags
* check for lack of IRQ resources
> Added more hardware specific defines
> Added more verbose printouts of firmware status codes
Sponsored by: Mellanox Technologies
MFC after: 3 days
by placing appropriate #ifdefs around otherwise unused variables
or sections with functions called which are not available without
IPv6 support in the kernel.
Introduce fget_fcntl which performs appropriate checks when needed.
This removes a branch from fget_unlocked.
Introduce fget_mmap dealing with cap_rights_to_vmprot conversion.
This removes a branch from _fget.
Modify fget_unlocked to pass sequence counter to interested callers so
that they can perform their own checks and make sure the result was
otained from stable & current state.
Reviewed by: silence on -hackers
this option from all modules that enable it theirselves.
In C mode -fms-extensions option enables anonymous structs and unions,
allowing us to use this C11 feature in kernel. Of course, clang supports
it without any extra options.
Reviewed by: dim
Highlights:
- Multiple verbs API updates
- Support for RoCE, RDMA over ethernet
All hardware drivers depending on the common infiniband stack has been
updated aswell.
Discussed with: np @
Sponsored by: Mellanox Technologies
MFC after: 1 month
FreeBSD developers need more time to review patches in the surrounding
areas like the TCP stack which are using MPSAFE callouts to restore
distribution of callouts on multiple CPUs.
Bump the __FreeBSD_version instead of reverting it.
Suggested by: kmacy, adrian, glebius and kib
Differential Revision: https://reviews.freebsd.org/D1438
"MODULE_VERSION" macro definition. Remove the redefinition of the
"MODULE_VERSION" macro from the Linux kernel compatibility API.
MFC after: 1 month
Reported by: np@
Sponsored by: Mellanox Technologies
by dumbbell@ to be able to compile this layer as a dependency module.
Clean up some Makefiles and remove the no longer used OFED define.
Currently only i386 and amd64 targets are supported.
MFC after: 1 month
Sponsored by: Mellanox Technologies
- Close a migration race where callout_reset() failed to set the
CALLOUT_ACTIVE flag.
- Callout callback functions are now allowed to be protected by
spinlocks.
- Switching the callout CPU number cannot always be done on a
per-callout basis. See the updated timeout(9) manual page for more
information.
- The timeout(9) manual page has been updated to reflect how all the
functions inside the callout API are working. The manual page has
been made function oriented to make it easier to deduce how each of
the functions making up the callout API are working without having
to first read the whole manual page. Group all functions into a
handful of sections which should give a quick top-level overview
when the different functions should be used.
- The CALLOUT_SHAREDLOCK flag and its functionality has been removed
to reduce the complexity in the callout code and to avoid problems
about atomically stopping callouts via callout_stop(). If someone
needs it, it can be re-added. From my quick grep there are no
CALLOUT_SHAREDLOCK clients in the kernel.
- A new callout API function named "callout_drain_async()" has been
added. See the updated timeout(9) manual page for a complete
description.
- Update the callout clients in the "kern/" folder to use the callout
API properly, like cv_timedwait(). Previously there was some custom
sleepqueue code in the callout subsystem, which has been removed,
because we now allow callouts to be protected by spinlocks. This
allows us to tear down the callout like done with regular mutexes,
and a "td_slpmutex" has been added to "struct thread" to atomically
teardown the "td_slpcallout". Further the "TDF_TIMOFAIL" and
"SWT_SLEEPQTIMO" states can now be completely removed. Currently
they are marked as available and will be cleaned up in a follow up
commit.
- Bump the __FreeBSD_version to indicate kernel modules need
recompilation.
- There has been several reports that this patch "seems to squash a
serious bug leading to a callout timeout and panic".
Kernel build testing: all architectures were built
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D1438
Sponsored by: Mellanox Technologies
Reviewed by: jhb, adrian, sbruno and emaste
- Remove unsupported "bus" field from "struct pci_dev".
- Fix logic inside "pci_enable_msix()" when the number of allocated
interrupts are less than the number of available interrupts.
- Update header files included from "list.h".
- Ensure that "idr_destroy()" removes all entries before destroying
the IDR root node(s).
- Set the "device->release" function so that we don't leak memory at
device destruction.
- Use FreeBSD's "log()" function for certain debug printouts.
- Put parenthesis around arguments inside the min, max, min_t and max_t macros.
- Make sure we don't leak file descriptors by dropping the extra file
reference counts done by the FreeBSD kernel when calling falloc()
and fget_unlocked().
MFC after: 1 week
Sponsored by: Mellanox Technologies
Without this fix, a kernel with VIMAGE + Infiniband will panic on bootup.
Certain necessary #include statements require LIST_HEAD.
Add these includes to ofed/include/linux/list.h, because
LIST_HEAD is specifically overridden in this file.
PR: 191468
Differential Revision: D1279
Reviewed by: hselasky
from the FreeBSD network code. The flag is still kept around in the
"sys/mbuf.h" header file, but does no longer have any users. Instead
the "m_pkthdr.rsstype" field in the mbuf structure is now used to
decide the meaning of the "m_pkthdr.flowid" field. To modify the
"m_pkthdr.rsstype" field please use the existing "M_HASHTYPE_XXX"
macros as defined in the "sys/mbuf.h" header file.
This patch introduces new behaviour in the transmit direction.
Previously network drivers checked if "M_FLOWID" was set in "m_flags"
before using the "m_pkthdr.flowid" field. This check has now now been
replaced by checking if "M_HASHTYPE_GET(m)" is different from
"M_HASHTYPE_NONE". In the future more hashtypes will be added, for
example hashtypes for hardware dedicated flows.
"M_HASHTYPE_OPAQUE" indicates that the "m_pkthdr.flowid" value is
valid and has no particular type. This change removes the need for an
"if" statement in TCP transmit code checking for the presence of a
valid flowid value. The "if" statement mentioned above is now a direct
variable assignment which is then later checked by the respective
network drivers like before.
Additional notes:
- The SCTP code changes will be committed as a separate patch.
- Removal of the "M_FLOWID" flag will also be done separately.
- The FreeBSD version has been bumped.
MFC after: 1 month
Sponsored by: Mellanox Technologies
sending not ready data:
o Add new flag to pru_send() flags - PRUS_NOTREADY.
o Add new protocol method pru_ready().
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
sb_cc member of struct sockbuf to a couple of inline functions:
sbavail() and sbused()
Right now they are equal, but once notion of "not ready socket buffer data",
will be checked in, they are going to be different.
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
- Wrong integer type was specified.
- Wrong or missing "access" specifier. The "access" specifier
sometimes included the SYSCTL type, which it should not, except for
procedural SYSCTL nodes.
- Logical OR where binary OR was expected.
- Properly assert the "access" argument passed to all SYSCTL macros,
using the CTASSERT macro. This applies to both static- and dynamically
created SYSCTLs.
- Properly assert the the data type for both static and dynamic
SYSCTLs. In the case of static SYSCTLs we only assert that the data
pointed to by the SYSCTL data pointer has the correct size, hence
there is no easy way to assert types in the C language outside a
C-function.
- Rewrote some code which doesn't pass a constant "access" specifier
when creating dynamic SYSCTL nodes, which is now a requirement.
- Updated "EXAMPLES" section in SYSCTL manual page.
MFC after: 3 days
Sponsored by: Mellanox Technologies
Mellanox hardware driver(s):
- Properly name an inclusion guard
- Fix compile warnings regarding unsigned enums
- Add two new sysctl nodes
- Remove all empty linux header files
- Make an error printout more verbose
- Use "mod_delayed_work()" instead of
cancelling and starting a timeout.
- Implement more Linux scatterlist
functions.
MFC after: 3 days
Sponsored by: Mellanox Technologies
struct kinfo_file.
- Move the various fill_*_info() methods out of kern_descrip.c and into the
various file type implementations.
- Rework the support for kinfo_ofile to generate a suitable kinfo_file object
for each file and then convert that to a kinfo_ofile structure rather than
keeping a second, different set of code that directly manipulates
type-specific file information.
- Remove the shm_path() and ksem_info() layering violations.
Differential Revision: https://reviews.freebsd.org/D775
Reviewed by: kib, glebius (earlier version)
The current TSO limitation feature only takes the total number of
bytes in an mbuf chain into account and does not limit by the number
of mbufs in a chain. Some kinds of hardware is limited by two
factors. One is the fragment length and the second is the fragment
count. Both of these limits need to be taken into account when doing
TSO. Else some kinds of hardware might have to drop completely valid
mbuf chains because they cannot loaded into the given hardware's DMA
engine. The new way of doing TSO limitation has been made backwards
compatible as input from other FreeBSD developers and will use
defaults for values not set.
Reviewed by: adrian, rmacklem
Sponsored by: Mellanox Technologies
MFC after: 1 week
The current TSO limitation feature only takes the total number of
bytes in an mbuf chain into account and does not limit by the number
of mbufs in a chain. Some kinds of hardware is limited by two
factors. One is the fragment length and the second is the fragment
count. Both of these limits need to be taken into account when doing
TSO. Else some kinds of hardware might have to drop completely valid
mbuf chains because they cannot loaded into the given hardware's DMA
engine. The new way of doing TSO limitation has been made backwards
compatible as input from other FreeBSD developers and will use
defaults for values not set.
MFC after: 1 week
Sponsored by: Mellanox Technologies
- Add invfo_rdwr() (for read and write), invfo_ioctl(), invfo_poll(),
and invfo_kqfilter() for use by file types that do not support the
respective operations. Home-grown versions of invfo_poll() were
universally broken (they returned an errno value, invfo_poll()
uses poll_no_poll() to return an appropriate event mask). Home-grown
ioctl routines also tended to return an incorrect errno (invfo_ioctl
returns ENOTTY).
- Use the invfo_*() functions instead of local versions for
unsupported file operations.
- Reorder fileops members to match the order in the structure definition
to make it easier to spot missing members.
- Add several missing methods to linuxfileops used by the OFED shim
layer: fo_write(), fo_truncate(), fo_kqfilter(), and fo_stat(). Most
of these used invfo_*(), but a dummy fo_stat() implementation was
added.
argument but not actually defined anywhere.
This fixes the compile complaining about
"declaration of 'struct kiocb' will not be visible outside of this function".
MFC after: 2 weeks
X-MFC with: whatever changed caused the breakage ;-)
hardware driver update from Mellanox Technologies.
- Remove empty files from the OFED Linux Emulation layer.
- Fix compile warnings related to printf() and the "%lld" and "%llx"
format specifiers.
- Add some missing 2-clause BSD copyrights.
- Add "Mellanox Technologies, Ltd." to list of copyright holders.
- Add some new compatibility files.
- Fix order of uninit in the mlx4ib module to avoid crash at unload
using the new module_exit_order() function.
MFC after: 1 week
Sponsored by: Mellanox Technologies
Make the functions pci_disable_msi, pci_enable_msi and pci_enable_msix
methods of the newbus PCI bus. This code should not include any
functional change.
Sponsored by: Citrix Systems R&D
Reviewed by: imp, jhb
Differential Revision: https://reviews.freebsd.org/D354
dev/pci/pci.c:
- Convert the mentioned functions to newbus methods.
- Fix the callers of the converted functions.
sys/dev/pci/pci_private.h:
dev/pci/pci_if.m:
- Declare the new methods.
dev/pci/pcivar.h:
- Add helpers to call the newbus methods.
ofed/include/linux/pci.h:
- Add define to prevent the ofed version of pci_enable_msix from
clashing with the FreeBSD native version.
trees. This happens because the logic inserting items into the radix
tree is allocating empty radix levels, when index zero does not
contain any items.
- Add proper error case handling, so that the radix tree does not end
up in a bad state, if memory cannot be allocated during insertion of
an item.
- Add check for inserting NULL items into the radix tree.
- Add check for radix tree getting too big.
MFC after: 1 week
Sponsored by: Mellanox Technologies
prior to starting "/sbin/init" which will run all the "/etc/rc.d/xxx"
scripts. Else there can be a race configuring the interfaces via
"/etc/rc.conf".
MFC after: 4 weeks
Sponsored by: Mellanox Technologies
Remove duplicate "debug_ktr.mask" sysctl definition.
Remove now unused variable from "kern_ktr.c".
This fixes build of "ktr" which was broken by r267961.
Let the default value for "vm_kmem_size_scale" be zero. It is setup
after that the sysctl has been initialized from "getenv()" in the
"kmeminit()" function to equal the "VM_KMEM_SIZE_MAX" value, if
zero. On Sparc64 the "VM_KMEM_SIZE_MAX" macro is not a constant. This
fixes build of Sparc64 which was broken by r267961.
Add a special macro to dynamically create SYSCTL root nodes, because
root nodes have a special parent. This fixes build of existing OFED
module and CANBUS module for pc98 which was broken by r267961.
Add missing "sysctl.h" includes to get the needed sysctl header file
declarations. This is needed after r267961.
MFC after: 2 weeks
To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.
Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.
Exp-run revealed no ports using it directly.
No objection from: arch@
Sponsored by: EMC / Isilon Storage Division
interface, in the r241616 a crutch was provided. It didn't work well, and
finally we decided that it is time to break ABI and simply make if_baudrate
a 64-bit value. Meanwhile, the entire struct if_data was reviewed.
o Remove the if_baudrate_pf crutch.
o Make all fields of struct if_data fixed machine independent size. The
notion of data (packet counters, etc) are by no means MD. And it is a
bug that on amd64 we've got a 64-bit counters, while on i386 32-bit,
which at modern speeds overflow within a second.
This also removes quite a lot of COMPAT_FREEBSD32 code.
o Give 16 bit for the ifi_datalen field. This field was provided to
make future changes to if_data less ABI breaking. Unfortunately the
8 bit size of it had effectively limited sizeof if_data to 256 bytes.
o Give 32 bits to ifi_mtu and ifi_metric.
o Give 64 bits to the rest of fields, since they are counters.
__FreeBSD_version bumped.
Discussed with: emax
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
callback providers. link_init_sdl() function can be used to
fill most of the parameters. Use caller stack instead of
allocation / freing memory for each request. Do not drop support
for extra-long (probably non-existing) link-layer protocols by
introducing link_alloc_sdl() (used by if_resolvemulti() callback)
and link_free_sdl() (used by caller).
Since this change breaks KBI, MFC requires slightly different approach
(link_init_sdl() auto-allocating buffer if necessary to handle cases
with unmodified if_resolvemulti() callers).
MFC after: 2 weeks
modules which require this flag to compile. Use a GCC_MS_EXTENSIONS
variable, defined in kern.pre.mk, which can be used to easily supply the
flag (or not), depending on the compiler type.
MFC after: 3 days
We need to do this because the Linux compat layer uses sx(9) for
mutex, however the lagg code uses rmlocks and calls into the mellanox
driver. This causes deadlock due to sleeping while holding a rmlock.
Submitted by: Shahar Klein (shahark mellanox.com)
MFC After: 3 days.
shifts into the sign bit. Instead use (1U << 31) which gets the
expected result.
This fix is not ideal as it assumes a 32 bit int, but does fix the issue
for most cases.
A similar change was made in OpenBSD.
Discussed with: -arch, rdivacky
Reviewed by: cperciva
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
The order of releasing resources in mlxen was wrong, which caused
panic on reload of the module.
conf_ctx list should be released before stat_ctx list, otherwise
the leafs in conf_ctx list won't be released because of the dependancy.
The fix is to change the order of the releases.
Submitted by: Shahar Klein (shahark at mellanox.com)
The device name was incorrect due to a specific function we ported
from the Linux driver that is not FBSD compatible. This resulted
with a false sysctl registration and some more problematic issues.
The patch basically revokes it all together.
Submitted by: Meny Yossefi (menyy mellanox.com)
Approved by: re
When calling to ib_uverbs_cleanup_ucontext, there is a call to
mutex_lock of xrcd_table_mutex, which was not initialized.
Added missing initialization for xrcd_table_mutex.
Submitted by: Orit Moskovich (oritm mellanox.com)
Approved by: re
Removed the ifdef linux from this function.
Added stub function for contiguous pages to avoid compilation
errors.
Submitted by: Orit Moskovich (oritm mellanox.com)
Approved by: re
srq_pingpong IBverbs
Removed refrences using 'ifdef __linux__' to qpg functions and
related fields in struct
ib_qp_init_attr.
Submitted by: Orit Moskovich (oritm mellanox.com)
Approved by: re
Changed module init from module_init() to module_init_order() with
SI_ORDER_MIDDLE flag
Submitted by: Orit Moskovich (oritm mellanox.com)
Approved by: re
Update the OFED Infiniband core to the version supplied in Linux
version 3.7.
The update to OFED is nearly all additional defines and functions
with the exception of the addition of additional parameters to
ib_register_device() and the reg_user_mr callback.
In addition the ibcore (Infiniband core) and ipoib (IP over Infiniband)
have both been made into completely loadable modules to facilitate
testing of the OFED stack in FreeBSD.
Finally the Mellanox Infiniband drivers are now updated to the
latest version shipping with Linux 3.7.
Submitted by: Mellanox FreeBSD driver team:
Oded Shanoon (odeds mellanox.com),
Meny Yossefi (menyy mellanox.com),
Orit Moskovich (oritm mellanox.com)
Approved by: re
into threads each processing queue in a single domain. The structure
of the pagedaemons and queues is kept intact, most of the changes come
from the need for code to find an owning page queue for given page,
calculated from the segment containing the page.
The tie between NUMA domain and pagedaemon thread/pagequeue split is
rather arbitrary, the multithreaded daemon could be allowed for the
single-domain machines, or one domain might be split into several page
domains, to further increase concurrency.
Right now, each pagedaemon thread tries to reach the global target,
precalculated at the start of the pass. This is not optimal, since it
could cause excessive page deactivation and freeing. The code should
be changed to re-check the global page deficit state in the loop after
some number of iterations.
The pagedaemons reach the quorum before starting the OOM, since one
thread inability to meet the target is normal for split queues. Only
when all pagedaemons fail to produce enough reusable pages, OOM is
started by single selected thread.
Launder is modified to take into account the segments layout with
regard to the region for which cleaning is performed.
Based on the preliminary patch by jeff, sponsored by EMC / Isilon
Storage Division.
Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
transparent layering and better fragmentation.
- Normalize functions that allocate memory to use kmem_*
- Those that allocate address space are named kva_*
- Those that operate on maps are named kmap_*
- Implement recursive allocation handling for kmem_arena in vmem.
Reviewed by: alc
Tested by: pho
Sponsored by: EMC / Isilon Storage Division
- Remove an incorrect assertion that can trigger when downing an interface.
- Stop the interface during detach to avoid panics when unloading the
driver.
- A few locking fixes to be more consistent with other FreeBSD drivers:
- Protect if_drv_flags with the driver lock, not atomic ops
- Hold the driver lock when adjusting multicast state.
- Hold the driver lock while adjusting if_capenable.
PR: kern/180791 [1,2]
Submitted by: Shakar Klein @ Mellanox [1,2]
MFC after: 3 days
- Only enable UDP/TCP hardware checksums if CSUM_UDP or CSUM_TCP is set.
- Only enable IP hardware checksums if CSUM_IP is set.
PR: kern/180430
Submitted by: Meny Yossefi <menyy@mellanox.com>
MFC after: 1 week
Also directly call swapper() at the end of mi_startup instead of
relying on swapper being the last thing in sysinits order.
Rationale:
- "RUN_SCHEDULER" was misleading, scheduling already takes place at that stage
- "scheduler" was misleading, the function swaps in the swapped out processes
- another SYSINIT(SI_SUB_RUN_SCHEDULER, SI_ORDER_ANY) could never be
invoked depending on its relative order with scheduler; this was not obvious
and the bug actually used to exist
Reviewed by: kib (ealier version)
MFC after: 14 days
generic and apply to all sysfs attributes:
- Use sysctl_handle_string() instead of reimplementing it.
- Remove trailing newline from the current value before passing it to
userland and append a newline to the new string value before passing it
to the attribute's store function.
- Don't leak the temporary buffer if the first error check triggers.
- Revert earlier change to mlx4 port mode handler.
PR: kern/174213
Submitted by: Garrett Cooper
Reviewed by: Shakar Klein @ Mellanox
MFC after: 1 week
to Ethernet and the subsequent port being set to IB.
Submitted by: Shakar Klein @ Mellanox
Tested by: Morgan Robertson <morganrobertson@gmail.com>
MFC after: 1 week