Use the Toeplitz hash value as source for the flowid. This makes the
hash value more suitable for so-called hash bucket algorithms which
are used in the FreeBSD's TCP/IP stack when RSS is enabled.
Sponsored by: Mellanox Technologies
MFC after: 1 week
ipoib module.
The bpfdetach() function is trying to turn off promiscious mode on the
network interface it is attached to while holding a mutex. The fix
consists of ignoring any further calls to the ipoib_ioctl() function
when the network interface is going to be detached. The ipoib_ioctl()
function might sleep.
Sponsored by: Mellanox Technologies
MFC after: 1 week
- Add some new hlist macros.
- Update existing hlist macros removing the need for a temporary
iteration variable.
- Properly define the RCU hlist macros to be SMP safe with regard
to RCU.
- Safe list macro arguments by adding a pair of parentheses.
- Prefix the _list_add() and _list_splice() functions with "linux"
to reflect they are LinuxKPI internal functions.
Obtained from: Linux
MFC after: 1 week
Sponsored by: Mellanox Technologies
The iWARP Connection Manager (CM) on FreeBSD creates a TCP socket to
represent an iWARP endpoint when the connection is over TCP. For
servers the current approach is to invoke create_listen callback for
each iWARP RNIC registered with the CM. This doesn't work too well for
INADDR_ANY because a listen on any TCP socket already notifies all
hardware TOEs/RNICs of the new listener. This patch fixes the server
side of things for FreeBSD. We've tried to keep all these modifications
in the iWARP/TCP specific parts of the OFED infrastructure as much as
possible.
Submitted by: Krishnamraju Eraparaju @ Chelsio (with design inputs from Steve Wise)
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D4801
The only piece of information that is required is rt_flags subset.
In particular, if_loop() requires RTF_REJECT and RTF_BLACKHOLE flags
to check if this particular mbuf needs to be dropped (and what
error should be returned).
Note that if_loop() will always return EHOSTUNREACH for "reject" routes
regardless of RTF_HOST flag existence. This is due to upcoming routing
changes where RTF_HOST value won't be available as lookup result.
All other functions require RTF_GATEWAY flag to check if they need
to return EHOSTUNREACH instead of EHOSTDOWN error.
There are 11 places where non-zero 'struct route' is passed to if_output().
For most of the callers (forwarding, bpf, arp) does not care about exact
error value. In fact, the only place where this result is propagated
is ip_output(). (ip6_output() passes NULL route to nd6_output_ifp()).
Given that, add 3 new 'struct route' flags (RT_REJECT, RT_BLACKHOLE and
RT_IS_GW) and inline function (rt_update_ro_flags()) to copy necessary
rte flags to ro_flags. Call this function in ip_output() after looking up/
verifying rte.
Reviewed by: ae
sbappendstream() does. Although, M_NOTREADY may appear only on SOCK_STREAM
sockets, due to sendfile(2) supporting only the latter, there is a corner
case of AF_UNIX/SOCK_STREAM socket, that still uses records for the sake
of control data, albeit being stream socket.
Provide private version of m_clrprotoflags(), which understands PRUS_NOTREADY,
similar to m_demote().
Add if_requestencap() interface method which is capable of calculating
various link headers for given interface. Right now there is support
for INET/INET6/ARP llheader calculation (IFENCAP_LL type request).
Other types are planned to support more complex calculation
(L2 multipath lagg nexthops, tunnel encap nexthops, etc..).
Reshape 'struct route' to be able to pass additional data (with is length)
to prepend to mbuf.
These two changes permits routing code to pass pre-calculated nexthop data
(like L2 header for route w/gateway) down to the stack eliminating the
need for other lookups. It also brings us closer to more complex scenarios
like transparently handling MPLS nexthops and tunnel interfaces.
Last, but not least, it removes layering violation introduced by flowtable
code (ro_lle) and simplifies handling of existing if_output consumers.
ARP/ND changes:
Make arp/ndp stack pre-calculate link header upon installing/updating lle
record. Interface link address change are handled by re-calculating
headers for all lles based on if_lladdr event. After these changes,
arpresolve()/nd6_resolve() returns full pre-calculated header for
supported interfaces thus simplifying if_output().
Move these lookups to separate ether_resolve_addr() function which ether
returs error or fully-prepared link header. Add <arp|nd6_>resolve_addr()
compat versions to return link addresses instead of pre-calculated data.
BPF changes:
Raw bpf writes occupied _two_ cases: AF_UNSPEC and pseudo_AF_HDRCMPLT.
Despite the naming, both of there have ther header "complete". The only
difference is that interface source mac has to be filled by OS for
AF_UNSPEC (controlled via BIOCGHDRCMPLT). This logic has to stay inside
BPF and not pollute if_output() routines. Convert BPF to pass prepend data
via new 'struct route' mechanism. Note that it does not change
non-optimized if_output(): ro_prepend handling is purely optional.
Side note: hackish pseudo_AF_HDRCMPLT is supported for ethernet and FDDI.
It is not needed for ethernet anymore. The only remaining FDDI user is
dev/pdq mostly untouched since 2007. FDDI support was eliminated from
OpenBSD in 2013 (sys/net/if_fddisubr.c rev 1.65).
Flowtable changes:
Flowtable violates layering by saving (and not correctly managing)
rtes/lles. Instead of passing lle pointer, pass pointer to pre-calculated
header data from that lle.
Differential Revision: https://reviews.freebsd.org/D4102
vtophys() when loading mbufs for transmission and reception. While at
it all pointer arithmetic and cast qualifier issues were fixed, mostly
related to transmission and reception.
MFC after: 1 week
Sponsored by: Mellanox Technologies
Differential Revision: https://reviews.freebsd.org/D4284
- Added support for dumping the SFP EEPROM content to dmesg.
- Fixed handling of network interface capability IOCTLs.
- Fixed race when loading and unloading the mlxen driver by applying
appropriate locking.
- Removed two unused C-files.
MFC after: 1 week
Submitted by: Mark Bloch <markb@mellanox.com>
Sponsored by: Mellanox Technologies
Differential Revision: https://reviews.freebsd.org/D4283
using GCC for 32-bit platforms. The integer size in this case is
hardcoded 64-bit while the pointer size is 32-bit.
Sponsored by: Mellanox Technologies
MFC after: 2 weeks
- Move all files related to the LinuxKPI into sys/compat/linuxkpi and
its subfolders.
- Update sys/conf/files and some Makefiles to use new file locations.
- Added description of COMPAT_LINUXKPI to sys/conf/NOTES which in turn
adds the LinuxKPI to all LINT builds.
- The LinuxKPI can be added to the kernel by setting the
COMPAT_LINUXKPI option. The OFED kernel option no longer builds the
LinuxKPI into the kernel. This was done to keep the build rules for
the LinuxKPI in sys/conf/files simple.
- Extend the LinuxKPI module to include support for USB by moving the
Linux USB compat from usb.ko to linuxkpi.ko.
- Bump the FreeBSD_version.
- A universe kernel build has been done.
Reviewed by: np @ (cxgb and cxgbe related changes only)
Sponsored by: Mellanox Technologies
- Remove redundant NBLONG macro and use BIT_WORD()
and BIT_MASK() instead.
- Correctly define BIT_MASK() according to Linux and
update all users of this macro.
- Add missing GENMASK() macro.
- Remove all comments deriving from Linux.
Sponsored by: Mellanox Technologies
- Redefine DIV_ROUND_UP as a function macro taking two arguments
instead of none.
- Implement more Linux kernel functions related to various forms
of DELAY() and basic mathematical operations.
Sponsored by: Mellanox Technologies
- Define the kref structure identical to the one found in Linux.
- Update clients referring inside the kref structure.
- Implement kref_sub() for FreeBSD.
Reviewed by: np @
Sponsored by: Mellanox Technologies
- Reimplement ktime header file to distinguish more from Linux.
- Add new time header file to handle time related Linux functions.
Sponsored by: Mellanox Technologies
- Avoid using PAGE_MASK, because Linux defines it differently.
Use (PAGE_SIZE - 1) instead.
- Add support for for_each_sg_page() and sg_page_iter_dma_address().
Sponsored by: Mellanox Technologies
- Added support for multiple new Linux functions.
- Properly implement DEFINE_WAIT() and init_waitqueue_head() macros.
- Removed FreeBSD specific __wait_queue_head structure definition.
Sponsored by: Mellanox Technologies
Problem description:
How do we currently perform layer 2 resolution and header imposition:
For IPv4 we have the following chain:
ip_output() -> (ether|atm|whatever)_output() -> arpresolve()
Lookup is done in proper place (link-layer output routine) and it is possible
to provide cached lle data.
For IPv6 situation is more complex:
ip6_output() -> nd6_output() -> nd6_output_ifp() -> (whatever)_output() ->
nd6_storelladdr()
We have ip6_ouput() which calls nd6_output() instead of link output routine.
nd6_output() does the following:
* checks if lle exists, creates it if needed (similar to arpresolve())
* performes lle state transitions (similar to arpresolve())
* calls nd6_output_ifp() which pushes packets to link output routine along
with running SeND/MAC hooks regardless of lle state
(e.g. works as run-hooks placeholder).
After that, iface output routine like ether_output() calls nd6_storelladdr()
which performs lle lookup once again.
As a result, we perform lookup twice for each outgoing packet for most types
of interfaces. We also need to maintain runtime-checked table of 'nd6-free'
interfaces (see nd6_need_cache()).
Fix this behavior by eliminating first ND lookup. To be more specific:
* make all nd6_output() consumers use nd6_output_ifp() instead
* rename nd6_output[_slow]() to nd6_resolve_[slow]()
* convert nd6_resolve() and nd6_resolve_slow() to arpresolve() semantics,
e.g. copy L2 address to buffer instead of pushing packet towards lower
layers
* Make all nd6_storelladdr() users use nd6_resolve()
* eliminate nd6_storelladdr()
The resulting callchain is the following:
ip6_output() -> nd6_output_ifp() -> (whatever)_output() -> nd6_resolve()
Error handling:
Currently sending packet to non-existing la results in ip6_<output|forward>
-> nd6_output() -> nd6_output _lle() which returns 0.
In new scenario packet is propagated to <ether|whatever>_output() ->
nd6_resolve() which will return EWOULDBLOCK, and that result
will be converted to 0.
(And EWOULDBLOCK is actually used by IB/TOE code).
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D1469
before proceeding. Otherwise, nothing prevents it from running after the
MAD agent struct has been been freed, and this results in a use-after-free
when the task's ta_pending count is incremented in the callout handler.
MFC after: 2 weeks
Sponsored by: EMC / Isilon Storage Division
operations that map a single page that has an associated vm_page_t.
This does not permit mapping larger regions (such as a PCI memory
BAR) and it does not permit mapping addresses beyond the top of RAM
(such as a 64-bit BAR located above the top of RAM).
Instead of using a single OBJT_DEVICE object and passing the physaddr via
the offset as a hack, create a new sglist and OBJT_SG object for each
mmap request. The requested memory attribute is applied to the object
thus affecting all pages mapped by the request.
Reviewed by: hselasky, np
MFC after: 1 week
Sponsored by: Chelsio
Differential Revision: https://reviews.freebsd.org/D3386
the last OFED update (r278886).
iWARP on FreeBSD is properly integrated with the network stack and the
iWARP drivers _never_ operate out of any private TCP port-space that is
invisible to the kernel. Instead, an iWARP connection shows up as a TCP
socket (which is what it is) fully visible to the kernel and standard
tools like netstat, sockstat, etc.
order, but IN_ZERONET and IN_LOOPBACK expect it in host order.
Submitted by: Tao Liu <Tao.Liu@isilon.com>
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
In tf_dequeue(), if we reach the end of the list without finding a
non-cancelled element, "tmp" will be a pointer into the list head, so the
tmp->canceled check is bogus. Use a flag instead.
Submitted by: Tao Liu <Tao.Liu@isilon.com>
Reviewed by: hselasky
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D3244
kmalloc() call. Make function global instead of static inline to fix
compiler warnings about passing variable argument lists to inline
functions.
MFC after: 1 week
Sponsored by: Mellanox Technologies
Use the same scheme implemented to manage credentials.
Code needing to look at process's credentials (as opposed to thred's) is
provided with *_proc variants of relevant functions.
Places which possibly had to take the proc lock anyway still use the proc
pointer to access limits.
with fresh firmware. The low level code is based on code provided by
Mellanox.
Thanks to Mellanox and their distributor Must (http://mustcompany.ru)
for providing hardware.
In collaboration with: Andre Melkoumian <andre mellanox.com>
Reviewed by: hselasky
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
years for head. However, it is continuously misused as the mpsafe argument
for callout_init(9). Deprecate the flag and clean up callout_init() calls
to make them more consistent.
Differential Revision: https://reviews.freebsd.org/D2613
Reviewed by: jhb
MFC after: 2 weeks
positive value since it adds the current tick to its result. This differs
from the behaviour in Linux, whose implementation does not add the extra
tick, so subtract the extra tick in the OFED compat layer implementation.
This addresses some incorrect handling of IB MAD timeouts, since some IB
code depends on msecs_to_jiffies(0) returning 0.
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
lies within the last block of the bit set and no bits are set beyond the
offset, terminate the search immediately instead of continuing as though
there are further blocks in the set and subsequently returning an incorrect
result.
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
is present in the tree. Otherwise there exists a window during which the
element could be removed by another thread, triggering an incorrect
assertion failure.
Reviewed by: jeff
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
- make sure the timeout computations are always above zero by using
the existing "linux_timer_jiffies_until()" function. Negative timeouts
can result in undefined behaviour.
- declare all completion functions like external symbols and move the
code to the LinuxAPI kernel module.
- add a proper prefix to all LinuxAPI kernel functions to avoid
namespace collision with other parts of the FreeBSD kernel.
- clean up header file inclusions in the linux/completion.h, linux/in.h
and linux/fs.h header files.
MFC after: 1 week
Sponsored by: Mellanox Technologies
conversion:
- The linux compat API layer casts the ticks to unsigned long which
might cause problems when the ticks value is negative.
- Guard against already expired ticks values, by checking if the
passed expiry tick is already elapsed.
- While at it avoid referring the address of an inlined function.
MFC after: 3 days
Sponsored by: Mellanox Technologies
drivers can use it. This avoids some code duplication. Add missing
default case to all switch statements while at it. Also move the
hashing of the IPv6 flow field to layer 4 because the IPv6 flow field
is constant on a per L4 connection basis and not on a per L3 network.
Differential Revision: https://reviews.freebsd.org/D1987
Sponsored by: Mellanox Technologies
MFC after: 1 month
> List of fixes:
* use correct format for GID printouts
* double array indexing
* spelling in printouts
* void pointer arithmetic
* allow more receive rings
* correct maximum number of transmit rings
* use "const" instead of "static" for constants
* check for invalid VLAN tags
* check for lack of IRQ resources
> Added more hardware specific defines
> Added more verbose printouts of firmware status codes
Sponsored by: Mellanox Technologies
MFC after: 3 days
by placing appropriate #ifdefs around otherwise unused variables
or sections with functions called which are not available without
IPv6 support in the kernel.
Introduce fget_fcntl which performs appropriate checks when needed.
This removes a branch from fget_unlocked.
Introduce fget_mmap dealing with cap_rights_to_vmprot conversion.
This removes a branch from _fget.
Modify fget_unlocked to pass sequence counter to interested callers so
that they can perform their own checks and make sure the result was
otained from stable & current state.
Reviewed by: silence on -hackers
this option from all modules that enable it theirselves.
In C mode -fms-extensions option enables anonymous structs and unions,
allowing us to use this C11 feature in kernel. Of course, clang supports
it without any extra options.
Reviewed by: dim
Highlights:
- Multiple verbs API updates
- Support for RoCE, RDMA over ethernet
All hardware drivers depending on the common infiniband stack has been
updated aswell.
Discussed with: np @
Sponsored by: Mellanox Technologies
MFC after: 1 month
FreeBSD developers need more time to review patches in the surrounding
areas like the TCP stack which are using MPSAFE callouts to restore
distribution of callouts on multiple CPUs.
Bump the __FreeBSD_version instead of reverting it.
Suggested by: kmacy, adrian, glebius and kib
Differential Revision: https://reviews.freebsd.org/D1438
"MODULE_VERSION" macro definition. Remove the redefinition of the
"MODULE_VERSION" macro from the Linux kernel compatibility API.
MFC after: 1 month
Reported by: np@
Sponsored by: Mellanox Technologies
by dumbbell@ to be able to compile this layer as a dependency module.
Clean up some Makefiles and remove the no longer used OFED define.
Currently only i386 and amd64 targets are supported.
MFC after: 1 month
Sponsored by: Mellanox Technologies
- Close a migration race where callout_reset() failed to set the
CALLOUT_ACTIVE flag.
- Callout callback functions are now allowed to be protected by
spinlocks.
- Switching the callout CPU number cannot always be done on a
per-callout basis. See the updated timeout(9) manual page for more
information.
- The timeout(9) manual page has been updated to reflect how all the
functions inside the callout API are working. The manual page has
been made function oriented to make it easier to deduce how each of
the functions making up the callout API are working without having
to first read the whole manual page. Group all functions into a
handful of sections which should give a quick top-level overview
when the different functions should be used.
- The CALLOUT_SHAREDLOCK flag and its functionality has been removed
to reduce the complexity in the callout code and to avoid problems
about atomically stopping callouts via callout_stop(). If someone
needs it, it can be re-added. From my quick grep there are no
CALLOUT_SHAREDLOCK clients in the kernel.
- A new callout API function named "callout_drain_async()" has been
added. See the updated timeout(9) manual page for a complete
description.
- Update the callout clients in the "kern/" folder to use the callout
API properly, like cv_timedwait(). Previously there was some custom
sleepqueue code in the callout subsystem, which has been removed,
because we now allow callouts to be protected by spinlocks. This
allows us to tear down the callout like done with regular mutexes,
and a "td_slpmutex" has been added to "struct thread" to atomically
teardown the "td_slpcallout". Further the "TDF_TIMOFAIL" and
"SWT_SLEEPQTIMO" states can now be completely removed. Currently
they are marked as available and will be cleaned up in a follow up
commit.
- Bump the __FreeBSD_version to indicate kernel modules need
recompilation.
- There has been several reports that this patch "seems to squash a
serious bug leading to a callout timeout and panic".
Kernel build testing: all architectures were built
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D1438
Sponsored by: Mellanox Technologies
Reviewed by: jhb, adrian, sbruno and emaste
- Remove unsupported "bus" field from "struct pci_dev".
- Fix logic inside "pci_enable_msix()" when the number of allocated
interrupts are less than the number of available interrupts.
- Update header files included from "list.h".
- Ensure that "idr_destroy()" removes all entries before destroying
the IDR root node(s).
- Set the "device->release" function so that we don't leak memory at
device destruction.
- Use FreeBSD's "log()" function for certain debug printouts.
- Put parenthesis around arguments inside the min, max, min_t and max_t macros.
- Make sure we don't leak file descriptors by dropping the extra file
reference counts done by the FreeBSD kernel when calling falloc()
and fget_unlocked().
MFC after: 1 week
Sponsored by: Mellanox Technologies
Without this fix, a kernel with VIMAGE + Infiniband will panic on bootup.
Certain necessary #include statements require LIST_HEAD.
Add these includes to ofed/include/linux/list.h, because
LIST_HEAD is specifically overridden in this file.
PR: 191468
Differential Revision: D1279
Reviewed by: hselasky
from the FreeBSD network code. The flag is still kept around in the
"sys/mbuf.h" header file, but does no longer have any users. Instead
the "m_pkthdr.rsstype" field in the mbuf structure is now used to
decide the meaning of the "m_pkthdr.flowid" field. To modify the
"m_pkthdr.rsstype" field please use the existing "M_HASHTYPE_XXX"
macros as defined in the "sys/mbuf.h" header file.
This patch introduces new behaviour in the transmit direction.
Previously network drivers checked if "M_FLOWID" was set in "m_flags"
before using the "m_pkthdr.flowid" field. This check has now now been
replaced by checking if "M_HASHTYPE_GET(m)" is different from
"M_HASHTYPE_NONE". In the future more hashtypes will be added, for
example hashtypes for hardware dedicated flows.
"M_HASHTYPE_OPAQUE" indicates that the "m_pkthdr.flowid" value is
valid and has no particular type. This change removes the need for an
"if" statement in TCP transmit code checking for the presence of a
valid flowid value. The "if" statement mentioned above is now a direct
variable assignment which is then later checked by the respective
network drivers like before.
Additional notes:
- The SCTP code changes will be committed as a separate patch.
- Removal of the "M_FLOWID" flag will also be done separately.
- The FreeBSD version has been bumped.
MFC after: 1 month
Sponsored by: Mellanox Technologies
sending not ready data:
o Add new flag to pru_send() flags - PRUS_NOTREADY.
o Add new protocol method pru_ready().
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
sb_cc member of struct sockbuf to a couple of inline functions:
sbavail() and sbused()
Right now they are equal, but once notion of "not ready socket buffer data",
will be checked in, they are going to be different.
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
- Wrong integer type was specified.
- Wrong or missing "access" specifier. The "access" specifier
sometimes included the SYSCTL type, which it should not, except for
procedural SYSCTL nodes.
- Logical OR where binary OR was expected.
- Properly assert the "access" argument passed to all SYSCTL macros,
using the CTASSERT macro. This applies to both static- and dynamically
created SYSCTLs.
- Properly assert the the data type for both static and dynamic
SYSCTLs. In the case of static SYSCTLs we only assert that the data
pointed to by the SYSCTL data pointer has the correct size, hence
there is no easy way to assert types in the C language outside a
C-function.
- Rewrote some code which doesn't pass a constant "access" specifier
when creating dynamic SYSCTL nodes, which is now a requirement.
- Updated "EXAMPLES" section in SYSCTL manual page.
MFC after: 3 days
Sponsored by: Mellanox Technologies
Mellanox hardware driver(s):
- Properly name an inclusion guard
- Fix compile warnings regarding unsigned enums
- Add two new sysctl nodes
- Remove all empty linux header files
- Make an error printout more verbose
- Use "mod_delayed_work()" instead of
cancelling and starting a timeout.
- Implement more Linux scatterlist
functions.
MFC after: 3 days
Sponsored by: Mellanox Technologies
struct kinfo_file.
- Move the various fill_*_info() methods out of kern_descrip.c and into the
various file type implementations.
- Rework the support for kinfo_ofile to generate a suitable kinfo_file object
for each file and then convert that to a kinfo_ofile structure rather than
keeping a second, different set of code that directly manipulates
type-specific file information.
- Remove the shm_path() and ksem_info() layering violations.
Differential Revision: https://reviews.freebsd.org/D775
Reviewed by: kib, glebius (earlier version)
The current TSO limitation feature only takes the total number of
bytes in an mbuf chain into account and does not limit by the number
of mbufs in a chain. Some kinds of hardware is limited by two
factors. One is the fragment length and the second is the fragment
count. Both of these limits need to be taken into account when doing
TSO. Else some kinds of hardware might have to drop completely valid
mbuf chains because they cannot loaded into the given hardware's DMA
engine. The new way of doing TSO limitation has been made backwards
compatible as input from other FreeBSD developers and will use
defaults for values not set.
Reviewed by: adrian, rmacklem
Sponsored by: Mellanox Technologies
MFC after: 1 week
The current TSO limitation feature only takes the total number of
bytes in an mbuf chain into account and does not limit by the number
of mbufs in a chain. Some kinds of hardware is limited by two
factors. One is the fragment length and the second is the fragment
count. Both of these limits need to be taken into account when doing
TSO. Else some kinds of hardware might have to drop completely valid
mbuf chains because they cannot loaded into the given hardware's DMA
engine. The new way of doing TSO limitation has been made backwards
compatible as input from other FreeBSD developers and will use
defaults for values not set.
MFC after: 1 week
Sponsored by: Mellanox Technologies
- Add invfo_rdwr() (for read and write), invfo_ioctl(), invfo_poll(),
and invfo_kqfilter() for use by file types that do not support the
respective operations. Home-grown versions of invfo_poll() were
universally broken (they returned an errno value, invfo_poll()
uses poll_no_poll() to return an appropriate event mask). Home-grown
ioctl routines also tended to return an incorrect errno (invfo_ioctl
returns ENOTTY).
- Use the invfo_*() functions instead of local versions for
unsupported file operations.
- Reorder fileops members to match the order in the structure definition
to make it easier to spot missing members.
- Add several missing methods to linuxfileops used by the OFED shim
layer: fo_write(), fo_truncate(), fo_kqfilter(), and fo_stat(). Most
of these used invfo_*(), but a dummy fo_stat() implementation was
added.
argument but not actually defined anywhere.
This fixes the compile complaining about
"declaration of 'struct kiocb' will not be visible outside of this function".
MFC after: 2 weeks
X-MFC with: whatever changed caused the breakage ;-)
hardware driver update from Mellanox Technologies.
- Remove empty files from the OFED Linux Emulation layer.
- Fix compile warnings related to printf() and the "%lld" and "%llx"
format specifiers.
- Add some missing 2-clause BSD copyrights.
- Add "Mellanox Technologies, Ltd." to list of copyright holders.
- Add some new compatibility files.
- Fix order of uninit in the mlx4ib module to avoid crash at unload
using the new module_exit_order() function.
MFC after: 1 week
Sponsored by: Mellanox Technologies
Make the functions pci_disable_msi, pci_enable_msi and pci_enable_msix
methods of the newbus PCI bus. This code should not include any
functional change.
Sponsored by: Citrix Systems R&D
Reviewed by: imp, jhb
Differential Revision: https://reviews.freebsd.org/D354
dev/pci/pci.c:
- Convert the mentioned functions to newbus methods.
- Fix the callers of the converted functions.
sys/dev/pci/pci_private.h:
dev/pci/pci_if.m:
- Declare the new methods.
dev/pci/pcivar.h:
- Add helpers to call the newbus methods.
ofed/include/linux/pci.h:
- Add define to prevent the ofed version of pci_enable_msix from
clashing with the FreeBSD native version.
trees. This happens because the logic inserting items into the radix
tree is allocating empty radix levels, when index zero does not
contain any items.
- Add proper error case handling, so that the radix tree does not end
up in a bad state, if memory cannot be allocated during insertion of
an item.
- Add check for inserting NULL items into the radix tree.
- Add check for radix tree getting too big.
MFC after: 1 week
Sponsored by: Mellanox Technologies
prior to starting "/sbin/init" which will run all the "/etc/rc.d/xxx"
scripts. Else there can be a race configuring the interfaces via
"/etc/rc.conf".
MFC after: 4 weeks
Sponsored by: Mellanox Technologies
Remove duplicate "debug_ktr.mask" sysctl definition.
Remove now unused variable from "kern_ktr.c".
This fixes build of "ktr" which was broken by r267961.
Let the default value for "vm_kmem_size_scale" be zero. It is setup
after that the sysctl has been initialized from "getenv()" in the
"kmeminit()" function to equal the "VM_KMEM_SIZE_MAX" value, if
zero. On Sparc64 the "VM_KMEM_SIZE_MAX" macro is not a constant. This
fixes build of Sparc64 which was broken by r267961.
Add a special macro to dynamically create SYSCTL root nodes, because
root nodes have a special parent. This fixes build of existing OFED
module and CANBUS module for pc98 which was broken by r267961.
Add missing "sysctl.h" includes to get the needed sysctl header file
declarations. This is needed after r267961.
MFC after: 2 weeks
To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.
Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.
Exp-run revealed no ports using it directly.
No objection from: arch@
Sponsored by: EMC / Isilon Storage Division
interface, in the r241616 a crutch was provided. It didn't work well, and
finally we decided that it is time to break ABI and simply make if_baudrate
a 64-bit value. Meanwhile, the entire struct if_data was reviewed.
o Remove the if_baudrate_pf crutch.
o Make all fields of struct if_data fixed machine independent size. The
notion of data (packet counters, etc) are by no means MD. And it is a
bug that on amd64 we've got a 64-bit counters, while on i386 32-bit,
which at modern speeds overflow within a second.
This also removes quite a lot of COMPAT_FREEBSD32 code.
o Give 16 bit for the ifi_datalen field. This field was provided to
make future changes to if_data less ABI breaking. Unfortunately the
8 bit size of it had effectively limited sizeof if_data to 256 bytes.
o Give 32 bits to ifi_mtu and ifi_metric.
o Give 64 bits to the rest of fields, since they are counters.
__FreeBSD_version bumped.
Discussed with: emax
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
callback providers. link_init_sdl() function can be used to
fill most of the parameters. Use caller stack instead of
allocation / freing memory for each request. Do not drop support
for extra-long (probably non-existing) link-layer protocols by
introducing link_alloc_sdl() (used by if_resolvemulti() callback)
and link_free_sdl() (used by caller).
Since this change breaks KBI, MFC requires slightly different approach
(link_init_sdl() auto-allocating buffer if necessary to handle cases
with unmodified if_resolvemulti() callers).
MFC after: 2 weeks
modules which require this flag to compile. Use a GCC_MS_EXTENSIONS
variable, defined in kern.pre.mk, which can be used to easily supply the
flag (or not), depending on the compiler type.
MFC after: 3 days
We need to do this because the Linux compat layer uses sx(9) for
mutex, however the lagg code uses rmlocks and calls into the mellanox
driver. This causes deadlock due to sleeping while holding a rmlock.
Submitted by: Shahar Klein (shahark mellanox.com)
MFC After: 3 days.
shifts into the sign bit. Instead use (1U << 31) which gets the
expected result.
This fix is not ideal as it assumes a 32 bit int, but does fix the issue
for most cases.
A similar change was made in OpenBSD.
Discussed with: -arch, rdivacky
Reviewed by: cperciva
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
The order of releasing resources in mlxen was wrong, which caused
panic on reload of the module.
conf_ctx list should be released before stat_ctx list, otherwise
the leafs in conf_ctx list won't be released because of the dependancy.
The fix is to change the order of the releases.
Submitted by: Shahar Klein (shahark at mellanox.com)
The device name was incorrect due to a specific function we ported
from the Linux driver that is not FBSD compatible. This resulted
with a false sysctl registration and some more problematic issues.
The patch basically revokes it all together.
Submitted by: Meny Yossefi (menyy mellanox.com)
Approved by: re
When calling to ib_uverbs_cleanup_ucontext, there is a call to
mutex_lock of xrcd_table_mutex, which was not initialized.
Added missing initialization for xrcd_table_mutex.
Submitted by: Orit Moskovich (oritm mellanox.com)
Approved by: re
Removed the ifdef linux from this function.
Added stub function for contiguous pages to avoid compilation
errors.
Submitted by: Orit Moskovich (oritm mellanox.com)
Approved by: re
srq_pingpong IBverbs
Removed refrences using 'ifdef __linux__' to qpg functions and
related fields in struct
ib_qp_init_attr.
Submitted by: Orit Moskovich (oritm mellanox.com)
Approved by: re
Changed module init from module_init() to module_init_order() with
SI_ORDER_MIDDLE flag
Submitted by: Orit Moskovich (oritm mellanox.com)
Approved by: re
Update the OFED Infiniband core to the version supplied in Linux
version 3.7.
The update to OFED is nearly all additional defines and functions
with the exception of the addition of additional parameters to
ib_register_device() and the reg_user_mr callback.
In addition the ibcore (Infiniband core) and ipoib (IP over Infiniband)
have both been made into completely loadable modules to facilitate
testing of the OFED stack in FreeBSD.
Finally the Mellanox Infiniband drivers are now updated to the
latest version shipping with Linux 3.7.
Submitted by: Mellanox FreeBSD driver team:
Oded Shanoon (odeds mellanox.com),
Meny Yossefi (menyy mellanox.com),
Orit Moskovich (oritm mellanox.com)
Approved by: re
into threads each processing queue in a single domain. The structure
of the pagedaemons and queues is kept intact, most of the changes come
from the need for code to find an owning page queue for given page,
calculated from the segment containing the page.
The tie between NUMA domain and pagedaemon thread/pagequeue split is
rather arbitrary, the multithreaded daemon could be allowed for the
single-domain machines, or one domain might be split into several page
domains, to further increase concurrency.
Right now, each pagedaemon thread tries to reach the global target,
precalculated at the start of the pass. This is not optimal, since it
could cause excessive page deactivation and freeing. The code should
be changed to re-check the global page deficit state in the loop after
some number of iterations.
The pagedaemons reach the quorum before starting the OOM, since one
thread inability to meet the target is normal for split queues. Only
when all pagedaemons fail to produce enough reusable pages, OOM is
started by single selected thread.
Launder is modified to take into account the segments layout with
regard to the region for which cleaning is performed.
Based on the preliminary patch by jeff, sponsored by EMC / Isilon
Storage Division.
Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
transparent layering and better fragmentation.
- Normalize functions that allocate memory to use kmem_*
- Those that allocate address space are named kva_*
- Those that operate on maps are named kmap_*
- Implement recursive allocation handling for kmem_arena in vmem.
Reviewed by: alc
Tested by: pho
Sponsored by: EMC / Isilon Storage Division
- Remove an incorrect assertion that can trigger when downing an interface.
- Stop the interface during detach to avoid panics when unloading the
driver.
- A few locking fixes to be more consistent with other FreeBSD drivers:
- Protect if_drv_flags with the driver lock, not atomic ops
- Hold the driver lock when adjusting multicast state.
- Hold the driver lock while adjusting if_capenable.
PR: kern/180791 [1,2]
Submitted by: Shakar Klein @ Mellanox [1,2]
MFC after: 3 days
- Only enable UDP/TCP hardware checksums if CSUM_UDP or CSUM_TCP is set.
- Only enable IP hardware checksums if CSUM_IP is set.
PR: kern/180430
Submitted by: Meny Yossefi <menyy@mellanox.com>
MFC after: 1 week
Also directly call swapper() at the end of mi_startup instead of
relying on swapper being the last thing in sysinits order.
Rationale:
- "RUN_SCHEDULER" was misleading, scheduling already takes place at that stage
- "scheduler" was misleading, the function swaps in the swapped out processes
- another SYSINIT(SI_SUB_RUN_SCHEDULER, SI_ORDER_ANY) could never be
invoked depending on its relative order with scheduler; this was not obvious
and the bug actually used to exist
Reviewed by: kib (ealier version)
MFC after: 14 days
generic and apply to all sysfs attributes:
- Use sysctl_handle_string() instead of reimplementing it.
- Remove trailing newline from the current value before passing it to
userland and append a newline to the new string value before passing it
to the attribute's store function.
- Don't leak the temporary buffer if the first error check triggers.
- Revert earlier change to mlx4 port mode handler.
PR: kern/174213
Submitted by: Garrett Cooper
Reviewed by: Shakar Klein @ Mellanox
MFC after: 1 week
to Ethernet and the subsequent port being set to IB.
Submitted by: Shakar Klein @ Mellanox
Tested by: Morgan Robertson <morganrobertson@gmail.com>
MFC after: 1 week
- Fix sysctl wrapper for sysfs attributes to properly handle new string
values similar to sysctl_handle_string() (only copyin the user's
supplied length and nul-terminate the string).
- Don't check for a trailing newline when evaluating the desired operating
mode of a mlx4 device.
PR: kern/179999
Submitted by: Shahar Klein <shahark@mellanox.com>
MFC after: 1 week
linux_file structure and use it instead of directly accessing td_fpop
when destroying the linux_file structure. The td_fpop pointer is not
valid when a cdevpriv destructor is run, and the type-specific close
method has already been called, so f_vnode may not be valid (and the
vnode might have been recycled without our own reference).
Tested by: Julian Stecklina <jsteckli@os.inf.tu-dresden.de>
MFC after: 1 week
do drain (flush_workqueue() in Linux terms) but instead returns true if
the work was removed before it is run, or false otherwise.
Simulate this by removing the taskqueue_drain() and return the value
derived from taskqueue_cancel()'s return value.
This would solve a witness warning caused by calling taskqueue_drain()
with a non-sleepable lock held, like:
taskqueue_drain with the following non-sleepable locks held:
exclusive rw lle (lle) r = 0 (0xfffffe001450b410) locked @
/usr/src/sys/netinet/in.c:1484
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xffffff848d4f7690
kdb_backtrace() at kdb_backtrace+0x39/frame 0xffffff848d4f7740
witness_warn() at witness_warn+0x4a8/frame 0xffffff848d4f7800
taskqueue_drain() at taskqueue_drain+0x3a/frame 0xffffff848d4f7840
set_timeout() at set_timeout+0x4a/frame 0xffffff848d4f7860
netevent_callback() at netevent_callback+0x16/frame 0xffffff848d4f7870
arpintr() at arpintr+0x9b5/frame 0xffffff848d4f7930
This do not affect kernel without OFED compiled in.
Reported by: Garrett Cooper <yaneurabeya gmail com>
(who also tested an earlier version of this patch,
but bugs are mine)
MFC after: 2 weeks
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.
The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
- VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
- VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
- VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
- VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
(in order to avoid visibility of implementation details)
- The read-mode operations are added:
VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
sys/mutex.h in consumers directly to cater its inlining functions
using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
the compat layer because the name clash between FreeBSD and solaris
versions must be avoided.
At this purpose zfs redefines the vm_object locking functions
directly, isolating the FreeBSD components in specific compat stubs.
The KPI results heavilly broken by this commit. Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).
Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: pjd (ZFS specific review)
Discussed with: alc
Tested by: pho
- Capability is no longer separate descriptor type. Now every descriptor
has set of its own capability rights.
- The cap_new(2) system call is left, but it is no longer documented and
should not be used in new code.
- The new syscall cap_rights_limit(2) should be used instead of
cap_new(2), which limits capability rights of the given descriptor
without creating a new one.
- The cap_getrights(2) syscall is renamed to cap_rights_get(2).
- If CAP_IOCTL capability right is present we can further reduce allowed
ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
ioctls can be retrived with cap_ioctls_get(2) syscall.
- If CAP_FCNTL capability right is present we can further reduce fcntls
that can be used with the new cap_fcntls_limit(2) syscall and retrive
them with cap_fcntls_get(2).
- To support ioctl and fcntl white-listing the filedesc structure was
heavly modified.
- The audit subsystem, kdump and procstat tools were updated to
recognize new syscalls.
- Capability rights were revised and eventhough I tried hard to provide
backward API and ABI compatibility there are some incompatible changes
that are described in detail below:
CAP_CREATE old behaviour:
- Allow for openat(2)+O_CREAT.
- Allow for linkat(2).
- Allow for symlinkat(2).
CAP_CREATE new behaviour:
- Allow for openat(2)+O_CREAT.
Added CAP_LINKAT:
- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
- Allow to be target for renameat(2).
Added CAP_SYMLINKAT:
- Allow for symlinkat(2).
Removed CAP_DELETE. Old behaviour:
- Allow for unlinkat(2) when removing non-directory object.
- Allow to be source for renameat(2).
Removed CAP_RMDIR. Old behaviour:
- Allow for unlinkat(2) when removing directory.
Added CAP_RENAMEAT:
- Required for source directory for the renameat(2) syscall.
Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
- Allow for unlinkat(2) on any object.
- Required if target of renameat(2) exists and will be removed by this
call.
Removed CAP_MAPEXEC.
CAP_MMAP old behaviour:
- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
PROT_WRITE.
CAP_MMAP new behaviour:
- Allow for mmap(2)+PROT_NONE.
Added CAP_MMAP_R:
- Allow for mmap(PROT_READ).
Added CAP_MMAP_W:
- Allow for mmap(PROT_WRITE).
Added CAP_MMAP_X:
- Allow for mmap(PROT_EXEC).
Added CAP_MMAP_RW:
- Allow for mmap(PROT_READ | PROT_WRITE).
Added CAP_MMAP_RX:
- Allow for mmap(PROT_READ | PROT_EXEC).
Added CAP_MMAP_WX:
- Allow for mmap(PROT_WRITE | PROT_EXEC).
Added CAP_MMAP_RWX:
- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).
Renamed CAP_MKDIR to CAP_MKDIRAT.
Renamed CAP_MKFIFO to CAP_MKFIFOAT.
Renamed CAP_MKNODE to CAP_MKNODEAT.
CAP_READ old behaviour:
- Allow pread(2).
- Disallow read(2), readv(2) (if there is no CAP_SEEK).
CAP_READ new behaviour:
- Allow read(2), readv(2).
- Disallow pread(2) (CAP_SEEK was also required).
CAP_WRITE old behaviour:
- Allow pwrite(2).
- Disallow write(2), writev(2) (if there is no CAP_SEEK).
CAP_WRITE new behaviour:
- Allow write(2), writev(2).
- Disallow pwrite(2) (CAP_SEEK was also required).
Added convinient defines:
#define CAP_PREAD (CAP_SEEK | CAP_READ)
#define CAP_PWRITE (CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ)
#define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
#define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W)
#define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X)
#define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X)
#define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
#define CAP_RECV CAP_READ
#define CAP_SEND CAP_WRITE
#define CAP_SOCK_CLIENT \
(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
#define CAP_SOCK_SERVER \
(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
CAP_SETSOCKOPT | CAP_SHUTDOWN)
Added defines for backward API compatibility:
#define CAP_MAPEXEC CAP_MMAP_X
#define CAP_DELETE CAP_UNLINKAT
#define CAP_MKDIR CAP_MKDIRAT
#define CAP_RMDIR CAP_UNLINKAT
#define CAP_MKFIFO CAP_MKFIFOAT
#define CAP_MKNOD CAP_MKNODAT
#define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER)
Sponsored by: The FreeBSD Foundation
Reviewed by: Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with: rwatson, benl, jonathan
ABI compatibility discussed with: kib
of the newer drivers. The basic problem was
that the driver was pulling the mbuf off the
drbr ring and then when sending with xmit(), encounting
a full transmit ring. Thus the lower layer
xmit() function would return an error, and the
drivers would then append the data back on to the ring.
For TCP this is a horrible scenario sure to bring
on a fast-retransmit.
The fix is to use drbr_peek() to pull the data pointer
but not remove it from the ring. If it fails then
we either call the new drbr_putback or drbr_advance
method. Advance moves it forward (we do this sometimes
when the xmit() function frees the mbuf). When
we succeed we always call advance. The
putback will always copy the mbuf back to the top
of the ring. Note that the putback *cannot* be used
with a drbr_dequeue() only with drbr_peek(). We most
of the time, in putback, would not need to copy it
back since most likey the mbuf is still the same, but
sometimes xmit() functions will change the mbuf via
a pullup or other call. So the optimial case for
the single consumer is to always copy it back. If
we ever do a multiple_consumer (for lagg?) we
will need a test and atomic in the put back possibly
a seperate putback_mc() in the ring buf.
Reviewed by: jhb@freebsd.org, jlv@freebsd.org
- In sys/ofed/drivers/infiniband/core/cma.c, an enum struct member is
interpreted as an int, so cast it to an int.
- In sys/ofed/drivers/infiniband/core/ud_header.c, initialize the
packet_length variable in ib_ud_header_init(), to prevent undefined
behaviour.
- In sys/ofed/drivers/infiniband/ulp/sdp/sdp_rx.c, call rdma_notify()
with the correct enum type and value.
- In sys/ofed/include/linux/pci.h, change the PCI_DEVICE and PCI_VDEVICE
macros to use C99 struct initializers, so additional members can be
overridden.
Reviewed by: delphij, Garrett Cooper <yanegomi@gmail.com>
MFC after: 1 week
tree used it incorrectly, which lead to inaccurate overrated
if_obytes accounting. The drbr(9) used to update ifnet stats on
drbr_enqueue(), which is not accurate since enqueuing doesn't
imply successful processing by driver. Dequeuing neither mean
that. Most drivers also called drbr_stats_update() which did
accounting again, leading to doubled if_obytes statistics. And
in case of severe transmitting, when a packet could be several
times enqueued and dequeued it could have been accounted several
times.
o Thus, make drbr(9) API thinner. Now drbr(9) merely chooses between
ALTQ queueing or buf_ring(9) queueing.
- It doesn't touch the buf_ring stats any more.
- It doesn't touch ifnet stats anymore.
- drbr_stats_update() no longer exists.
o buf_ring(9) handles its stats itself:
- It handles br_drops itself.
- br_prod_bytes stats are dropped. Rationale: no one ever
reads them but update of a common counter on every packet
negatively affects performance due to excessive cache
invalidation.
- buf_ring_enqueue_bytes() reduced to buf_ring_enqueue(), since
we no longer account bytes.
o Drivers handle their stats theirselves: if_obytes, if_omcasts.
o mlx4(4), igb(4), em(4), vxge(4), oce(4) and ixv(4) no longer
use drbr_stats_update(), and update ifnet stats theirselves.
o bxe(4) was the most correct driver, it didn't call
drbr_stats_update(), thus it was the only driver accurate under
moderate load. Now it also maintains stats itself.
o ixgbe(4) had already taken stats from hardware, so just
- drop software stats updating.
- take multicast packet count from hardware as well.
o mxge(4) just no longer needs NO_SLOW_STATS define.
o cxgb(4), cxgbe(4) need no change, since they obtain stats
from hardware.
Reviewed by: jfv, gnn
#defines. This also has the advantage that it makes the names more
compact, iand also allows us to correct the non-uniform naming of
the PCIM_LINK_* defines, making them all consistent amongst themselves.
This is a mostly mechanical rename:
s/PCIR_EXPRESS_/PCIER_/g
s/PCIM_EXP_/PCIEM_/g
s/PCIM_LINK_/PCIEM_LINK_/g
When this is MFC'd, #defines will be added for the old names to assist
out-of-tree drivers.
Discussed with: jhb
MFC after: 1 week
1) It is not useful to call "devfs_clear_cdevpriv()" from
"d_close" callbacks, hence for example read, write, ioctl and
so on might be sleeping at the time of "d_close" being called
and then then freed private data can still be accessed.
Examples: dtrace, linux_compat, ksyms (all fixed by this patch)
2) In sys/dev/drm* there are some cases in which memory will
be freed twice, if open fails, first by code in the open
routine, secondly by the cdevpriv destructor. Move registration
of the cdevpriv to the end of the drm open routines.
3) devfs_clear_cdevpriv() is not called if the "d_open" callback
registered cdevpriv data and the "d_open" callback function
returned an error. Fix this.
Discussed with: phk
MFC after: 2 weeks
to pull vm_param.h was removed. Other big dependency of vm_page.h on
vm_param.h are PA_LOCK* definitions, which are only needed for
in-kernel code, because modules use KBI-safe functions to lock the
pages.
Stop including vm_param.h into vm_page.h. Include vm_param.h
explicitely for the kernel code which needs it.
Suggested and reviewed by: alc
MFC after: 2 weeks
- Stateful TCP offload drivers for Terminator 3 and 4 (T3 and T4) ASICs.
These are available as t3_tom and t4_tom modules that augment cxgb(4)
and cxgbe(4) respectively. The cxgb/cxgbe drivers continue to work as
usual with or without these extra features.
- iWARP driver for Terminator 3 ASIC (kernel verbs). T4 iWARP in the
works and will follow soon.
Build-tested with make universe.
30s overview
============
What interfaces support TCP offload? Look for TOE4 and/or TOE6 in the
capabilities of an interface:
# ifconfig -m | grep TOE
Enable/disable TCP offload on an interface (just like any other ifnet
capability):
# ifconfig cxgbe0 toe
# ifconfig cxgbe0 -toe
Which connections are offloaded? Look for toe4 and/or toe6 in the
output of netstat and sockstat:
# netstat -np tcp | grep toe
# sockstat -46c | grep toe
Reviewed by: bz, gnn
Sponsored by: Chelsio communications.
MFC after: ~3 months (after 9.1, and after ensuring MFC is feasible)
Cleaner solution (e.g. adding another header) should be done here.
Original log:
Move several enums and structures required for L2 filtering from ip_fw_private.h to ip_fw.h.
Remove ipfw/ip_fw_private.h header from non-ipfw code.
Requested by: luigi
Approved by: kib(mentor)
- Mark 'sdp' as requiring 'inet'.
- Always include "opt_inet.h" and "opt_inet6.h" and modify the IB
driver Makefiles to honor WITH/WITHOUT_INET/INET6/_SUPPORT options
to determine what should be enabled during a module build.
- Fix the mlxen(4) driver and the core IB code to compile without
if INET is disabled (including when both INET and INET6 are disabled).
Reviewed by: bz
MFC after: 2 weeks
in IFQ_HANDOFF() when the packet is passed to the start routine, so doing
it here resulted in double counting.
Reported by: Alex Tutubalin lexa lexa ru
MFC after: 1 week
The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.
to implement fchown(2) and fchmod(2) support for several file types
that previously lacked it. Add MAC entries for chown/chmod done on
posix shared memory and (old) in-kernel posix semaphores.
Based on the submission by: glebius
Reviewed by: rwatson
Approved by: re (bz)
cpuset_t objects.
That is going to offer the underlying support for a simple bump of
MAXCPU and then support for number of cpus > 32 (as it is today).
Right now, cpumask_t is an int, 32 bits on all our supported architecture.
cpumask_t on the other side is implemented as an array of longs, and
easilly extendible by definition.
The architectures touched by this commit are the following:
- amd64
- i386
- pc98
- arm
- ia64
- XEN
while the others are still missing.
Userland is believed to be fully converted with the changes contained
here.
Some technical notes:
- This commit may be considered an ABI nop for all the architectures
different from amd64 and ia64 (and sparc64 in the future)
- per-cpu members, which are now converted to cpuset_t, needs to be
accessed avoiding migration, because the size of cpuset_t should be
considered unknown
- size of cpuset_t objects is different from kernel and userland (this is
primirally done in order to leave some more space in userland to cope
with KBI extensions). If you need to access kernel cpuset_t from the
userland please refer to example in this patch on how to do that
correctly (kgdb may be a good source, for example).
- Support for other architectures is going to be added soon
- Only MAXCPU for amd64 is bumped now
The patch has been tested by sbruno and Nicholas Esborn on opteron
4 x 12 pack CPUs. More testing on big SMP is expected to came soon.
pluknet tested the patch with his 8-ways on both amd64 and i386.
Tested by: pluknet, sbruno, gianni, Nicholas Esborn
Reviewed by: jeff, jhb, sbruno
set as TCP.
- Eliminate the fully linear non-scatter/gather rx path, there is no
harm in using arrays of clusters for both TCP and UDP.
- Implement support for enabling/disabling per-vlan priority pause and
queues via sysctl.