Commit Graph

88 Commits

Author SHA1 Message Date
alfred
9dd81144ca Fix for bad performance when mtu is increased.
Update the auto moderation behavior in the mlxen driver to match
the new LINUX OFED code.

Submitted by:	odeds
2013-11-08 18:26:28 +00:00
alfred
ac3a1c41ee Use explicit long cast to avoid overflow in bitopts.
This was causing problems with the buddy allocator inside of
ofed.

Submitted by: odeds
2013-11-08 18:20:19 +00:00
alfred
0b3e10d274 Fix API mismatch exposed by lagg.
When destroying a lagg the driver tries to restore the old mac and
fails due to API mismatch
2013-11-02 10:49:47 +00:00
glebius
ff6e113f1b The r48589 promised to remove implicit inclusion of if_var.h soon. Prepare
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-26 17:58:36 +00:00
alfred
000c45f54a Fix resource free.
The order of releasing resources in mlxen was wrong, which caused
panic on reload of the module.

conf_ctx list should be released before stat_ctx list, otherwise
the leafs in conf_ctx list won't be released because of the dependancy.

The fix is to change the order of the releases.

Submitted by:	Shahar Klein (shahark at mellanox.com)
2013-10-17 12:19:36 +00:00
alfred
31d9407b2d Fix __free_pages() in the linux shim.
__free_pages() is actaully supposed to take a "struct page *" not
an address.
2013-10-15 15:50:43 +00:00
alfred
07fbeff5fe Fix for When more than one NIC is present.
The device name was incorrect due to a specific function we ported
from the Linux driver that is not FBSD compatible.  This resulted
with a false sysctl registration and some more problematic issues.

The patch basically revokes it all together.

Submitted by: Meny Yossefi (menyy mellanox.com)

Approved by:	re
2013-10-10 14:03:03 +00:00
dim
255d648aec Remove redundant declaration of cmclass in
sys/ofed/drivers/infiniband/core/ucm.c, to silence a gcc warning.

Approved by:	re (kib)
X-MFC-With:	r255932
2013-10-09 07:02:03 +00:00
dim
1d953d974c Give an unnamed union in sys/ofed/include/rdma/ib_verbs.h a name, to
silence a gcc warning.

Approved by:	re (gjb)
MFC after:      3 days
2013-10-07 16:54:29 +00:00
alfred
9e3370c119 Fixed kernel crash when running devinfo
When calling to ib_uverbs_cleanup_ucontext, there is a call to
mutex_lock of xrcd_table_mutex, which was not initialized.
Added missing initialization for xrcd_table_mutex.

Submitted by: Orit Moskovich (oritm mellanox.com)

Approved by:	re
2013-10-01 15:43:23 +00:00
alfred
7a625ace1a Enable ib_dev.mmap function
Removed the ifdef linux from this function.
Added stub function for contiguous pages to avoid compilation
errors.

Submitted by:	Orit Moskovich (oritm mellanox.com)
Approved by:	re
2013-10-01 15:42:38 +00:00
alfred
732ff0c1ed Fixed 'Couldn't Create QP' issue when running rc_pingpong, uc_pingpong,
srq_pingpong IBverbs

Removed refrences using 'ifdef __linux__' to qpg functions and
related fields in struct
ib_qp_init_attr.

Submitted by: Orit Moskovich (oritm mellanox.com)

Approved by:	re
2013-10-01 15:38:29 +00:00
alfred
282d961db9 Fixed kernel crash when removing IPOIB_CM option from configuration file
Changed module init from module_init() to module_init_order() with
SI_ORDER_MIDDLE flag
Submitted by:	Orit Moskovich (oritm mellanox.com)
Approved by:	re
2013-10-01 15:36:51 +00:00
alfred
689b653e06 Fix mis-merge of upstream fix.
We would accidentally make the string one byte too short.

Submitted by: Orit Moskovich (oritm mellanox.com)

Approved by:	re
2013-10-01 15:33:00 +00:00
alfred
91eb2b78a7 Update OFED to Linux 3.7 and update Mellanox drivers.
Update the OFED Infiniband core to the version supplied in Linux
version 3.7.

The update to OFED is nearly all additional defines and functions
with the exception of the addition of additional parameters to
ib_register_device() and the reg_user_mr callback.

In addition the ibcore (Infiniband core) and ipoib (IP over Infiniband)
have both been made into completely loadable modules to facilitate
testing of the OFED stack in FreeBSD.

Finally the Mellanox Infiniband drivers are now updated to the
latest version shipping with Linux 3.7.

Submitted by: Mellanox FreeBSD driver team:
                Oded Shanoon (odeds mellanox.com),
                Meny Yossefi (menyy mellanox.com),
                Orit Moskovich (oritm mellanox.com)

Approved by: re
2013-09-29 00:35:03 +00:00
pjd
1c7defb76e Handle cases where capability rights are not provided.
Reported by:	kib
2013-09-05 11:58:12 +00:00
andre
c4e489b014 Change m->pkthdr.header to m->pkthdr.PH_loc.ptr after r254804
to transiently store pointers to packet headers.

Sponsored by:	The FreeBSD Foundation
2013-08-25 09:45:26 +00:00
np
7701451891 Fix implementation of sock_getname.
MFC after:	1 week
2013-08-23 18:54:27 +00:00
jhb
dc097a7ee8 Stop an ipoib interface before detaching it.
PR:		kern/181225
Submitted by:	Shahar Klein
Obtained from:	Mellanox
MFC after:	1 week
2013-08-20 18:08:06 +00:00
andre
7cc6cc696c Add m_clrprotoflags() to clear protocol specific mbuf flags at up and
downwards layer crossings.

Consistently use it within IP, IPv6 and ethernet protocols.

Discussed with:	trociny, glebius
2013-08-19 13:27:32 +00:00
glebius
722a1a5e5d Make sendfile() a method in the struct fileops. Currently only
vnode backed file descriptors have this method implemented.

Reviewed by:	kib
Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2013-08-15 07:54:31 +00:00
jeff
5f141f7f1f - Reserve a special AF for SDP. The one we were incorrectly using before
was taken by another AF.

Sponsored by:	EMC / Isilon Storage Division
2013-08-09 03:26:17 +00:00
jeff
31941e8d5d - Correctly handle various edge cases in sysfs emulation.
Sponsored by:	EMC / Isilon Storage Division
2013-08-09 03:24:48 +00:00
jeff
fd428c40be - Use the correct type in the linux bitops emulation.
Submitted by:	Maxim Ignatenko <gelraen.ua@gmail.com>
2013-08-09 03:24:12 +00:00
kib
8de1718b60 Split the pagequeues per NUMA domains, and split pageademon process
into threads each processing queue in a single domain.  The structure
of the pagedaemons and queues is kept intact, most of the changes come
from the need for code to find an owning page queue for given page,
calculated from the segment containing the page.

The tie between NUMA domain and pagedaemon thread/pagequeue split is
rather arbitrary, the multithreaded daemon could be allowed for the
single-domain machines, or one domain might be split into several page
domains, to further increase concurrency.

Right now, each pagedaemon thread tries to reach the global target,
precalculated at the start of the pass.  This is not optimal, since it
could cause excessive page deactivation and freeing.  The code should
be changed to re-check the global page deficit state in the loop after
some number of iterations.

The pagedaemons reach the quorum before starting the OOM, since one
thread inability to meet the target is normal for split queues.  Only
when all pagedaemons fail to produce enough reusable pages, OOM is
started by single selected thread.

Launder is modified to take into account the segments layout with
regard to the region for which cleaning is performed.

Based on the preliminary patch by jeff, sponsored by EMC / Isilon
Storage Division.

Reviewed by:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2013-08-07 16:36:38 +00:00
jeff
de4ecca213 Replace kernel virtual address space allocation with vmem. This provides
transparent layering and better fragmentation.

 - Normalize functions that allocate memory to use kmem_*
 - Those that allocate address space are named kva_*
 - Those that operate on maps are named kmap_*
 - Implement recursive allocation handling for kmem_arena in vmem.

Reviewed by:	alc
Tested by:	pho
Sponsored by:	EMC / Isilon Storage Division
2013-08-07 06:21:20 +00:00
jhb
8da4734455 Add a missing prototype.
Pointy hat:	me
2013-07-29 20:48:10 +00:00
jhb
4c0b5de701 Various fixes to the mlxen(4) driver:
- Remove an incorrect assertion that can trigger when downing an interface.
- Stop the interface during detach to avoid panics when unloading the
  driver.
- A few locking fixes to be more consistent with other FreeBSD drivers:
  - Protect if_drv_flags with the driver lock, not atomic ops
  - Hold the driver lock when adjusting multicast state.
  - Hold the driver lock while adjusting if_capenable.

PR:		kern/180791 [1,2]
Submitted by:	Shakar Klein @ Mellanox [1,2]
MFC after:	3 days
2013-07-29 18:44:52 +00:00
jhb
f6535606ea Avoid trashing IP fragments:
- Only enable UDP/TCP hardware checksums if CSUM_UDP or CSUM_TCP is set.
- Only enable IP hardware checksums if CSUM_IP is set.

PR:		kern/180430
Submitted by:	Meny Yossefi <menyy@mellanox.com>
MFC after:	1 week
2013-07-25 16:34:34 +00:00
avg
9e6374b6a9 rename scheduler->swapper and SI_SUB_RUN_SCHEDULER->SI_SUB_LAST
Also directly call swapper() at the end of mi_startup instead of
relying on swapper being the last thing in sysinits order.

Rationale:

- "RUN_SCHEDULER" was misleading, scheduling already takes place at that stage
- "scheduler" was misleading, the function swaps in the swapped out processes
- another SYSINIT(SI_SUB_RUN_SCHEDULER, SI_ORDER_ANY) could never be
  invoked depending on its relative order with scheduler; this was not obvious
  and the bug actually used to exist

Reviewed by:	kib (ealier version)
MFC after:	14 days
2013-07-24 09:45:31 +00:00
jhb
45bc92633d Rework the previous fix for the IB vs Ethernet sysctl handler to be more
generic and apply to all sysfs attributes:
- Use sysctl_handle_string() instead of reimplementing it.
- Remove trailing newline from the current value before passing it to
  userland and append a newline to the new string value before passing it
  to the attribute's store function.
- Don't leak the temporary buffer if the first error check triggers.
- Revert earlier change to mlx4 port mode handler.

PR:		kern/174213
Submitted by:	Garrett Cooper
Reviewed by:	Shakar Klein @ Mellanox
MFC after:	1 week
2013-07-18 14:06:01 +00:00
jhb
04ef6ec7f4 Remove check forbidding requests that would result in one port being set
to Ethernet and the subsequent port being set to IB.

Submitted by:	Shakar Klein @ Mellanox
Tested by:	Morgan Robertson <morganrobertson@gmail.com>
MFC after:	1 week
2013-07-17 13:41:54 +00:00
jhb
3297db39d5 Allow mlx4 devices to switch from Ethernet to Infiniband (and vice versa):
- Fix sysctl wrapper for sysfs attributes to properly handle new string
  values similar to sysctl_handle_string() (only copyin the user's
  supplied length and nul-terminate the string).
- Don't check for a trailing newline when evaluating the desired operating
  mode of a mlx4 device.

PR:		kern/179999
Submitted by:	Shahar Klein <shahark@mellanox.com>
MFC after:	1 week
2013-07-08 21:25:12 +00:00
jhb
da4379cf0b Store a reference to the vnode associated with a file descriptor in the
linux_file structure and use it instead of directly accessing td_fpop
when destroying the linux_file structure.  The td_fpop pointer is not
valid when a cdevpriv destructor is run, and the type-specific close
method has already been called, so f_vnode may not be valid (and the
vnode might have been recycled without our own reference).

Tested by:	Julian Stecklina <jsteckli@os.inf.tu-dresden.de>
MFC after:	1 week
2013-06-11 15:37:07 +00:00
eadler
4f9ab6c580 Fxi a bunch of typos.
PR:	misc/174625
Submitted by:	Jeremy Chadwick <jdc@koitsu.org>
2013-05-10 16:41:26 +00:00
delphij
55fe0cb833 According to the documentation, on Linux, cancel_delayed_work() does not
do drain (flush_workqueue() in Linux terms) but instead returns true if
the work was removed before it is run, or false otherwise.

Simulate this by removing the taskqueue_drain() and return the value
derived from taskqueue_cancel()'s return value.

This would solve a witness warning caused by calling taskqueue_drain()
with a non-sleepable lock held, like:

taskqueue_drain with the following non-sleepable locks held:
exclusive rw lle (lle) r = 0 (0xfffffe001450b410) locked @
/usr/src/sys/netinet/in.c:1484
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xffffff848d4f7690
kdb_backtrace() at kdb_backtrace+0x39/frame 0xffffff848d4f7740
witness_warn() at witness_warn+0x4a8/frame 0xffffff848d4f7800
taskqueue_drain() at taskqueue_drain+0x3a/frame 0xffffff848d4f7840
set_timeout() at set_timeout+0x4a/frame 0xffffff848d4f7860
netevent_callback() at netevent_callback+0x16/frame 0xffffff848d4f7870
arpintr() at arpintr+0x9b5/frame 0xffffff848d4f7930

This do not affect kernel without OFED compiled in.

Reported by:	Garrett Cooper <yaneurabeya gmail com>
		(who also tested an earlier version of this patch,
		but bugs are mine)
MFC after:	2 weeks
2013-05-08 17:45:22 +00:00
glebius
c6c2824a9a Add const qualifier to the dst parameter of the ifnet if_output method. 2013-04-27 08:11:48 +00:00
jhb
cb3494ff31 Check for SS_NBIO in the socket state field rather than socket buffer
flags.

Submitted by:	Vijay Singh
MFC after:	1 week
2013-04-03 20:31:10 +00:00
attilio
bf1dc90446 MFC 2013-03-08 00:03:07 +00:00
mav
b9da6c918f Add protective parentheses for macro argument, missed in r247671. 2013-03-02 22:41:06 +00:00
mav
dc07b9e1fa MFcalloutng:
Give OFED Linux wrapper own "expires" field instead of abusing callout's
c_time, which will change its type and units with calloutng commit.
2013-03-02 22:19:17 +00:00
attilio
e98f58faf6 MFC 2013-03-02 14:48:41 +00:00
pjd
f07ebb8888 Merge Capsicum overhaul:
- Capability is no longer separate descriptor type. Now every descriptor
  has set of its own capability rights.

- The cap_new(2) system call is left, but it is no longer documented and
  should not be used in new code.

- The new syscall cap_rights_limit(2) should be used instead of
  cap_new(2), which limits capability rights of the given descriptor
  without creating a new one.

- The cap_getrights(2) syscall is renamed to cap_rights_get(2).

- If CAP_IOCTL capability right is present we can further reduce allowed
  ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
  ioctls can be retrived with cap_ioctls_get(2) syscall.

- If CAP_FCNTL capability right is present we can further reduce fcntls
  that can be used with the new cap_fcntls_limit(2) syscall and retrive
  them with cap_fcntls_get(2).

- To support ioctl and fcntl white-listing the filedesc structure was
  heavly modified.

- The audit subsystem, kdump and procstat tools were updated to
  recognize new syscalls.

- Capability rights were revised and eventhough I tried hard to provide
  backward API and ABI compatibility there are some incompatible changes
  that are described in detail below:

	CAP_CREATE old behaviour:
	- Allow for openat(2)+O_CREAT.
	- Allow for linkat(2).
	- Allow for symlinkat(2).
	CAP_CREATE new behaviour:
	- Allow for openat(2)+O_CREAT.

	Added CAP_LINKAT:
	- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
	- Allow to be target for renameat(2).

	Added CAP_SYMLINKAT:
	- Allow for symlinkat(2).

	Removed CAP_DELETE. Old behaviour:
	- Allow for unlinkat(2) when removing non-directory object.
	- Allow to be source for renameat(2).

	Removed CAP_RMDIR. Old behaviour:
	- Allow for unlinkat(2) when removing directory.

	Added CAP_RENAMEAT:
	- Required for source directory for the renameat(2) syscall.

	Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
	- Allow for unlinkat(2) on any object.
	- Required if target of renameat(2) exists and will be removed by this
	  call.

	Removed CAP_MAPEXEC.

	CAP_MMAP old behaviour:
	- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
	  PROT_WRITE.
	CAP_MMAP new behaviour:
	- Allow for mmap(2)+PROT_NONE.

	Added CAP_MMAP_R:
	- Allow for mmap(PROT_READ).
	Added CAP_MMAP_W:
	- Allow for mmap(PROT_WRITE).
	Added CAP_MMAP_X:
	- Allow for mmap(PROT_EXEC).
	Added CAP_MMAP_RW:
	- Allow for mmap(PROT_READ | PROT_WRITE).
	Added CAP_MMAP_RX:
	- Allow for mmap(PROT_READ | PROT_EXEC).
	Added CAP_MMAP_WX:
	- Allow for mmap(PROT_WRITE | PROT_EXEC).
	Added CAP_MMAP_RWX:
	- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).

	Renamed CAP_MKDIR to CAP_MKDIRAT.
	Renamed CAP_MKFIFO to CAP_MKFIFOAT.
	Renamed CAP_MKNODE to CAP_MKNODEAT.

	CAP_READ old behaviour:
	- Allow pread(2).
	- Disallow read(2), readv(2) (if there is no CAP_SEEK).
	CAP_READ new behaviour:
	- Allow read(2), readv(2).
	- Disallow pread(2) (CAP_SEEK was also required).

	CAP_WRITE old behaviour:
	- Allow pwrite(2).
	- Disallow write(2), writev(2) (if there is no CAP_SEEK).
	CAP_WRITE new behaviour:
	- Allow write(2), writev(2).
	- Disallow pwrite(2) (CAP_SEEK was also required).

	Added convinient defines:

	#define	CAP_PREAD		(CAP_SEEK | CAP_READ)
	#define	CAP_PWRITE		(CAP_SEEK | CAP_WRITE)
	#define	CAP_MMAP_R		(CAP_MMAP | CAP_SEEK | CAP_READ)
	#define	CAP_MMAP_W		(CAP_MMAP | CAP_SEEK | CAP_WRITE)
	#define	CAP_MMAP_X		(CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
	#define	CAP_MMAP_RW		(CAP_MMAP_R | CAP_MMAP_W)
	#define	CAP_MMAP_RX		(CAP_MMAP_R | CAP_MMAP_X)
	#define	CAP_MMAP_WX		(CAP_MMAP_W | CAP_MMAP_X)
	#define	CAP_MMAP_RWX		(CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
	#define	CAP_RECV		CAP_READ
	#define	CAP_SEND		CAP_WRITE

	#define	CAP_SOCK_CLIENT \
		(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
		 CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
	#define	CAP_SOCK_SERVER \
		(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
		 CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
		 CAP_SETSOCKOPT | CAP_SHUTDOWN)

	Added defines for backward API compatibility:

	#define	CAP_MAPEXEC		CAP_MMAP_X
	#define	CAP_DELETE		CAP_UNLINKAT
	#define	CAP_MKDIR		CAP_MKDIRAT
	#define	CAP_RMDIR		CAP_UNLINKAT
	#define	CAP_MKFIFO		CAP_MKFIFOAT
	#define	CAP_MKNOD		CAP_MKNODAT
	#define	CAP_SOCK_ALL		(CAP_SOCK_CLIENT | CAP_SOCK_SERVER)

Sponsored by:	The FreeBSD Foundation
Reviewed by:	Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with:	rwatson, benl, jonathan
ABI compatibility discussed with:	kib
2013-03-02 00:53:12 +00:00
attilio
15bf891afe Rename VM_OBJECT_LOCK(), VM_OBJECT_UNLOCK() and VM_OBJECT_TRYLOCK() to
their "write" versions.

Sponsored by:	EMC / Isilon storage division
2013-02-20 12:03:20 +00:00
attilio
658534ed5a Switch vm_object lock to be a rwlock.
* VM_OBJECT_LOCK and VM_OBJECT_UNLOCK are mapped to write operations
* VM_OBJECT_SLEEP() is introduced as a general purpose primitve to
  get a sleep operation using a VM_OBJECT_LOCK() as protection
* The approach must bear with vm_pager.h namespace pollution so many
  files require including directly rwlock.h
2013-02-20 10:38:34 +00:00
delphij
1e94694547 Fix LINT build on amd64. 2013-02-09 04:13:45 +00:00
rrs
75ad250e97 This fixes a out-of-order problem with several
of the newer drivers. The basic problem was
that the driver was pulling the mbuf off the
drbr ring and then when sending with xmit(), encounting
a full transmit ring. Thus the lower layer
xmit() function would return an error, and the
drivers would then append the data back on to the ring.
For TCP this is a horrible scenario sure to bring
on a fast-retransmit.

The fix is to use drbr_peek() to pull the data pointer
but not remove it from the ring. If it fails then
we either call the new drbr_putback or drbr_advance
method. Advance moves it forward (we do this sometimes
when the xmit() function frees the mbuf). When
we succeed we always call advance. The
putback will always copy the mbuf back to the top
of the ring. Note that the putback *cannot* be used
with a drbr_dequeue() only with drbr_peek(). We most
of the time, in putback, would not need to copy it
back since most likey the mbuf is still the same, but
sometimes xmit() functions will change the mbuf via
a pullup or other call. So the optimial case for
the single consumer is to always copy it back. If
we ever do a multiple_consumer (for lagg?) we
will  need a test and atomic in the put back possibly
a seperate putback_mc() in the ring buf.

Reviewed by:	jhb@freebsd.org, jlv@freebsd.org
2013-02-07 15:20:54 +00:00
glebius
8e20fa5ae9 Mechanically substitute flags from historic mbuf allocator with
malloc(9) flags within sys.

Exceptions:

- sys/contrib not touched
- sys/mbuf.h edited manually
2012-12-05 08:04:20 +00:00
dim
45f8cc9a6c Redo r242842, now actually fixing the warnings, as follows:
- In sys/ofed/drivers/infiniband/core/cma.c, an enum struct member is
  interpreted as an int, so cast it to an int.
- In sys/ofed/drivers/infiniband/core/ud_header.c, initialize the
  packet_length variable in ib_ud_header_init(), to prevent undefined
  behaviour.
- In sys/ofed/drivers/infiniband/ulp/sdp/sdp_rx.c, call rdma_notify()
  with the correct enum type and value.
- In sys/ofed/include/linux/pci.h, change the PCI_DEVICE and PCI_VDEVICE
  macros to use C99 struct initializers, so additional members can be
  overridden.

Reviewed by:	delphij, Garrett Cooper <yanegomi@gmail.com>
MFC after:	1 week
2012-11-12 22:01:29 +00:00
delphij
6ac4bcf601 Use %s when calling make_dev with a string pointer. This makes
clang happy.

MFC after:	2 weeks
2012-11-09 21:41:07 +00:00