5070 Commits

Author SHA1 Message Date
Adrian Chadd
8ad1a83b48 Calculate the RSS hash for outbound UDPv4 frames.
Differential Revision:	https://reviews.freebsd.org/D527
Reviewed by:	grehan
2014-09-09 04:19:36 +00:00
Adrian Chadd
b8bc95cd49 Update the IPv4 input path to handle reassembled frames and incoming frames
with no RSS hash.

When doing RSS:

* Create a new IPv4 netisr which expects the frames to have been verified;
  it just directly dispatches to the IPv4 input path.
* Once IPv4 reassembly is done, re-calculate the RSS hash with the new
  IP and L3 header; then reinject it as appropriate.
* Update the IPv4 netisr to be a CPU affinity netisr with the RSS hash
  function (rss_soft_m2cpuid) - this will do a software hash if the
  hardware doesn't provide one.

NICs that don't implement hardware RSS hashing will now benefit from RSS
distribution - it'll inject into the correct destination netisr.

Note: the netisr distribution doesn't work out of the box - netisr doesn't
query RSS for how many CPUs and the affinity setup.  Yes, netisr likely
shouldn't really be doing CPU stuff anymore and should be "some kind of
'thing' that is a workqueue that may or may not have any CPU affinity";
that's for a later commit.

Differential Revision:	https://reviews.freebsd.org/D527
Reviewed by:	grehan
2014-09-09 04:18:20 +00:00
Adrian Chadd
72d33245f5 Implement IPv4 RSS software hash functions to use during packet ingress
and egress.

* rss_mbuf_software_hash_v4 - look at the IPv4 mbuf to fetch the IPv4 details
  + direction to calculate a hash.
* rss_proto_software_hash_v4 - hash the given source/destination IPv4 address,
  port and direction.
* rss_soft_m2cpuid - map the given mbuf to an RSS CPU ("bucket" for now)

These functions are intended to be used by the stack to support
the following:

* Not all NICs do RSS hashing, so we should support some way of doing
  a hash in software;
* The NIC / driver may not hash frames the way we want (eg UDP 4-tuple
  hashing when the stack is only doing 2-tuple hashing for UDP); so we
  may need to re-hash frames;
* .. same with IPv4 fragments - they will need to be re-hashed after
  reassembly;
* .. and same with things like IP tunneling and such;
* The transmit path for things like UDP, RAW and ICMP don't currently
  have any RSS information attached to them - so they'll need an
  RSS calculation performed before transmit.

TODO:

* Counters! Everywhere!
* Add a debug mode that software hashes received frames and compares them
  to the hardware hash provided by the hardware to ensure they match.

The IPv6 part of this is missing - I'm going to do some re-juggling of
where various parts of the RSS framework live before I add the IPv6
code (read: the IPv6 code is going to go into netinet6/in6_rss.[ch],
rather than living here.)

Note: This API is still fluid.  Please keep that in mind.

Differential Revision:	https://reviews.freebsd.org/D527
Reviewed by:	grehan
2014-09-09 03:10:21 +00:00
Adrian Chadd
9d3ddf4384 Add support for receiving and setting flowtype, flowid and RSS bucket
information as part of recvmsg().

This is primarily used for debugging/verification of the various
processing paths in the IP, PCB and driver layers.

Unfortunately the current implementation of the control message path
results in a ~10% or so drop in UDP frame throughput when it's used.

Differential Revision:	https://reviews.freebsd.org/D527
Reviewed by:	grehan
2014-09-09 01:45:39 +00:00
Adrian Chadd
061a4b4c36 Add a flag to ip_output() - IP_NODEFAULTFLOWID - which prevents it from
overriding an existing flowid/flowtype field in the outbound mbuf with
the inp_flowid/inp_flowtype details.

The upcoming RSS UDP support calculates a valid RSS value for outbound
mbufs and since it may change per send, it doesn't cache it in the inpcb.
So overriding it here would be wrong.

Differential Revision:	https://reviews.freebsd.org/D527
Reviewed by:	grehan
2014-09-09 00:19:02 +00:00
Alexander V. Chernikov
d6164b77f8 Make ipfw_nat module use IP_FW3 codes.
Kernel changes:
* Split kernel/userland nat structures eliminating IPFW_INTERNAL hack.
* Add IP_FW_NAT44_* codes resemblin old ones.
* Assume that instances can be named (no kernel support currently).
* Use both UH+WLOCK locks for all configuration changes.
* Provide full ABI support for old sockopts.

Userland changes:
* Use IP_FW_NAT44_* codes for nat operations.
* Remove undocumented ability to show ranges of nat "log" entries.
2014-09-07 18:30:29 +00:00
Michael Tuexen
ad234e3c3d Address warnings generated by the clang analyzer.
MFC after: 1 week
2014-09-07 18:05:37 +00:00
Michael Tuexen
23602b60fb Address another warnings reported by Patrick Laimbock when compiling
in userspace. While there, improve consistency.

MFC after: 1 week
2014-09-07 17:07:19 +00:00
Michael Tuexen
24aaac8d59 Use union sctp_sockstore instead of struct sockaddr_storage. This
eliminiates some warnings when building in userland.
Thanks to Patrick Laimbock for reporting this issue.
Remove also some unnecessary casts.
There should be no functional change.

MFC after: 1 week
2014-09-07 09:06:26 +00:00
Michael Tuexen
95e550801c Use SYSCTL_PROC instead of SYSCTL_VNET_PROC.
Suggested by: glebius@
MFC after: 1 week
2014-09-07 07:49:49 +00:00
Michael Tuexen
24110da033 Fix a leak of an address, if the address is scheduled for removal
and the stack is torn down.
Thanks to Peter Bostroem and Jiayang Liu from Google for reporting the
issue.

MFC after: 1 week
2014-09-06 20:03:24 +00:00
Michael Tuexen
f47f328dc5 Fix the handling of sysctl variables when used with VIMAGE.
While there do some cleanup of the code.

MFC after: 1 week
2014-09-06 19:12:14 +00:00
Alexander V. Chernikov
c9daea0b86 Sync to HEAD@r271160. 2014-09-05 13:52:39 +00:00
Gleb Smirnoff
770aa6cb25 Satisfy assertion in m_demote().
Sponsored by:	Nginx, Inc.
2014-09-04 19:28:02 +00:00
John Baldwin
a7c7f2a7e2 In tcp_input(), don't acquire the pcbinfo global write lock for SYN
packets targeting a listening socket.  Permit to reduce TCP input
processing starvation in context of high SYN load (e.g. short-lived TCP
connections or SYN flood).

Submitted by:	Julien Charbon <jcharbon@verisign.com>
Reviewed by:	adrian, hiren, jhb, Mike Bentkofsky
2014-09-04 19:09:08 +00:00
Gleb Smirnoff
07e845a3f4 Fixes for tcp_respond() comment. 2014-09-04 17:05:57 +00:00
Gleb Smirnoff
ba32fcfff9 Improve r265338. When inserting mbufs into TCP reassembly queue,
try to collapse adjacent pieces using m_catpkt(). In best case
scenario it copies data and frees mbufs, making mbuf exhaustion
attack harder.

Suggested by:		Jonathan Looney <jonlooney gmail.com>
Security:		Hardens against remote mbuf exhaustion attack.
Sponsored by:		Netflix
Sponsored by:		Nginx, Inc.
2014-09-04 09:15:44 +00:00
Gleb Smirnoff
bf7dcda366 Clean up unused CSUM_FRAGMENT.
Sponsored by:	Nginx, Inc.
2014-09-03 08:30:18 +00:00
Gleb Smirnoff
c26544aa7f Make SOCK_RAW sockets to be truly raw, not modifying received and sent
packets at all. Swapping byte order on SOCK_RAW was actually a bug, an
artifact from the BSD network stack, that used to convert a packet to
native byte order once it is received by kernel.

Other operating systems didn't follow this, and later other BSD
descendants fixed this, leaving us alone with the bug. Now it is
clear that we should fix the bug.

In collaboration with:	Olivier Cochard-Labbé <olivier cochard.me>
See also:		https://wiki.freebsd.org/SOCK_RAW
Sponsored by:		Nginx, Inc.
2014-09-01 14:04:51 +00:00
Alexander V. Chernikov
0cba2b2802 Add support for multi-field values inside ipfw tables.
This is the last major change in given branch.

Kernel changes:
* Use 64-bytes structures to hold multi-value variables.
* Use shared array to hold values from all tables (assume
  each table algo is capable of holding 32-byte variables).
* Add some placeholders to support per-table value arrays in future.
* Use simple eventhandler-style API to ease the process of adding new
  table items. Currently table addition may required multiple UH drops/
  acquires which is quite tricky due to atomic table modificatio/swap
  support, shared array resize, etc. Deal with it by calling special
  notifier capable of rolling back state before actually performing
  swap/resize operations. Original operation then restarts itself after
  acquiring UH lock.
* Bump all objhash users default values to at least 64
* Fix custom hashing inside objhash.

Userland changes:
* Add support for dumping shared value array via "vlist" internal cmd.
* Some small print/fill_flags dixes to support u32 values.
* valtype is now bitmask of
  <skipto|pipe|fib|nat|dscp|tag|divert|netgraph|limit|ipv4|ipv6>.
  New values can hold distinct values for each of this types.
* Provide special "legacy" type which assumes all values are the same.
* More helpers/docs following..

Some examples:

3:41 [1] zfscurr0# ipfw table mimimi create valtype skipto,limit,ipv4,ipv6
3:41 [1] zfscurr0# ipfw table mimimi info
+++ table(mimimi), set(0) +++
 kindex: 2, type: addr
 references: 0, valtype: skipto,limit,ipv4,ipv6
 algorithm: addr:radix
 items: 0, size: 296
3:42 [1] zfscurr0# ipfw table mimimi add 10.0.0.5 3000,10,10.0.0.1,2a02:978:2::1
added: 10.0.0.5/32 3000,10,10.0.0.1,2a02:978:2::1
3:42 [1] zfscurr0# ipfw table mimimi list
+++ table(mimimi), set(0) +++
10.0.0.5/32 3000,0,10.0.0.1,2a02:978:2::1
2014-08-31 23:51:09 +00:00
Gleb Smirnoff
546451a2e5 Use macros instead of referencing struct if_data that resides in ifnet.
Sponsored by:	Nginx, Inc.
2014-08-31 06:30:50 +00:00
Michael Tuexen
76031b19ef Announce SCTP support in the kern.features sysctl variables.
MFC after: 3 days
2014-08-26 21:15:34 +00:00
Alexander V. Chernikov
832fd78087 Sync to HEAD@r270409. 2014-08-23 14:58:31 +00:00
Xin LI
a7f77a3950 Restore historical behavior of in_control, which, when no matching address
is found, the first usable address is returned for legacy ioctls like
SIOCGIFBRDADDR, SIOCGIFDSTADDR, SIOCGIFNETMASK and SIOCGIFADDR.

While there also fix a subtle issue that a caller from a jail asking for
INADDR_ANY may get the first IP of the host that do not belong to the jail.

Submitted by:	glebius
Differential Revision: https://reviews.freebsd.org/D667
2014-08-22 19:08:12 +00:00
Lawrence Stewart
8b0fe327e8 Destroy the "qdiffsample_zone" UMA zone on unload to avoid a use-after-unload
panic easily triggered by running "sysctl -a" after unload.

Reported and tested by:	Grenville Armitage <garmitage@swin.edu.au>
MFC after:	1 week
2014-08-19 02:19:53 +00:00
Alexander V. Chernikov
4bbd15771b Make room for multi-type values in struct tentry. 2014-08-15 12:58:32 +00:00
Kevin Lo
73d76e77b6 Change pr_output's prototype to avoid the need for explicit casts.
This is a follow up to r269699.

Phabric:	D564
Reviewed by:	jhb
2014-08-15 02:43:02 +00:00
Alexander V. Chernikov
c21034b744 Replace "cidr" table type with "addr" type.
Suggested by:	luigi
2014-08-14 21:43:20 +00:00
Alexander V. Chernikov
18ad419788 * Fix displaying dynamic rules for large rulesets.
* Clean up some comments.
2014-08-14 08:21:22 +00:00
Alexander V. Chernikov
1b833d535b Sync to HEAD@r269943. 2014-08-13 16:20:41 +00:00
Michael Tuexen
f0396ad15e Add support for the SCTP_PR_STREAM_STATUS and SCTP_PR_ASSOC_STATUS
socket options. This includes managing the correspoing stat counters.
Add the SCTP_DETAILED_STR_STATS kernel option to control per policy
counters on every stream. The default is off and only an aggregated
counter is available. This is sufficient for the RTCWeb usecase.

MFC after: 1 week
2014-08-13 15:50:16 +00:00
Alexander V. Chernikov
1940fa7727 Change tablearg value to be 0 (try #2).
Most of the tablearg-supported opcodes does not accept 0 as valid value:
 O_TAG, O_TAGGED, O_PIPE, O_QUEUE, O_DIVERT, O_TEE, O_SKIPTO, O_CALLRET,
 O_NETGRAPH, O_NGTEE, O_NAT treats 0 as invalid input.

The rest are O_SETDSCP and O_SETFIB.
'Fix' them by adding high-order bit (0x8000) set for non-tablearg values.
Do translation in kernel for old clients (import_rule0 / export_rule0),
teach current ipfw(8) binary to add/remove given bit.

This change does not affect handling SETDSCP values, but limit
O_SETFIB values to 32767 instead of 65k. Since currently we have either
old (16) or new (2^32) max fibs, this should not be a big deal:
we're definitely OK for former and have to add another opcode to deal
with latter, regardless of tablearg value.
2014-08-12 15:51:48 +00:00
Michael Tuexen
97a0ca5b3e Change SCTP sysctl from auth_disable to auth_enable. This is
consistent with other similar sysctl variable used in SCTP.
2014-08-12 13:13:11 +00:00
Michael Tuexen
c79bec9c75 Add support for the SCTP_AUTH_SUPPORTED and SCTP_ASCONF_SUPPORTED
socket options. Add also a sysctl to control the support of ASCONF.

MFC after: 1 week
2014-08-12 11:30:16 +00:00
Alexander V. Chernikov
4f43138ade * Add the abilify to lock/unlock given table from changes.
Example:

# ipfw table si lock
# ipfw table si info
+++ table(si), set(0) +++
 kindex: 0, type: cidr, locked
 valtype: number, references: 0
 algorithm: cidr:radix
 items: 0, size: 288
# ipfw table si add 4.5.6.7
ignored: 4.5.6.7/32 0
ipfw: Adding record failed: table is locked
# ipfw table si unlock
# ipfw table si add 4.5.6.7
added: 4.5.6.7/32 0
# ipfw table si lock
# ipfw table si delete 4.5.6.7
ignored: 4.5.6.7/32 0
ipfw: Deleting record failed: table is locked
# ipfw table si unlock
# ipfw table si delete 4.5.6.7
deleted: 4.5.6.7/32 0
2014-08-11 18:09:37 +00:00
Alexander V. Chernikov
3a845e1076 * Add support for batched add/delete for ipfw tables
* Add support for atomic batches add (all or none).
* Fix panic on deleting non-existing entry in radix algo.

Examples:

# si is empty
# ipfw table si add 1.1.1.1/32 1111 2.2.2.2/32 2222
added: 1.1.1.1/32 1111
added: 2.2.2.2/32 2222
# ipfw table si add 2.2.2.2/32 2200 4.4.4.4/32 4444
exists: 2.2.2.2/32 2200
added: 4.4.4.4/32 4444
ipfw: Adding record failed: record already exists
^^^^^ Returns error but keeps inserted items
# ipfw table si list
+++ table(si), set(0) +++
1.1.1.1/32 1111
2.2.2.2/32 2222
4.4.4.4/32 4444
# ipfw table si atomic add 3.3.3.3/32 3333 4.4.4.4/32 4400 5.5.5.5/32 5555
added(reverted): 3.3.3.3/32 3333
exists: 4.4.4.4/32 4400
ignored: 5.5.5.5/32 5555
ipfw: Adding record failed: record already exists
^^^^^ Returns error and reverts added records
# ipfw table si list
+++ table(si), set(0) +++
1.1.1.1/32 1111
2.2.2.2/32 2222
4.4.4.4/32 4444
2014-08-11 17:34:25 +00:00
Hans Petter Selasky
e167cb89a2 Fix string length argument passed to "sysctl_handle_string()" so that
the complete string is returned by the function and not just only one
byte.

PR:	192544
MFC after:	2 weeks
2014-08-10 07:51:55 +00:00
Hiren Panchasara
f7469d3e52 Improve comments by listing a criteria for automatic increment of receive socket
buffer.

Reviewed by:	jmg
2014-08-09 21:01:24 +00:00
Michael Tuexen
82eaf95e8d Small modification of the sctp_input() cleanup to avoid having
code between declariations.
2014-08-09 14:33:44 +00:00
Konstantin Belousov
1216eb3320 Fix one more compiler warning, m is not initialized. 2014-08-08 15:50:02 +00:00
Alexander V. Chernikov
8bd1921248 Partially revert previous commit:
"0" value is perfectly valid for O_SETFIB and O_SETDSCP,
  so tablearg remains to be 655535 for now.
2014-08-08 15:33:26 +00:00
Alexander V. Chernikov
2c452b20dd * Switch tablearg value from 65535 to 0.
* Use u16 table kidx instead of integer on for iface opcode.
* Provide compability layer for old clients.
2014-08-08 14:23:20 +00:00
Alexander V. Chernikov
adf3b2b9d8 * Add IP_FW_TABLE_XMODIFY opcode
* Since there seems to be lack of consensus on strict value typing,
  remove non-default value types. Use userland-only "value format type"
  to print values.

Kernel changes:
* Add IP_FW_XMODIFY to permit table run-time modifications.
  Currently we support changing limit and value format type.

Userland changes:
* Support IP_FW_XMODIFY opcode.
* Support specifying value format type (ftype) in tablble create/modify req
* Fine-print value type/value format type.
2014-08-08 09:27:49 +00:00
Bjoern A. Zeeb
eb5eb08820 Fix argument to KTR after r269699 to unbreak LINT builds. 2014-08-08 09:17:02 +00:00
Alexander V. Chernikov
28ea4fa355 Remove IP_FW_TABLES_XGETSIZE opcode.
It is superseded by IP_FW_TABLES_XLIST.
2014-08-08 06:36:26 +00:00
Kevin Lo
8f5a8818f5 Merge 'struct ip6protosw' and 'struct protosw' into one. Now we have
only one protocol switch structure that is shared between ipv4 and ipv6.

Phabric:	D476
Reviewed by:	jhb
2014-08-08 01:57:15 +00:00
Alexander V. Chernikov
a73d728d31 Kernel changes:
* Implement proper checks for switching between global and set-aware tables
* Split IP_FW_DEL mess into the following opcodes:
  * IP_FW_XDEL (del rules matching pattern)
  * IP_FW_XMOVE (move rules matching pattern to another set)
  * IP_FW_SET_SWAP (swap between 2 sets)
  * IP_FW_SET_MOVE (move one set to another one)
  * IP_FW_SET_ENABLE (enable/disable sets)
* Add IP_FW_XZERO / IP_FW_XRESETLOG to finish IP_FW3 migration.
* Use unified ipfw_range_tlv as range description for all of the above.
* Check dynamic states IFF there was non-zero number of deleted dyn rules,
* Del relevant dynamic states with singe traversal instead of per-rule one.

Userland changes:
* Switch ipfw(8) to use new opcodes.
2014-08-07 21:37:31 +00:00
Michael Tuexen
317e00ef86 Add support for the SCTP_RECONFIG_SUPPORTED and the corresponding
sysctl controlling the negotiation of the RE-CONFIG extension.

MFC after: 3 days
2014-08-04 20:07:35 +00:00
Hiren Panchasara
76504ce978 Add a comment for easier code understanding. 2014-08-04 19:42:48 +00:00
Alexander V. Chernikov
46d5200874 Implement atomic ipfw table swap.
Kernel changes:
* Add opcode IP_FW_TABLE_XSWAP
* Add support for swapping 2 tables with the same type/ftype/vtype.
* Make skipto cache init after ipfw locks init.

Userland changes:
* Add "table X swap Y" command.
2014-08-03 21:37:12 +00:00