5247 Commits

Author SHA1 Message Date
tuexen
9794079730 Fix a bug where messages would not be sent in SHUTDOWN_RECEIVED state.
This problem was reported by Mark Bonnekessel and Markus Boese.
Thanks to Irene Ruengeler for helping me to fix the cause of
the problem. It can be tested with the following packetdrill script:

+0.0 socket(..., SOCK_STREAM, IPPROTO_SCTP) = 3
+0.0 fcntl(3, F_GETFL) = 0x2 (flags O_RDWR)
+0.0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
// Check the handshake with an empty(!) cookie
+0.1 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
+0.0 > sctp: INIT[flgs=0, tag=1, a_rwnd=..., os=..., is=..., tsn=0, ...]
+0.1 < sctp: INIT_ACK[flgs=0, tag=2, a_rwnd=10000, os=1, is=1, tsn=0, STATE_COOKIE[len=4, val=...]]
+0.0 > sctp: COOKIE_ECHO[flgs=0, len=4, val=...]
+0.1 < sctp: COOKIE_ACK[flgs=0]
+0.0 getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
+0.0 write(3, ..., 1024) = 1024
+0.0 > sctp: DATA[flgs=BE, len=1040, tsn=0, sid=0, ssn=0, ppid=0]
+0.0 write(3, ..., 1024) = 1024 // Pending due to Nagle
+0.0 < sctp: SHUTDOWN[flgs=0, cum_tsn=0]
+0.0 > sctp: DATA[flgs=BE, len=1040, tsn=1, sid=0, ssn=1, ppid=0]
+0.0 < sctp: SACK[flgs=0, cum_tsn=1, a_rwnd=10000, gaps=[], dups=[]] // Do we need another SHUTDOWN here?
+0.0 > sctp: SHUTDOWN_ACK[flgs=0]
+0.0 < sctp: SHUTDOWN_COMPLETE[flgs=0]
+0.0 close(3) = 0

MFC after: 3 days
2015-05-28 18:34:02 +00:00
tuexen
ec768a1460 Use macros for overhead in a consistent way. No functional change.
Thanks to Irene Ruengeler for suggesting the change.

MFC after: 3 days
2015-05-28 17:57:56 +00:00
tuexen
67b3bbe09c Some more debug info cleanup.
MFC after: 3 days
2015-05-28 16:39:22 +00:00
tuexen
a82f33e60c Fix and cleanup the debug information. This has no user-visible changes.
Thanks to Irene Ruengeler for proving a patch.

MFC after: 3 days
2015-05-28 16:00:23 +00:00
tuexen
d4fad6a818 Address some compiler warnings. No functional change.
MFC after: 3 days
2015-05-28 14:24:21 +00:00
jkim
318c4f97e6 CALLOUT_MPSAFE has lost its meaning since r141428, i.e., for more than ten
years for head.  However, it is continuously misused as the mpsafe argument
for callout_init(9).  Deprecate the flag and clean up callout_init() calls
to make them more consistent.

Differential Revision:	https://reviews.freebsd.org/D2613
Reviewed by:	jhb
MFC after:	2 weeks
2015-05-22 17:05:21 +00:00
hiren
0e45b26b69 Add a new sysctl net.inet.tcp.hostcache.purgenow=1 to expire and purge all
entries in hostcache immediately.

In collaboration with:	bz, rwatson
MFC after:	1 week
Relnotes:	yes
Sponsored by:	Limelight Networks
2015-05-20 01:08:01 +00:00
hiren
b33b449313 Correct the wording as we are increasing the window size.
Reviewed by:	jhb
Sponsored by:	Limelight Networks
2015-05-19 19:17:20 +00:00
ae
cbc4e577f0 Add an ability accept encapsulated packets from different sources by one
gif(4) interface. Add new option "ignore_source" for gif(4) interface.
When it is enabled, gif's encapcheck function requires match only for
packet's destination address.

Differential Revision:	https://reviews.freebsd.org/D2004
Obtained from:	Yandex LLC
MFC after:	2 weeks
Sponsored by:	Yandex LLC
2015-05-15 12:19:45 +00:00
tuexen
02ec72fed7 Ensure that the COOKIE-ACK can be sent over UDP if the COOKIE-ECHO was
received over UDP.
Thanks to Felix Weinrank for makeing me aware of the problem and to
Irene Ruengeler for providing the fix.

MFC after: 1 week
2015-05-12 08:08:16 +00:00
gnn
ac5008de11 Add a state transition call to show that we have entered TIME_WAIT.
Although this is not important to the rest of the TCP processing
it is a conveneint way to make the DTrace state-transition probe
catch this important state change.

MFC after:	1 week
2015-05-01 12:49:03 +00:00
gnn
d333d3e495 Move the SIFTR DTrace probe out of the writing thread context
and directly into the place where the data is collected.
2015-04-30 17:43:40 +00:00
gnn
08d35a248b Brief demo script showing the various values that can be read via
the new SIFTR statically defined tracepoint (SDT).

Differential Revision:	https://reviews.freebsd.org/D2387
Reviewed by:	bz, markj
2015-04-29 17:19:55 +00:00
melifaro
9f3d7ccd07 Make rule table kernel-index rewriting support any kind of objects.
Currently we have tables identified by their names in userland
with internal kernel-assigned indices. This works the following way:

When userland wishes to communicate with kernel to add or change rule(s),
it makes indexed sorted array of table names
(internally ipfw_obj_ntlv entries), and refer to indices in that
array in rule manipulation.
Prior to committing new rule to the ruleset kernel
a) finds all referenced tables, bump their refcounts and change
 values inside the opcodes to be real kernel indices
b) auto-creates all referenced but not existing tables and then
 do a) for them.

Kernel does almost the same when exporting rules to userland:
 prepares array of used tables in all rules in range, and
 prepends it before the actual ruleset retaining actual in-kernel
 indexes for that.

There is also special translation layer for legacy clients which is
able to provide 'real' indices for table names (basically doing atoi()).

While it is arguable that every subsystem really needs names instead of
numbers, there are several things that should be noted:

1) every non-singleton subsystem needs to store its runtime state
somewhere inside ipfw chain (and be able to get it fast)
2) we can't assume object numbers provided by humans will be dense.

Existing nat implementation (O(n) access and LIST inside chain) is a
good example.

Hence the following:
* Convert table-centric rewrite code to be more generic, callback-based
* Move most of the code from ip_fw_table.c to ip_fw_sockopt.c
* Provide abstract API to permit subsystems convert their objects
  between userland string identifier and in-kernel index.
  (See struct opcode_obj_rewrite) for more details
* Create another per-chain index (in next commit) shared among all subsystems
* Convert current NAT44 implementation to use new API, O(1) lookups,
 shared index and names instead of numbers (in next commit).

Sponsored by:	Yandex LLC
2015-04-27 08:29:39 +00:00
ae
b38774ffc9 Remove now unneded KEY_FREESP() for case when ipsec[46]_process_packet()
returns EJUSTRETURN.

Sponsored by:	Yandex LLC
2015-04-27 01:11:09 +00:00
ae
5a6412a276 Fix possible use after free due to security policy deletion.
When we are passing mbuf to IPSec processing via ipsec[46]_process_packet(),
we hold one reference to security policy and release it just after return
from this function. But IPSec processing can be deffered and when we release
reference to security policy after ipsec[46]_process_packet(), user can
delete this security policy from SPDB. And when IPSec processing will be
done, xform's callback function will do access to already freed memory.

To fix this move KEY_FREESP() into callback function. Now IPSec code will
release reference to SP after processing will be finished.

Differential Revision:	https://reviews.freebsd.org/D2324
No objections from:	#network
Sponsored by:	Yandex LLC
2015-04-27 00:55:56 +00:00
tuexen
5106da568a Don't panic under INVARIANTS when receiving a SACK which cumacks
a TSN never sent.
While there, fix two typos.

MFC after: 1 week
2015-04-26 21:47:15 +00:00
bapt
9ec79fef30 mdoc: fix rendering issues 2015-04-26 11:39:25 +00:00
ae
2af9b531aa Fix possible reference leak.
Sponsored by:	Yandex LLC
2015-04-24 21:05:29 +00:00
glebius
2ac0a031ed Improve carp(4) locking:
- Use the carp_sx to serialize not only CARP ioctls, but also carp_attach()
  and carp_detach().
- Use cif_mtx to lock only access to those the linked list.
- These locking changes allow us to do some memory allocations with M_WAITOK
  and also properly call callout_drain() in carp_destroy().
- In carp_attach() assert that ifaddr isn't attached. We always come here
  with a pristine address from in[6]_control().

Reviewed by:	oleg
Sponsored by:	Nginx, Inc.
2015-04-21 20:25:12 +00:00
glebius
14b7122d6d Provide functions to determine presence of a given address
configured on a given interface.

Discussed with:	np
Sponsored by:	Nginx, Inc.
2015-04-17 11:57:06 +00:00
jch
b227cb3d85 Fix an old and well-documented use-after-free race condition in
TCP timers:
 - Add a reference from tcpcb to its inpcb
 - Defer tcpcb deletion until TCP timers have finished

Differential Revision:	https://reviews.freebsd.org/D2079
Submitted by:		jch, Marc De La Gueronniere <mdelagueronniere@verisign.com>
Reviewed by:		imp, rrs, adrian, jhb, bz
Approved by:		jhb
Sponsored by:		Verisign, Inc.
2015-04-16 10:00:06 +00:00
adrian
efbeb67101 Fix RSS build - netisr input / NETISR_IP_DIRECT is used here. 2015-04-15 00:57:21 +00:00
mjg
7f65d178b4 Replace struct filedesc argument in getsock_cap with struct thread
This is is a step towards removal of spurious arguments.
2015-04-11 16:00:33 +00:00
mjg
22da590f11 fd: remove filedesc argument from fdclose
Just accept a thread instead. This makes it consistent with fdalloc.

No functional changes.
2015-04-11 15:40:28 +00:00
delphij
34b5443a48 Attempt to fix build after 281351 by defining full prototype for the
functions that were moved to ip_reass.c.
2015-04-11 01:06:59 +00:00
glebius
d2eb1d4ad8 o Use Jenkins hash. With previous hash, for a single source IP address and
sequential IP ID case (e.g. ping -f), distribution fell into 8-10 buckets
  out of 64. With Jenkins hash, distribution is even.
o Add random seed to the hash.

Sponsored by:	Nginx, Inc.
2015-04-10 06:55:43 +00:00
glebius
a0f9f3c303 Move all code related to IP fragment reassembly to ip_reass.c. Some
function names have changed and comments are reformatted or added, but
there is no functional change.

Claim copyright for me and Adrian.

Sponsored by:	Nginx, Inc.
2015-04-10 06:02:37 +00:00
glebius
1f2f31e87d Now that IP reassembly is no longer under single lock, book-keeping amount
of allocations in V_nipq is racy.  To fix that, we would simply stop doing
book-keeping ourselves, and rely on UMA doing that.  There could be a
slight overcommit due to caches, but that isn't a big deal.

o V_nipq and V_maxnipq go away.
o net.inet.ip.fragpackets is now just SYSCTL_UMA_CUR()
o net.inet.ip.maxfragpackets could have been just SYSCTL_UMA_MAX(), but
  historically it has special semantics about values of 0 and -1, so
  provide sysctl_maxfragpackets() to handle these special cases.
o If zone limit lowers either due to net.inet.ip.maxfragpackets or due to
  kern.ipc.nmbclusters, then new function ipq_drain_tomax() goes over
  buckets and frees the oldest packets until we are in the limit.
  The code that (incorrectly) did that in ip_slowtimo() is removed.
o ip_reass() doesn't check any limits and calls uma_zalloc(M_NOWAIT).
  If it fails, a new function ipq_reuse() is called. This function will
  find the oldest packet in the currently locked bucket, and if there is
  none, it will search in other buckets until success.

Sponsored by:	Nginx, Inc.
2015-04-09 22:13:27 +00:00
glebius
8685b42463 In the ip_reass() do packet examination and adjusting before acquiring
locks and doing lookups.

Sponsored by:	Nginx, Inc.
2015-04-09 21:32:32 +00:00
glebius
a68631b0e4 Make ip reassembly queue mutexes per-vnet, putting them into the structure
that they protect.

Sponsored by:	Nginx, Inc.
2015-04-09 21:17:07 +00:00
glebius
977624bc79 Use TAILQ_FOREACH_SAFE() instead of implementing it ourselves.
Sponsored by:	Nginx, Inc.
2015-04-09 09:00:32 +00:00
glebius
03afa52780 If V_maxnipq is set to zero, drain the queue here and now, instead of
relying on timeouts.

Sponsored by:	Nginx, Inc.
2015-04-09 08:56:23 +00:00
glebius
ddd581998f o Since we always update either fragdrop or fragtimeout stat counter when we
free a fragment, provide two inline functions that do that for us:
  ipq_drop() and ipq_timeout().
o Rename ip_free_f() to ipq_free() to match the name scheme of IP reassembly.
o Remove assertion from ipq_free(), since it requires extra argument to be
  passed, but locking scheme is simple enough and function is static.

Sponsored by:	Nginx, Inc.
2015-04-09 08:52:02 +00:00
glebius
89392dc82f Rename ip_drain_locked() to ip_drain_vnet(), since the function differs
from ip_drain() not in locking, but in the scope of its work.

Sponsored by:	Nginx, Inc.
2015-04-09 08:37:16 +00:00
adrian
f1a634931a Move the IPv4 reassembly queue locking from a single lock to be per-bucket (global).
This significantly improves performance on multi-core servers where there
is any kind of IPv4 reassembly going on.

glebius@ would like to see the locking moved to be attached to the reassembly
bucket, which would make it per-bucket + per-VNET, instead of being global.
I decided to keep it global for now as it's the minimal useful change;
if people agree / wish to migrate it to be per-bucket / per-VNET then please
do feel free to do so.  I won't complain.

Thanks to Norse Corp for giving me access to much larger servers
to test this at across the 4 core boxes I have at home.

Differential Revision:	https://reviews.freebsd.org/D2095
Reviewed by:	glebius (initial comments incorporated into this patch)
MFC after:	2 weeks
Sponsored by:	Norse Corp, Inc (hardware)
2015-04-07 23:09:34 +00:00
delphij
4836e6055c Improve patch for SA-15:04.igmp to solve a potential buffer overflow.
Reported by:	bde
Submitted by:	oshogbo
2015-04-07 20:20:03 +00:00
glebius
42eded17aa Add sleepable lock to protect at least against two parallel SIOCSVHs.
Sponsored by:	Nginx, Inc.
2015-04-06 15:31:19 +00:00
hselasky
5d01d90225 Extend fixes made in r278103 and r38754 by copying the complete packet
header and not only partial flags and fields. Firewalls can attach
classification tags to the outgoing mbufs which should be copied to
all the new fragments. Else only the first fragment will be let
through by the firewall. This can easily be tested by sending a large
ping packet through a firewall. It was also discovered that VLAN
related flags and fields should be copied for packets traversing
through VLANs. This is all handled by "m_dup_pkthdr()".

Regarding the MAC policy check in ip_fragment(), the tag provided by
the originating mbuf is copied instead of using the default one
provided by m_gethdr().

Tested by:		Karim Fodil-Lemelin <fodillemlinkarim at gmail.com>
MFC after:		2 weeks
Sponsored by:		Mellanox Technologies
PR:			7802
2015-04-02 15:47:37 +00:00
jch
990bf230ee Provide better debugging information in tcp_timer_activate() and
tcp_timer_active()

Differential Revision:	https://reviews.freebsd.org/D2179
Suggested by:		bz
Reviewed by:		jhb
Approved by:		jhb
2015-04-02 14:43:07 +00:00
glebius
0514a3075b Provide a comment explaining issues with the counter(9) trick, so that
people won't copy and paste it blindly.

Prodded by:	ian
Sponsored by:	Nginx, Inc.
2015-04-02 14:22:59 +00:00
bz
868c1ec5c2 Try to unbreak the build after r280971 by providing the missing
#include header for SYSINIT.
2015-04-02 00:30:53 +00:00
glebius
7c22152af0 o Use new function ip_fillid() in all places throughout the kernel,
where we want to create a new IP datagram.
o Add support for RFC6864, which allows to set IP ID for atomic IP
  datagrams to any value, to improve performance. The behaviour is
  controlled by net.inet.ip.rfc6864 sysctl knob, which is enabled by
  default.
o In case if we generate IP ID, use counter(9) to improve performance.
o Gather all code related to IP ID into ip_id.c.

Differential Revision:		https://reviews.freebsd.org/D2177
Reviewed by:			adrian, cy, rpaulo
Tested by:			Emeric POUPON <emeric.poupon stormshield.eu>
Sponsored by:			Netflix
Sponsored by:			Nginx, Inc.
Relnotes:			yes
2015-04-01 22:26:39 +00:00
jch
e6365e0ee8 Use appropriate timeout_t* instead of void* in tcp_timer_activate()
Suggested by:		imp
Differential Revision:	https://reviews.freebsd.org/D2154
Reviewed by:		imp, jhb
Approved by:		jhb
2015-03-31 10:17:13 +00:00
glebius
5292c6a2d2 VNETalize random IP ID engine.
Sponsored by:	Nginx, Inc.
2015-03-28 16:59:57 +00:00
glebius
96ef75c4d1 Initialize random IP ID engine via SYSINIT() instead of doing that on
first packet.  This allow to use M_WAITOK and cut down some error handling.

Sponsored by:	Nginx, Inc.
2015-03-28 16:06:46 +00:00
fabient
d819d79785 On multi CPU systems, we may emit successive packets with the same id.
Fix the race by using an atomic operation.

Differential Revision:	https://reviews.freebsd.org/D2141
Obtained from:	emeric.poupon@stormshield.eu
MFC after:	1 week
Sponsored by:	Stormshield
2015-03-27 13:26:59 +00:00
tuexen
65bba872bd Improve the selection of the destination address of SACK chunks.
This fixes
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=196755
and is joint work with rrs@.

MFC after: 1 week
2015-03-26 22:05:31 +00:00
tuexen
7ca13bb5b0 Make sure that we don't free an SCTP shared key too early.
Thanks to Pouyan Sepehrdad from Qualcomm Product Security Initiative
for reporting the issue.
MFC after: 3 days
2015-03-25 22:45:54 +00:00
tuexen
34cbdad427 Use the reference count of the right SCTP inp.
Joint work with rrs@

MFC after: 3 days
2015-03-25 21:41:20 +00:00