Commit Graph

5289 Commits

Author SHA1 Message Date
George V. Neville-Neil
5d06879adb dd DTrace probe points, translators and a corresponding script
to provide the TCPDEBUG functionality with pure DTrace.

Reviewed by:	rwatson
MFC after:	2 weeks
Sponsored by:	Limelight Networks
Differential Revision:	D3530
2015-09-13 15:50:55 +00:00
Michael Tuexen
30811e70d9 Fix compilation issue introduced in r287717.
Thanks to bz@ for making me aware of it.

MFC after:	1 week
2015-09-12 21:23:24 +00:00
Michael Tuexen
6802b0904f Address a compile warning.
MFC after:	1 week
2015-09-12 18:00:06 +00:00
Michael Tuexen
86eda749af Cleanup the handling of error causes for ERROR chunks. This fixes
an inconsistency of the padding handling. The final padding is
now considered to be a chunk padding.

MFC after:	1 week
2015-09-12 17:08:51 +00:00
Michael Tuexen
e629b9fc56 Ensure that ERROR chunks are always padded by implementing this
in the routine, which queues an ERROR chunk, instead on relyinh
on the callers to do so. Since one caller missed this, this actially
fixes a bug.

MFC after:	1 week
2015-09-11 13:54:33 +00:00
Michael Tuexen
0941640f34 RFC 4960 requires that packets containing an INIT chunk bundled with
another chunk are silently discarded. Do so, instead of sending an
ABORT.

MFC after:	1 week
2015-09-07 14:00:38 +00:00
Allan Jude
32d321fa4a missed file that should have been included in r287528
PR:		184110
Submitted by:	Marie Helene Kvello-Aune <marieheleneka@gmail.com>
Approved by:	wblock (mentor)
2015-09-07 02:00:05 +00:00
Adrian Chadd
499baf0aa7 Replace rss_m2cpuid with rss_soft_m2cpuid_v4 for ip_direct_nh.nh_m2cpuid,
because the RSS hash may need to be recalculated.

Submitted by:	Tiwei Bie <btw@mail.ustc.edu.cn>
Differential Revision:	https://reviews.freebsd.org/D3564
2015-09-06 20:20:48 +00:00
Alexander V. Chernikov
26deb8826c Do not pass lle to nd6_ns_output(). Use newly-added
nd6_llinfo_get_holdsrc() to extract desired IPv6 source
  from holdchain and pass it to the nd6_ns_output().
2015-09-05 14:14:03 +00:00
Gleb Smirnoff
388909a12a Use Jenkins hash for TCP syncache.
o Unlike xor, in Jenkins hash every bit of input affects virtually
  every bit of output, thus salting the hash actually works. With
  xor salting only provides a false sense of security, since if
  hash(x) collides with hash(y), then of course, hash(x) ^ salt
  would also collide with hash(y) ^ salt. [1]
o Jenkins provides much better distribution than xor, very close to
  ideal.

TCP connection setup/teardown benchmark has shown a 10% increase
with default hash size, and with bigger hashes that still provide
possibility for collisions. With enormous hash size, when dataset is
by an order of magnitude smaller than hash size, the benchmark has
shown 4% decrease in performance decrease, which is expected and
acceptable.

Noticed by:	Jeffrey Knockel <jeffk cs.unm.edu> [1]
Benchmarks by:	jch
Reviewed by:	jch, pkelsey, delphij
Security:	strengthens protection against hash collision DoS
Sponsored by:	Nginx, Inc.
2015-09-05 10:15:19 +00:00
Gleb Smirnoff
24067db8ca Make tcp_mtudisc() static and void. No functional changes.
Sponsored by:	Nginx, Inc.
2015-09-04 12:02:12 +00:00
Michael Tuexen
6fb9db98b3 Don't leak memory in an error case.
MFC after:	1 week
2015-09-04 09:24:07 +00:00
Michael Tuexen
59713bbf27 Add a NULL pointer check to silence the clang code analyzer.
MFC after:	1 week
2015-09-04 09:22:16 +00:00
Michael Tuexen
aa1cfca969 Fix a bug where two SHUTDOWN_ACK chunks were sent if a SHUTDOWN chunk was
received acking all outstanding data.
2015-09-03 22:15:56 +00:00
Julien Charbon
d6de19ac2f Put r284245 back in place: If at first this fix was seen as a temporary
workaround for a callout(9) issue, it turns out it is instead the right
way to use callout in mpsafe mode without using callout_drain().

r284245 commit message:

Fix a callout race condition introduced in TCP timers callouts with r281599.
In TCP timer context, it is not enough to check callout_stop() return value
to decide if a callout is still running or not, previous callout_reset()
return values have also to be checked.

Differential Revision:	https://reviews.freebsd.org/D2763
2015-08-30 13:44:39 +00:00
Michael Tuexen
2e2d67945a Use 5 times RTO.Max as the default for the shutdown guard timer
as required by RFC 4960. The sysctl variable can be used to
overwrite this.

Discussed with:	rrs
MFC after:	1 week
2015-08-29 17:26:29 +00:00
Michael Tuexen
e92c2a8d6a Fix the exporting of SCTP association states to userland. Without this,
associations in SHUTDOWN-PENDING were never reported correctly.

MFC after:	3 weeks
2015-08-29 09:14:32 +00:00
Adrian Chadd
2527ccad2d Rename rss_soft_m2cpuid() -> rss_soft_m2cpuid_v4() in preparation for
an IPv6 version to show up.

Submitted by:	Tiwei Bie <btw@mail.ustc.edu.cn>
Differential Revision:	https://reviews.freebsd.org/D3504
2015-08-29 06:58:30 +00:00
Adrian Chadd
e5562eb934 Replace the printf()s with optional rate limited debugging for RSS.
Submitted by:	Tiwei Bie <btw@mail.ustc.edu.cn>
Differential Revision:	https://reviews.freebsd.org/D3471
2015-08-28 05:58:16 +00:00
Bjoern A. Zeeb
a86e5c96af get_inpcbinfo() and get_pcblist() are UDP local functions and
do not do what one would expect by name. Prefix them with "udp_"
to at least obviously limit the scope.

This is a non-functional change.

Reviewed by:		gnn, rwatson
MFC after:		2 weeks
Differential Revision:	https://reviews.freebsd.org/D3505
2015-08-27 15:27:41 +00:00
Julien Charbon
bcf9b91395 Revert r284245: "Fix a callout race condition introduced in TCP
timers callouts with r281599."

r281599 fixed a TCP timer race condition, but due a callout(9) bug
it also introduced another race condition workaround-ed with r284245.
The callout(9) bug being fixed with r286880, we can now revert the
workaround (r284245).

Differential Revision:	https://reviews.freebsd.org/D2079 (Initial change)
Differential Revision:	https://reviews.freebsd.org/D2763 (Workaround)
Differential Revision:	https://reviews.freebsd.org/D3078 (Fix)
Sponsored by:		Verisign, Inc.
MFC after:		2 weeks
2015-08-24 09:30:27 +00:00
Alexander V. Chernikov
5a2555160f * Split allocation and table linking for lle's.
Before that, the logic besides lle_create() was the following:
  return existing if found, create if not. This behaviour was error-prone
  since we had to deal with 'sudden' static<>dynamic lle changes.
  This commit fixes bunch of different issues like:
  - refcount leak when lle is converted to static.
    Simple check case:
    console 1:
    while true;
      do for i in `arp -an|awk '$4~/incomp/{print$2}'|tr -d '()'`;
        do arp -s $i 00:22:44:66:88:00 ; arp -d $i;
      done;
    done
   console 2:
    ping -f any-dead-host-in-L2
   console 3:
    # watch for memory consumption:
    vmstat -m | awk '$1~/lltable/{print$2}'
  - possible problems in arptimer() / nd6_timer() when dropping/reacquiring
   lock.
  New logic explicitly handles use-or-create cases in every lla_create
  user. Basically, most of the changes are purely mechanical. However,
  we explicitly avoid using existing lle's for interface/static LLE records.
* While here, call lle_event handlers on all real table lle change.
* Create lltable_free_entry() calling existing per-lltable
  lle_free_t callback for entry deletion
2015-08-20 12:05:17 +00:00
Alexander V. Chernikov
a4141c63c5 Check value return from lle_create() for NULL.
This bug sneaked unnoticed in r286722.

Reported by:	adrian
2015-08-19 21:08:42 +00:00
Julien Charbon
31a7749d4b Make clear that TIME_WAIT timeout expiration is managed solely by
tcp_tw_2msl_scan().

Sponsored by:	Verisign, Inc.
2015-08-18 08:27:26 +00:00
Alexander V. Chernikov
0c4210f984 Fix panic when handling non-inet arp message introduced in r286825.
Submitted by:	delphij
2015-08-18 06:16:19 +00:00
Alexander V. Chernikov
512e30ef9f Split arpresolve() into fast/slow path.
This change isolates the most common case (e.g. successful lookup)
  from more complicates scenarios. It also (tries to) make code
  more simple by avoiding retry: cycle.

The actual goal is to prepare code to the upcoming change that will
  allow LL address retrieval without acquiring LLE lock at all.

Reviewed by:		ae
Differential Revision:	https://reviews.freebsd.org/D3383
2015-08-16 12:23:58 +00:00
Michael Tuexen
faadc1b492 Allow the path MTU to grow up to the outgoing interface MTU.
MFC after: 3 days
2015-08-14 14:26:13 +00:00
Alexander V. Chernikov
f3bfa7d1cf Move lle update code from from gigantic ip_arpinput() to
separate bunch of functions. The goal is to isolate actual lle
updates to permit more fine-grained locking.

Do all lle link-level update under AFDATA wlock.

Sponsored by:	Yandex LLC
2015-08-13 13:38:09 +00:00
Hiren Panchasara
ad389a8c3b Remove unused TCPTV_SRTTDFLT. We initialize srtt with TCPTV_SRTTBASE when we
don't have any rtt estimate.

Differential Revision:	D3334
Sponsored by:		Limelight Networks
2015-08-12 16:08:37 +00:00
Alexander V. Chernikov
0447c1367a Use single 'lle_timer' callout in lltable instead of
two different names of the same timer.
2015-08-11 12:38:54 +00:00
Alexander V. Chernikov
314294de5c Store addresses instead of sockaddrs inside llentry.
This permits us having all (not fully true yet) all the info
needed in lookup process in first 64 bytes of 'struct llentry'.

struct llentry layout:
BEFORE:
[rwlock .. state .. state .. MAC ] (lle+1) [sockaddr_in[6]]
AFTER
[ in[6]_addr MAC .. state .. rwlock ]

Currently, address part of struct llentry has only 16 bytes for the key.
However, lltable does not restrict any custom lltable consumers with long
keys use the previous approach (store key at (lle+1)).

Sponsored by:	Yandex LLC
2015-08-11 09:26:11 +00:00
Alexander V. Chernikov
41cb42a633 MFP r276712.
* Split lltable_init() into lltable_allocate_htbl() (alloc
  hash table with default callbacks) and lltable_link() (
  links any lltable to the list).
* Switch from LLTBL_HASHTBL_SIZE to per-lltable hash size field.
* Move lltable setup to separate functions in in[6]_domifattach.
2015-08-11 05:51:00 +00:00
Alexander V. Chernikov
2caee4be35 Rename rt_foreach_fib() to rt_foreach_fib_walk().
Suggested by:	julian
2015-08-10 20:50:31 +00:00
Alexander V. Chernikov
11cdad9873 Partially merge r274887,r275334,r275577,r275578,r275586 to minimize
differences between projects/routing and HEAD.

This commit tries to keep code logic the same while changing underlying
code to use unified callbacks.

* Add llt_foreach_entry method to traverse all entries in given llt
* Add llt_dump_entry method to export particular lle entry in sysctl/rtsock
  format (code is not indented properly to minimize diff). Will be fixed
  in the next commits.
* Add llt_link_entry/llt_unlink_entry methods to link/unlink particular lle.
* Add llt_fill_sa_entry method to export address in the lle to sockaddr
  format.
* Add llt_hash method to use in generic hash table support code.
* Add llt_free_entry method which is used in llt_prefix_free code.

* Prepare for fine-grained locking by separating lle unlink and deletion in
  lltable_free() and lltable_prefix_free().

* Provide lltable_get<ifp|af>() functions to reduce direct 'struct lltable'
 access by external callers.

* Remove @llt agrument from lle_free() lle callback since it was unused.
* Temporarily add L3_CADDR() macro for 'const' sockaddr typecasting.
* Switch to per-af hashing code.
* Rename LLE_FREE_LOCKED() callback from in[6]_lltable_free() to
  in_[6]lltable_destroy() to avoid clashing with llt_free_entry() method.
  Update description from these functions.
* Use unified lltable_free_entry() function instead of per-af one.

Reviewed by:	ae
2015-08-10 12:03:59 +00:00
Kristof Provost
30edc5385e tcp_reass_zone is not a VNET variable.
This fixes a panic during 'sysctl -a' on VIMAGE kernels.

The tcp_reass_zone variable is not VNET_DEFINE() so we can not mark it as a VNET
variable (with CTLFLAG_VNET).
2015-08-09 19:07:24 +00:00
Marius Strobl
d2b5ade3f4 Fix compilation after r286458. 2015-08-08 21:42:15 +00:00
Marius Strobl
6e4cd74673 Fix compilation after r286457 w/o INVARIANTS or INVARIANT_SUPPORT. 2015-08-08 21:41:59 +00:00
Alexander V. Chernikov
4bdf0b6a9a MFP r274295:
* Move interface route cleanup to route.c:rt_flushifroutes()
* Convert most of "for (fibnum = 0; fibnum < rt_numfibs; fibnum++)" users
  to use new rt_foreach_fib() instead of hand-rolling cycles.
2015-08-08 18:14:59 +00:00
Alexander V. Chernikov
e362cf0e9f MFP r274553:
* Move lle creation/deletion from lla_lookup to separate functions:
  lla_lookup(LLE_CREATE) -> lla_create
  lla_lookup(LLE_DELETE) -> lla_delete
lla_create now returns with LLE_EXCLUSIVE lock for lle.
* Provide typedefs for new/existing lltable callbacks.

Reviewed by:	ae
2015-08-08 17:48:54 +00:00
Alexander V. Chernikov
331dff0737 Simplify ip[6] simploop:
Do not pass 'dst' sockaddr to ip[6]_mloopback:
  - We have explicit check for AF_INET in ip_output()
  - We assume ip header inside passed mbuf in ip_mloopback
  - We assume ip6 header inside passed mbuf in ip6_mloopback
2015-08-08 15:58:35 +00:00
Julien Charbon
079672cb07 Fix a kernel assertion issue introduced with r286227:
Avoid too strict INP_INFO_RLOCK_ASSERT checks due to
tcp_notify() being called from in6_pcbnotify().

Reported by:	Larry Rosenman <ler@lerctr.org>
Submitted by:	markj, jch
2015-08-08 08:40:36 +00:00
Mark Johnston
8f980c016b The mbuf parameter to ip_output_pfil() must be an output parameter since
pfil(9) hooks may modify the chain.

X-MFC-With:	r286028
2015-08-03 17:47:02 +00:00
Julien Charbon
ff9b006d61 Decompose TCP INP_INFO lock to increase short-lived TCP connections scalability:
- The existing TCP INP_INFO lock continues to protect the global inpcb list
  stability during full list traversal (e.g. tcp_pcblist()).

- A new INP_LIST lock protects inpcb list actual modifications (inp allocation
  and free) and inpcb global counters.

It allows to use TCP INP_INFO_RLOCK lock in critical paths (e.g. tcp_input())
and INP_INFO_WLOCK only in occasional operations that walk all connections.

PR:			183659
Differential Revision:	https://reviews.freebsd.org/D2599
Reviewed by:		jhb, adrian
Tested by:		adrian, nitroboost-gmail.com
Sponsored by:		Verisign, Inc.
2015-08-03 12:13:54 +00:00
Michael Tuexen
e7e71dd7f3 Don't take the port numbers for packets containing ABORT chunks from
a freed mbuf. Just use them from the stcb.

MFC after: 3 days
2015-08-02 16:07:30 +00:00
Andrey V. Elsukov
cf14ccb0f7 Remove unneded #include "opt_inet.h". 2015-07-31 09:02:28 +00:00
Hiren Panchasara
03041aaac8 Update snd_una description to make it more readable.
Differential Revision:	https://reviews.freebsd.org/D3179
Reviewed by:		gnn
Sponsored by:		Limelight Networks
2015-07-30 19:24:49 +00:00
Ermal Luçi
3c40232395 Avoid double reference decrement when firewalls force relooping of packets
When firewalls force a reloop of packets and the caller supplied a route the reference to the route might be reduced twice creating issues.
This is especially the scenario when a packet is looped because of operation in the firewall but the new route lookup gives a down route.

Differential Revision:	https://reviews.freebsd.org/D3037
Reviewed by:	gnn
Approved by:	gnn(mentor)
2015-07-29 20:10:36 +00:00
Ermal Luçi
d9f2a78249 ip_output normalization and fixes
ip_output has a big chunk of code used to handle special cases with pfil consumers which also forces a reloop on it.
Gather all this code together to make it readable and properly handle the reloop cases.

Some of the issues identified:

M_IP_NEXTHOP is not handled properly in existing code.
route reference leaking is possible with in FIB number change
route flags checking is not consistent in the function

Differential Revision:	https://reviews.freebsd.org/D3022
Reviewed by:	gnn
Approved by:	gnn(mentor)
MFC after:	4 weeks
2015-07-29 18:04:01 +00:00
Patrick Kelsey
4741bfcb57 Revert r265338, r271089 and r271123 as those changes do not handle
non-inline urgent data and introduce an mbuf exhaustion attack vector
similar to FreeBSD-SA-15:15.tcp, but not requiring VNETs.

Address the issue described in FreeBSD-SA-15:15.tcp.

Reviewed by:	glebius
Approved by:	so
Approved by:	jmallett (mentor)
Security:	FreeBSD-SA-15:15.tcp
Sponsored by:	Norse Corp, Inc.
2015-07-29 17:59:13 +00:00
Andrey V. Elsukov
10a0e0bf0a Eliminate the use of m_copydata() in gif_encapcheck().
ip_encap already has inspected mbuf's data, at least an IP header.
And it is safe to use mtod() and do direct access to needed fields.
Add M_ASSERTPKTHDR() to gif_encapcheck(), since the code expects that
mbuf has a packet header.
Move the code from gif_validate[46] into in[6]_gif_encapcheck(), also
remove "martian filters" checks. According to RFC 4213 it is enough to
verify that the source address is the address of the encapsulator, as
configured on the decapsulator.

Reviewed by:	melifaro
Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2015-07-29 14:07:43 +00:00