Commit Graph

4824 Commits

Author SHA1 Message Date
Gleb Smirnoff
76039bc84f The r48589 promised to remove implicit inclusion of if_var.h soon. Prepare
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-26 17:58:36 +00:00
John Baldwin
3380883230 Finish r254925 and remove the last remaining sysctl name list macro. The
one port that used it has been fixed to use the more portable
getprotoent(3) instead.
2013-10-23 13:22:50 +00:00
Andre Oppermann
c1e5a6e5e8 The TCP delayed ACK logic isn't aware of LRO passing up large aggregated
segments thinking it received only one segment. This causes it to enable
the delay the ACK for 100ms to wait for another segment which may never
come because all the data was received already.

Doing delayed ACK for LRO segments is bogus for two reasons: a) it pushes
us further away from acking every other packet; b) it introduces additional
delay in responding to the sender.  The latter is especially bad because it
is in the nature of LRO to aggregated all segments of a burst with no more
coming until an ACK is sent back.

Change the delayed ACK logic to detect LRO segments by being larger than
the MSS for this connection and issuing an immediate ACK for them to keep
the ACK clock ticking without interruption.

Reported by:	julian, cperciva
Tested by:	cperciva
Reviewed by:	lstewart
MFC after:	3 days
2013-10-22 18:24:34 +00:00
Kevin Lo
9768475eb0 - Add parentheses to all internet addresses
- All the casts to uint32_t should be to in_addr_t

Suggested by:	bde
Reviewed by:	bde
2013-10-19 18:13:32 +00:00
Michael Tuexen
77dabf96d9 Remove a buggy comparision when setting manually the path MTU.
After fixing, the comparision would have become redundant.
Thanks to Andrew Galante for reporting the issue.

MFC after:	3 days
2013-10-15 20:21:27 +00:00
Gleb Smirnoff
7caf4ab7ac - Utilize counter(9) to accumulate statistics on interface addresses. Add
four counters to struct ifaddr. This kills '+=' on a variables shared
  between processors for every packet.
- Nuke struct if_data from struct ifaddr.
- In ip_input() do not put a reference on ifaddr, instead update statistics
  right now in place and do IN_IFADDR_RUNLOCK(). These removes atomic(9)
  for every packet. [1]
- To properly support NET_RT_IFLISTL sysctl used by getifaddrs(3), in
  rtsock.c fill if_data fields using counter_u64_fetch().
- Accidentially fix bug in COMPAT_32 version of NET_RT_IFLISTL, which
  took if_data not from the ifaddr, but from ifaddr's ifnet. [2]

Submitted by:	melifaro [1], pluknet[2]
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-15 11:37:57 +00:00
Gleb Smirnoff
4675896098 Remove ifa_init() and provide ifa_alloc() that will allocate and setup
struct ifaddr internally.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-15 10:31:42 +00:00
Gleb Smirnoff
6ed910fabe Hide 'struct ifaddr' definition from userland. Two tools left that use it,
namely ipftest(1) and ifmcstat(1). These sniff structure definition using
_WANT_IFADDR define.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-15 10:19:24 +00:00
Kevin Lo
b298a86678 Treat INADDR_NONE as uint32_t.
Reviewed by:	glebius
2013-10-15 07:35:39 +00:00
Gleb Smirnoff
c11a15bf8d When processing ACK in tcp_do_segment, use sbcut_locked() instead of
sbdrop_locked() to cut acked mbufs from the socket buffer. Free this
chain a batch manner after the socket buffer lock is dropped.

This measurably reduces contention on socket buffer.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
Approved by:	re (marius)
2013-10-09 12:00:38 +00:00
Mark Johnston
8298c17c6c Add a separate translator for headers passed to the TCP probes in the
input path. These probes get some of the fields in host order, whereas the
output probes get them in network order, so a single translator isn't
enough. This workaround ensures that the problem is essentially invisble
to users: none of the probe arguments or their fields have changed.

Approved by:	re (hrs)
2013-10-02 17:14:12 +00:00
Bjoern A. Zeeb
a5f44cd7a1 Introduce spares in the TCP syncache and timewait structures
so that fixed TCP_SIGNATURE handling can later be merged.

This is derived from follow-up work to SVN r183001 posted to
net@ on Sep 13 2008.

Approved by:	re (gjb)
2013-09-21 10:01:51 +00:00
Mikolaj Golub
4d3dfd450a Unregister inet/inet6 pfil hooks on vnet destroy.
Discussed with:	andre
Approved by:	re (rodrigc)
2013-09-13 18:45:10 +00:00
Michael Tuexen
5dc80df9c5 Fix the aborting of association with the iterator using an empty
user initiated error cause (using SCTP_ABORT|SCTP_SENDALL).

Approved by: re (delphij)
MFC after: 1 week
2013-09-09 21:40:07 +00:00
Mikolaj Golub
1f6addd92c Relese the interface in the last.
Reviewed by:	glebius
Approved by:	re (kib)
2013-09-08 18:19:40 +00:00
Michael Tuexen
d4d23375d3 When computing the partial delivery point, take the
receiver socket buffer size correctly into account.

MFC after: 1 week
2013-09-07 00:45:24 +00:00
John Baldwin
86d93a15ff Use LIST_FOREACH_SAFE() instead of doing it by hand. 2013-09-05 14:26:37 +00:00
John Baldwin
fa302f207f Use an unsigned long when indexing into mfchashtbl[] and mf6ctable[]. This
matches the types used when computing hash indices and the type of the
maximum size of mfchashtbl[].

PR:		kern/181821
Submitted by:	Sven-Thorsten Dietrich <sven@vyatta.com> (IPv4)
MFC after:	1 week
2013-09-05 14:16:37 +00:00
Andrey V. Elsukov
d983befd2f Remove unused code and sort variables declarations.
PR:		kern/181822
MFC after:	1 week
2013-09-05 08:12:36 +00:00
Michael Tuexen
0ddb429900 Remove redundant field pr_sctp_on.
MFC after: 1 week
2013-09-03 19:31:59 +00:00
Michael Tuexen
a28c9ff0b7 Use uint16_t instead of in_port_t for consistency with the SCTP code.
MFC after: 1 week
2013-09-02 23:27:53 +00:00
Michael Tuexen
e6b2b4b65b All changes affect only SCTP-AUTH:
* Remove non working code related to SHA224.
* Remove support for non-standardised HMAC-IDs using SHA384 and SHA512.
* Prefer SHA256 over SHA1.
* Minor cleanup.

MFC after: 2 weeks
2013-09-02 22:48:41 +00:00
Navdeep Parhar
7127e6acf0 Merge r254336 from user/np/cxl_tuning.
Add a last-modified timestamp to each LRO entry and provide an interface
to flush all inactive entries.  Drivers decide when to flush and what
the inactivity threshold should be.

Network drivers that process an rx queue to completion can enter a
livelock type situation when the rate at which packets are received
reaches equilibrium with the rate at which the rx thread is processing
them.  When this happens the final LRO flush (normally when the rx
routine is done) does not occur.  Pure ACKs and segments with total
payload < 64K can get stuck in an LRO entry.  Symptoms are that TCP
tx-mostly connections' performance falls off a cliff during heavy,
unrelated rx on the interface.

Flushing only inactive LRO entries works better than any of these
alternates that I tried:
- don't LRO pure ACKs
- flush _all_ LRO entries periodically (every 'x' microseconds or every
  'y' descriptors)
- stop rx processing in the driver periodically and schedule remaining
  work for later.

Reviewed by:	andre
2013-08-28 23:00:34 +00:00
John Baldwin
fd77bbb967 Remove most of the remaining sysctl name list macros. They were only
ever intended for use in sysctl(8) and it has not used them for many
years.

Reviewed by:	bde
Tested by:	exp-run by bdrewery
2013-08-26 18:16:05 +00:00
Mark Johnston
1ad19fb657 The second last argument of udp:::receive is supposed to contain the
connection state, not the IP header.

X-MFC with:	r254889
2013-08-26 00:28:57 +00:00
Mark Johnston
57f6086735 Implement the ip, tcp, and udp DTrace providers. The probe definitions use
dynamic translation so that their arguments match the definitions for
these providers in Solaris and illumos. Thus, existing scripts for these
providers should work unmodified on FreeBSD.

Tested by:	gnn, hiren
MFC after:	1 month
2013-08-25 21:54:41 +00:00
Michael Tuexen
1a94cdbea7 Provide human readable debug output. 2013-08-25 12:44:03 +00:00
Andre Oppermann
9850f95989 For now limit printf(9) %x of the 64bit pkthdr.csum_flags field to 32bits.
The upper 32bits are not occupied for now.

Sponsored by:	The FreeBSD Foundation
2013-08-25 09:49:00 +00:00
Andre Oppermann
1b4381afbb Restructure the mbuf pkthdr to make it fit for upcoming capabilities and
features.  The changes in particular are:

o Remove rarely used "header" pointer and replace it with a 64bit protocol/
  layer specific union PH_loc for local use.  Protocols can flexibly overlay
  their own 8 to 64 bit fields to store information while the packet is
  worked on.

o Mechanically convert IP reassembly, IGMP/MLD and ATM to use pkthdr.PH_loc
  instead of pkthdr.header.

o Extend csum_flags to 64bits to allow for additional future offload
  information to be carried (e.g. iSCSI, IPsec offload, and others).

o Move the RSS hash type enumerator from abusing m_flags to its own 8bit
  rsstype field.  Adjust accessor macros.

o Add cosqos field to store Class of Service / Quality of Service information
  with the packet.  It is not yet supported in any drivers but allows us to
  get on par with Cisco/Juniper in routing applications (plus MPLS QoS) with
  a modernized ALTQ.

o Add four 8 bit fields l[2-5]hlen to store the relative header offsets
  from the start of the packet.  This is important for various offload
  capabilities and to relieve the drivers from having to parse the packet
  and protocol headers to find out location of checksums and other
  information.  Header parsing in drivers is a lot of copy-paste and
  unhandled corner cases which we want to avoid.

o Add another flexible 64bit union to map various additional persistent
  packet information, like ether_vtag, tso_segsz and csum fields.
  Depending on the csum_flags settings some fields may have different usage
  making it very flexible and adaptable to future capabilities.

o Restructure the CSUM flags to better signify their outbound (down the
  stack) and inbound (up the stack) use.  The CSUM flags used to be a bit
  chaotic and rather poorly documented leading to incorrect use in many
  places.  Bring clarity into their use through better naming.
  Compatibility mappings are provided to preserve the API.  The drivers
  can be corrected one by one and MFC'd without issue.

o The size of pkthdr stays the same at 48/56bytes (32/64bit architectures).

Sponsored by:	The FreeBSD Foundation
2013-08-24 19:51:18 +00:00
Michael Tuexen
6be15a24c4 Export the inpcb features as a 64-bit entity.
Bump __FreeBSD_version to 1000048 since the
modified structure is user visible and used
by netstat, for example.
2013-08-22 20:29:57 +00:00
Michael Tuexen
06c9f9bddf Make also the features of the association 64-bit.
When exporting to xinpcb, just export the lower
32-bit. Using there also 64-bits will break the
ABI and will be committed separetly.

MFC after: 2 weeks
X-MFC with: 254248
2013-08-22 19:28:13 +00:00
Xin LI
acde2476c4 Fix an integer overflow in computing the size of a temporary buffer
can result in a buffer which is too small for the requested
operation.

Security:	CVE-2013-3077
Security:	FreeBSD-SA-13:09.ip_multicast
2013-08-22 00:51:37 +00:00
Andre Oppermann
5fc98a7895 Reorder the mbuf defines to make more sense and group related flags
together.

Add M_FLAG_PRINTF for use with printf(9) %b indentifier.

Use the generic mbuf flags print names in the net80211 code and adjust
the protocol specific bits for their new positions.

Change SCTP M_PROTO mapping from 5 to 1 to fit within the 16bit field
they use internally to store some additional information.

Discussed with:	trociny, glebius
2013-08-19 14:25:11 +00:00
Andre Oppermann
86bd049144 Add m_clrprotoflags() to clear protocol specific mbuf flags at up and
downwards layer crossings.

Consistently use it within IP, IPv6 and ethernet protocols.

Discussed with:	trociny, glebius
2013-08-19 13:27:32 +00:00
Andre Oppermann
678d7b9461 Move the SCTP specific definition of M_NOTIFICATION onto a protocol
specific mbuf flag from sys/mbuf.h to netinet/sctp_os_bsd.h.  It is
only relevant within SCTP.

Discussed with:	tuexen
2013-08-19 12:30:18 +00:00
Andre Oppermann
88388bdcbe Move the global M_SKIP_FIREWALL mbuf flags to a protocol layer specific
flag instead.  The flag is only used within the IP and IPv6 layer 3
protocols.

Because some firewall packages treat IPv4 and IPv6 packets the same the
flag should have the same value for both.

Discussed with:	trociny, glebius
2013-08-19 11:08:36 +00:00
Andre Oppermann
b09dc7e328 Move ip_reassemble()'s use of the global M_FRAG mbuf flag to a protocol layer
specific flag instead.  The flag is only relevant while the packet stays in
the IP reassembly queue.

Discussed with:	trociny, glebius
2013-08-19 10:34:10 +00:00
Andre Oppermann
fb86dfcd2f Remove unused M_FRAG, M_FIRSTFRAG and M_LASTFRAG tagging from ip_fragment().
There wasn't any real driver (and hardware) support for it.  Modern hardware
does full fragmentation/segmentation offload instead.
2013-08-19 10:30:15 +00:00
Mark Johnston
7b77e1fe0f Specify SDT probe argument types in the probe definition itself rather than
using SDT_PROBE_ARGTYPE(). This will make it easy to extend the SDT(9) API
to allow probes with dynamically-translated types.

There is no functional change.

MFC after:	2 weeks
2013-08-15 04:08:55 +00:00
Michael Tuexen
0e05fbded9 Don't send uninitialized memory (two instances of 4 bytes) in
every cookie on the wire. This bug was reported in
https://bugzilla.mozilla.org/show_bug.cgi?id=905080

MFC after: 3 days
2013-08-14 21:51:32 +00:00
Mikolaj Golub
c5c392e7ed Virtualize carp(4) variables to have per vnet control.
Reviewed by:	ae, glebius
2013-08-13 19:59:49 +00:00
Michael Tuexen
2c9c61defa Make the features a 64-bit value instead of 32-bit.
This will allow an easier integration of the support
for NDATA.
While there, do also some minor cleanups.
Obtained from:	rrs@
MFC after: 2 weeks
2013-08-12 13:52:15 +00:00
Michael Tuexen
bfd1666aad Micro-optimization suggested in
https://bugzilla.mozilla.org/show_bug.cgi?id=898234
by pchang9. While there simplify the code.

MFC after: 1 week
2013-08-01 12:05:23 +00:00
Andrey V. Elsukov
6794f46021 Remove the large part of struct ipsecstat. Only few fields of this
structure is used, but they already have equal fields in the struct
newipsecstat, that was introduced with FAST_IPSEC and then was merged
together with old ipsecstat structure.

This fixes kernel stack overflow on some architectures after migration
ipsecstat to PCPU counters.

Reported by:	Taku YAMAMOTO, Maciej Milewski
2013-07-23 14:14:24 +00:00
Michael Tuexen
88a95b1f25 Allow the code to be compiled without warnings for any combination
of INET, INET6 and SCTP_DEBUG defines.
The issue was reported by Lally Singh.

MFC after: 2 weeks
2013-07-20 13:14:59 +00:00
Michael Tuexen
da24cfcb35 Get the code compiling without INET and INET6 being defined.
This is not possible in FreeBSD, but in the upstream code.

MFC after: 2 weeks
2013-07-19 21:16:59 +00:00
Andre Oppermann
ccd040ab18 Free the non-fatal "timestamp missing" debug string manually as it is
not covered by the catch-all free for the error cases.

Found by:	Coverity
2013-07-16 16:37:08 +00:00
Mikolaj Golub
f122b319eb A complete duplication of binding should be allowed if on both new and
duplicated sockets a multicast address is bound and either
SO_REUSEPORT or SO_REUSEADDR is set.

But actually it works for the following combinations:

  * SO_REUSEPORT is set for the fist socket and SO_REUSEPORT for the new;
  * SO_REUSEADDR is set for the fist socket and SO_REUSEADDR for the new;
  * SO_REUSEPORT is set for the fist socket and SO_REUSEADDR for the new;

and fails for this:

  * SO_REUSEADDR is set for the fist socket and SO_REUSEPORT for the new.

Fix the last case.

PR:		179901
MFC after:	1 month
2013-07-12 19:08:33 +00:00
Andre Oppermann
10c982958c Unbreak VIMAGE by correctly naming the vnet pointer in struct tcp_syncache.
Reported by:	trociny, rodrigc
2013-07-12 07:43:56 +00:00
Andre Oppermann
81d392a09d Improve SYN cookies by encoding the MSS, WSCALE (window scaling) and SACK
information into the ISN (initial sequence number) without the additional
use of timestamp bits and switching to the very fast and cryptographically
strong SipHash-2-4 MAC hash algorithm to protect the SYN cookie against
forgeries.

The purpose of SYN cookies is to encode all necessary session state in
the 32 bits of our initial sequence number to avoid storing any information
locally in memory.  This is especially important when under heavy spoofed
SYN attacks where we would either run out of memory or the syncache would
fill with bogus connection attempts swamping out legitimate connections.

The original SYN cookies method only stored an indexed MSS values in the
cookie.  This isn't sufficient anymore and breaks down in the presence of
WSCALE information which is only exchanged during SYN and SYN-ACK.  If we
can't keep track of it then we may severely underestimate the available
send or receive window. This is compounded with large windows whose size
information on the TCP segment header is even lower numerically.  A number
of years back SYN cookies were extended to store the additional state in
the TCP timestamp fields, if available on a connection.  While timestamps
are common among the BSD, Linux and other *nix systems Windows never enabled
them by default and thus are not present for the vast majority of clients
seen on the Internet.

The common parameters used on TCP sessions have changed quite a bit since
SYN cookies very invented some 17 years ago.  Today we have a lot more
bandwidth available making the use window scaling almost mandatory.  Also
SACK has become standard making recovering from packet loss much more
efficient.

This change moves all necessary information into the ISS removing the need
for timestamps.  Both the MSS (16 bits) and send WSCALE (4 bits) are stored
in 3 bit indexed form together with a single bit for SACK.  While this is
significantly less than the original range, it is sufficient to encode all
common values with minimal rounding.

The MSS depends on the MTU of the path and with the dominance of ethernet
the main value seen is around 1460 bytes.  Encapsulations for DSL lines
and some other overheads reduce it by a few more bytes for many connections
seen.  Rounding down to the next lower value in some cases isn't a problem
as we send only slightly more packets for the same amount of data.

The send WSCALE index is bit more tricky as rounding down under-estimates
the available send space available towards the remote host, however a small
number values dominate and are carefully selected again.

The receive WSCALE isn't encoded at all but recalculated based on the local
receive socket buffer size when a valid SYN cookie returns.  A listen socket
buffer size is unlikely to change while active.

The index values for MSS and WSCALE are selected for minimal rounding errors
based on large traffic surveys.  These values have to be periodically
validated against newer traffic surveys adjusting the arrays tcp_sc_msstab[]
and tcp_sc_wstab[] if necessary.

In addition the hash MAC to protect the SYN cookies is changed from MD5
to SipHash-2-4, a much faster and cryptographically secure algorithm.

Reviewed by:	dwmalone
Tested by:	Fabian Keil <fk@fabiankeil.de>
2013-07-11 15:29:25 +00:00
Andre Oppermann
07dacf031e Extend debug logging of TCP timestamp related specification
violations.

Update related comments and style.
2013-07-10 12:06:01 +00:00
Michael Tuexen
e5aeb83c42 Use IPSECSTAT_INC() and IPSEC6STAT_INC() macros for ipsec statistics
accounting.

X-MFC with: r252026
2013-07-09 14:38:26 +00:00
Andrey V. Elsukov
69edf037d7 Migrate struct carpstats to PCPU counters. 2013-07-09 10:02:51 +00:00
Andrey V. Elsukov
2841260cd6 Migrate structs in6_ifstat and icmp6_ifstat to PCPU counters. 2013-07-09 09:59:46 +00:00
Andrey V. Elsukov
a786f67981 Migrate structs ip6stat, icmp6stat and rip6stat to PCPU counters. 2013-07-09 09:54:54 +00:00
Andrey V. Elsukov
5b7cb97c2b Migrate structs arpstat, icmpstat, mrtstat, pimstat and udpstat to PCPU
counters.
2013-07-09 09:50:15 +00:00
Andrey V. Elsukov
5da0521fce Use new macros to implement ipstat and tcpstat using PCPU counters.
Change interface of kread_counters() similar ot kread() in the netstat(1).
2013-07-09 09:43:03 +00:00
Andrey V. Elsukov
c80211e3cf Prepare network statistics structures for migration to PCPU counters.
Use uint64_t as type for all fields of structures.

Changed structures: ahstat, arpstat, espstat, icmp6_ifstat, icmp6stat,
in6_ifstat, ip6stat, ipcompstat, ipipstat, ipsecstat, mrt6stat, mrtstat,
pfkeystat, pim6stat, pimstat, rip6stat, udpstat.

Discussed with:	arch@
2013-07-09 09:32:06 +00:00
Michael Tuexen
ee1ccd9258 Fix a bug were only 2048 streams where usable even though more than
2048 streams were negotiated on the wire. While there, remove the
hard coded limit of 2048 streams.

MFC after: 3 days
2013-07-05 10:08:49 +00:00
Michael Tuexen
5db47b3def When processing an incoming ABORT, SHUTDOWN_COMPLETE or ERROR (NAT related)
chunk, take always the T-bit into account, when checking the verification
tag.

MFC after: 3 days
2013-07-04 19:47:46 +00:00
Mikolaj Golub
efdf104bca In r227207, to fix the issue with possible NULL inp_socket pointer
dereferencing, when checking for SO_REUSEPORT option (and SO_REUSEADDR
for multicast), INP_REUSEPORT flag was introduced to cache the socket
option.  It was decided then that one flag would be enough to cache
both SO_REUSEPORT and SO_REUSEADDR: when processing SO_REUSEADDR
setsockopt(2), it was checked if it was called for a multicast address
and INP_REUSEPORT was set accordingly.

Unfortunately that approach does not work when setsockopt(2) is called
before binding to a multicast address: the multicast check fails and
INP_REUSEPORT is not set.

Fix this by adding INP_REUSEADDR flag to unconditionally cache
SO_REUSEADDR.

PR:		179901
Submitted by:	Michael Gmelin freebsd grem.de (initial version)
Reviewed by:	rwatson
MFC after:	1 week
2013-07-04 18:38:00 +00:00
Michael Tuexen
56f778aadf Code cleanups.
MFC after: 3 days
2013-07-03 18:48:43 +00:00
Navdeep Parhar
e364d8c44a Catch up with r238990. LLE_DELETED does not clobber everything else in
la_flags since said revision.
2013-07-03 17:27:32 +00:00
Hiroki Sato
e32d93954d Fix a panic when leaving MC group in a kernel with VIMAGE enabled.
in_leavegroup() is called from an asynchronous task, and
igmp_change_state() requires that curvnet is set by the caller.
2013-07-02 16:39:12 +00:00
Lawrence Stewart
92a0637f73 Import an implementation of the CAIA Delay-Gradient (CDG) congestion control
algorithm, which is based on the 2011 v0.1 patch release and described in the
paper "Revisiting TCP Congestion Control using Delay Gradients" by David Hayes
and Grenville Armitage. It is implemented as a kernel module compatible with the
modular congestion control framework.

CDG is a hybrid congestion control algorithm which reacts to both packet loss
and inferred queuing delay. It attempts to operate as a delay-based algorithm
where possible, but utilises heuristics to detect loss-based TCP cross traffic
and will compete effectively as required. CDG is therefore incrementally
deployable and suitable for use on shared networks.

In collaboration with:	David Hayes <david.hayes at ieee.org> and
		Grenville Armitage <garmitage at swin edu au>
MFC after:	4 days
Sponsored by:	Cisco University Research Program and FreeBSD Foundation
2013-07-02 08:44:56 +00:00
Gleb Smirnoff
42a253e6a1 Fix kmod_*stat_inc() after r249276. The incorrect code actually
increased the pointer, not the memory it points to.

In collaboration with:	kib
Reported & tested by:	Ian FREISLICH <ianf clue.co.za>
Sponsored by:		Nginx, Inc.
2013-06-21 06:36:26 +00:00
Andrey V. Elsukov
6659296cb0 Use IPSECSTAT_INC() and IPSEC6STAT_INC() macros for ipsec statistics
accounting.

MFC after:	2 weeks
2013-06-20 09:55:53 +00:00
Bruce M Simpson
c91950082d Disable IGMPv3 link timers on a transition to IGMPv2.
Submitted by:	Alan Smithee
2013-06-07 17:12:08 +00:00
Andre Oppermann
3c914c547e Allow drivers to specify a maximum TSO length in bytes if they are
limited in the amount of data they can handle at once.

Drivers can set ifp->if_hw_tsomax before calling ether_ifattach() to
change the limit.

The lowest allowable size is IP_MAXPACKET / 8 (8192 bytes) as anything
less wouldn't be very useful anymore.  The upper limit is still at
IP_MAXPACKET (65536 bytes).  Raising it requires further auditing of
the IPv4/v6 code path's as the length field in the IP header would
overflow leading to confusion in firewalls and others packet handler on
the real size of the packet.

The placement into "struct ifnet" is a bit hackish but the best place
that was found.  When the stack/driver boundary is updated it should
be handled in a better way.

Submitted by:	cperciva (earlier version)
Reviewed by:	cperciva
Tested by:	cperciva
MFC after:	1 week (using spare struct members to preserve ABI)
2013-06-03 12:55:13 +00:00
Michael Tuexen
fe1831e06f Use LIST_EMPTY when appropriate.
MFC after: 1 week
2013-06-02 10:35:08 +00:00
Michael Tuexen
fb4a67d207 Remove redundant checks.
MFC after: 2 weeks
2013-05-28 09:25:58 +00:00
Michael Tuexen
3f61f926ea Withdraw http://svnweb.freebsd.org/changeset/base/250809
since the real fix is in http://svnweb.freebsd.org/changeset/base/250952.
2013-05-24 09:21:18 +00:00
Michael Tuexen
e3581df21e Initialize the fibnum for outgoing packets to 0. This avoids
crashing due to the usage of uninitialized fibnum.
This bugs became visiable after
http://svnweb.freebsd.org/changeset/base/250700

MFC after: 2 weeks
2013-05-19 16:06:43 +00:00
Michael Tuexen
553bb0688c Set errno to ETIMEDOUT if an SCTP association times out during
setup.

MFC after: 1 week
2013-05-17 22:26:05 +00:00
Michael Tuexen
b05fbf171e Don't send an ABORT chunk with verification 0.
MFC after: 1 week
2013-05-17 21:45:52 +00:00
Jim Harris
d13fc9954b Fix typo in net.inet.tcp.minmss sysctl description.
MFC after:	3 days
2013-05-13 19:55:27 +00:00
Hiroki Sato
b8992a6792 Add IFF_MONITOR support to gre(4).
Tested by:	Chip Marshall
MFC after:	1 week
2013-05-11 19:05:38 +00:00
Gleb Smirnoff
5d81d09598 Rate limit the number of remotely triggered ARP log messages
to 1 log message per second.
2013-05-11 10:51:32 +00:00
Michael Tuexen
3457ccdaea Honor the net.inet6.ip6.v6only sysctl variable and the IPV6_V6ONLY
socket option for SCTP sockets in the same way as for UDP or TCP
sockets.

MFC after: 2 weeks
2013-05-10 18:09:38 +00:00
Andre Oppermann
f89d4c3acf Back out r249318, r249320 and r249327 due to a heisenbug most
likely related to a race condition in the ipi_hash_lock with
the exact cause currently unknown but under investigation.
2013-05-06 16:42:18 +00:00
Hiroki Sato
5df1b6b57e Use FF02:0:0:0:0:2:FF00::/104 prefix for IPv6 Node Information Group
Address.  Although KAME implementation used FF02:0:0:0:0:2::/96 based on
older versions of draft-ietf-ipngwg-icmp-name-lookup, it has been changed
in RFC 4620.

The kernel always joins the /104-prefixed address, and additionally does
/96-prefixed one only when net.inet6.icmp6.nodeinfo_oldmcprefix=1.
The default value of the sysctl is 1.

ping6(8) -N flag now uses /104-prefixed one.  When this flag is specified
twice, it uses /96-prefixed one instead.

Reviewed by:		ume
Based on work by:	Thomas Scheffler
PR:			conf/174957
MFC after:		2 weeks
2013-05-04 19:16:26 +00:00
Colin Percival
76089c9511 Move IPPROTO_IPV6 from #ifdef __BSD_VISIBLE to #if __POSIX_VISIBLE >= 201112
since POSIX 2001 states that it shall be defined.

Reported by:	sbruno
Reviewed by:	jilles
MFC after:	1 week
2013-04-27 23:36:01 +00:00
Gleb Smirnoff
47e8d432d5 Add const qualifier to the dst parameter of the ifnet if_output method. 2013-04-26 12:50:32 +00:00
Gleb Smirnoff
414676ba31 Fix couple of mbuf leaks in incoming ARP processing. 2013-04-25 17:38:04 +00:00
Gleb Smirnoff
4c7a605968 Introduce a pointer to const variable gw, which points either at the
same place as dst, or to the sockaddr in the routing table.

The const constraint of gw makes us safe from modifing routing table
accidentially. And "onstantness" of dst allows us to remove several
bandaids, when we switched it back at &ro->ro_dst, now it always
points there.

Reviewed by:	rrs
2013-04-25 12:42:09 +00:00
Randall Stewart
0be23a54cf This fixes the issue with the "randomly changing" default
route. What it was is there are two places in ip_output.c
where we do a goto again. One place was fine, it
copies out the new address and then resets dst = ro->rt_dst;
But the other place does *not* do that, which means earlier
when we found the gateway, we have dst pointing there
aka dst = ro->rt_gateway is done.. then we do a
goto again.. bam now we clobber the default route.

The fix is just to move the again so we are always
doing dst = &ro->rt_dst; in the again loop.

PR:	 174749,157796
MFC after:	1 week
2013-04-24 18:30:32 +00:00
Andre Oppermann
5628dd0893 When doing RFC3042 limited transmit on the first on second
duplicate ACK make sure we actually have new data to send.
This prevents us from sending unneccessary pure ACKs.

Reported by:	Matt Miller <matt@matthewjmiller.net>
Tested by:	Matt Miller <matt@matthewjmiller.net>
MFC after:	2 weeks
2013-04-23 14:06:32 +00:00
Oleg Bulyzhin
1571132f14 Plug static llentry leak (ipv4 & ipv6 were affected).
PR:		kern/172985
MFC after:	1 month
2013-04-21 21:28:38 +00:00
Gabor Kovesdan
8fb3bbe770 - Corrrect mispellings of word useful
Submitted by:	Christoph Mallon <christoph.mallon@gmx.de> (via private mail)
2013-04-17 11:45:15 +00:00
Xin LI
f2297451fe Fix incomplete printf.
PR:		kern/177889
Submitted by:	Sven-Thorsten Dietrich <sven vyatta com>
MFC after:	1 week
2013-04-16 19:32:12 +00:00
Xin LI
c1031303f0 Don't leak lock when returning.
PR:		kern/177888
Submitted by:	Sven-Thorsten Dietrich <sven vyatta com>
MFC after:	1 week
2013-04-16 19:25:41 +00:00
Andrey V. Elsukov
e3389419ef Reflect removing of the counter_u64_subtract() function in the macro. 2013-04-12 16:29:15 +00:00
Gleb Smirnoff
0e2bc05c47 Fix tcp_output() so that tcpcb is updated in the same manner when an
mbuf allocation fails, as in a case when ip_output() returns error.

To achieve that, move large block of code that updates tcpcb below
the out: label.

This fixes a panic, that requires the following sequence to happen:

1) The SYN was sent to the network, tp->snd_nxt = iss + 1, tp->snd_una = iss
2) The retransmit timeout happened for the SYN we had sent,
   tcp_timer_rexmt() sets tp->snd_nxt = tp->snd_una, and calls tcp_output().
   In tcp_output m_get() fails.
3) Later on the SYN|ACK for the SYN sent in step 1) came,
   tcp_input sets tp->snd_una += 1, which leads to
   tp->snd_una > tp->snd_nxt inconsistency, that later panics in
   socket buffer code.

For reference, this bug fixed in DragonflyBSD repo:

http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/1ff9b7d322dc5a26f7173aa8c38ecb79da80e419

Reviewed by:	andre
Tested by:	pho
Sponsored by:	Nginx, Inc.
PR:		kern/177456
Submitted by:	HouYeFei&XiBoLiu <lglion718 163.com>
2013-04-11 18:23:56 +00:00
Gleb Smirnoff
18ba072a22 Fix build. 2013-04-10 08:09:25 +00:00
Andre Oppermann
e8b3186b6a Change certain heavily used network related mutexes and rwlocks to
reside on their own cache line to prevent false sharing with other
nearby structures, especially for those in the .bss segment.

NB: Those mutexes and rwlocks with variables next to them that get
changed on every invocation do not benefit from their own cache line.
Actually it may be net negative because two cache misses would be
incurred in those cases.
2013-04-09 21:02:20 +00:00
Andre Oppermann
982c1675ff Fix a race condition on tcp listen socket teardown with pending
connections in the accept queue and contiguous new incoming SYNs.

Compared to the original submitters patch I've moved the test
next to the SYN handling to have it together in a logical unit
and reworded the comment explaining the issue.

Submitted by:	Matt Miller <matt@matthewjmiller.net>
Submitted by:	Juan Mojica <jmojica@gmail.com>
Reviewed by:	Matt Miller (changes)
Tested by:	pho
MFC after:	1 week
2013-04-09 20:52:26 +00:00
Gleb Smirnoff
4a21e86ec1 Fix VIMAGE build. 2013-04-09 09:15:26 +00:00
Andrey V. Elsukov
9cb8d207af Use IP6STAT_INC/IP6STAT_DEC macros to update ip6 stats.
MFC after:	1 week
2013-04-09 07:11:22 +00:00
Gleb Smirnoff
5923c29332 Merge from projects/counters: TCP/IP stats.
Convert 'struct ipstat' and 'struct tcpstat' to counter(9).

  This speeds up IP forwarding at extreme packet rates, and
makes accounting more precise.

Sponsored by:	Nginx, Inc.
2013-04-08 19:57:21 +00:00
Michael Tuexen
ebae998767 Add a macro for checking for IPv4 link local addresses.
MFC after: 1 week
2013-03-31 18:27:46 +00:00
Ed Maste
ce7ad6640c Keep fwd_tag around for subsequent pcb lookups
For TIMEWAIT handling tcp_input may have to jump back for an additional
pass through pcblookup.  Prior to this change the fwd_tag had been
discarded after the first lookup, so a new connection attempt delivered
locally via 'ipfw fwd' would fail to find a match.

As of r248886 the tag will be detached and freed when passed to the
socket buffer.
2013-03-29 20:51:44 +00:00
Alexander V. Chernikov
ae01d73c04 Add ipfw support for setting/matching DiffServ codepoints (DSCP).
Setting DSCP support is done via O_SETDSCP which works for both
IPv4 and IPv6 packets. Fast checksum recalculation (RFC 1624) is done for IPv4.
Dscp can be specified by name (AFXY, CSX, BE, EF), by value
(0..63) or via tablearg.

Matching DSCP is done via another opcode (O_DSCP) which accepts several
classes at once (af11,af22,be). Classes are stored in bitmask (2 u32 words).

Many people made their variants of this patch, the ones I'm aware of are
(in alphabetic order):

Dmitrii Tejblum
Marcelo Araujo
Roman Bogorodskiy (novel)
Sergey Matveichuk (sem)
Sergey Ryabin

PR:		kern/102471, kern/121122
MFC after:	2 weeks
2013-03-20 10:35:33 +00:00
Gleb Smirnoff
7525c48111 In m_megapullup() instead of reserving some space at the end of packet,
m_align() it, reserving space to prepend data.

Reviewed by:	mav
2013-03-17 07:37:10 +00:00
Gleb Smirnoff
aa8bd99d99 - Replace compat macros with function calls. 2013-03-16 08:58:28 +00:00
Gleb Smirnoff
3c26f4a9bc We can, and should use M_WAITOK here.
Sponsored by:	Nginx, Inc.
2013-03-15 13:10:06 +00:00
Gleb Smirnoff
dc4ad05ecd Use m_get/m_gethdr instead of compat macros.
Sponsored by:	Nginx, Inc.
2013-03-15 12:55:30 +00:00
Gleb Smirnoff
39f6074e2e - Use m_getcl() instead of hand allocating.
Sponsored by:	Nginx, Inc.
2013-03-15 12:53:53 +00:00
Gleb Smirnoff
41a7572b26 Functions m_getm2() and m_get2() have different order of arguments,
and that can drive someone crazy. While m_get2() is young and not
documented yet, change its order of arguments to match m_getm2().

Sorry for churn, but better now than later.
2013-03-12 13:42:47 +00:00
Gleb Smirnoff
f4562a299c Remove LIBALIAS_LOCK_ASSERT(), including a couple with an uninitialzed
argument, in code that isn't compiled in kernel.

PR:		kern/176667
Sponsored by:	Nginx, Inc.
2013-03-11 12:22:44 +00:00
Lawrence Stewart
1e0e83d760 The hashmask returned by hashinit() is a valid index in the returned hash array.
Fix a siftr(4) potential memory leak and INVARIANTS triggered kernel panic in
hashdestroy() by ensuring the last array index in the flow counter hash table is
flushed of entries.

MFC after:	3 days
2013-03-07 04:42:20 +00:00
Davide Italiano
5b999a6be0 - Make callout(9) tickless, relying on eventtimers(4) as backend for
precise time event generation. This greatly improves granularity of
callouts which are not anymore constrained to wait next tick to be
scheduled.
- Extend the callout KPI introducing a set of callout_reset_sbt* functions,
which take a sbintime_t as timeout argument. The new KPI also offers a
way for consumers to specify precision tolerance they allow, so that
callout can coalesce events and reduce number of interrupts as well as
potentially avoid scheduling a SWI thread.
- Introduce support for dispatching callouts directly from hardware
interrupt context, specifying an additional flag. This feature should be
used carefully, as long as interrupt context has some limitations
(e.g. no sleeping locks can be held).
- Enhance mechanisms to gather informations about callwheel, introducing
a new sysctl to obtain stats.

This change breaks the KBI. struct callout fields has been changed, in
particular 'int ticks' (4 bytes) has been replaced with 'sbintime_t'
(8 bytes) and another 'sbintime_t' field was added for precision.

Together with:	mav
Reviewed by:	attilio, bde, luigi, phk
Sponsored by:	Google Summer of Code 2012, iXsystems inc.
Tested by:	flo (amd64, sparc64), marius (sparc64), ian (arm),
		markj (amd64), mav, Fabian Keil
2013-03-04 11:09:56 +00:00
Michael Tuexen
e045904fdc Fix a potential race in returning setting errno when an
association goes down.
Reported by Mozilla in
https://bugzilla.mozilla.org/show_bug.cgi?id=845513

MFC after: 3 days
2013-02-27 19:51:47 +00:00
Andrew Gallatin
e5ca1ffab5 Fix tcp_lro_rx_ipv4() for drivers that do not set CSUM_IP_CHECKED.
Specifcially, in_cksum_hdr() returns 0 (not 0xffff) when the IPv4
checksum is correct. Without this fix, the tcp_lro code will reject
good IPv4 traffic from drivers that do not implement IPv4 header
harder csum offload.

Sponsored by: Myricom Inc.

MFC after:	7 days
2013-02-21 17:00:35 +00:00
Sergey Kandaurov
46f2df9c13 ip_savecontrol() style fixes. No functional changes.
- fix indentation
- put the operator at the end of the line for long statements
- remove spaces between the type and the variable in a cast
- remove excessive parentheses

Tested by:	md5
2013-02-20 15:44:40 +00:00
Michael Tuexen
2416af26a0 Send the adaptation layer indication only if set by the user.
MFC after: 3 days
Discussed with: rrs
2013-02-11 21:02:49 +00:00
Michael Tuexen
c53f854a17 Don't send kernel provided information in the User Initiated
ABORT cause, since the user can also provide this kind of
information. So the receiver doesn't know who provided the
information.
While there: Fix a bug where the stack would send a malformed
ABORT chunk when using a send() call with SCTP_ABORT|SCT_SENDALL
flags.

MFC after: 3 days
2013-02-11 13:57:03 +00:00
Gleb Smirnoff
24421c1c32 Resolve source address selection in presense of CARP. Add a couple
of helper functions:

- carp_master()   - boolean function which is true if an address
		    is in the MASTER state.
- ifa_preferred() - boolean function that compares two addresses,
		    and is aware of CARP.

  Utilize ifa_preferred() in ifa_ifwithnet().

  The previous version of patch also changed source address selection
logic in jails using carp_master(), but we failed to negotiate this part
with Bjoern. May be we will approach this problem again later.

Reported & tested by:	Anton Yuzhaninov <citrin citrin.ru>
Sponsored by:		Nginx, Inc
2013-02-11 10:58:22 +00:00
Michael Tuexen
f0d44a49a0 Make sure that received packets for removed addresses are handled
consistently. While there, make variable names consistent.

MFC after: 3 days
2013-02-10 19:57:19 +00:00
Michael Tuexen
a1cb341b5d Cleanup the handling of address scopes. Announce in the INIT/INIT-ACK
only the supported address types. While there, do some whitespace
cleanups.

MFC after: 1 week
2013-02-09 17:26:14 +00:00
Michael Tuexen
c39cfa1f7e Fix a bug where HEARTBEATs were still sent in SHUTDOWN_SENT or
SHUTDOWN_ACK_SENT state. While there, make the corresponding
code consistent.

MFC after: 1 week
2013-02-09 08:27:08 +00:00
John Baldwin
0d25fab44d Add placeholder constants to reserve a portion of the socket option
name space for use by downstream vendors to add custom options.

MFC after:	2 weeks
2013-02-01 15:32:20 +00:00
Andre Oppermann
cda3447bb0 uma_zone_set_max() directly returns the rounded effective zone
limit.  Use the return value directly instead of doing a second
uma_zone_set_max() step.

MFC after:	1 week
2013-02-01 14:21:09 +00:00
Gleb Smirnoff
498944374f - Move AUTHORS and ACKNOWLEDGEMENTS to the end of the page.
- Add myself to list of authors.
2013-01-31 10:29:22 +00:00
Gleb Smirnoff
9711a168b9 Retire struct sockaddr_inarp.
Since ARP and routing are separated, "proxy only" entries
don't have any meaning, thus we don't need additional field
in sockaddr to pass SIN_PROXY flag.

New kernel is binary compatible with old tools, since sizes
of sockaddr_inarp and sockaddr_in match, and sa_family are
filled with same value.

The structure declaration is left for compatibility with
third party software, but in tree code no longer use it.

Reviewed by:	ru, andre, net@
2013-01-31 08:55:21 +00:00
Gleb Smirnoff
ea26ed7eea Utilize m_get2() to get mbuf of appropriate size. 2013-01-30 18:40:19 +00:00
Navdeep Parhar
adfaf8f6ad Add checks for SO_NO_OFFLOAD in a couple of places that I missed earlier
in r245915.
2013-01-26 01:41:42 +00:00
Navdeep Parhar
20be068c8a Teach toe_l2_resolve to resolve IPv6 destinations too.
Reviewed by:	bz@
2013-01-26 00:57:29 +00:00
Navdeep Parhar
4364ec0852 Move lle_event to if_llatbl.h
lle_event replaced arp_update_event after the ARP rewrite and ended up
in if_ether.h simply because arp_update_event used to be there too.
IPv6 neighbor discovery is going to grow lle_event support and this is a
good time to move it to if_llatbl.h.

The two in-tree consumers of this event - OFED and toecore - are not
affected.

Reviewed by:	bz@
2013-01-25 23:58:21 +00:00
Navdeep Parhar
460cf046c2 There is no need to call into the TOE driver twice in pru_rcvd (tod_rcvd
and then tod_output right after that).

Reviewed by:	bz@
2013-01-25 22:50:52 +00:00
Navdeep Parhar
464dfeb43f Add TCP_OFFLOAD hook in syncache_respond for IPv6 too, just like the one
that exists for IPv4.

Reviewed by:	bz@
2013-01-25 22:16:35 +00:00
Navdeep Parhar
b218348bc3 Teach toe_4tuple_check() to deal with IPv6 4-tuples too.
Reviewed by:	bz@
2013-01-25 20:45:24 +00:00
Navdeep Parhar
37cc0ecb1b Heed SO_NO_OFFLOAD.
MFC after:	1 week
2013-01-25 20:23:33 +00:00
Navdeep Parhar
5cd3dcaa25 Remove redundant test, we know inp_lport is 0.
MFC after:	1 week
2013-01-25 20:14:27 +00:00
John Baldwin
1d77fa5a26 Use decimal values for UDP and TCP socket options rather than hex to avoid
implying that these constants should be treated as bit masks.

Reviewed by:	net
MFC after:	1 week
2013-01-22 19:45:04 +00:00
Lawrence Stewart
5b648e797b Simplify and fix a bug in cc_ack_received()'s "are we congestion window limited"
logic (refer to [1] for associated discussion). snd_cwnd and snd_wnd are
unsigned long and on 64 bit hosts, min() will truncate them to 32 bits and could
therefore potentially corrupt the result (although under normal operation,
neither variable should legitmately exceed 32 bits).

[1] http://lists.freebsd.org/pipermail/freebsd-net/2013-January/034297.html

Submitted by:	jhb
MFC after:	1 week
2013-01-22 09:44:21 +00:00
John Baldwin
6c0ef8957f Don't drop options from the third retransmitted SYN by default. If the
SYNs (or SYN/ACK replies) are dropped due to network congestion, then the
remote end of the connection may act as if options such as window scaling
are enabled but the local end will think they are not.  This can result in
very slow data transfers in the case of window scaling disagreements.

The old behavior can be obtained by setting the
net.inet.tcp.rexmit_drop_options sysctl to a non-zero value.

Reviewed by:	net@
MFC after:	2 weeks
2013-01-09 20:27:06 +00:00
Peter Wemm
8a1163e82f Temporarily revert rev 244678. This is causing loopback problems with
the lo (loopback) interfaces.
2013-01-03 10:21:28 +00:00
Michael Tuexen
11e03b3200 Some cleanups.
MFC after: 3 days
2012-12-27 08:10:58 +00:00
Michael Tuexen
72c123a8b4 Minor cleanups of debug messages.
MFC after: 3 days
2012-12-27 08:06:58 +00:00
Michael Tuexen
2c2e3218cb Fix a copy and paste error.
MFC after: 3 days
2012-12-27 08:02:58 +00:00
Gleb Smirnoff
c4d0697685 Garbage collect carp_cksum(). 2012-12-25 14:29:38 +00:00
Gleb Smirnoff
7951008b47 Change net.inet.carp.demotion sysctl to add the supplied value
to the current demotion factor instead of assigning it.

  This allows external scripts to control demotion factor together
with kernel in a raceless manner.
2012-12-25 14:08:13 +00:00
Gleb Smirnoff
e8db9937f3 Fix sysctl_handle_int() usage. Either arg1 or arg2 should be supplied,
and arg2 doesn't pass size of arg1.
2012-12-25 13:55:21 +00:00
Gleb Smirnoff
468e45f3bd The SIOCSIFFLAGS ioctl handler runs if_up()/if_down() that notify
all interested parties in case if interface flag IFF_UP has changed.

  However, not only SIOCSIFFLAGS can raise the flag, but SIOCAIFADDR
and SIOCAIFADDR_IN6 can, too. The actual |= is done not in the protocol
code, but in code of interface drivers. To fix this historical layering
violation, we will check whether ifp->if_ioctl(SIOCSIFADDR) raised the
IFF_UP flag, and if it did, run the if_up() handler.

  This fixes configuring an address under CARP control on an interface
that was initially !IFF_UP.

P.S. I intentionally omitted handling the IFF_SMART flag. This flag was
never ever used in any driver since it was introduced, and since it
means another layering violation, it should be garbage collected instead
of pretended to be supported.
2012-12-25 13:01:58 +00:00
Gleb Smirnoff
3e6c8b5366 Minor style(9) changes:
- Remove declaration in initializer.
- Add empty line between logical blocks.
2012-12-24 21:35:48 +00:00
Gleb Smirnoff
b8056fae06 Fix !INET6 build after r244365. 2012-12-18 08:14:16 +00:00
Gleb Smirnoff
dd029d52fa Clear correct flag in INET6 case. 2012-12-18 08:09:44 +00:00
Andrey V. Elsukov
f491274582 Since we use different flags to detect tcp forwarding, and we share the
same code for IPv4 and IPv6 in tcp_input, we should check both
M_IP_NEXTHOP and M_IP6_NEXTHOP flags.

MFC after:	3 days
2012-12-17 20:55:33 +00:00
Gleb Smirnoff
b1ec2940af Fix problem in r238990. The LLE_LINKED flag should be tested prior to
entering llentry_free(), and in case if we lose the race, we should simply
perform LLE_FREE_LOCKED(). Otherwise, if the race is lost by the thread
performing arptimer(), it will remove two references from the lle instead
of one.

Reported by:	Ian FREISLICH <ianf clue.co.za>
2012-12-13 11:11:15 +00:00
Gleb Smirnoff
78a7880f64 Fix a crash in tcp_input(), that happens when mbuf has a fwd_tag on it,
but later after processing and freeing the tag, we need to jump back again
to the findpcb label. Since the fwd_tag pointer wasn't NULL we tried to
process and free the tag for second time.

Reported & tested by:	Pawel Tyll <ptyll nitronet.pl>
MFC after:		3 days
2012-12-12 17:41:21 +00:00
Michael Tuexen
cca6f4a8f3 Get it compiling without INET and INET6 support (mainly userland stack).
MFC after: 2 weeks
2012-12-08 15:11:09 +00:00
Pawel Jakub Dawidek
6acd596efb More warnings for zones that depend on the kern.ipc.maxsockets limit.
Obtained from:	WHEEL Systems
2012-12-08 12:51:06 +00:00
Michael Tuexen
b11f07d86c Use correct padding of the ABORT chunk in case of an user initiated
abort cause is used.

MFC after: 2 weeks
2012-12-08 09:50:38 +00:00
Michael Tuexen
3fb7827628 Ensure that the padding of the last parameter of an INIT chunk
is not included in the chunk length as required by RFC 4960.
While there, cleanup sctp_send_initiate().

MFC after: 2 weeks
2012-12-08 08:22:33 +00:00
Gleb Smirnoff
eb1b1807af Mechanically substitute flags from historic mbuf allocator with
malloc(9) flags within sys.

Exceptions:

- sys/contrib not touched
- sys/mbuf.h edited manually
2012-12-05 08:04:20 +00:00
Andre Oppermann
da2299c5c7 Remove unused and unnecessary CSUM_IP_FRAGS checksumming capability.
Checksumming the IP header of fragments is no different from doing
normal IP headers.

Discussed with:	yongari
MFC after:	1 week
2012-11-27 19:31:49 +00:00
Andre Oppermann
13feab8286 Add DELACK to list of timers.
MFC after:	1 week
2012-11-27 19:07:28 +00:00
Navdeep Parhar
825fd1e437 Make sure that tcp_timer_activate() correctly sees TCP_OFFLOAD (or not). 2012-11-27 06:42:44 +00:00
Alfred Perlstein
08373e0bc4 Auto size the tcbhashsize structure based on max sockets.
While here, also make the code that enforces power-of-two more
forgiving, instead of just resetting to 512, graciously round-down
to the next lower power of two.
2012-11-27 03:04:24 +00:00
Michael Tuexen
a50f0e3152 Add support for sctp_peeloff() also in the front states of the
association.

MFC after: 3 days
2012-11-26 16:44:03 +00:00
Michael Tuexen
e3976bb8d7 Find the endpoint for an incoming packet also if the endpoint
comes from sctp_peeloff().

MFC after: 3 days
2012-11-26 16:43:32 +00:00
Michael Tuexen
440da2d35b Allow shutdown() to be used on fds returned from sctp_peeloff().
MFC after: 3 days
2012-11-26 08:50:00 +00:00
Michael Tuexen
a3158782c2 Remove unused function.
MFC after: 1 week
2012-11-25 14:25:08 +00:00
Michael Tuexen
3a51a2647a Add support for SCTP/UDP/IPV6.
This completes the support of
http://tools.ietf.org/html/draft-ietf-tsvwg-sctp-udp-encaps

MFC after: 1 week
2012-11-17 20:04:04 +00:00
Michael Tuexen
325c8c46b1 Get the accounting working. We now have counters how many
chunks for each SCTP outgoing stream are in the send and
sent queue.
While there, improve the naming of NR-SACK related constants
recently introduced.

MFC after: 1 week
2012-11-16 19:39:10 +00:00
Roman Divacky
8252626fb4 Initialize hdrlen to 0 to avoid clang warning in NOINET case. 2012-11-10 10:41:00 +00:00
Bjoern A. Zeeb
ec89d0398b Cleanup some whitspace in this file to get it out of an upcoming patch.
MFC after:	10 days
2012-11-08 03:29:55 +00:00
Michael Tuexen
a7ad6026e0 Add per outgoing stream accounting for chunks in the send
and sent queue. This provides no functional change, but is
a preparation for an upcoming stream reset improvement.
Done with rrs@.

MFC after: 1 week
2012-11-07 22:11:38 +00:00
Michael Tuexen
2a4985847a Add some missing changes missed in the last commit.
MFC after: 1 week
X-MFC with: 242708
2012-11-07 21:25:32 +00:00
Michael Tuexen
98f2956c11 Improve PR-SCTP if used in combination with NR-SACK.
Based on work done by Mohammad Rajiullah.

MFC after: 1 week
2012-11-07 20:59:00 +00:00
Kevin Lo
0f5e7edc14 Fix typo; s/ouput/output 2012-11-07 07:00:59 +00:00
Mateusz Guzik
8e1e6e5f4a Fix possible spurious sbunlock in sctp_sorecvmsg.
Reviewed by:	tuexen
Approved by:	trasz (mentor)
MFC after:	3 days
2012-11-06 23:04:23 +00:00
Michael Tuexen
f3b05218ea Move from early SSN assignment to late SSN assignment.
This doesn't change functionality, but makes upcoming change
much easier.
Developed with rrs@ at the IETF 85.

MFC after: 1 week
2012-11-05 20:55:17 +00:00
Andre Oppermann
60ee3bb213 Back out r242262. The simplified window change/update logic wasn't
complete and ready for production use.

PR:	kern/173309
2012-11-05 09:13:06 +00:00
Andrey V. Elsukov
ffdbf9da3b Remove the recently added sysctl variable net.pfil.forward.
Instead, add protocol specific mbuf flags M_IP_NEXTHOP and
M_IP6_NEXTHOP. Use them to indicate that the mbuf's chain
contains the PACKET_TAG_IPFORWARD tag. And do a tag lookup
only when this flag is set.

Suggested by:	andre
2012-11-02 01:20:55 +00:00
Michael Tuexen
21f67da7c4 Whitespace changes due to upstream integration of SCTP changes in the
FreeBSD code base.
2012-10-29 20:47:32 +00:00
Michael Tuexen
24d4ce2c87 Add braces (as used elsewhere in the SCTP code). 2012-10-29 20:44:29 +00:00
Michael Tuexen
09c1c8563a Use ntohs() and htons() in correct order. However, this doesn't change
functionality.
2012-10-29 20:42:48 +00:00
Andre Oppermann
78f59b4bfd Forced commit to provide the correct commit message to r242251:
Defer sending an independent window update if a delayed ACK is pending
  saving a packet.  The window update then gets piggy-backed on the next
  already scheduled ACK.

Added grammar fixes as well.

MFC after:	2 weeks
2012-10-29 13:16:33 +00:00
Andre Oppermann
8d045dbdf3 Define the delayed ACK timeout value directly as hz/10 instead of
obfuscating it by going through PR_FASTHZ.  No functional change.

MFC after:	2 weeks
2012-10-29 12:17:02 +00:00
Andre Oppermann
322181c98e If the user has closed the socket then drop a persisting connection
after a much reduced timeout.

Typically web servers close their sockets quickly under the assumption
that the TCP connections goes away as well.  That is not entirely true
however.  If the peer closed the window we're going to wait for a long
time with lots of data in the send buffer.

MFC after:	2 weeks
2012-10-28 19:58:20 +00:00
Andre Oppermann
09440655fe Increase the initial CWND to 10 segments as defined in IETF TCPM
draft-ietf-tcpm-initcwnd-05. It explains why the increased initial
window improves the overall performance of many web services without
risking congestion collapse.

As long as it remains a draft it is placed under a sysctl marking it
as experimental:
 net.inet.tcp.experimental.initcwnd10 = 1
When it becomes an official RFC soon the sysctl will be changed to
the RFC number and moved to net.inet.tcp.

This implementation differs from the RFC draft in that it is a bit
more conservative in the case of packet loss on SYN or SYN|ACK because
we haven't reduced the default RTO to 1 second yet.  Also the restart
window isn't yet increased as allowed.  Both will be adjusted with
upcoming changes.

Is is enabled by default.  In Linux it is enabled since kernel 3.0.

MFC after:	2 weeks
2012-10-28 19:47:46 +00:00
Andre Oppermann
77339e1cdc Update comment to reflect the change made in r242263.
MFC after:	2 weeks
2012-10-28 19:22:18 +00:00
Andre Oppermann
c4ab59c1a1 Add SACK_PERMIT to the list of TCP options that are switched off after
retransmitting a SYN three times.

MFC after:	2 weeks
2012-10-28 19:20:23 +00:00
Andre Oppermann
79ce26a08c Simplify and enhance the window change/update acceptance logic,
especially in the presence of bi-directional data transfers.

snd_wl1 tracks the right edge, including data in the reassembly
queue, of valid incoming data.  This makes it like rcv_nxt plus
reassembly.  It never goes backwards to prevent older, possibly
reordered segments from updating the window.

snd_wl2 tracks the left edge of sent data.  This makes it a duplicate
of snd_una.  However joining them right now is difficult due to
separate update dependencies in different places in the code flow.

snd_wnd tracks the current advertized send window by the peer.  In
tcp_output() the effective window is calculated by subtracting the
already in-flight data, snd_nxt less snd_una, from it.

ACK's become the main clock of window updates and will always update
the window when the left edge of what we sent is advanced.  The ACK
clock is the primary signaling mechanism in ongoing data transfers.
This works reliably even in the presence of reordering, reassembly
and retransmitted segments.  The ACK clock is most important because
it determines how much data we are allowed to inject into the network.

Zero window updates get us out of persistence mode are crucial.  Here
a segment that neither moves ACK nor SEQ but enlarges WND is accepted.

When the ACK clock is not active (that is we're not or no longer
sending any data) any segment that moves the extended right SEQ edge,
including out-of-order segments, updates the window.  This gives us
updates especially during ping-pong transfers where the peer isn't
done consuming the already acknowledged data from the receive buffer
while responding with data.

The SSH protocol is a prime candidate to benefit from the improved
bi-directional window update logic as it has its own windowing
mechanism on top of TCP and is frequently sending back protocol ACK's.

Tcpdump provided by:	darrenr
Tested by:	darrenr
MFC after:	2 weeks
2012-10-28 19:16:22 +00:00
Andre Oppermann
024fd5b6bb For retransmits of SYN|ACK from the syncache use the slightly more
aggressive special tcp_syn_backoff[] retransmit schedule instead of
the normal tcp_backoff[] schedule for established connections.

MFC after:	2 weeks
2012-10-28 19:02:07 +00:00
Andre Oppermann
f4748ef5fb When retransmitting SYN in TCPS_SYN_SENT state use TCPTV_RTOBASE,
the default retransmit timeout, as base to calculate the backoff
time until next try instead of the TCP_REXMTVAL() macro which only
works correctly when we already have measured an actual RTT+RTTVAR.

Before it would cause the first retransmit at RTOBASE, the next
four at the same time (!) about 200ms later, and then another one
again RTOBASE later.

MFC after:	2 weeks
2012-10-28 18:56:57 +00:00
Andre Oppermann
602e8e45ee Remove bogus 'else' in #ifdef that prevented the rttvar from being reset
tcp_timer_rexmt() on retransmit for IPv6 sessions.

MFC after:	2 weeks
2012-10-28 18:45:04 +00:00
Andre Oppermann
4faaea5505 Allow arbitrary MSS sizes and don't mind about the cluster size anymore.
We've got more cluster sizes for quite some time now and the orginally
imposed limits and the previously codified thoughts on efficiency gains
are no longer true.

MFC after:	2 weeks
2012-10-28 18:33:52 +00:00
Andre Oppermann
f3a10d7954 Change the syncache count reporting the current number of entries
from an unprotected u_int that reports garbage on SMP to a function
based sysctl obtaining the current value from UMA.

Also read back the actual cache_limit after page size rounding by UMA.

PR:		kern/165879
MFC after:	2 weeks
2012-10-28 18:07:34 +00:00
Andre Oppermann
aafa0b4164 Simplify implementation of net.inet.tcp.reass.maxsegments and
net.inet.tcp.reass.cursegments.

MFC after:	2 weeks
2012-10-28 17:59:46 +00:00
Andre Oppermann
f62563d33c Prevent a flurry of forced window updates when an application is
doing small reads on a (partially) filled receive socket buffer.

Normally one would a send a window update every time the available
space in the socket buffer increases by two times MSS.  This leads
to a flurry of window updates that do not provide any meaningful
new information to the sender.  There still is available space in
the window and the sender can continue sending data.  All window
updates then get carried by the regular ACKs.  Only when the socket
buffer was (almost) full and the window closed accordingly a window
updates delivery new information and allows the sender to start
sending more data again.

Send window updates only every two MSS when the socket buffer
has less than 1/8 space available, or the available space in the
socket buffer increased by 1/4 its full capacity, or the socket
buffer is very small.  The next regular data ACK will carry and
report the exact window size again.

Reported by:	sbruno
Tested by:	darrenr
Tested by:	Darren Baginski
PR:		kern/116335
MFC after:	2 weeks
2012-10-28 17:40:35 +00:00
Andre Oppermann
4249614cb0 When SYN or SYN/ACK had to be retransmitted RFC5681 requires us to
reduce the initial CWND to one segment.  This reduction got lost
some time ago due to a change in initialization ordering.

Additionally in tcp_timer_rexmt() avoid entering fast recovery when
we're still in TCPS_SYN_SENT state.

MFC after:	2 weeks
2012-10-28 17:30:28 +00:00
Andre Oppermann
cf8f04f4c0 When SYN or SYN/ACK had to be retransmitted RFC5681 requires us to
reduce the initial CWND to one segment.  This reduction got lost
some time ago due to a change in initialization ordering.

Additionally in tcp_timer_rexmt() avoid entering fast recovery when
we're still in TCPS_SYN_SENT state.

MFC after:	2 weeks
2012-10-28 17:25:08 +00:00
Andre Oppermann
22efabd40c Adjust the initial default CWND upon connection establishment to the
new and increased values specified by RFC5681 Section 3.1.

The even larger initial CWND per RFC3390, if enabled, is not affected.

MFC after:	2 weeks
2012-10-28 17:16:09 +00:00
Gleb Smirnoff
078468ede4 o Remove last argument to ip_fragment(), and obtain all needed information
on checksums directly from mbuf flags. This simplifies code.
o Clear CSUM_IP from the mbuf in ip_fragment() if we did checksums in
  hardware. Some driver may not announce CSUM_IP in theur if_hwassist,
  although try to do checksums if CSUM_IP set on mbuf. Example is em(4).
o While here, consistently use CSUM_IP instead of its alias CSUM_DELAY_IP.
  After this change CSUM_DELAY_IP vanishes from the stack.

Submitted by:	Sebastian Kuzminsky <seb lineratesystems.com>
2012-10-26 21:06:33 +00:00
Andrey V. Elsukov
c1de64a495 Remove the IPFIREWALL_FORWARD kernel option and make possible to turn
on the related functionality in the runtime via the sysctl variable
net.pfil.forward. It is turned off by default.

Sponsored by:	Yandex LLC
Discussed with:	net@
MFC after:	2 weeks
2012-10-25 09:39:14 +00:00
Gleb Smirnoff
a7f707cd37 After r241923 the updated ip_len no longer needed. 2012-10-25 09:02:21 +00:00
Gleb Smirnoff
b6fcf6f9f5 Fix error in r241913 that had broken fragment reassembly. 2012-10-25 09:00:57 +00:00
Gleb Smirnoff
9e2a372fd2 Use ip_stripoptions() instead of handrolled version. 2012-10-23 10:30:09 +00:00