Commit Graph

1453 Commits

Author SHA1 Message Date
Hiroki Sato
705bef548a Cancel DAD for an ifa when the ifp has ND6_IFF_IFDISABLED as early as
possible and do not clear IN6_IFF_TENTATIVE.  If IFDISABLED was accidentally
set after a DAD started, TENTATIVE could be cleared because no NA was
received due to IFDISABLED, and as a result it could prevent DAD when
manually clearing IFDISABLED after that.
2014-05-16 15:53:31 +00:00
Alexander V. Chernikov
b980262e63 Pass radix head ptr along with rte to rtexpunge().
Rename rtexpunge to rt_expunge().
2014-05-03 16:28:54 +00:00
Alexander V. Chernikov
cf58751a44 Use "hash" value in rtalloc_mpath_fib() instead of RTF_ANNOUNCE flag.
Hashing method is the same as in in6_src.c. (Probably we need better one).

MFC after:	2 weeks
2014-04-26 16:46:33 +00:00
Alexander V. Chernikov
36d55f0f9d Unify sa_equal() macro usage.
MFC after:	2 weeks
2014-04-26 14:52:03 +00:00
Alan Somers
0cfee0c223 Fix subnet and default routes on different FIBs on the same subnet.
These two bugs are closely related.  The root cause is that ifa_ifwithnet
does not consider FIBs when searching for an interface address.

sys/net/if_var.h
sys/net/if.c
	Add a fib argument to ifa_ifwithnet and ifa_ifwithdstadddr.  Those
	functions will only return an address whose interface fib equals the
	argument.

sys/net/route.c
	Update calls to ifa_ifwithnet and ifa_ifwithdstaddr with fib
	arguments.

sys/netinet/in.c
	Update in_addprefix to consider the interface fib when adding
	prefixes.  This will prevent it from not adding a subnet route when
	one already exists on a different fib.

sys/net/rtsock.c
sys/netinet/in_pcb.c
sys/netinet/ip_output.c
sys/netinet/ip_options.c
sys/netinet6/nd6.c
	Add RT_DEFAULT_FIB arguments to ifa_ifwithdstaddr and ifa_ifwithnet.
	In some cases it there wasn't a clear specific fib number to use.
	In others, I was unable to test those functions so I chose
	RT_DEFAULT_FIB to minimize divergence from current behavior.  I will
	fix some of the latter changes along with PR kern/187553.

tests/sys/netinet/fibs_test.sh
tests/sys/netinet/udp_dontroute.c
tests/sys/netinet/Makefile
	Revert r263738.  The udp_dontroute test was right all along.
	However, bugs kern/187550 and kern/187553 cancelled each other out
	when it came to this test.  Because of kern/187553, ifa_ifwithnet
	searched the default fib instead of the requested one, but because
	of kern/187550, there was an applicable subnet route on the default
	fib.  The new test added in r263738 doesn't work right, however.  I
	can verify with dtrace that ifa_ifwithnet returned the wrong address
	before I applied this commit, but route(8) miraculously found the
	correct interface to use anyway.  I don't know how.

	Clear expected failure messages for kern/187550 and kern/187552.

PR:		kern/187550
PR:		kern/187552
Reviewed by:	melifaro
MFC after:	3 weeks
Sponsored by:	Spectra Logic
2014-04-24 23:56:56 +00:00
Andrey V. Elsukov
52c57247d3 Remove unused variable.
PR:		173521
MFC after:	1 week
Sponsored by:	Yandex LLC
2014-04-17 06:40:11 +00:00
Andrey V. Elsukov
4fd913364f Properly release the in6_multi lock.
MFC after:	1 week
Sponsored by:	Yandex LLC
2014-04-12 02:05:31 +00:00
Kevin Lo
d1b18731d9 Minor style cleanups. 2014-04-07 01:55:53 +00:00
Kevin Lo
e06e816f67 Add support for UDP-Lite protocol (RFC 3828) to IPv4 and IPv6 stacks.
Tested with vlc and a test suite [1].

[1] http://www.erg.abdn.ac.uk/~gerrit/udp-lite/files/udplite_linux.tar.gz

Reviewed by:	jhb, glebius, adrian
2014-04-07 01:53:03 +00:00
Andrey V. Elsukov
cd71804c84 Remove unused label.
MFC after:	1 week
2014-03-31 14:40:35 +00:00
Andrey V. Elsukov
27aa751c90 Don't generate an ICMPv6 error message if packet was consumed by filter.
MFC after:	1 week
Sponsored by:	Yandex LLC
2014-03-31 14:27:22 +00:00
Robert Watson
7527624efa Several years after initial development, merge prototype support for
linking NIC Receive Side Scaling (RSS) to the network stack's
connection-group implementation.  This prototype (and derived patches)
are in use at Juniper and several other FreeBSD-using companies, so
despite some reservations about its maturity, merge the patch to the
base tree so that it can be iteratively refined in collaboration rather
than maintained as a set of gradually diverging patch sets.

(1) Merge a software implementation of the Toeplitz hash specified in
    RSS implemented by David Malone.  This is used to allow suitable
    pcbgroup placement of connections before the first packet is
    received from the NIC.  Software hashing is generally avoided,
    however, due to high cost of the hash on general-purpose CPUs.

(2) In in_rss.c, maintain authoritative versions of RSS state intended
    to be pushed to each NIC, including keying material, hash
    algorithm/ configuration, and buckets.  Provide software-facing
    interfaces to hash 2- and 4-tuples for IPv4 and IPv6 using both
    the RSS standardised Toeplitz and a 'naive' variation with a hash
    efficient in software but with poor distribution properties.
    Implement rss_m2cpuid()to be used by netisr and other load
    balancing code to look up the CPU on which an mbuf should be
    processed.

(3) In the Ethernet link layer, allow netisr distribution using RSS as
    a source of policy as an alternative to source ordering; continue
    to default to direct dispatch (i.e., don't try and requeue packets
    for processing on the 'right' CPU if they arrive in a directly
    dispatchable context).

(4) Allow RSS to control tuning of connection groups in order to align
    groups with RSS buckets.  If a packet arrives on a protocol using
    connection groups, and contains a suitable hardware-generated
    hash, use that hash value to select the connection group for pcb
    lookup for both IPv4 and IPv6.  If no hardware-generated Toeplitz
    hash is available, we fall back on regular PCB lookup risking
    contention rather than pay the cost of Toeplitz in software --
    this is a less scalable but, at my last measurement, faster
    approach.  As core counts go up, we may want to revise this
    strategy despite CPU overhead.

Where device drivers suitably configure NICs, and connection groups /
RSS are enabled, this should avoid both lock and line contention during
connection lookup for TCP.  This commit does not modify any device
drivers to tune device RSS configuration to the global RSS
configuration; patches are in circulation to do this for at least
Chelsio T3 and Intel 1G/10G drivers.  Currently, the KPI for device
drivers is not particularly robust, nor aware of more advanced features
such as runtime reconfiguration/rebalancing.  This will hopefully prove
a useful starting point for refinement.

No MFC is scheduled as we will first want to nail down a more mature
and maintainable KPI/KBI for device drivers.

Sponsored by:   Juniper Networks (original work)
Sponsored by:   EMC/Isilon (patch update and merge)
2014-03-15 00:57:50 +00:00
Gleb Smirnoff
aa69c61235 Since both netinet/ and netinet6/ call into netipsec/ and netpfil/,
the protocol specific mbuf flags are shared between them.

- Move all M_FOO definitions into a single place: netinet/in6.h, to
  avoid future  clashes.
- Resolve clash between M_DECRYPTED and M_SKIP_FIREWALL which resulted
  in a failure of operation of IPSEC and packet filters.

Thanks to Nicolas and Georgios for all the hard work on bisecting,
testing and finally finding the root of the problem.

PR:			kern/186755
PR:			kern/185876
In collaboration with:	Georgios Amanakis <gamanakis gmail.com>
In collaboration with:	Nicolas DEFFAYET <nicolas-ml deffayet.com>
Sponsored by:		Nginx, Inc.
2014-03-12 14:29:08 +00:00
Gleb Smirnoff
e3a7aa6f56 - Remove rt_metrics_lite and simply put its members into rtentry.
- Use counter(9) for rt_pksent (former rt_rmx.rmx_pksent). This
  removes another cache trashing ++ from packet forwarding path.
- Create zini/fini methods for the rtentry UMA zone. Via initialize
  mutex and counter in them.
- Fix reporting of rmx_pksent to routing socket.
- Fix netstat(1) to report "Use" both in kvm(3) and sysctl(3) mode.

The change is mostly targeted for stable/10 merge. For head,
rt_pksent is expected to just disappear.

Discussed with:		melifaro
Sponsored by:		Netflix
Sponsored by:		Nginx, Inc.
2014-03-05 01:17:47 +00:00
John Baldwin
5b26ea5df3 Remove more constants related to static sysctl nodes. The MAXID constants
were primarily used to size the sysctl name list macros that were removed
in r254295.  A few other constants either did not have an associated
sysctl node, or the associated node used OID_AUTO instead.

PR:		ports/184525 (exp-run)
2014-02-25 18:44:33 +00:00
Craig Rodrigues
47a79fadc6 Remove KASSERT from in6p_lookup_mcast_ifp().
When the devel/jenkins port, version 1.551 was started,
the kernel would panic if INVARIANTS was enabled in the kernel config.

Suggested by: bms
2014-02-23 01:27:22 +00:00
Gleb Smirnoff
0ff96b4f55 o Remove at compile time the HASH_ALL code, that was never
tested and is unfinished. However, I've tested my version,
  it works okay. As before it is unfinished: timeout aren't
  driven by TCP session state. To enable the HASH_ALL mode,
  one needs in kernel config:

	options FLOWTABLE_HASH_ALL

o Reduce the alignment on flentry to 64 bytes. Without
  the FLOWTABLE_HASH_ALL option, twice less memory would
  be consumed by flows.
o API to ip_output()/ip6_output() got even more thin: 1 liner.
o Remove unused unions. Simply use fle->f_key[].
o Merge all IPv4 code into flowtable_lookup_ipv4(), and do same
  flowtable_lookup_ipv6(). Stop copying data to on stack
  sockaddr structures, simply use key[] on stack.
o Move code from flowtable_lookup_common() that actually works
  on insertion into flowtable_insert().

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-02-17 11:50:56 +00:00
Alexander V. Chernikov
f6990c4e3e Further simplify nd6_output_lle.
Currently we have 3 usage patterns:
1) nd6_output (most traffic flow, no lle supplied, lle RLOCK sufficient)
2) corner cases for output (no lle, STALE lle, so on). lle WLOCK needed.
3) nd* iunternal machinery (WLOCK'ed lle provided, perform packet queing).

We separate case 1 and implement it inside its only customer - nd6_output.
This leads to some code duplication (especialy SEND stuff, which should be
hooked to output in a different way), but simplifies locking and control
flow logic fir nd6_output_lle.

Reviewed by:	ae
MFC after:	3 weeks
Sponsored by:	Yandex LLC
2014-02-13 19:09:04 +00:00
Andrey V. Elsukov
e4c77ca0c0 Drop packets to multicast address whose scop field contains the
reserved value 0.

MFC after:	1 week
Sponsored by:	Yandex LLC
2014-02-13 14:10:44 +00:00
Christian Brueffer
d37872314f Only count table lookups when we're actually processing packets.
PR:		183462
Submitted by:	Sven-Thorsten Dietrich <thebigcorporation at gmail.com>
Reviewed by:	bms
MFC after:	1 month
2014-02-10 14:47:51 +00:00
Christian Brueffer
1b55364ed9 For IPv6, return the same error code as IPv4 when mrouter is not initialized.
PR:		178472
Submitted by:	Sven-Thorsten Dietrich <sven at vyatta.com>
Reviewed by:	bms
2014-02-10 14:36:51 +00:00
Alexander V. Chernikov
9dffa6a3f3 Simplify nd6_output_lle:
* Check ND6_IFF_IFDISABLED before acquiring any locks
* Assume m is always non-NULL
* remove 'bad' case not used anymore
* Simply if_output conditional

MFC after:	2 weeks
Sponsored by:	Yandex LLC
2014-02-10 12:52:33 +00:00
Gleb Smirnoff
5d6d7e756b o Revamp API between flowtable and netinet, netinet6.
- ip_output() and ip_output6() simply call flowtable_lookup(),
    passing mbuf and address family. That's the only code under
    #ifdef FLOWTABLE in the protocols code now.
o Revamp statistics gathering and export.
  - Remove hand made pcpu stats, and utilize counter(9).
  - Snapshot of statistics is available via 'netstat -rs'.
  - All sysctls are moved into net.flowtable namespace, since
    spreading them over net.inet isn't correct.
o Properly separate at compile time INET and INET6 parts.
o General cleanup.
  - Remove chain of multiple flowtables. We simply have one for
    IPv4 and one for IPv6.
  - Flowtables are allocated in flowtable.c, symbols are static.
  - With proper argument to SYSINIT() we no longer need flowtable_ready.
  - Hash salt doesn't need to be per-VNET.
  - Removed rudimentary debugging, which use quite useless in dtrace era.

The runtime behavior of flowtable shouldn't be changed by this commit.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-02-07 15:18:23 +00:00
Andrey V. Elsukov
74a976fffd Unlock entry before retry.
Submitted by:	melifaro
MFC after:	1 week
2014-02-07 10:58:46 +00:00
Andrey V. Elsukov
51eecdc35a Take exclusive lock only when lle isn't NULL. We don't need write access
to lle in most cases.

MFC after:	1 week
Sponsored by:	Yandex LLC
2014-02-02 07:28:04 +00:00
Alexander V. Chernikov
f6b84910bb Further rework netinet6 address handling code:
* Set ia address/mask values BEFORE attaching to address lists.
Inet6 address assignment is not atomic, so the simplest way to
do this atomically is to fill in ia before attach.
* Validate irfa->ia_addr field before use (we permit ANY sockaddr in old code).
* Do some renamings:
  in6_ifinit -> in6_notify_ifa (interaction with other subsystems is here)
  in6_setup_ifa -> in6_broadcast_ifa (LLE/Multicast/DaD code)
  in6_ifaddloop -> nd6_add_ifa_lle
  in6_ifremloop -> nd6_rem_ifa_lle
* Split working with LLE and route announce code for last two.
Add temporary in6_newaddrmsg() function to mimic current rtsock behaviour.
* Call device SIOCSIFADDR handler IFF we're adding first address.
In IPv4 we have to call it on every address change since ARP record
is installed by arp_ifinit() which is called by given handler.
IPv6 stack, on the opposite is responsible to call nd6_add_ifa_lle() so
there is no reason to call SIOCSIFADDR often.
2014-01-19 16:07:27 +00:00
Alexander V. Chernikov
0c5d4bde90 Use in6_localip() instead of hand-rolled cycle.
MFC after:	2 weeks
2014-01-18 20:54:55 +00:00
Alexander V. Chernikov
9080e7d023 Add in6_prepare_ifra() function to ease preparing in-kernel IPv6
address requests.

MFC after:	2 weeks
2014-01-18 20:32:59 +00:00
Alexander V. Chernikov
b6a16fc853 Do some style(9) not done in r260851 to improve readability.
MFC after:	2 weeks
2014-01-18 15:57:43 +00:00
Alexander V. Chernikov
60d7c722a5 Split in6_update_ifa() into smaller pieces leaving functionality intact.
Discussed with:	ae
MFC after:	2 weeks
2014-01-18 15:52:52 +00:00
Andrey V. Elsukov
e74966f60b Mechanically replace direct accessing to if_xname to using if_name() macro. 2014-01-10 12:33:28 +00:00
John-Mark Gurney
f2effe745c revert part of r260485 which changes how part of the header gets
included..  netstat uses -DKERNEL=1 to get these parts and breaks the
build w/o it...

melifaro@ says that ae@ is probably asleep, and the PR doesn't have
this part of the patch...  Probably a local change got in by accident..

PR:		185148
Pointy hat to:	ae@
2014-01-09 22:41:18 +00:00
Andrey V. Elsukov
78415d1082 Remove extra nesting from X_ip6_mforward() function.
Also remove disabled definitions from ip6_mroute.h.

PR:		185148
Sponsored by:	Yandex LLC
2014-01-09 15:38:28 +00:00
Andrey V. Elsukov
0a6b0ffa54 Add MRT6_DLOG() macro for debugging.
Reduce number of MRT6DEBUG ifdefs and fix some broken format strings.

MFC after:	1 week
Sponsored by:	Yandex LLC
2014-01-09 14:58:06 +00:00
Alexander V. Chernikov
1dc8f6a82c Introduce IN6_MASK_ADDR() macro to unify various hand-rolled code
to do IPv6 addr & mask in different places.

MFC after:	2 weeks
2014-01-08 22:13:32 +00:00
Andrey V. Elsukov
b88aef1dcf Use pointer to struct sockaddr_in6 in lla_lookup() call.
This prevents from triggering KASSERT in in6_lltable_lookup.
2014-01-03 02:40:56 +00:00
Andrey V. Elsukov
e2d14d9317 Add IF_AFDATA_WLOCK_ASSERT() in case lla_lookup() is called with
LLE_CREATE flag.

MFC after:	1 week
2014-01-03 02:32:05 +00:00
Andrey V. Elsukov
ea0c377602 lla_lookup() does modification only when LLE_CREATE is specified.
Thus we can use IF_AFDATA_RLOCK() instead of IF_AFDATA_LOCK() when doing
lla_lookup() without LLE_CREATE flag.

Reviewed by:	glebius, adrian
MFC after:	1 week
Sponsored by:	Yandex LLC
2014-01-02 08:40:37 +00:00
Adrian Chadd
c445d2520d Use an RLOCK here instead of an RWLOCK - matching all the other calls
to lla_lookup().

This drastically reduces the very high lock contention when doing parallel
TCP throughput tests (> 1024 sockets) with IPv6.

Tested:

* parallel IPv6 TCP bulk data exchange, 8192 sockets

MFC after:	1 week
Sponsored by:	Netflix, Inc.
2014-01-01 00:56:26 +00:00
Bjoern A. Zeeb
010c2b8192 Correct warnings comparing unsigned variables < 0 constantly reported
while building kernels.  All instances removed are indeed unsigned so
the expressions could not be true.

MFC after:	1 week
2013-12-25 20:08:44 +00:00
Dimitry Andric
6c5a340e56 In sys/netinet6/in6_mcast.c, in6m_is_ifp_detached() is only used
whenever KTR is defined, so put it between #ifdef KTR guards.  This
avoids a warning about a unused function if KTR is not enabled.

MFC after:	3 days
2013-12-24 20:30:13 +00:00
Andrey V. Elsukov
569aad57d2 Free mbuf in case of error.
MFC after:	1 week
2013-12-17 10:53:17 +00:00
Attilio Rao
54366c0bd7 - For kernel compiled only with KDTRACE_HOOKS and not any lock debugging
option, unbreak the lock tracing release semantic by embedding
  calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
  version of the releasing functions for mutex, rwlock and sxlock.
  Failing to do so skips the lockstat_probe_func invokation for
  unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
  kernel compiled without lock debugging options, potentially every
  consumer must be compiled including opt_kdtrace.h.
  Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
  dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
  is linked there and it is only used as a compile-time stub [0].

[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested.  As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while.  Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].

Sponsored by:	EMC / Isilon storage division
Discussed with:	rstone
[0] Reported by:	rstone
[1] Discussed with:	philip
2013-11-25 07:38:45 +00:00
Andrey V. Elsukov
ee674966f4 Fix panic with RADIX_MPATH, when RTFREE_LOCKED() called for already
unlocked route. Use in6_rtalloc() instead of in6_rtalloc1. This helps
simplify the code and remove several now unused variables.

PR:		156283
MFC after:	2 weeks
2013-11-11 12:49:00 +00:00
Gleb Smirnoff
555036b5f6 Remove never used ioctls that originate from KAME. The proof
of their zero usage was exp-run from misc/183538.
2013-11-11 05:39:42 +00:00
Michael Tuexen
b54ddf225f Changes from upstream to improve compilation when INET or INET6
or none of them is defined.

MFC after: 3 days
2013-11-02 20:12:19 +00:00
Gleb Smirnoff
c3322cb91c Include necessary headers that now are available due to pollution
via if_var.h.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-28 07:29:16 +00:00
Gleb Smirnoff
eedc7fd9e8 Provide includes that are needed in these files, and before were read
in implicitly via if.h -> if_var.h pollution.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-26 18:18:50 +00:00
Gleb Smirnoff
76039bc84f The r48589 promised to remove implicit inclusion of if_var.h soon. Prepare
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-26 17:58:36 +00:00
Andrey V. Elsukov
baa09f1891 Initialize inc_fibnum for properly handling ICMP6_PACKET_TOO_BIG errors
in multifib environment.

PR:		183265
MFC after:	1 week
2013-10-25 01:02:25 +00:00
Gleb Smirnoff
7caf4ab7ac - Utilize counter(9) to accumulate statistics on interface addresses. Add
four counters to struct ifaddr. This kills '+=' on a variables shared
  between processors for every packet.
- Nuke struct if_data from struct ifaddr.
- In ip_input() do not put a reference on ifaddr, instead update statistics
  right now in place and do IN_IFADDR_RUNLOCK(). These removes atomic(9)
  for every packet. [1]
- To properly support NET_RT_IFLISTL sysctl used by getifaddrs(3), in
  rtsock.c fill if_data fields using counter_u64_fetch().
- Accidentially fix bug in COMPAT_32 version of NET_RT_IFLISTL, which
  took if_data not from the ifaddr, but from ifaddr's ifnet. [2]

Submitted by:	melifaro [1], pluknet[2]
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-15 11:37:57 +00:00
Gleb Smirnoff
4675896098 Remove ifa_init() and provide ifa_alloc() that will allocate and setup
struct ifaddr internally.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-15 10:31:42 +00:00
Gleb Smirnoff
6ed910fabe Hide 'struct ifaddr' definition from userland. Two tools left that use it,
namely ipftest(1) and ifmcstat(1). These sniff structure definition using
_WANT_IFADDR define.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-15 10:19:24 +00:00
Gleb Smirnoff
3fa98cf9ac Remove unsigned < 0 check. 2013-10-15 10:12:19 +00:00
Gleb Smirnoff
ca695e0807 Remove useless check of ia6 against NULL, right after dereferencing it. 2013-10-15 10:11:23 +00:00
Gleb Smirnoff
0218539652 Now counter_u64_t is known to userland, thus remove hack from r253086.
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-15 10:09:33 +00:00
Hiroki Sato
6378e1f369 Do not try to detach if the interface does not support IPv6.
Tested by:	hselasky
PR:		usb/182820
Approved by:	re (glebius)
2013-10-10 09:43:15 +00:00
Gleb Smirnoff
491b520174 Fix mbuf leak.
Submitted by:	Loganaden Velvindron <logan elandsys.com>
Obtained from:	NetBSD
Approved by:	re (kib)
2013-10-07 12:07:40 +00:00
Bjoern A. Zeeb
fd291ae3ec Update comment from draft to RFC number.
Submitted by:	Loganaden Velvindron (logan elandsys.com)
Approved by:	re (gjb)
MFC after:	6 days
2013-09-22 14:53:07 +00:00
Mikolaj Golub
4d3dfd450a Unregister inet/inet6 pfil hooks on vnet destroy.
Discussed with:	andre
Approved by:	re (rodrigc)
2013-09-13 18:45:10 +00:00
Dag-Erling Smørgrav
1a05c762b9 Fix the length calculation for the final block of a sendfile(2)
transmission which could be tricked into rounding up to the nearest
page size, leaking up to a page of kernel memory.  [13:11]

In IPv6 and NetATM, stop SIOCSIFADDR, SIOCSIFBRDADDR, SIOCSIFDSTADDR
and SIOCSIFNETMASK at the socket layer rather than pass them on to the
link layer without validation or credential checks.  [SA-13:12]

Prevent cross-mount hardlinks between different nullfs mounts of the
same underlying filesystem.  [SA-13:13]

Security:	CVE-2013-5666
Security:	FreeBSD-SA-13:11.sendfile
Security:	CVE-2013-5691
Security:	FreeBSD-SA-13:12.ifioctl
Security:	CVE-2013-5710
Security:	FreeBSD-SA-13:13.nullfs
Approved by:	re
2013-09-10 10:05:59 +00:00
John Baldwin
fa302f207f Use an unsigned long when indexing into mfchashtbl[] and mf6ctable[]. This
matches the types used when computing hash indices and the type of the
maximum size of mfchashtbl[].

PR:		kern/181821
Submitted by:	Sven-Thorsten Dietrich <sven@vyatta.com> (IPv4)
MFC after:	1 week
2013-09-05 14:16:37 +00:00
John Baldwin
fd77bbb967 Remove most of the remaining sysctl name list macros. They were only
ever intended for use in sysctl(8) and it has not used them for many
years.

Reviewed by:	bde
Tested by:	exp-run by bdrewery
2013-08-26 18:16:05 +00:00
Mark Johnston
57f6086735 Implement the ip, tcp, and udp DTrace providers. The probe definitions use
dynamic translation so that their arguments match the definitions for
these providers in Solaris and illumos. Thus, existing scripts for these
providers should work unmodified on FreeBSD.

Tested by:	gnn, hiren
MFC after:	1 month
2013-08-25 21:54:41 +00:00
Michael Tuexen
1a94cdbea7 Provide human readable debug output. 2013-08-25 12:44:03 +00:00
Andre Oppermann
9850f95989 For now limit printf(9) %x of the 64bit pkthdr.csum_flags field to 32bits.
The upper 32bits are not occupied for now.

Sponsored by:	The FreeBSD Foundation
2013-08-25 09:49:00 +00:00
Andre Oppermann
1b4381afbb Restructure the mbuf pkthdr to make it fit for upcoming capabilities and
features.  The changes in particular are:

o Remove rarely used "header" pointer and replace it with a 64bit protocol/
  layer specific union PH_loc for local use.  Protocols can flexibly overlay
  their own 8 to 64 bit fields to store information while the packet is
  worked on.

o Mechanically convert IP reassembly, IGMP/MLD and ATM to use pkthdr.PH_loc
  instead of pkthdr.header.

o Extend csum_flags to 64bits to allow for additional future offload
  information to be carried (e.g. iSCSI, IPsec offload, and others).

o Move the RSS hash type enumerator from abusing m_flags to its own 8bit
  rsstype field.  Adjust accessor macros.

o Add cosqos field to store Class of Service / Quality of Service information
  with the packet.  It is not yet supported in any drivers but allows us to
  get on par with Cisco/Juniper in routing applications (plus MPLS QoS) with
  a modernized ALTQ.

o Add four 8 bit fields l[2-5]hlen to store the relative header offsets
  from the start of the packet.  This is important for various offload
  capabilities and to relieve the drivers from having to parse the packet
  and protocol headers to find out location of checksums and other
  information.  Header parsing in drivers is a lot of copy-paste and
  unhandled corner cases which we want to avoid.

o Add another flexible 64bit union to map various additional persistent
  packet information, like ether_vtag, tso_segsz and csum fields.
  Depending on the csum_flags settings some fields may have different usage
  making it very flexible and adaptable to future capabilities.

o Restructure the CSUM flags to better signify their outbound (down the
  stack) and inbound (up the stack) use.  The CSUM flags used to be a bit
  chaotic and rather poorly documented leading to incorrect use in many
  places.  Bring clarity into their use through better naming.
  Compatibility mappings are provided to preserve the API.  The drivers
  can be corrected one by one and MFC'd without issue.

o The size of pkthdr stays the same at 48/56bytes (32/64bit architectures).

Sponsored by:	The FreeBSD Foundation
2013-08-24 19:51:18 +00:00
Xin LI
acde2476c4 Fix an integer overflow in computing the size of a temporary buffer
can result in a buffer which is too small for the requested
operation.

Security:	CVE-2013-3077
Security:	FreeBSD-SA-13:09.ip_multicast
2013-08-22 00:51:37 +00:00
Andre Oppermann
86bd049144 Add m_clrprotoflags() to clear protocol specific mbuf flags at up and
downwards layer crossings.

Consistently use it within IP, IPv6 and ethernet protocols.

Discussed with:	trociny, glebius
2013-08-19 13:27:32 +00:00
Andre Oppermann
88388bdcbe Move the global M_SKIP_FIREWALL mbuf flags to a protocol layer specific
flag instead.  The flag is only used within the IP and IPv6 layer 3
protocols.

Because some firewall packages treat IPv4 and IPv6 packets the same the
flag should have the same value for both.

Discussed with:	trociny, glebius
2013-08-19 11:08:36 +00:00
Hiroki Sato
5a04191532 Return 0 in nbi->expire when la_expire == 0. Conversion from time_uptime to
time_second should not be performed in this case.
2013-08-17 07:14:45 +00:00
Hiroki Sato
ffa0165ae0 Fix incompatibility in ICMPV6CTL_ND6_PRLIST sysctl, and SIOCGPRLST_IN6,
SIOCGDRLST_IN6, and SIOCGNBRINFO_IN6 ioctl.  These userland interfaces
treat expiration times in time_second, not time_uptime.
2013-08-06 17:10:52 +00:00
Hiroki Sato
7d26db1792 - Use time_uptime instead of time_second in data structures for
PF_INET6 in kernel.  This fixes various malfunction when the wall time
  clock is changed.  Bump __FreeBSD_version to 1000041.

- Use clock_gettime(CLOCK_MONOTONIC_FAST) in userland utilities.

MFC after:	1 month
2013-08-05 20:13:02 +00:00
Hiroki Sato
41541ebf94 Fix a panic in tmpaddrtimer. 2013-08-05 00:36:12 +00:00
Hiroki Sato
0de0dd9be8 Allocate in6_ifextra (ifp->if_afdata[AF_INET6]) only for IPv6-capable
interfaces.  This eliminates unnecessary IPv6 processing for non-IPv6
interfaces.

MFC after:	3 days
2013-07-31 16:24:49 +00:00
Andrey V. Elsukov
6794f46021 Remove the large part of struct ipsecstat. Only few fields of this
structure is used, but they already have equal fields in the struct
newipsecstat, that was introduced with FAST_IPSEC and then was merged
together with old ipsecstat structure.

This fixes kernel stack overflow on some architectures after migration
ipsecstat to PCPU counters.

Reported by:	Taku YAMAMOTO, Maciej Milewski
2013-07-23 14:14:24 +00:00
Mikolaj Golub
f122b319eb A complete duplication of binding should be allowed if on both new and
duplicated sockets a multicast address is bound and either
SO_REUSEPORT or SO_REUSEADDR is set.

But actually it works for the following combinations:

  * SO_REUSEPORT is set for the fist socket and SO_REUSEPORT for the new;
  * SO_REUSEADDR is set for the fist socket and SO_REUSEADDR for the new;
  * SO_REUSEPORT is set for the fist socket and SO_REUSEADDR for the new;

and fails for this:

  * SO_REUSEADDR is set for the fist socket and SO_REUSEPORT for the new.

Fix the last case.

PR:		179901
MFC after:	1 month
2013-07-12 19:08:33 +00:00
Andrey V. Elsukov
9f0f032d10 Correct the size of allocated memory to store array of counters. 2013-07-09 15:20:46 +00:00
Andrey V. Elsukov
2841260cd6 Migrate structs in6_ifstat and icmp6_ifstat to PCPU counters. 2013-07-09 09:59:46 +00:00
Andrey V. Elsukov
a786f67981 Migrate structs ip6stat, icmp6stat and rip6stat to PCPU counters. 2013-07-09 09:54:54 +00:00
Andrey V. Elsukov
c80211e3cf Prepare network statistics structures for migration to PCPU counters.
Use uint64_t as type for all fields of structures.

Changed structures: ahstat, arpstat, espstat, icmp6_ifstat, icmp6stat,
in6_ifstat, ip6stat, ipcompstat, ipipstat, ipsecstat, mrt6stat, mrtstat,
pfkeystat, pim6stat, pimstat, rip6stat, udpstat.

Discussed with:	arch@
2013-07-09 09:32:06 +00:00
Mikolaj Golub
efdf104bca In r227207, to fix the issue with possible NULL inp_socket pointer
dereferencing, when checking for SO_REUSEPORT option (and SO_REUSEADDR
for multicast), INP_REUSEPORT flag was introduced to cache the socket
option.  It was decided then that one flag would be enough to cache
both SO_REUSEPORT and SO_REUSEADDR: when processing SO_REUSEADDR
setsockopt(2), it was checked if it was called for a multicast address
and INP_REUSEPORT was set accordingly.

Unfortunately that approach does not work when setsockopt(2) is called
before binding to a multicast address: the multicast check fails and
INP_REUSEPORT is not set.

Fix this by adding INP_REUSEADDR flag to unconditionally cache
SO_REUSEADDR.

PR:		179901
Submitted by:	Michael Gmelin freebsd grem.de (initial version)
Reviewed by:	rwatson
MFC after:	1 week
2013-07-04 18:38:00 +00:00
Hiroki Sato
af8056441e - Allow ND6_IFF_AUTO_LINKLOCAL for IFT_BRIDGE. An interface with IFT_BRIDGE
is initialized with !ND6_IFF_AUTO_LINKLOCAL && !ND6_IFF_ACCEPT_RTADV
  regardless of net.inet6.ip6.accept_rtadv and net.inet6.ip6.auto_linklocal.
  To configure an autoconfigured link-local address (RFC 4862), the
  following rc.conf(5) configuration can be used:

   ifconfig_bridge0_ipv6="inet6 auto_linklocal"

- if_bridge(4) now removes IPv6 addresses on a member interface to be
  added when the parent interface or one of the existing member
  interfaces has an IPv6 address.  if_bridge(4) merges each link-local
  scope zone which the member interfaces form respectively, so it causes
  address scope violation.  Removal of the IPv6 addresses prevents it.

- if_lagg(4) now removes IPv6 addresses on a member interfaces
  unconditionally.

- Set reasonable flags to non-IPv6-capable interfaces. [*]

Submitted by:	rpaulo [*]
MFC after:	1 week
2013-07-02 16:58:15 +00:00
Qing Li
378aa8d85e Delete the nd6 entries associated with an off-link prefix
if the same prefix cannot be found on an alternative
interface.

Reviewed by:	hrs
MFC after:	1 week
2013-06-24 05:01:13 +00:00
Andrey V. Elsukov
6659296cb0 Use IPSECSTAT_INC() and IPSEC6STAT_INC() macros for ipsec statistics
accounting.

MFC after:	2 weeks
2013-06-20 09:55:53 +00:00
Andrey V. Elsukov
faf146b9e5 Use PIM6STAT_INC() and MRT6STAT_INC() macros for IPv6 multicast
statistics accounting.

MFC after:	2 weeks
2013-06-19 21:50:17 +00:00
Andrey V. Elsukov
f1d7ebfe05 Use RIP6STAT_INC() macro for raw ip6 statistics accounting.
MFC after:	2 weeks
2013-06-19 20:48:34 +00:00
Andrey V. Elsukov
49dc650cea Use ICMP6STAT_INC() macro for ICMPv6 errors accounting.
MFC after:	2 weeks
2013-06-19 15:59:21 +00:00
Alexander V. Chernikov
6bdfdb2c5e Really fix netmask address family this time.
MFC with:	r250813
2013-05-19 19:42:46 +00:00
Alexander V. Chernikov
346e9c9de8 Finish r85740 : Make IPv6 netmask has address family set.
This pleases routing daemons like bird.

MFC after:	2 weeks
2013-05-19 19:19:01 +00:00
Julian Elischer
4871fc4ab5 Finally change the mbuf to have its own fib field instead of stealing
4 flag bits. This was supposed to happen in 8.0, and again in 2012..

MFC after:	never
2013-05-16 16:20:17 +00:00
Michael Tuexen
3457ccdaea Honor the net.inet6.ip6.v6only sysctl variable and the IPV6_V6ONLY
socket option for SCTP sockets in the same way as for UDP or TCP
sockets.

MFC after: 2 weeks
2013-05-10 18:09:38 +00:00
Hiroki Sato
5df1b6b57e Use FF02:0:0:0:0:2:FF00::/104 prefix for IPv6 Node Information Group
Address.  Although KAME implementation used FF02:0:0:0:0:2::/96 based on
older versions of draft-ietf-ipngwg-icmp-name-lookup, it has been changed
in RFC 4620.

The kernel always joins the /104-prefixed address, and additionally does
/96-prefixed one only when net.inet6.icmp6.nodeinfo_oldmcprefix=1.
The default value of the sysctl is 1.

ping6(8) -N flag now uses /104-prefixed one.  When this flag is specified
twice, it uses /96-prefixed one instead.

Reviewed by:		ume
Based on work by:	Thomas Scheffler
PR:			conf/174957
MFC after:		2 weeks
2013-05-04 19:16:26 +00:00
Gleb Smirnoff
47e8d432d5 Add const qualifier to the dst parameter of the ifnet if_output method. 2013-04-26 12:50:32 +00:00
Andrey V. Elsukov
817f395375 Remove unused variable.
MFC after:	1 week
2013-04-24 10:24:01 +00:00
Oleg Bulyzhin
1571132f14 Plug static llentry leak (ipv4 & ipv6 were affected).
PR:		kern/172985
MFC after:	1 month
2013-04-21 21:28:38 +00:00
Tijl Coosemans
6c81895dab Fix build after r249543. 2013-04-16 16:59:29 +00:00
Andrey V. Elsukov
4ff7c740fe Fix accounting after the r249528, also add several another counters to
the statistics.
2013-04-16 11:31:26 +00:00
Andrey V. Elsukov
eca4d72003 Use IP6S_M2MMAX macro. 2013-04-16 11:19:13 +00:00
Andrey V. Elsukov
43851aae9a Replace hardcoded numbers. 2013-04-16 11:12:58 +00:00
Andrey V. Elsukov
e7a87117d3 The source address selection algorithm tries to apply several rules
for the set of IPv6 addresses. Now each attempt goes into IPv6 statistics,
even if given rule did not won. Change this and take into account only
those rules, that won. Also add accounting for cases, when algorithm
fails to select an address.
2013-04-15 21:02:40 +00:00
Andrey V. Elsukov
ecc5c7387b Free memory after deleting an address policy entry.
MFC after:	1 week
2013-04-12 07:59:54 +00:00
Andrey V. Elsukov
9cb8d207af Use IP6STAT_INC/IP6STAT_DEC macros to update ip6 stats.
MFC after:	1 week
2013-04-09 07:11:22 +00:00
Kevin Lo
b3dcd51dde Clean up some unused leftover code.
Pointed out by:	ae
2013-03-22 01:45:54 +00:00
Kevin Lo
dda95c6e59 Remove unused global variables.
Reviewed by:	ae, glebius
2013-03-22 01:40:17 +00:00
Gleb Smirnoff
10e5acc3c6 - Use m_getcl() instead of hand allocating.
- Do not calculate constant length values at run time,
  CTASSERT() their sanity.
- Remove superfluous cleaning of mbuf fields after allocation.
- Replace compat macros with function calls.

Sponsored by:	Nginx, Inc.
2013-03-15 13:48:53 +00:00
Gleb Smirnoff
7b07d1bed0 - Use m_getcl() instead of hand allocating.
- Use m_get()/m_gethdr() instead of macros.
- Remove superfluous cleaning of mbuf fields after allocation.

Sponsored by:	Nginx, Inc.
2013-03-15 12:50:29 +00:00
Gleb Smirnoff
aa48181169 Use m_getcl() instead of hand made allocation.
Sponsored by:	Nginx, Inc.
2013-03-15 12:33:23 +00:00
Andrey V. Elsukov
b7c896c9be Take the inpcb rlock before calculating checksum, it was accidentally
moved in r191672.

Obtained from:	Yandex LLC
MFC after:	1 week
2013-03-12 02:20:20 +00:00
Navdeep Parhar
63a97a4040 Generate lle_event in the IPv6 neighbor discovery code too.
Reviewed by:	bz@
2013-01-26 00:05:22 +00:00
Navdeep Parhar
f31b83e118 Avoid NULL dereference in nd6_storelladdr when no mbuf is provided. It
is called this way from a couple of places in the OFED code.  (toecore
calls it too but that's going to change shortly).

Reviewed by:	bz@
2013-01-25 23:11:13 +00:00
Andrey V. Elsukov
3ea87cb57f Simplify in6_setscope() function to get better performance.
Currently we use interface indeces as zone IDs for link-local and
interface-local scopes, and since we don't have any tool to configure
zone IDs, there is no need to acquire the afdata lock several times per
packet only to read if_index value.
So, now in6_setscope reads zone IDs for interface-local, link-local and
global scopes without a lock.

Sponsored by:	Yandex LLC
MFC after:	2 weeks
2013-01-10 00:10:24 +00:00
Andrey V. Elsukov
6e9a5a5e52 Remove unneeded variable.
MFC after:	1 week
2013-01-09 18:54:58 +00:00
Hajimu UMEMOTO
164051cea5 Add no_prefer_iface option.
It stops treating the address on the interface as special by source
address selection rule even when the interface is outgoing interface.
This is desired in some situation.

Requested by:	hrs
Reviewed by:	IHANet folks including hrs
MFC after:	1 week
2013-01-09 18:18:08 +00:00
Andrey V. Elsukov
78c235c99f The in6_setscope() function determines the scope zone id of an address
and embeds it into address. Inside the kernel we keep addresses with
embedded zone id only for two scopes: link-local and interface-local.

For other scopes this function is nop in most cases. To reduce an
overhead of locking, first check that address is capable for embedding.
Also, handle the loopback address before acquire the lock.

Sponsored by:	Yandex LLC
MFC after:	1 week
2013-01-09 00:36:06 +00:00
Peter Wemm
8a1163e82f Temporarily revert rev 244678. This is causing loopback problems with
the lo (loopback) interfaces.
2013-01-03 10:21:28 +00:00
Gleb Smirnoff
468e45f3bd The SIOCSIFFLAGS ioctl handler runs if_up()/if_down() that notify
all interested parties in case if interface flag IFF_UP has changed.

  However, not only SIOCSIFFLAGS can raise the flag, but SIOCAIFADDR
and SIOCAIFADDR_IN6 can, too. The actual |= is done not in the protocol
code, but in code of interface drivers. To fix this historical layering
violation, we will check whether ifp->if_ioctl(SIOCSIFADDR) raised the
IFF_UP flag, and if it did, run the if_up() handler.

  This fixes configuring an address under CARP control on an interface
that was initially !IFF_UP.

P.S. I intentionally omitted handling the IFF_SMART flag. This flag was
never ever used in any driver since it was introduced, and since it
means another layering violation, it should be garbage collected instead
of pretended to be supported.
2012-12-25 13:01:58 +00:00
Andrey V. Elsukov
f8fe3dc9aa When we have some address to forward (e.g. it was specified with ipfw fwd),
we should pass it as first argument into in6_selectroute_fib function to
initiate new route lookup.

MFC after:	1 week
2012-12-19 17:28:17 +00:00
Andrey V. Elsukov
16607317b5 Make dst_sa initialization only when it is actually needed.
MFC after:	1 week
2012-12-19 17:08:49 +00:00
Andrey V. Elsukov
61d88f3421 The selectroute functions does own account of EHOSTUNREACH errors,
no need to do it twice.

MFC after:	1 week
2012-12-19 17:02:07 +00:00
Andrey V. Elsukov
79672fd277 Use M_PROTO7 flag for M_IP6_NEXTHOP, because M_PROTO2 was used for
M_AUTHIPHDR.

Pointy hat to:	ae
Reported by:	Vadim Goncharov
MFC after:	3 days
2012-12-17 14:36:56 +00:00
Andrey V. Elsukov
68eba526b9 In additional to the tailq of IPv6 addresses add the hash table.
For now use 256 buckets and fnv_hash function. Use xor'ed 32-bit
s6_addr32 parts of in6_addr structure as a hash key. Update
in6_localip and in6_is_addr_deprecated to use hash table for fastest
lookup.

Sponsored by:	Yandex LLC
Discussed with:	dwmalone, glebius, bz
2012-12-15 20:04:24 +00:00
Gleb Smirnoff
b1ec2940af Fix problem in r238990. The LLE_LINKED flag should be tested prior to
entering llentry_free(), and in case if we lose the race, we should simply
perform LLE_FREE_LOCKED(). Otherwise, if the race is lost by the thread
performing arptimer(), it will remove two references from the lle instead
of one.

Reported by:	Ian FREISLICH <ianf clue.co.za>
2012-12-13 11:11:15 +00:00
Hiroki Sato
0bebb5448b - Move definition of V_deembed_scopeid to scope6_var.h.
- Deembed scope id in L3 address in in6_lltable_dump().
- Simplify scope id recovery in rtsock routines.
- Remove embedded scope id handling in ndp(8) and route(8) completely.
2012-12-05 19:45:24 +00:00
Gleb Smirnoff
eb1b1807af Mechanically substitute flags from historic mbuf allocator with
malloc(9) flags within sys.

Exceptions:

- sys/contrib not touched
- sys/mbuf.h edited manually
2012-12-05 08:04:20 +00:00
Andrey V. Elsukov
35e692dc0e Remove opt_inet.h, it isn't required here.
MFC after:	1 week
2012-11-20 14:09:37 +00:00
Hiroki Sato
3829d13564 Check if an extracted zoneid is equal to the non-zero sin6_scope_id only when
it is link-local or MC interface-local.
2012-11-18 16:06:51 +00:00
Michael Tuexen
3a51a2647a Add support for SCTP/UDP/IPV6.
This completes the support of
http://tools.ietf.org/html/draft-ietf-tsvwg-sctp-udp-encaps

MFC after: 1 week
2012-11-17 20:04:04 +00:00
Andrey V. Elsukov
73cb2f38f2 Reduce the overhead of locking, use IF_AFDATA_RLOCK() when we are doing
simple lookups.

Sponsored by:	Yandex LLC
MFC after:	1 week
2012-11-16 12:12:02 +00:00
Andrey V. Elsukov
5cf7ec13f7 if_afdata lock was converted from mutex to rwlock a long ago, so we can
replace IF_AFDATA_LOCK() macro depending to the access type.

Sponsored by:	Yandex LLC
MFC after:	1 week
2012-11-14 17:36:06 +00:00
Andrey V. Elsukov
862a3e1227 SCOPE6_LOCK protects V_sid_default, no need to acquire it without
any access to V_sid_default.

Sponsored by:	Yandex LLC
MFC after:	1 week
2012-11-14 17:23:48 +00:00
Andrey V. Elsukov
f062401329 zoneid has unsigned type.
MFC after:	1 week
2012-11-14 17:14:03 +00:00
David E. O'Brien
f1e0de695c Use consistent style. 2012-11-13 01:48:00 +00:00
Andrey V. Elsukov
ffdbf9da3b Remove the recently added sysctl variable net.pfil.forward.
Instead, add protocol specific mbuf flags M_IP_NEXTHOP and
M_IP6_NEXTHOP. Use them to indicate that the mbuf's chain
contains the PACKET_TAG_IPFORWARD tag. And do a tag lookup
only when this flag is set.

Suggested by:	andre
2012-11-02 01:20:55 +00:00
Michael Tuexen
21f67da7c4 Whitespace changes due to upstream integration of SCTP changes in the
FreeBSD code base.
2012-10-29 20:47:32 +00:00
Andrey V. Elsukov
c1de64a495 Remove the IPFIREWALL_FORWARD kernel option and make possible to turn
on the related functionality in the runtime via the sysctl variable
net.pfil.forward. It is turned off by default.

Sponsored by:	Yandex LLC
Discussed with:	net@
MFC after:	2 weeks
2012-10-25 09:39:14 +00:00
Xin LI
6f56329a25 Remove __P.
Submitted by:	kevlo
Reviewed by:	md5(1)
MFC after:	2 months
2012-10-22 21:49:56 +00:00
Gleb Smirnoff
8f134647ca Switch the entire IPv4 stack to keep the IP packet header
in network byte order. Any host byte order processing is
done in local variables and host byte order values are
never[1] written to a packet.

  After this change a packet processed by the stack isn't
modified at all[2] except for TTL.

  After this change a network stack hacker doesn't need to
scratch his head trying to figure out what is the byte order
at the given place in the stack.

[1] One exception still remains. The raw sockets convert host
byte order before pass a packet to an application. Probably
this would remain for ages for compatibility.

[2] The ip_input() still subtructs header len from ip->ip_len,
but this is planned to be fixed soon.

Reviewed by:	luigi, Maxim Dounin <mdounin mdounin.ru>
Tested by:	ray, Olivier Cochard-Labbe <olivier cochard.me>
2012-10-22 21:09:03 +00:00
Alexander V. Chernikov
7994d24f0c Eliminate code checking if found IPv6 rte is dynamic. IPv6 redirects
are using (different) ND-based approach described in RFC 4861. This change
is similar to r241406 which conditionally skips the same check in IPv4.

This change is part of bigger patch eliminating rte locking.

Sponsored by:	Yandex LLC.
OK'd by:	hrs
MFC after:	2 weeks
2012-10-22 12:54:52 +00:00
Andre Oppermann
c9b652e3e8 Mechanically remove the last stray remains of spl* calls from net*/*.
They have been Noop's for a long time now.
2012-10-18 13:57:24 +00:00
Alexander V. Chernikov
3bff27cd67 Cleanup documentation: cloning route support has been removed in r186119.
MFC after:	2 weeks
2012-10-13 09:31:01 +00:00
Kevin Lo
9823d52705 Revert previous commit...
Pointyhat to:	kevlo (myself)
2012-10-10 08:36:38 +00:00
Kevin Lo
a10cee30c9 Prefer NULL over 0 for pointers 2012-10-09 08:27:40 +00:00
Andriy Gapon
ce8b4f7cb9 ip6_ipsec_output: fix a typo in r241344
Acting as a remote drone of glebius.
2012-10-08 13:45:40 +00:00
Gleb Smirnoff
23e9c6dc1e After r241245 it appeared that in_delayed_cksum(), which still expects
host byte order, was sometimes called with net byte order. Since we are
moving towards net byte order throughout the stack, the function was
converted to expect net byte order, and its consumers fixed appropriately:
  - ip_output(), ipfilter(4) not changed, since already call
    in_delayed_cksum() with header in net byte order.
  - divert(4), ng_nat(4), ipfw_nat(4) now don't need to swap byte order
    there and back.
  - mrouting code and IPv6 ipsec now need to switch byte order there and
    back, but I hope, this is temporary solution.
  - In ipsec(4) shifted switch to net byte order prior to in_delayed_cksum().
  - pf_route() catches up on r241245 changes to ip_output().
2012-10-08 08:03:58 +00:00
Gleb Smirnoff
d6d3f01e0a Merge the projects/pf/head branch, that was worked on for last six months,
into head. The most significant achievements in the new code:

 o Fine grained locking, thus much better performance.
 o Fixes to many problems in pf, that were specific to FreeBSD port.

New code doesn't have that many ifdefs and much less OpenBSDisms, thus
is more attractive to our developers.

  Those interested in details, can browse through SVN log of the
projects/pf/head branch. And for reference, here is exact list of
revisions merged:

r232043, r232044, r232062, r232148, r232149, r232150, r232298, r232330,
r232332, r232340, r232386, r232390, r232391, r232605, r232655, r232656,
r232661, r232662, r232663, r232664, r232673, r232691, r233309, r233782,
r233829, r233830, r233834, r233835, r233836, r233865, r233866, r233868,
r233873, r234056, r234096, r234100, r234108, r234175, r234187, r234223,
r234271, r234272, r234282, r234307, r234309, r234382, r234384, r234456,
r234486, r234606, r234640, r234641, r234642, r234644, r234651, r235505,
r235506, r235535, r235605, r235606, r235826, r235991, r235993, r236168,
r236173, r236179, r236180, r236181, r236186, r236223, r236227, r236230,
r236252, r236254, r236298, r236299, r236300, r236301, r236397, r236398,
r236399, r236499, r236512, r236513, r236525, r236526, r236545, r236548,
r236553, r236554, r236556, r236557, r236561, r236570, r236630, r236672,
r236673, r236679, r236706, r236710, r236718, r237154, r237155, r237169,
r237314, r237363, r237364, r237368, r237369, r237376, r237440, r237442,
r237751, r237783, r237784, r237785, r237788, r237791, r238421, r238522,
r238523, r238524, r238525, r239173, r239186, r239644, r239652, r239661,
r239773, r240125, r240130, r240131, r240136, r240186, r240196, r240212.

I'd like to thank people who participated in early testing:

Tested by:	Florian Smeets <flo freebsd.org>
Tested by:	Chekaluk Vitaly <artemrts ukr.net>
Tested by:	Ben Wilber <ben desync.com>
Tested by:	Ian FREISLICH <ianf cloudseed.co.za>
2012-09-08 06:41:54 +00:00
Mikolaj Golub
ab16a5bd08 In ip6_ctloutput() guard inp_flags modifications with INP_WLOCK.
MFC after:	2 weeks
2012-08-19 08:16:13 +00:00
Gleb Smirnoff
ea53792942 Fix races between in_lltable_prefix_free(), lla_lookup(),
llentry_free() and arptimer():

o Use callout_init_rw() for lle timeout, this allows us safely
  disestablish them.
  - This allows us to simplify the arptimer() and make it
    race safe.
o Consistently use ifp->if_afdata_lock to lock access to
  linked lists in the lle hashes.
o Introduce new lle flag LLE_LINKED, which marks an entry that
  is attached to the hash.
  - Use LLE_LINKED to avoid double unlinking via consequent
    calls to llentry_free().
  - Mark lle with LLE_DELETED via |= operation istead of =,
    so that other flags won't be lost.
o Make LLE_ADDREF(), LLE_REMREF() and LLE_FREE_LOCKED() more
  consistent and provide more informative KASSERTs.

The patch is a collaborative work of all submitters and myself.

PR:		kern/165863
Submitted by:	Andrey Zonov <andrey zonov.org>
Submitted by:	Ryan Stone <rysto32 gmail.com>
Submitted by:	Eric van Gyzen <eric_van_gyzen dell.com>
2012-08-02 13:57:49 +00:00
Gleb Smirnoff
b9aee262e5 Some more whitespace cleanup. 2012-08-01 09:00:26 +00:00
Bjoern A. Zeeb
3b43b78342 In case of IPsec he have to do delayed checksum calculations before
adding any extension header, or rather before calling into IPsec
processing as we may send the packet and not return to IPv6 output
processing here.

PR:		kern/170116
MFC After:	3 days
2012-07-31 23:34:06 +00:00
Gleb Smirnoff
ea50c13ebe Some style(9) and whitespace changes.
Together with:	Andrey Zonov <andrey zonov.org>
2012-07-31 11:31:12 +00:00
Bjoern A. Zeeb
d2ed798615 Properly apply #ifdef INET and leave a comment that we are (will) apply
delayed IPv6 checksum processing in ip6_output.c when doing IPsec.

PR:		kern/170116
MFC after:	3 days
2012-07-31 05:44:03 +00:00
Bjoern A. Zeeb
68c99a6023 Improve the should-never-hit printf to ease debugging in case we'd ever hit
it again when doing the delayed IPv6 checksum calculations.

MFC after:	3 days
2012-07-31 05:34:54 +00:00
Bjoern A. Zeeb
5dbbe4fdd2 For consistency put the IPsec comment iside the #fidef section.
MFC after:	3 days
2012-07-29 00:45:24 +00:00
Bjoern A. Zeeb
c50ffbdf4b Fix a comment that we do not have an SA yet but need to acquire one.
MFC after:	3 days
2012-07-29 00:44:41 +00:00
Michael Tuexen
5e20b91dbe Changes which improve compilation if neither INET nor INET6 is defined.
MFC after: 3 days
2012-07-15 20:16:17 +00:00
Michael Tuexen
e0e00a4d0f #ifdef INET and INET6 consistently. This also fixes a bug, where
it was done wrong.

MFC after: 3 days
2012-07-15 11:04:49 +00:00
Hiroki Sato
f6c336fe66 Remove "prefer_source" address selection option. FreeBSD has had an
implementation of RFC 3484 for this purpose for a long time and "prefer_source"
was never implemented actually.  ND6_IFF_PREFER_SOURCE macro is left intact.
2012-07-09 06:21:46 +00:00
Bjoern A. Zeeb
4018ea9a2b Implement handling of "atomic fragements" as outlined in
draft-gont-6man-ipv6-atomic-fragments to mitigate one class of
possible fragmentation-based attacks.

MFC after:	5 days
2012-07-08 15:30:24 +00:00
Bjoern A. Zeeb
8dcc26febe As mentioned in the commit message of r237571 (copied from a prototype
patch of mine) also check if the 2nd in6_setscope() failed and return
the error in that case.

MFC after:	5 days
2012-07-08 08:49:37 +00:00
Gleb Smirnoff
bf9840512a When ip_output()/ip6_output() is supplied a struct route *ro argument,
it skips FLOWTABLE lookup. However, the non-NULL ro has dual meaning
here: it may be supplied to provide route, and it may be supplied to
store and return to caller the route that ip_output()/ip6_output()
finds. In the latter case skipping FLOWTABLE lookup is pessimisation.

The difference between struct route filled by FLOWTABLE and filled
by rtalloc() family is that the former doesn't hold a reference on
its rtentry. Reference is hold by flow entry, and it is about to
be released in future. Thus, route filled by FLOWTABLE shouldn't
be passed to RTFREE() macro.

- Introduce new flag for struct route/route_in6, that marks route
  not holding a reference on rtentry.
- Introduce new macro RO_RTFREE() that cleans up a struct route
  depending on its kind.
- All callers to ip_output()/ip6_output() that do supply non-NULL
  but empty route should use RO_RTFREE() to free results of
  lookup.
- ip_output()/ip6_output() now do FLOWTABLE lookup always when
  ro->ro_rt == NULL.

Tested by:	tuexen (SCTP part)
2012-07-04 07:37:53 +00:00
Gleb Smirnoff
3df6468a2d Remove route caching from IP multicast routing code. There is no
reason to do that, and also, cached route never got unreferenced,
which meant a reference leak.

Reviewed by:	bms
2012-07-02 19:44:18 +00:00
Michael Tuexen
a8775ad93d Move common code parts to sctp_common_input_processing().
MFC after: 3 days
2012-07-02 16:44:09 +00:00
Bruce M Simpson
df7e35725b Kick the current-state report timer when a V1 group report would
be triggered.

Submitted by:	rpaulo@
MFC after:	3 days
2012-06-28 23:48:40 +00:00
Bruce M Simpson
b7d882304b Fix a typo in MLD query exponent processing.
Submitted by:	rpaulo@
MFC after:	3 days
2012-06-28 23:45:37 +00:00
Bruce M Simpson
289ca95209 In MLDv2 general query processing, do not enforce the strict check
on query origins.

Submitted by:	Gu Yong
MFC after:	3 days
2012-06-28 23:44:47 +00:00
Michael Tuexen
b1754ad17b Pass the src and dst address of a received packet explicitly around.
MFC after: 3 days
2012-06-28 16:01:08 +00:00
Xin LI
5b8f1a8676 Fix a LOR acquiring the if_afdata lock while holding an rtentry lock.
Possibly do some entra work in case we would not get into the
ifa0 != NULL paths later as we already do for the mltaddr before.

XXX We should possibly error in case in6_setscope fails.

Reference: http://lists.freebsd.org/pipermail/freebsd-net/2011-September/029829.html

Submitted by:	bz
MFC after:	1 week
2012-06-25 20:56:32 +00:00
Michael Tuexen
6dc5aabcb7 Unify sctp_input() and sctp6_input().
MFC after: 3 days
2012-06-25 19:13:43 +00:00
Michael Tuexen
39803b8c58 Whitespace cleanup.
MFC after: 3 days
2012-06-25 17:15:09 +00:00
Michael Tuexen
20cc2188f3 Pass the packet length explicitly around.
MFC after: 3 days
2012-06-24 23:12:24 +00:00
Michael Tuexen
f938425253 Do packet logging in a consistent way.
MFC after: 3 days
2012-06-24 21:25:54 +00:00
Bjoern A. Zeeb
23bc7025c4 Just add a comment to further investigate when being closer to that code
again next time.  The condition of the 2nd if() is very unlikely ever met.
2012-06-22 21:26:35 +00:00
Michael Tuexen
f30ac43257 Pass flowid explicitly through the stack instead of taking it from
the mbuf chain at different places.
While there: Fix several bugs related to VRFs.

MFC after: 3 days
2012-06-14 06:54:48 +00:00
Michael Tuexen
b36dcb9a39 Deliver IPV6_TCLASS, IPV6_HOPLIMIT and IPV6_PKTINFO cmsgs (if
requested) on IPV6 sockets, which have been marked to be not IPV6_V6ONLY,
for each received IPV4 packet.

MFC after: 3 days
2012-06-12 13:57:56 +00:00
Bjoern A. Zeeb
15cc25e9c0 Plug two interface address refcount leaks in early error return cases
in the ioctl path.

Reported by:	rpaulo
 Reviewed by:	emax
MFC after:	3 days
2012-06-05 13:27:37 +00:00
Maksim Yevmenkin
3df0e439b0 Plug reference leak.
Interface routes are refcounted as packets move through the stack,
and there's garbage collection tied to it so that route changes can
safely propagate while traffic is flowing. In our setup, we weren't
changing or deleting any routes, but the refcounting logic in
ip6_input() was wrong and caused a reference leak on every inbound
V6 packet. This eventually caused a 32bit overflow, and the resulting
0 value caused the garbage collection to run on the active route.
That then snowballed into the panic.

Reviewed by:	scottl
MFC after:	3 days
2012-06-03 07:36:59 +00:00
Michael Tuexen
a6cff10f2a Seperate SCTP checksum offloading for IPv4 and IPv6.
While there: remove some trainling whitespaces.

MFC after: 3 days
X-MFC with: 236170
2012-05-30 20:56:07 +00:00
Maksim Yevmenkin
c784f9e5eb When we return deprecated addresses, we need to reference them.
Reviewed by:	bz, scottl
MFC after:	3 days
2012-05-30 20:02:39 +00:00
Bjoern A. Zeeb
356ab07e2d It turns out that too many drivers are not only parsing the L2/3/4
headers for TSO but also for generic checksum offloading.  Ideally we
would only have one common function shared amongst all drivers, and
perhaps when updating them for IPv6 we should introduce that.
Eventually we should provide the meta information along with mbufs to
avoid (re-)parsing entirely.

To not break IPv6 (checksums and offload) and to be able to MFC the
changes without risking to hurt 3rd party drivers, duplicate the v4
framework, as other OSes have done as well.

Introduce interface capability flags for TX/RX checksum offload with
IPv6, to allow independent toggling (where possible).  Add CSUM_*_IPV6
flags for UDP/TCP over IPv6, and reserve further for SCTP, and IPv6
fragmentation.  Define CSUM_DELAY_DATA_IPV6 as we do for legacy IP and
add an alias for CSUM_DATA_VALID_IPV6.

This pretty much brings IPv6 handling in line with IPv4.
TSO is still handled in a different way and not via if_hwassist.

Update ifconfig to allow (un)setting of the new capability flags.
Update loopback to announce the new capabilities and if_hwassist flags.

Individual driver updates will have to follow, as will SCTP.

Reported by:	gallatin, dim, ..
Reviewed by:	gallatin (glanced at?)
MFC after:	3 days
X-MFC with:	r235961,235959,235958
2012-05-28 09:30:13 +00:00
Bjoern A. Zeeb
c69baa7e91 Correctly get the payload length in host byte order. While we
already plan to support >64k payload here, the IPv6 header payload
length obviously is only 16 bit and the calculations need to be right.

Reported by:	dim
Tested by:	dim
MFC after:	1 day
X-MFC:		with r235958
2012-05-26 23:58:51 +00:00
Michael Tuexen
8d9638ab33 Get rid of SCTP specific code to avoid CRC32C computations on loopback.
Just just offloading.
MFC after: 3 days
2012-05-26 09:16:33 +00:00
Bjoern A. Zeeb
39e19560d6 MFp4 bz_ipv6_fast:
Use M_ZERO with malloc rather than calling bzero() ourselves.

  Change if () panic() checks to KASSERT()s as they are only
  catching invariants in code flow but not dependent on network
  input/output.

  Move initial assigments indirecting pointers after the lock
  has been aquired.

  Passing layer boundries, reset M_PROTOFLAGS.

  Remove a NULL assignment before free.

  Sponsored by:	The FreeBSD Foundation
  Sponsored by:	iXsystems

Reviewed by:	gnn (as part of the whole)
MFC After:	3 days
2012-05-25 09:27:16 +00:00
Bjoern A. Zeeb
ae14505058 MFp4 bz_ipv6_fast:
Factor out Hop-By-Hop option processing.  It's still not heavily used,
  it reduces the footprint of ip6_input() and makes ip6_input() more
  readable.

  Sponsored by:	The FreeBSD Foundation
  Sponsored by:	iXsystems

Reviewed by:	gnn (as part of the whole)
MFC After:	3 days
2012-05-25 02:58:21 +00:00
Bjoern A. Zeeb
5aa624a803 MFp4 bz_ipv6_fast:
Defer checksum calulations on UDP6 output and respect the mbuf
  flags set by NICs having done checksum validation for us already,
  thus saving the computing time in the input path as well.

  Sponsored by:	The FreeBSD Foundation
  Sponsored by:	iXsystems

Reviewed by:	gnn (as part of the whole)
MFC After:	3 days
2012-05-25 02:19:17 +00:00
Bjoern A. Zeeb
e7b92e2769 MFp4 bz_ipv6_fast:
Add support for delayed checksum calculations in the IPv6
  output path.  We currently cannot offload to the card if we
  add extension headers (which incl. fragmentation).

  Fix two SCTP offload support copy&paste bugs: calculate
  checksums if fragmenting and no need to flag IPv4 header
  checksums in the IPv6 forwarding path.

  Sponsored by:	The FreeBSD Foundation
  Sponsored by:	iXsystems

Reviewed by:	gnn (as part of the whole)
MFC After:	3 days
2012-05-25 02:17:16 +00:00
Bjoern A. Zeeb
d3443481dc MFp4 bz_ipv6_fast:
Hide the ip6aux functions.  The only one referenced outside ip6_input.c
  is not compiled in yet (__notyet__) in route6.c (r235954).  We do have
  accessor functions that should be used.

  Sponsored by:	The FreeBSD Foundation
  Sponsored by:	iXsystems

Reviewed by:	gnn (as part of the whole)
MFC After:	3 days
X-MFC:		KPI?
2012-05-25 01:48:15 +00:00
Bjoern A. Zeeb
f8315b5fd6 MFp4 bz_ipv6_fast:
Simplify the code removing a return from an earlier else case,
  not differing from the default function return called now.

  Sponsored by:	The FreeBSD Foundation
  Sponsored by:	iXsystems

Reviewed by:	gnn (as part of the whole)
MFC After:	3 days
2012-05-25 01:45:05 +00:00
Bjoern A. Zeeb
1b53a49ad9 MFp4 bz_ipv6_fast:
We currently nowhere set IP6A_SWAP making the entire check useless
  with the current code.  Keep around but do not compile in.

  Sponsored by:	The FreeBSD Foundation
  Sponsored by:	iXsystems

Reviewed by:	gnn (as part of the whole)
MFC After:	3 days
2012-05-25 01:43:52 +00:00
Bjoern A. Zeeb
2cf62998da MFp4 bz_ipv6_fast:
No need to hold the (expensive) rt lock over (expensive) logging.

  Sponsored by:	The FreeBSD Foundation
  Sponsored by:	iXsystems

Reviewed by:	gnn (as part of the whole)
MFC After:	3 days
2012-05-25 01:42:48 +00:00
Bjoern A. Zeeb
ecade87edf MFp4 bz_ipv6_fast:
Introduce a (for now copied stripped down) in6_cksum_pseudo()
  function.  We should be able to use this from in6_cksum() but
  we should also ponder possible MD specific improvements.
  It takes an extra csum argument to allow for easy checks as
  will be done by the upper layer protocol input paths.

  Sponsored by:	The FreeBSD Foundation
  Sponsored by:	iXsystems

Reviewed by:	gnn (as part of the whole)
MFC After:	3 days
2012-05-24 18:25:09 +00:00
Bjoern A. Zeeb
2889eb8bdf MFp4 bz_ipv6_fast:
Optimize in6_cksum(), re-ordering work and limiting variable
  initialization, removing a bzero() for mostly re-initialized
  struct values, making use of the newly introduced in6_getscope(),
  as well as converting an if/panic to a KASSERT().

  Sponsored by:	The FreeBSD Foundation
  Sponsored by:	iXsystems

Reviewed by:	gnn (as part of the whole)
MFC After:	3 days
2012-05-24 18:05:10 +00:00
Bjoern A. Zeeb
0edc703a02 MFp4 bz_ipv6_fast:
Introduce in6_getscope() to allow more effective checksum
  computations without the need to copy the address to clear the
  scope.

  Sponsored by:	The FreeBSD Foundation
  Sponsored by:	iXsystems

Reviewed by:	gnn (as part of the whole)
MFC After:	3 days
2012-05-24 16:30:13 +00:00
Michael Tuexen
807aad636f Use consistent text at the begining of the files.
MFC after: 3 days
2012-05-23 11:26:28 +00:00
Marius Strobl
7195094c3c Rewrite nd6_sysctl_{d,p}rlist() to avoid misaligned accesses to char arrays
casted to structs by getting rid of these buffers entirely. In r169832, it
was tried to paper over this issue by 32-bit aligning the buffers. Depending
on compiler optimizations that still was insufficient for 64-bit architectures
with strong alignment requirements though.
While at it, add comments regarding the total lack of locking in this area.

Tested by:	bz
Reviewed by:	bz (slightly earlier version), yongari (earlier version)
MFC after:	1 week
2012-05-20 05:12:31 +00:00
Michael Tuexen
1a5b79010a Missed to commit this in r235414.
MFC after: 3 days
2012-05-13 19:25:21 +00:00
Michael Tuexen
410a3b1ef0 Use ECONNABORTED in cases where the ABORT was sent to the peer.
MFC after: 3 days
2012-05-13 16:56:16 +00:00
Michael Tuexen
a2b42326b5 Provide in the association change notification the received ABORT chunk
if case of SCTP_COMM_LOST or SCTP_CANT_STR_ASSOC as required by RFC 6458.

MFC after: 3 days
2012-05-12 20:11:35 +00:00
Gleb Smirnoff
d0e6c546a2 in6_pcblookup_local() still can return a pcb with NULL
inp_socket. To avoid panic, do not dereference inp_socket,
but obtain reuse port option from inp_flags2, like this
is done after next call to in_pcblookup_local() a few lines
down below.

Submitted by:	rwatson
2012-03-21 08:43:38 +00:00
Michael Tuexen
dea47f3999 Clean up, no functional change.
MFC after: 3 days.
2012-03-15 14:22:05 +00:00