Commit Graph

408 Commits

Author SHA1 Message Date
Andrey V. Elsukov
1d3b268c04 Requeue mbuf via netisr when we use IPSec tunnel mode and IPv6.
ipsec6_common_input_cb() uses partial copy of ip6_input() to parse
headers. But this isn't correct, when we use tunnel mode IPSec.

When we stripped outer IPv6 header from the decrypted packet, it
can become IPv4 packet and should be handled by ip_input. Also when
we use tunnel mode IPSec with IPv6 traffic, we should pass decrypted
packet with inner IPv6 header to ip6_input, it will correctly handle
it and also can decide to forward it.

The "skip" variable points to offset where payload starts. In tunnel
mode we reset it to zero after stripping the outer header. So, when
it is zero, we should requeue mbuf via netisr.

Differential Revision:	https://reviews.freebsd.org/D2306
Reviewed by:	adrian, gnn
Sponsored by:	Yandex LLC
2015-04-18 16:51:24 +00:00
Andrey V. Elsukov
1ae800e7a6 Fix handling of scoped IPv6 addresses in IPSec code.
* in ipsec_encap() embed scope zone ids into link-local addresses
  in the new IPv6 header, this helps ip6_output() disambiguate the
  scope;
* teach key_ismyaddr6() use in6_localip(). in6_localip() is less
  strict than key_sockaddrcmp(). It doesn't compare all fileds of
  struct sockaddr_in6, but it is faster and it should be safe,
  because all SA's data was checked for correctness. Also, since
  IPv6 link-local addresses in the &V_in6_ifaddrhead are stored in
  kernel-internal form, we need to embed scope zone id from SA into
  the address before calling in6_localip.
* in ipsec_common_input() take scope zone id embedded in the address
  and use it to initialize sin6_scope_id, then use this sockaddr
  structure to lookup SA, because we keep addresses in the SADB without
  embedded scope zone id.

Differential Revision:	https://reviews.freebsd.org/D2304
Reviewed by:	gnn
Sponsored by:	Yandex LLC
2015-04-18 16:46:31 +00:00
Andrey V. Elsukov
61f376155d Remove xform_ipip.c and code related to XF_IP4.
The only thing is used from this code is ipip_output() function, that does
IPIP encapsulation. Other parts of XF_IP4 code were removed in r275133.
Also it isn't possible to configure the use of XF_IP4, nor from userland
via setkey(8), nor from the kernel.

Simplify the ipip_output() function and rename it to ipsec_encap().
* move IP_DF handling from ipsec4_process_packet() into ipsec_encap();
* since ipsec_encap() called from ipsec[64]_process_packet(), it
  is safe to assume that mbuf is contiguous at least to IP header
  for used IP version. Remove all unneeded m_pullup(), m_copydata
  and related checks.
* use V_ip_defttl and V_ip6_defhlim for outer headers;
* use V_ip4_ipsec_ecn and V_ip6_ipsec_ecn for outer headers;
* move all diagnostic messages to the ipsec_encap() callers;
* simplify handling of ipsec_encap() results: if it returns non zero
  value, print diagnostic message and free mbuf.
* some style(9) fixes.

Differential Revision:	https://reviews.freebsd.org/D2303
Reviewed by:	glebius
Sponsored by:	Yandex LLC
2015-04-18 16:38:45 +00:00
Gleb Smirnoff
6d947416cc o Use new function ip_fillid() in all places throughout the kernel,
where we want to create a new IP datagram.
o Add support for RFC6864, which allows to set IP ID for atomic IP
  datagrams to any value, to improve performance. The behaviour is
  controlled by net.inet.ip.rfc6864 sysctl knob, which is enabled by
  default.
o In case if we generate IP ID, use counter(9) to improve performance.
o Gather all code related to IP ID into ip_id.c.

Differential Revision:		https://reviews.freebsd.org/D2177
Reviewed by:			adrian, cy, rpaulo
Tested by:			Emeric POUPON <emeric.poupon stormshield.eu>
Sponsored by:			Netflix
Sponsored by:			Nginx, Inc.
Relnotes:			yes
2015-04-01 22:26:39 +00:00
Andrey V. Elsukov
ba76ce40b4 Remove extra '&'. sin6 is already a pointer.
PR:		195011
MFC after:	1 week
2015-03-07 18:44:52 +00:00
Andrey V. Elsukov
47568136c5 Fix possible memory leak and several races in the IPsec policy management
code.

Resurrect the state field in the struct secpolicy, it has
IPSEC_SPSTATE_ALIVE value when security policy linked in the chain,
and IPSEC_SPSTATE_DEAD value in all other cases. This field protects
from trying to unlink one security policy several times from the different
threads.

Take additional reference in the key_flush_spd() to be sure that policy
won't be freed from the different thread while we are sending SPDEXPIRE message.

Add KEY_FREESP() call to the key_unlink() to release additional reference
that we take when use key_getsp*() functions.

Differential Revision:	https://reviews.freebsd.org/D1914
Tested by:		Emeric POUPON <emeric.poupon at stormshield dot eu>
Reviewed by:	hrs
Sponsored by:	Yandex LLC
2015-02-24 10:35:07 +00:00
Andrey V. Elsukov
b489a49fc0 key_spdget uses key_setdumpsp() without SPTREE_RLOCK held (it uses
referenced pointer to sp). Remove SPTREE_RLOCK_ASSERT from
key_setdumpsp() to fix wrong assertion.

Reported by:	Emeric POUPON
Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2015-01-27 17:46:55 +00:00
Robert Watson
2a8c860fe3 In order to reduce use of M_EXT outside of the mbuf allocator and
socket-buffer implementations, introduce a return value for MCLGET()
(and m_cljget() that underlies it) to allow the caller to avoid testing
M_EXT itself.  Update all callers to use the return value.

With this change, very few network device drivers remain aware of
M_EXT; the primary exceptions lie in mbuf-chain pretty printers for
debugging, and in a few cases, custom mbuf and cluster allocation
implementations.

NB: This is a difficult-to-test change as it touches many drivers for
which I don't have physical devices.  Instead we've gone for intensive
review, but further post-commit review would definitely be appreciated
to spot errors where changes could not easily be made mechanically,
but were largely mechanical in nature.

Differential Revision:	https://reviews.freebsd.org/D1440
Reviewed by:	adrian, bz, gnn
Sponsored by:	EMC / Isilon Storage Division
2015-01-06 12:59:37 +00:00
Andrey V. Elsukov
fe07a9d08f Fix VIMAGE build. 2014-12-25 13:38:51 +00:00
Andrey V. Elsukov
93201211e9 Rename ip4_def_policy variable to def_policy. It is used by both IPv4 and
IPv6. Initialize it only once in def_policy_init(). Remove its
initialization from key_init() and make it static.

Remove several fields from struct secpolicy:
* lock - it isn't so useful having mutex in the structure, but the only
  thing we do with it is initialization and destroying.
* state - it has only two values - DEAD and ALIVE. Instead of take a lock
  and change the state to DEAD, then take lock again in GC function and
  delete policy from the chain - keep in the chain only ALIVE policies.
* scangen - it was used in GC function to protect from sending several
  SADB_SPDEXPIRE messages for one SPD entry. Now we don't keep DEAD entries
  in the chain and there is no need to have scangen variable.

Use TAILQ to implement SPD entries chain. Use rmlock to protect access
to SPD entries chain. Protect all SP lookup with RLOCK, and use WLOCK
when we are inserting (or removing) SP entry in the chain.

Instead of using pattern "LOCK(); refcnt++; UNLOCK();", use refcount(9)
API to implement refcounting in SPD. Merge code from key_delsp() and
_key_delsp() into _key_freesp(). And use KEY_FREESP() macro in all cases
when we want to release reference or just delete SP entry.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-12-24 18:34:56 +00:00
Andrey V. Elsukov
a91150da31 Treat errors when retrieving security policy as policy violation.
Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-12-11 18:46:11 +00:00
Andrey V. Elsukov
e65ada3e3c Initialize error variable.
Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-12-11 18:40:56 +00:00
Andrey V. Elsukov
0275b2e369 Remove flag/flags argument from the following functions:
ipsec_getpolicybyaddr()
 ipsec4_checkpolicy()
 ip_ipsec_output()
 ip6_ipsec_output()

The only flag used here was IP_FORWARDING.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-12-11 18:35:34 +00:00
Andrey V. Elsukov
619764beab Remove flags and tunalready arguments from ipsec4_process_packet()
and make its prototype similar to ipsec6_process_packet.
The flags argument isn't used here, tunalready is always zero.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-12-11 17:34:49 +00:00
Andrey V. Elsukov
f0514a8b8a Remove now unused mtag argument from ipsec*_common_input_cb.
Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-12-11 17:14:49 +00:00
Andrey V. Elsukov
08537f4526 Remove code related to PACKET_TAG_IPSEC_IN_CRYPTO_DONE mbuf tag.
It isn't used in FreeBSD.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-12-11 17:07:21 +00:00
Andrey V. Elsukov
566cbcc82a Remove unused mtag variable.
Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-12-11 17:01:53 +00:00
Andrey V. Elsukov
4268212124 key_getspacq() returns holding the spacq_lock. Unlock it in all cases.
MFC after:	1 week
Sponsored by:	Yandex LLC
2014-12-07 06:47:00 +00:00
Andrey V. Elsukov
3d6aff5615 Fix style(9) and remove m_freem(NULL).
Add XXX comment, it looks incorrect, because m_pkthdr.len is already
incremented by M_PREPEND().

Sponsored by:	Yandex LLC
2014-12-04 05:02:12 +00:00
Andrey V. Elsukov
18961126cb Remove __P() macro.
Suggested by:	kevlo
Sponsored by:	Yandex LLC
2014-12-03 04:08:41 +00:00
Andrey V. Elsukov
2e84e6eac9 ANSIfy function declarations.
Sponsored by:	Yandex LLC
2014-12-03 03:50:54 +00:00
Andrey V. Elsukov
bd766f3425 Remove unneded check. No need to do m_pullup to the size that we prepended.
Sponsored by:	Yandex LLC
2014-12-02 05:28:40 +00:00
Andrey V. Elsukov
2d957916ef Remove route chaching support from ipsec code. It isn't used for some time.
* remove sa_route_union declaration and route_cache member from struct secashead;
* remove key_sa_routechange() call from ICMP and ICMPv6 code;
* simplify ip_ipsec_mtu();
* remove #include <net/route.h>;

Sponsored by:	Yandex LLC
2014-12-02 04:20:50 +00:00
Andrey V. Elsukov
1fea1b0889 Remove unused structure declarations.
Sponsored by:	Yandex LLC
2014-12-02 02:41:44 +00:00
Andrey V. Elsukov
0e23cc372d Remove unused declartations.
Sponsored by:	Yandex LLC
2014-12-02 02:32:28 +00:00
Andrey V. Elsukov
ffbf9cdeb6 Remove ip4_input() declaration. It was removed in r275133.
MFC after:	1 month
2014-11-27 00:27:39 +00:00
Andrey V. Elsukov
b05765d75f Do not use xform_ipip as decapsulation fallback.
xform_ipip was used as fallback with low priority for IPIP
encapsulated packets that were decrypted. In some cases
it can decapsulate packets, that it shouldn't. This leads to situations,
when wrong configurations are magically working. Also it can propagate
wrong ingress interface and this can break security.

Now we redesigned the IPSEC code and IPIP encapsulation is called directly
from ipsec_output, and decapsulation is done in the ipsec_input with m_striphdr.

Differential Revision:	https://reviews.freebsd.org/D1220
MFC after:	1 month
Sponsored by:	Yandex LLC
2014-11-26 17:44:49 +00:00
Andrey V. Elsukov
f9d8f66552 Count statistics for the specific address family.
MFC after:	1 week
Sponsored by:	Yandex LLC
2014-11-13 12:58:33 +00:00
Andrey V. Elsukov
612faae7a2 Strip IP header only when we act in tunnel mode.
MFC after:	1 week
Sponsored by:	Yandex LLC
2014-11-13 10:48:59 +00:00
Andrey V. Elsukov
ab2164e0b5 Remove redundant ip6_plen initialization.
MFC after:	1 week
Sponsored by:	Yandex LLC
2014-11-13 10:47:24 +00:00
Andrey V. Elsukov
67fd172767 ipsec6_process_packet is called before ip6_output fixes ip6_plen.
Update ip6_plen before bpf processing to be able see correct value.

MFC after:	1 week
Sponsored by:	Yandex LLC
2014-11-12 22:51:30 +00:00
Andrey V. Elsukov
f3c93842bf Fix ips_out_nosa errors accounting.
MFC after:	1 week
Sponsored by:	Yandex LLC
2014-11-12 14:00:49 +00:00
Andrey V. Elsukov
b6e1ad3a3a Pass mbuf to pfil processing before stripping outer IP header as it
is described in if_enc(4).

MFC after:	2 week
Sponsored by:	Yandex LLC
2014-11-07 12:05:20 +00:00
Gleb Smirnoff
6df8a71067 Remove SYSCTL_VNET_* macros, and simply put CTLFLAG_VNET where needed.
Sponsored by:	Nginx, Inc.
2014-11-07 09:39:05 +00:00
Andrey V. Elsukov
1f194d8ae1 When mode isn't explicitly specified (wildcard) and inner protocol isn't
IPv4 or IPv6, assume it is the transport mode.

Reported by:	jmg
MFC after:	1 week
Sponsored by:	Yandex LLC
2014-11-06 20:23:57 +00:00
Andrey V. Elsukov
f5196a58a0 Use in_localip() instead of handmade implementation.
MFC after:	1 week
Sponsored by:	Yandex LLC
2014-10-31 12:19:22 +00:00
John Baldwin
a4432e6bf7 Use a static callout to drive key_timehandler() instead of timeout().
While here, make key_timehandler() private to key.c.

Submitted by:	bz (2)
Tested by:	bz
2014-10-23 20:43:16 +00:00
Hans Petter Selasky
f0188618f2 Fix multiple incorrect SYSCTL arguments in the kernel:
- Wrong integer type was specified.

- Wrong or missing "access" specifier. The "access" specifier
sometimes included the SYSCTL type, which it should not, except for
procedural SYSCTL nodes.

- Logical OR where binary OR was expected.

- Properly assert the "access" argument passed to all SYSCTL macros,
using the CTASSERT macro. This applies to both static- and dynamically
created SYSCTLs.

- Properly assert the the data type for both static and dynamic
SYSCTLs. In the case of static SYSCTLs we only assert that the data
pointed to by the SYSCTL data pointer has the correct size, hence
there is no easy way to assert types in the C language outside a
C-function.

- Rewrote some code which doesn't pass a constant "access" specifier
when creating dynamic SYSCTL nodes, which is now a requirement.

- Updated "EXAMPLES" section in SYSCTL manual page.

MFC after:	3 days
Sponsored by:	Mellanox Technologies
2014-10-21 07:31:21 +00:00
Andrey V. Elsukov
a28b277a9f Do not strip outer header when operating in transport mode.
Instead requeue mbuf back to IPv4 protocol handler. If there is one extra IP-IP
encapsulation, it will be handled with tunneling interface. And thus proper
interface will be exposed into mbuf's rcvif. Also, tcpdump that listens on tunneling
interface will see packets in both directions.

Sponsored by:	Yandex LLC
2014-10-02 02:00:21 +00:00
Gleb Smirnoff
6ff8af1ca5 Mechanically convert to if_inc_counter(). 2014-09-19 10:18:14 +00:00
Kevin Lo
73d76e77b6 Change pr_output's prototype to avoid the need for explicit casts.
This is a follow up to r269699.

Phabric:	D564
Reviewed by:	jhb
2014-08-15 02:43:02 +00:00
Kevin Lo
8f5a8818f5 Merge 'struct ip6protosw' and 'struct protosw' into one. Now we have
only one protocol switch structure that is shared between ipv4 and ipv6.

Phabric:	D476
Reviewed by:	jhb
2014-08-08 01:57:15 +00:00
Gleb Smirnoff
fcc34a238c Fix style bug: rename the refcount field of m_ext to ext_cnt, to match
other members.

Sponsored by:	Nginx, Inc.
2014-07-11 14:34:29 +00:00
Marko Zec
b01e3d0802 The assumption in ipsec4_process_packet() that the payload may be
only IPv4 is wrong, so check the IP version before mangling the
payload header.
2014-07-01 08:02:25 +00:00
Bjoern A. Zeeb
6d120f909c Use IPv4 statistics in ipsec4_process_packet() rather than the IPv6
version.  This also unbreaks the NOINET6 builds after r266800.
2014-05-28 23:01:20 +00:00
VANHULLEBUS Yvan
aaf2cfc0d6 Fixed IPv4-in-IPv6 and IPv6-in-IPv4 IPsec tunnels.
For IPv6-in-IPv4, you may need to do the following command
on the tunnel interface if it is configured as IPv4 only:
ifconfig <interface> inet6 -ifdisabled

Code logic inspired from NetBSD.

PR: kern/169438
Submitted by: emeric.poupon@netasq.com
Reviewed by: fabient, ae
Obtained from: NETASQ
2014-05-28 12:45:27 +00:00
Bjoern A. Zeeb
799653be1c Only do a ports check if this is a NAT-T SA. Otherwise other
lookups providing ports may get unexpected results.

MFC After:	2 weeks
2014-05-24 09:29:23 +00:00
Andrey V. Elsukov
cd92c2e1ec Remove _IP_VHL* macros and related ifdefs.
MFC after:	1 week
2014-04-16 05:31:54 +00:00
Andrey V. Elsukov
e064642c9a The check for local address spoofing lacks ifaddr locking.
Remove these loops and use in_localip() and in6_localip()
functions instead.

MFC after:	1 week
Sponsored by:	Yandex LLC
2014-04-04 16:58:32 +00:00
Andrey V. Elsukov
b48f1835a7 Remove unused variable.
MFC after:	1 week
Sponsored by:	Yandex LLC
2014-04-04 15:57:27 +00:00
Andrey V. Elsukov
c91a8e0ebe Remove dead code.
MFC after:	1 week
Sponsored by:	Yandex LLC
2014-04-04 15:55:38 +00:00
John Baldwin
5b26ea5df3 Remove more constants related to static sysctl nodes. The MAXID constants
were primarily used to size the sysctl name list macros that were removed
in r254295.  A few other constants either did not have an associated
sysctl node, or the associated node used OID_AUTO instead.

PR:		ports/184525 (exp-run)
2014-02-25 18:44:33 +00:00
Andrey V. Elsukov
00a689c438 Initialize prot variable.
PR:		177417
MFC after:	1 week
2013-11-11 13:19:55 +00:00
Gleb Smirnoff
eedc7fd9e8 Provide includes that are needed in these files, and before were read
in implicitly via if.h -> if_var.h pollution.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-26 18:18:50 +00:00
Gleb Smirnoff
76039bc84f The r48589 promised to remove implicit inclusion of if_var.h soon. Prepare
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-26 17:58:36 +00:00
John Baldwin
fd77bbb967 Remove most of the remaining sysctl name list macros. They were only
ever intended for use in sysctl(8) and it has not used them for many
years.

Reviewed by:	bde
Tested by:	exp-run by bdrewery
2013-08-26 18:16:05 +00:00
Andrey V. Elsukov
6794f46021 Remove the large part of struct ipsecstat. Only few fields of this
structure is used, but they already have equal fields in the struct
newipsecstat, that was introduced with FAST_IPSEC and then was merged
together with old ipsecstat structure.

This fixes kernel stack overflow on some architectures after migration
ipsecstat to PCPU counters.

Reported by:	Taku YAMAMOTO, Maciej Milewski
2013-07-23 14:14:24 +00:00
Andrey V. Elsukov
db8c087944 Migrate structs ahstat, espstat, ipcompstat, ipipstat, pfkeystat,
ipsec4stat, ipsec6stat to PCPU counters.
2013-07-09 10:08:13 +00:00
Andrey V. Elsukov
c80211e3cf Prepare network statistics structures for migration to PCPU counters.
Use uint64_t as type for all fields of structures.

Changed structures: ahstat, arpstat, espstat, icmp6_ifstat, icmp6stat,
in6_ifstat, ip6stat, ipcompstat, ipipstat, ipsecstat, mrt6stat, mrtstat,
pfkeystat, pim6stat, pimstat, rip6stat, udpstat.

Discussed with:	arch@
2013-07-09 09:32:06 +00:00
Andrey V. Elsukov
a04d64d875 Use corresponding macros to update statistics for AH, ESP, IPIP, IPCOMP,
PFKEY.

MFC after:	2 weeks
2013-06-20 11:44:16 +00:00
Andrey V. Elsukov
6659296cb0 Use IPSECSTAT_INC() and IPSEC6STAT_INC() macros for ipsec statistics
accounting.

MFC after:	2 weeks
2013-06-20 09:55:53 +00:00
Andrey V. Elsukov
9cb8d207af Use IP6STAT_INC/IP6STAT_DEC macros to update ip6 stats.
MFC after:	1 week
2013-04-09 07:11:22 +00:00
Gleb Smirnoff
dcba52a5f1 Use m_get2() + m_align() instead of hand made key_alloc_mbuf(). Code
examination shows, that although key_alloc_mbuf() could return chains,
the callers never use chains, so m_get2() should suffice.

Sponsored by:	Nginx, Inc.
2013-03-15 10:20:15 +00:00
Gleb Smirnoff
eb1b1807af Mechanically substitute flags from historic mbuf allocator with
malloc(9) flags within sys.

Exceptions:

- sys/contrib not touched
- sys/mbuf.h edited manually
2012-12-05 08:04:20 +00:00
Gleb Smirnoff
8ad458a471 Do not reduce ip_len by size of IP header in the ip_input()
before passing a packet to protocol input routines.
  For several protocols this mean that now protocol needs to
do subtraction itself, and for another half this means that
we do not need to add header length back to the packet.

  Make ip_stripoptions() to adjust ip_len, since now we enter
this function with a packet header whose ip_len does represent
length of entire packet, not payload only.
2012-10-23 08:33:13 +00:00
Gleb Smirnoff
d2bffb140e - Fix one more miss from r241913.
- Add XXX comment about necessity of the entire block,
  that "fixes up" the IP header.
2012-10-23 08:22:01 +00:00
Gleb Smirnoff
20472bce45 Couple of changes missed from r241913, which converted
IPv4 stack to network byte order.
2012-10-22 22:42:28 +00:00
Gleb Smirnoff
8f134647ca Switch the entire IPv4 stack to keep the IP packet header
in network byte order. Any host byte order processing is
done in local variables and host byte order values are
never[1] written to a packet.

  After this change a packet processed by the stack isn't
modified at all[2] except for TTL.

  After this change a network stack hacker doesn't need to
scratch his head trying to figure out what is the byte order
at the given place in the stack.

[1] One exception still remains. The raw sockets convert host
byte order before pass a packet to an application. Probably
this would remain for ages for compatibility.

[2] The ip_input() still subtructs header len from ip->ip_len,
but this is planned to be fixed soon.

Reviewed by:	luigi, Maxim Dounin <mdounin mdounin.ru>
Tested by:	ray, Olivier Cochard-Labbe <olivier cochard.me>
2012-10-22 21:09:03 +00:00
Andre Oppermann
c9b652e3e8 Mechanically remove the last stray remains of spl* calls from net*/*.
They have been Noop's for a long time now.
2012-10-18 13:57:24 +00:00
Kevin Lo
9f614af4cf Add missing break 2012-09-18 08:00:43 +00:00
VANHULLEBUS Yvan
d1b835208a In NAT-T transport mode, allow a client to open a new connection just after
closing another.
It worked only in tunnel mode before.

Submitted by:	Andreas Longwitz <longwitz@incore.de>
MFC after: 1M
2012-09-12 12:14:50 +00:00
Gleb Smirnoff
d6d3f01e0a Merge the projects/pf/head branch, that was worked on for last six months,
into head. The most significant achievements in the new code:

 o Fine grained locking, thus much better performance.
 o Fixes to many problems in pf, that were specific to FreeBSD port.

New code doesn't have that many ifdefs and much less OpenBSDisms, thus
is more attractive to our developers.

  Those interested in details, can browse through SVN log of the
projects/pf/head branch. And for reference, here is exact list of
revisions merged:

r232043, r232044, r232062, r232148, r232149, r232150, r232298, r232330,
r232332, r232340, r232386, r232390, r232391, r232605, r232655, r232656,
r232661, r232662, r232663, r232664, r232673, r232691, r233309, r233782,
r233829, r233830, r233834, r233835, r233836, r233865, r233866, r233868,
r233873, r234056, r234096, r234100, r234108, r234175, r234187, r234223,
r234271, r234272, r234282, r234307, r234309, r234382, r234384, r234456,
r234486, r234606, r234640, r234641, r234642, r234644, r234651, r235505,
r235506, r235535, r235605, r235606, r235826, r235991, r235993, r236168,
r236173, r236179, r236180, r236181, r236186, r236223, r236227, r236230,
r236252, r236254, r236298, r236299, r236300, r236301, r236397, r236398,
r236399, r236499, r236512, r236513, r236525, r236526, r236545, r236548,
r236553, r236554, r236556, r236557, r236561, r236570, r236630, r236672,
r236673, r236679, r236706, r236710, r236718, r237154, r237155, r237169,
r237314, r237363, r237364, r237368, r237369, r237376, r237440, r237442,
r237751, r237783, r237784, r237785, r237788, r237791, r238421, r238522,
r238523, r238524, r238525, r239173, r239186, r239644, r239652, r239661,
r239773, r240125, r240130, r240131, r240136, r240186, r240196, r240212.

I'd like to thank people who participated in early testing:

Tested by:	Florian Smeets <flo freebsd.org>
Tested by:	Chekaluk Vitaly <artemrts ukr.net>
Tested by:	Ben Wilber <ben desync.com>
Tested by:	Ian FREISLICH <ianf cloudseed.co.za>
2012-09-08 06:41:54 +00:00
John Baldwin
2541fcd953 Unexpand a couple of TAILQ_FOREACH()s. 2012-08-17 16:01:24 +00:00
Bjoern A. Zeeb
174b0d419b Fix a bug introduced in r221129 that leads to a panic wen using bundled
SAs.  For now allow same address family bundles.  While discovered with
ESP and AH, which does not make a lot of sense, IPcomp could be a possible
problematic candidate.

PR:		kern/164400
MFC after:	3 days
2012-07-22 17:46:05 +00:00
Bjoern A. Zeeb
81d5d46b3c Add multi-FIB IPv6 support to the core network stack supplementing
the original IPv4 implementation from r178888:

- Use RT_DEFAULT_FIB in the IPv4 implementation where noticed.
- Use rt*fib() KPI with explicit RT_DEFAULT_FIB where applicable in
  the NFS code.
- Use the new in6_rt* KPI in TCP, gif(4), and the IPv6 network stack
  where applicable.
- Split in6_rtqtimo() and in6_mtutimo() as done in IPv4 and equally
  prevent multiple initializations of callouts in in6_inithead().
- Use wrapper functions where needed to preserve the current KPI to
  ease MFCs.  Use BURN_BRIDGES to indicate expected future cleanup.
- Fix (related) comments (both technical or style).
- Convert to rtinit() where applicable and only use custom loops where
  currently not possible otherwise.
- Multicast group, most neighbor discovery address actions and faith(4)
  are locked to the default FIB.  Individual IPv6 addresses will only
  appear in the default FIB, however redirect information and prefixes
  of connected subnets are automatically propagated to all FIBs by
  default (mimicking IPv4 behavior as closely as possible).

Sponsored by:	Cisco Systems, Inc.
2012-02-03 13:08:44 +00:00
Bjoern A. Zeeb
83e521ec73 Clean up some #endif comments removing from short sections. Add #endif
comments to longer, also refining strange ones.

Properly use #ifdef rather than #if defined() where possible.  Four
#if defined(PCBGROUP) occurances (netinet and netinet6) were ignored to
avoid conflicts with eventually upcoming changes for RSS.

Reported by:	bde (most)
Reviewed by:	bde
MFC after:	3 days
2012-01-22 02:13:19 +00:00
Pawel Jakub Dawidek
d3e8e66d75 Remove unused 'plen' variable. 2011-11-26 23:57:03 +00:00
Pawel Jakub Dawidek
cdb7ebe38c The esp_max_ivlen global variable is not needed, we can just use
EALG_MAX_BLOCK_LEN.
2011-11-26 23:27:41 +00:00
Pawel Jakub Dawidek
5be4c9b9e6 malloc(M_WAITOK) never fails, so there is no need to check for NULL. 2011-11-26 23:18:19 +00:00
Pawel Jakub Dawidek
0e4fb1db44 Eliminate 'err' variable and just use existing 'error'. 2011-11-26 23:15:28 +00:00
Pawel Jakub Dawidek
0a95a08ecb Simplify code a bit. 2011-11-26 23:13:30 +00:00
Pawel Jakub Dawidek
b6a4c9acdb There is no need to virtualize esp_max_ivlen. 2011-11-26 23:11:41 +00:00
Christian Brueffer
4795003bd2 Add missing va_end() in an error case to clean up after va_start()
(already done in the non-error case).

CID:		4726
Found with:	Coverity Prevent(tm)
MFC after:	1 week
2011-10-07 21:00:26 +00:00
Bjoern A. Zeeb
e0bfbfce79 Update packet filter (pf) code to OpenBSD 4.5.
You need to update userland (world and ports) tools
to be in sync with the kernel.

Submitted by:	mlaier
Submitted by:	eri
2011-06-28 11:57:25 +00:00
VANHULLEBUS Yvan
568fac6f2e Release SP's refcount in key_get_spdbyid().
PR:	156676
Submitted by: Tobias Brunner (tobias@strongswan.org)
MFC after:	1 week
2011-05-09 13:16:21 +00:00
Bjoern A. Zeeb
db178eb816 Make IPsec compile without INET adding appropriate #ifdef checks.
Unfold the IPSEC_COMMON_INPUT_CB() macro in xform_{ah,esp,ipcomp}.c
to not need three different versions depending on INET, INET6 or both.

Mark two places preparing for not yet supported functionality with IPv6.

Reviewed by:	gnn
Sponsored by:	The FreeBSD Foundation
Sponsored by:	iXsystems
MFC after:	4 days
2011-04-27 19:28:42 +00:00
Bjoern A. Zeeb
dc49da9761 Do not allow recursive RFC3173 IPComp payload.
Reviewed by:	Tavis Ormandy (taviso cmpxchg8b.com)
MFC after:	5 days
Security:	CVE-2011-1547
2011-04-01 14:13:49 +00:00
Fabien Thomas
73171433c5 Optimisation in IPSEC(4):
- Remove contention on ISR during the crypto operation by using rwlock(9).
- Remove a second lookup of the SA in the callback.

Gain on 6 cores CPU with SHA1/AES128 can be up to 30%.

Reviewed by:	vanhu
MFC after:	1 month
2011-03-31 15:23:32 +00:00
Fabien Thomas
11d2f4df50 Fix two SA refcount:
- AH does not release the SA like in ESP/IPCOMP when handling EAGAIN
- ipsec_process_done incorrectly release the SA.

Reviewed by:	vanhu
MFC after:	1 week
2011-03-31 13:14:24 +00:00
VANHULLEBUS Yvan
442da28aeb Fixed IPsec's HMAC_SHA256-512 support to be RFC4868 compliant.
This will break interoperability with all older versions of
FreeBSD for those algorithms.

Reviewed by:	bz, gnn
Obtained from:	NETASQ
MFC after:	1w
2011-02-18 09:40:13 +00:00
Dimitry Andric
3e288e6238 After some off-list discussion, revert a number of changes to the
DPCPU_DEFINE and VNET_DEFINE macros, as these cause problems for various
people working on the affected files.  A better long-term solution is
still being considered.  This reversal may give some modules empty
set_pcpu or set_vnet sections, but these are harmless.

Changes reverted:

------------------------------------------------------------------------
r215318 | dim | 2010-11-14 21:40:55 +0100 (Sun, 14 Nov 2010) | 4 lines

Instead of unconditionally emitting .globl's for the __start_set_xxx and
__stop_set_xxx symbols, only emit them when the set_vnet or set_pcpu
sections are actually defined.

------------------------------------------------------------------------
r215317 | dim | 2010-11-14 21:38:11 +0100 (Sun, 14 Nov 2010) | 3 lines

Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.

------------------------------------------------------------------------
r215316 | dim | 2010-11-14 21:23:02 +0100 (Sun, 14 Nov 2010) | 2 lines

Add macros to define static instances of VNET_DEFINE and DPCPU_DEFINE.
2010-11-22 19:32:54 +00:00
Dimitry Andric
31c6a0037e Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.
2010-11-14 20:38:11 +00:00
Bjoern A. Zeeb
13a6cf24ac Announce both IPsec and UDP Encap (NAT-T) if available for
feature_present(3) checks.

This will help to run-time detect and conditionally handle specific
optionas of either feature in user space (i.e. in libipsec).

Descriptions read by:	rwatson
MFC after:		2 weeks
2010-10-30 18:52:44 +00:00
Thomas Quinot
94294cada5 Fix typo in comment. 2010-10-25 16:11:37 +00:00
Bjoern A. Zeeb
4a85b5e2ea Make the IPsec SADB embedded route cache a union to be able to hold both the
legacy and IPv6 route destination address.
Previously in case of IPv6, there was a memory overwrite due to not enough
space for the IPv6 address.

PR:		kern/122565
MFC After:	2 weeks
2010-10-23 20:35:40 +00:00
Bjoern A. Zeeb
acf456a04a Remove dead code:
assignment to a local variable not used anywhere after that.

MFC after:	3 days
2010-10-14 15:15:22 +00:00
Bjoern A. Zeeb
e046b77ee1 Style: make the asterisk go with the variable name, not the type.
MFC after:	3 days
2010-10-14 14:49:49 +00:00
Bjoern A. Zeeb
3abaa08643 MFp4 @178283:
Improve IPsec flow distribution for better netisr parallelism.
Instead of using the pointer that would have the last bits masked in a %
statement in netisr_select_cpuid() to select the queue, use the SPI.

Reviewed by:	rwatson
MFC after:	4 weeks
2010-05-24 16:27:47 +00:00
VANHULLEBUS Yvan
2e8d55c4e8 Set SA's natt_type before calling key_mature() in key_add(),
as the SA may be used as soon as key_mature() has been done.

Obtained from:	NETASQ
MFC after:	1 week
2010-05-05 08:58:58 +00:00
VANHULLEBUS Yvan
2d2a2083f7 Update SA's NAT-T stuff before calling key_mature() in key_update(),
as SA may be used as soon as key_mature() has been called.

Obtained from:	NETASQ
MFC after: 1 week
2010-05-05 08:55:26 +00:00
Bjoern A. Zeeb
82cea7e6f3 MFP4: @176978-176982, 176984, 176990-176994, 177441
"Whitspace" churn after the VIMAGE/VNET whirls.

Remove the need for some "init" functions within the network
stack, like pim6_init(), icmp_init() or significantly shorten
others like ip6_init() and nd6_init(), using static initialization
again where possible and formerly missed.

Move (most) variables back to the place they used to be before the
container structs and VIMAGE_GLOABLS (before r185088) and try to
reduce the diff to stable/7 and earlier as good as possible,
to help out-of-tree consumers to update from 6.x or 7.x to 8 or 9.

This also removes some header file pollution for putatively
static global variables.

Revert VIMAGE specific changes in ipfilter::ip_auth.c, that are
no longer needed.

Reviewed by:	jhb
Discussed with:	rwatson
Sponsored by:	The FreeBSD Foundation
Sponsored by:	CK Software GmbH
MFC after:	6 days
2010-04-29 11:52:42 +00:00
VANHULLEBUS Yvan
61f73308d4 Locks SPTREE when setting some SP entries to state DEAD.
This can prevent kernel panics when updating SPs while
there is some traffic for them.

Obtained from: NETASQ
MFC after: 1m
2010-04-15 12:40:33 +00:00
Ermal Luçi
87a25418ac Fix a logic error in ipsec code that extracts
information from the packets.

Reviewed by:	bz, mlaier
Approved by:	mlaier(mentor)
MFC after:	1 month
2010-04-02 18:15:23 +00:00
Bjoern A. Zeeb
8b7893b056 When tearing down IPsec as part of a (virtual) network stack,
do not try to free the same list twice but free both the
acquiring list and the security policy acquiring list.

Reviewed by:	anchie
MFC after:	3 days
2010-03-28 06:51:50 +00:00
Pawel Jakub Dawidek
d0d6567d4a Correct typo in comment. 2010-02-18 22:34:29 +00:00
Bjoern A. Zeeb
a77cb332ee Enable IPcomp by default.
PR:		kern/123587
MFC after:	5 days
2009-11-29 20:47:43 +00:00
Bjoern A. Zeeb
90b4c081d0 Add more statistics variables for IPcomp.
Try to version the struct in a backward compatible way.
People asked for the versioning of the stats structs in general before.

MFC after:	5 days
2009-11-29 20:37:30 +00:00
Bjoern A. Zeeb
10229cd109 Assimilate very similar input and output code paths
(no real functional change).

MFC after:	5 days
2009-11-29 17:47:49 +00:00
Bjoern A. Zeeb
afa47e51aa Only add the IPcomp header if crypto reported success and we have a lower
payload size.  Before we had always added the header, no matter if we
actually send out compressed data or not.

With this, after the opencrypto/deflate changes, IPcomp starts to work
apart from edge cases.  Leave it disabled by default until those are
fixed as well.

PR:		kern/123587
MFC after:	5 days
2009-11-29 10:53:34 +00:00
Bjoern A. Zeeb
3d34d241be Remove whitespace.
MFC after:	6 days
2009-11-28 21:42:39 +00:00
Bjoern A. Zeeb
4ff9852103 Directly send data uncompressed if the packet payload size is lower than
the compression algorithm threshold.

MFC after:	6 days
2009-11-28 21:40:57 +00:00
Bjoern A. Zeeb
023795f033 Correct a typo.
MFC after:	6 days
2009-11-28 21:01:26 +00:00
VANHULLEBUS Yvan
3e6265f14d fixed two race conditions when inserting/removing SAs via PFKey,
which can both lead to a kernel panic when adding/removing quickly
a lot of SAs.

Obtained from:	NETASQ
MFC after:	2w (MFC on 8 before 8.0 release ???)
2009-11-17 16:00:41 +00:00
VANHULLEBUS Yvan
a45bff047c Changed an IPSEC_ASSERT to a simple test, as such invalid packets
may come from outside without being discarded before.

Submitted by:	aurelien.ansel@netasq.com
Reviewed by:	bz (secteam)
Obtained from:	NETASQ
MFC after:	1m
2009-10-01 15:33:53 +00:00
VANHULLEBUS Yvan
22c125a1b6 When checking traffic endpoint's adresses families in key_spdadd(),
compare them together instead of comparing each one with respective
tunnel endpoint.

PR:	kern/138439
Submitted by:	aurelien.ansel@netasq.com
Obtained from:	NETASQ
MFC after:	1 m
2009-09-16 11:56:44 +00:00
Pawel Jakub Dawidek
fc79063e66 Silent gcc? Yeah, you wish. What I ment was to silence gcc.
Spotted by:	julian
2009-09-06 19:05:03 +00:00
Pawel Jakub Dawidek
3b02c4a3d3 Initialize state_valid and arraysize variable so gcc won't complain.
Reported by:	bz
2009-09-06 18:09:25 +00:00
Pawel Jakub Dawidek
950ab2f81e Improve code a bit by eliminating goto and having one unlock per lock. 2009-09-06 07:32:16 +00:00
Pawel Jakub Dawidek
cee0fa809b Correct typo in comment. 2009-09-06 07:30:21 +00:00
Robert Watson
77dfcdc445 Rework global locks for interface list and index management, correcting
several critical bugs, including race conditions and lock order issues:

Replace the single rwlock, ifnet_lock, with two locks, an rwlock and an
sxlock.  Either can be held to stablize the lists and indexes, but both
are required to write.  This allows the list to be held stable in both
network interrupt contexts and sleepable user threads across sleeping
memory allocations or device driver interactions.  As before, writes to
the interface list must occur from sleepable contexts.

Reviewed by:	bz, julian
MFC after:	3 days
2009-08-23 20:40:19 +00:00
Robert Watson
530c006014 Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks.  Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by:	bz
Approved by:	re (vimage blanket)
2009-08-01 19:26:27 +00:00
Robert Watson
d0728d7174 Introduce and use a sysinit-based initialization scheme for virtual
network stacks, VNET_SYSINIT:

- Add VNET_SYSINIT and VNET_SYSUNINIT macros to declare events that will
  occur each time a network stack is instantiated and destroyed.  In the
  !VIMAGE case, these are simply mapped into regular SYSINIT/SYSUNINIT.
  For the VIMAGE case, we instead use SYSINIT's to track their order and
  properties on registration, using them for each vnet when created/
  destroyed, or immediately on module load for already-started vnets.
- Remove vnet_modinfo mechanism that existed to serve this purpose
  previously, as well as its dependency scheme: we now just use the
  SYSINIT ordering scheme.
- Implement VNET_DOMAIN_SET() to allow protocol domains to declare that
  they want init functions to be called for each virtual network stack
  rather than just once at boot, compiling down to DOMAIN_SET() in the
  non-VIMAGE case.
- Walk all virtualized kernel subsystems and make use of these instead
  of modinfo or DOMAIN_SET() for init/uninit events.  In some cases,
  convert modular components from using modevent to using sysinit (where
  appropriate).  In some cases, do minor rejuggling of SYSINIT ordering
  to make room for or better manage events.

Portions submitted by:	jhb (VNET_SYSINIT), bz (cleanup)
Discussed with:		jhb, bz, julian, zec
Reviewed by:		bz
Approved by:		re (VIMAGE blanket)
2009-07-23 20:46:49 +00:00
Robert Watson
0a4747d4d0 Garbage collect vnet module registrations that have neither constructors
nor destructors, as there's no actual work to do.

In most cases, the constructors weren't needed because of the existing
protocol initialization functions run by net_init_domain() as part of
VNET_MOD_NET, or they were eliminated when support for static
initialization of virtualized globals was added.

Garbage collect dependency references to modules without constructors or
destructors, notably VNET_MOD_INET and VNET_MOD_INET6.

Reviewed by:	bz
Approved by:	re (vimage blanket)
2009-07-20 13:55:33 +00:00
Robert Watson
5ee847d3ac Reimplement and/or implement vnet list locking by replacing a mostly
unused custom mutex/condvar-based sleep locks with two locks: an
rwlock (for non-sleeping use) and sxlock (for sleeping use).  Either
acquired for read is sufficient to stabilize the vnet list, but both
must be acquired for write to modify the list.

Replace previous no-op read locking macros, used in various places
in the stack, with actual locking to prevent race conditions.  Callers
must declare when they may perform unbounded sleeps or not when
selecting how to lock.

Refactor vnet sysinits so that the vnet list and locks are initialized
before kernel modules are linked, as the kernel linker will use them
for modules loaded by the boot loader.

Update various consumers of these KPIs based on whether they may sleep
or not.

Reviewed by:	bz
Approved by:	re (kib)
2009-07-19 14:20:53 +00:00
Robert Watson
1e77c1056a Remove unused VNET_SET() and related macros; only VNET_GET() is
ever actually used.  Rename VNET_GET() to VNET() to shorten
variable references.

Discussed with:	bz, julian
Reviewed by:	bz
Approved by:	re (kensmith, kib)
2009-07-16 21:13:04 +00:00
Robert Watson
eddfbb763d Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator.  Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...).  This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.

Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack.  Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory.  Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.

Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy.  Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address.  When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.

This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.

Bump __FreeBSD_version and update UPDATING.

Portions submitted by:  bz
Reviewed by:            bz, zec
Discussed with:         gnn, jamie, jeff, jhb, julian, sam
Suggested by:           peter
Approved by:            re (kensmith)
2009-07-14 22:48:30 +00:00
Robert Watson
d1da0a0672 Add address list locking for in6_ifaddrhead/ia_link: as with locking
for in_ifaddrhead, we stick with an rwlock for the time being, which
we will revisit in the future with a possible move to rmlocks.

Some pieces of code require significant further reworking to be
safe from all classes of writer-writer races.

Reviewed by:	bz
MFC after:	6 weeks
2009-06-25 16:35:28 +00:00
Robert Watson
2d9cfabad4 Add a new global rwlock, in_ifaddr_lock, which will synchronize use of the
in_ifaddrhead and INADDR_HASH address lists.

Previously, these lists were used unsynchronized as they were effectively
never changed in steady state, but we've seen increasing reports of
writer-writer races on very busy VPN servers as core count has gone up
(and similar configurations where address lists change frequently and
concurrently).

For the time being, use rwlocks rather than rmlocks in order to take
advantage of their better lock debugging support.  As a result, we don't
enable ip_input()'s read-locking of INADDR_HASH until an rmlock conversion
is complete and a performance analysis has been done.  This means that one
class of reader-writer races still exists.

MFC after:      6 weeks
Reviewed by:    bz
2009-06-25 11:52:33 +00:00
Robert Watson
80af0152f3 Convert netinet6 to using queue(9) rather than hand-crafted linked lists
for the global IPv6 address list (in6_ifaddr -> in6_ifaddrhead).  Adopt
the code styles and conventions present in netinet where possible.

Reviewed by:	gnn, bz
MFC after:	6 weeks (possibly not MFCable?)
2009-06-24 21:00:25 +00:00
Bjoern A. Zeeb
57700c9e4d Move setting of ports from NAT-T below key_getsah() and actually
below key_setsaval().
Without that, the lookup for the SA had failed as we were looking for
a SA with the new, updated port numbers instead of the old ones and
were comparing the ports in key_cmpsaidx().
This makes updating the remote -> local SA on the initiator work again.

Problem introduced with:	p4 changeset 152114
2009-06-19 21:01:55 +00:00
Bjoern A. Zeeb
7654a365db Add the explicit include of vimage.h to another five .c files still
missing it.

Remove the "hidden" kernel only include of vimage.h from ip_var.h added
with the very first Vimage commit r181803 to avoid further kernel poisoning.
2009-06-17 12:44:11 +00:00
VANHULLEBUS Yvan
7b495c4494 Added support for NAT-Traversal (RFC 3948) in IPsec stack.
Thanks to (no special order) Emmanuel Dreyfus (manu@netbsd.org), Larry
Baird (lab@gta.com), gnn, bz, and other FreeBSD devs, Julien Vanherzeele
(julien.vanherzeele@netasq.com, for years of bug reporting), the PFSense
team, and all people who used / tried the NAT-T patch for years and
reported bugs, patches, etc...

X-MFC: never

Reviewed by:	bz
Approved by:	gnn(mentor)
Obtained from:	NETASQ
2009-06-12 15:44:35 +00:00
Bjoern A. Zeeb
fc228fbf49 Properly hide IPv4 only variables and functions under #ifdef INET. 2009-06-10 19:25:46 +00:00
Bjoern A. Zeeb
8d8bc0182e After r193232 rt_tables in vnet.h are no longer indirectly dependent on
the ROUTETABLES kernel option thus there is no need to include opt_route.h
anymore in all consumers of vnet.h and no longer depend on it for module
builds.

Remove the hidden include in flowtable.h as well and leave the two
explicit #includes in ip_input.c and ip_output.c.
2009-06-08 19:57:35 +00:00
Marko Zec
bc29160df3 Introduce an infrastructure for dismantling vnet instances.
Vnet modules and protocol domains may now register destructor
functions to clean up and release per-module state.  The destructor
mechanisms can be triggered by invoking "vimage -d", or a future
equivalent command which will be provided via the new jail framework.

While this patch introduces numerous placeholder destructor functions,
many of those are currently incomplete, thus leaking memory or (even
worse) failing to stop all running timers.  Many of such issues are
already known and will be incrementaly fixed over the next weeks in
smaller incremental commits.

Apart from introducing new fields in structs ifnet, domain, protosw
and vnet_net, which requires the kernel and modules to be rebuilt, this
change should have no impact on nooptions VIMAGE builds, since vnet
destructors can only be called in VIMAGE kernels.  Moreover,
destructor functions should be in general compiled in only in
options VIMAGE builds, except for kernel modules which can be safely
kldunloaded at run time.

Bump __FreeBSD_version to 800097.
Reviewed by:	bz, julian
Approved by:	rwatson, kib (re), julian (mentor)
2009-06-08 17:15:40 +00:00
Robert Watson
d4b5cae49b Reimplement the netisr framework in order to support parallel netisr
threads:

- Support up to one netisr thread per CPU, each processings its own
  workstream, or set of per-protocol queues.  Threads may be bound
  to specific CPUs, or allowed to migrate, based on a global policy.

  In the future it would be desirable to support topology-centric
  policies, such as "one netisr per package".

- Allow each protocol to advertise an ordering policy, which can
  currently be one of:

  NETISR_POLICY_SOURCE: packets must maintain ordering with respect to
    an implicit or explicit source (such as an interface or socket).

  NETISR_POLICY_FLOW: make use of mbuf flow identifiers to place work,
    as well as allowing protocols to provide a flow generation function
    for mbufs without flow identifers (m2flow).  Falls back on
    NETISR_POLICY_SOURCE if now flow ID is available.

  NETISR_POLICY_CPU: allow protocols to inspect and assign a CPU for
    each packet handled by netisr (m2cpuid).

- Provide utility functions for querying the number of workstreams
  being used, as well as a mapping function from workstream to CPU ID,
  which protocols may use in work placement decisions.

- Add explicit interfaces to get and set per-protocol queue limits, and
  get and clear drop counters, which query data or apply changes across
  all workstreams.

- Add a more extensible netisr registration interface, in which
  protocols declare 'struct netisr_handler' structures for each
  registered NETISR_ type.  These include name, handler function,
  optional mbuf to flow ID function, optional mbuf to CPU ID function,
  queue limit, and ordering policy.  Padding is present to allow these
  to be expanded in the future.  If no queue limit is declared, then
  a default is used.

- Queue limits are now per-workstream, and raised from the previous
  IFQ_MAXLEN default of 50 to 256.

- All protocols are updated to use the new registration interface, and
  with the exception of netnatm, default queue limits.  Most protocols
  register as NETISR_POLICY_SOURCE, except IPv4 and IPv6, which use
  NETISR_POLICY_FLOW, and will therefore take advantage of driver-
  generated flow IDs if present.

- Formalize a non-packet based interface between interface polling and
  the netisr, rather than having polling pretend to be two protocols.
  Provide two explicit hooks in the netisr worker for start and end
  events for runs: netisr_poll() and netisr_pollmore(), as well as a
  function, netisr_sched_poll(), to allow the polling code to schedule
  netisr execution.  DEVICE_POLLING still embeds single-netisr
  assumptions in its implementation, so for now if it is compiled into
  the kernel, a single and un-bound netisr thread is enforced
  regardless of tunable configuration.

In the default configuration, the new netisr implementation maintains
the same basic assumptions as the previous implementation: a single,
un-bound worker thread processes all deferred work, and direct dispatch
is enabled by default wherever possible.

Performance measurement shows a marginal performance improvement over
the old implementation due to the use of batched dequeue.

An rmlock is used to synchronize use and registration/unregistration
using the framework; currently, synchronized use is disabled
(replicating current netisr policy) due to a measurable 3%-6% hit in
ping-pong micro-benchmarking.  It will be enabled once further rmlock
optimization has taken place.  However, in practice, netisrs are
rarely registered or unregistered at runtime.

A new man page for netisr will follow, but since one doesn't currently
exist, it hasn't been updated.

This change is not appropriate for MFC, although the polling shutdown
handler should be merged to 7-STABLE.

Bump __FreeBSD_version.

Reviewed by:	bz
2009-06-01 10:41:38 +00:00
VANHULLEBUS Yvan
aa1faa5fc6 Lock SPTREE before parsing it in key_spddump()
Approved by:	gnn(mentor)
Obtained from:	NETASQ
MFC after:	2 weeks
2009-05-27 09:44:14 +00:00
VANHULLEBUS Yvan
cff5821a61 Only decrease refcnt once when flushing SPD entries, to
avoid flushing entries which are still used.

Approved by:	gnn(mentor)
Obtained from:	NETASQ
MFC after:	1 month
2009-05-27 09:31:50 +00:00
Bjoern A. Zeeb
db2e47925e Add sysctls to toggle the behaviour of the (former) IPSEC_FILTERTUNNEL
kernel option.
This also permits tuning of the option per virtual network stack, as
well as separately per inet, inet6.

The kernel option is left for a transition period, marked deprecated,
and will be removed soon.

Initially requested by:	phk (1 year 1 day ago)
MFC after:		4 weeks
2009-05-23 16:42:38 +00:00
Marko Zec
21ca7b57bd Change the curvnet variable from a global const struct vnet *,
previously always pointing to the default vnet context, to a
dynamically changing thread-local one.  The currvnet context
should be set on entry to networking code via CURVNET_SET() macros,
and reverted to previous state via CURVNET_RESTORE().  Recursions
on curvnet are permitted, though strongly discuouraged.

This change should have no functional impact on nooptions VIMAGE
kernel builds, where CURVNET_* macros expand to whitespace.

The curthread->td_vnet (aka curvnet) variable's purpose is to be an
indicator of the vnet context in which the current network-related
operation takes place, in case we cannot deduce the current vnet
context from any other source, such as by looking at mbuf's
m->m_pkthdr.rcvif->if_vnet, sockets's so->so_vnet etc.  Moreover, so
far curvnet has turned out to be an invaluable consistency checking
aid: it helps to catch cases when sockets, ifnets or any other
vnet-aware structures may have leaked from one vnet to another.

The exact placement of the CURVNET_SET() / CURVNET_RESTORE() macros
was a result of an empirical iterative process, whith an aim to
reduce recursions on CURVNET_SET() to a minimum, while still reducing
the scope of CURVNET_SET() to networking only operations - the
alternative would be calling CURVNET_SET() on each system call entry.
In general, curvnet has to be set in three typicall cases: when
processing socket-related requests from userspace or from within the
kernel; when processing inbound traffic flowing from device drivers
to upper layers of the networking stack, and when executing
timer-driven networking functions.

This change also introduces a DDB subcommand to show the list of all
vnet instances.

Approved by:	julian (mentor)
2009-05-05 10:56:12 +00:00
Marko Zec
5f416f8e84 Make indentation more uniform accross vnet container structs.
This is a purely cosmetic / NOP change.

Reviewed by:	bz
Approved by:	julian (mentor)
Verified by:	svn diff -x -w producing no output
2009-05-02 08:16:26 +00:00
Marko Zec
f6dfe47a14 Permit buiding kernels with options VIMAGE, restricted to only a single
active network stack instance.  Turning on options VIMAGE at compile
time yields the following changes relative to default kernel build:

1) V_ accessor macros for virtualized variables resolve to structure
fields via base pointers, instead of being resolved as fields in global
structs or plain global variables.  As an example, V_ifnet becomes:

    options VIMAGE:          ((struct vnet_net *) vnet_net)->_ifnet
    default build:           vnet_net_0._ifnet
    options VIMAGE_GLOBALS:  ifnet

2) INIT_VNET_* macros will declare and set up base pointers to be used
by V_ accessor macros, instead of resolving to whitespace:

    INIT_VNET_NET(ifp->if_vnet); becomes

    struct vnet_net *vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET];

3) Memory for vnet modules registered via vnet_mod_register() is now
allocated at run time in sys/kern/kern_vimage.c, instead of per vnet
module structs being declared as globals.  If required, vnet modules
can now request the framework to provide them with allocated bzeroed
memory by filling in the vmi_size field in their vmi_modinfo structures.

4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are
extended to hold a pointer to the parent vnet.  options VIMAGE builds
will fill in those fields as required.

5) curvnet is introduced as a new global variable in options VIMAGE
builds, always pointing to the default and only struct vnet.

6) struct sysctl_oid has been extended with additional two fields to
store major and minor virtualization module identifiers, oid_v_subs and
oid_v_mod.  SYSCTL_V_* family of macros will fill in those fields
accordingly, and store the offset in the appropriate vnet container
struct in oid_arg1.
In sysctl handlers dealing with virtualized sysctls, the
SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target
variable and make it available in arg1 variable for further processing.

Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have
been deleted.

Reviewed by:	bz, rwatson
Approved by:	julian (mentor)
2009-04-30 13:36:26 +00:00
Bruce M Simpson
d9cc2ca298 Stub out IN6_LOOKUP_MULTI() for GETSPI requests, for now.
This has the effect that IPv6 multicast traffic won't trigger
an SPI allocation when IPSEC is in use, however, this obviously
needs to stomp on locks, and IN6_LOOKUP_MULTI() is about to go away.

This definitely needs to be revisited before 8.x is branched as
a release branch.
2009-04-29 11:15:58 +00:00
Bjoern A. Zeeb
f4ad3139bb key_gettunnel() has been unsued with FAST_IPSEC (now IPSEC).
KAME had explicit checks at one point using it, so just hide it behind
#if 0 for now until we are sure if we can completely dump it or not.

MFC after:	1 month
2009-04-27 21:04:16 +00:00
Marko Zec
bfe1aba468 Introduce vnet module registration / initialization framework with
dependency tracking and ordering enforcement.

With this change, per-vnet initialization functions introduced with
r190787 are no longer directly called from traditional initialization
functions (which cc in most cases inlined to pre-r190787 code), but are
instead registered via the vnet framework first, and are invoked only
after all prerequisite modules have been initialized.  In the long run,
this framework should allow us to both initialize and dismantle
multiple vnet instances in a correct order.

The problem this change aims to solve is how to replay the
initialization sequence of various network stack components, which
have been traditionally triggered via different mechanisms (SYSINIT,
protosw).  Note that this initialization sequence was and still can be
subtly different depending on whether certain pieces of code have been
statically compiled into the kernel, loaded as modules by boot
loader, or kldloaded at run time.

The approach is simple - we record the initialization sequence
established by the traditional mechanisms whenever vnet_mod_register()
is called for a particular vnet module.  The vnet_mod_register_multi()
variant allows a single initializer function to be registered multiple
times but with different arguments - currently this is only used in
kern/uipc_domain.c by net_add_domain() with different struct domain *
as arguments, which allows for protosw-registered initialization
routines to be invoked in a correct order by the new vnet
initialization framework.

For the purpose of identifying vnet modules, each vnet module has to
have a unique ID, which is statically assigned in sys/vimage.h.
Dynamic assignment of vnet module IDs is not supported yet.

A vnet module may specify a single prerequisite module at registration
time by filling in the vmi_dependson field of its vnet_modinfo struct
with the ID of the module it depends on.  Unless specified otherwise,
all vnet modules depend on VNET_MOD_NET (container for ifnet list head,
rt_tables etc.), which thus has to and will always be initialized
first.  The framework will panic if it detects any unresolved
dependencies before completing system initialization.  Detection of
unresolved dependencies for vnet modules registered after boot
(kldloaded modules) is not provided.

Note that the fact that each module can specify only a single
prerequisite may become problematic in the long run.  In particular,
INET6 depends on INET being already instantiated, due to TCP / UDP
structures residing in INET container.  IPSEC also depends on INET,
which will in turn additionally complicate making INET6-only kernel
configs a reality.

The entire registration framework can be compiled out by turning on the
VIMAGE_GLOBALS kernel config option.

Reviewed by:	bz
Approved by:	julian (mentor)
2009-04-11 05:58:58 +00:00
Marko Zec
1ed81b739e First pass at separating per-vnet initializer functions
from existing functions for initializing global state.

        At this stage, the new per-vnet initializer functions are
	directly called from the existing global initialization code,
	which should in most cases result in compiler inlining those
	new functions, hence yielding a near-zero functional change.

        Modify the existing initializer functions which are invoked via
        protosw, like ip_init() et. al., to allow them to be invoked
	multiple times, i.e. per each vnet.  Global state, if any,
	is initialized only if such functions are called within the
	context of vnet0, which will be determined via the
	IS_DEFAULT_VNET(curvnet) check (currently always true).

        While here, V_irtualize a few remaining global UMA zones
        used by net/netinet/netipsec networking code.  While it is
        not yet clear to me or anybody else whether this is the right
        thing to do, at this stage this makes the code more readable,
        and makes it easier to track uncollected UMA-zone-backed
        objects on vnet removal.  In the long run, it's quite possible
        that some form of shared use of UMA zone pools among multiple
        vnets should be considered.

	Bump __FreeBSD_version due to changes in layout of structs
	vnet_ipfw, vnet_inet and vnet_net.

Approved by:	julian (mentor)
2009-04-06 22:29:41 +00:00
VANHULLEBUS Yvan
485cf78203 Fixed comments so it stays in 80 chars by line
with hard tabs of 8 chars....

Approved by:	gnn(mentor)
2009-03-23 16:20:39 +00:00
VANHULLEBUS Yvan
039b1e5daa Spelling fix in a comment
Approved by:	gnn(mentor)
2009-03-20 09:12:01 +00:00
VANHULLEBUS Yvan
d802e23c60 Fixed style for some comments
Approved by:	gnn(mentor)
2009-03-19 15:50:45 +00:00
VANHULLEBUS Yvan
bf777da6b1 Fixed style for some comments
Approved by:	gnn(mentor)
2009-03-19 15:44:13 +00:00