2005-01-07 01:45:51 +00:00
|
|
|
/*-
|
1995-09-22 19:56:26 +00:00
|
|
|
* Copyright (c) 1982, 1986, 1988, 1990, 1993, 1995
|
2007-02-20 10:13:11 +00:00
|
|
|
* The Regents of the University of California.
|
2008-07-10 16:20:18 +00:00
|
|
|
* Copyright (c) 2008 Robert N. M. Watson
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
* Copyright (c) 2010-2011 Juniper Networks, Inc.
|
2014-04-07 01:53:03 +00:00
|
|
|
* Copyright (c) 2014 Kevin Lo
|
2007-02-20 10:13:11 +00:00
|
|
|
* All rights reserved.
|
1994-05-24 10:09:53 +00:00
|
|
|
*
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
* Portions of this software were developed by Robert N. M. Watson under
|
|
|
|
* contract to Juniper Networks, Inc.
|
|
|
|
*
|
1994-05-24 10:09:53 +00:00
|
|
|
* Redistribution and use in source and binary forms, with or without
|
|
|
|
* modification, are permitted provided that the following conditions
|
|
|
|
* are met:
|
|
|
|
* 1. Redistributions of source code must retain the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer.
|
|
|
|
* 2. Redistributions in binary form must reproduce the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer in the
|
|
|
|
* documentation and/or other materials provided with the distribution.
|
|
|
|
* 4. Neither the name of the University nor the names of its contributors
|
|
|
|
* may be used to endorse or promote products derived from this software
|
|
|
|
* without specific prior written permission.
|
|
|
|
*
|
|
|
|
* THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
|
|
|
|
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
|
|
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
|
|
* ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
|
|
|
|
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
|
|
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
|
|
|
|
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
|
|
|
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
|
|
|
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
|
|
|
|
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
|
|
|
|
* SUCH DAMAGE.
|
|
|
|
*
|
1995-09-22 19:56:26 +00:00
|
|
|
* @(#)udp_usrreq.c 8.6 (Berkeley) 5/23/95
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
|
|
|
|
2007-10-07 20:44:24 +00:00
|
|
|
#include <sys/cdefs.h>
|
|
|
|
__FBSDID("$FreeBSD$");
|
|
|
|
|
2006-01-24 09:08:54 +00:00
|
|
|
#include "opt_ipfw.h"
|
2011-04-30 11:17:00 +00:00
|
|
|
#include "opt_inet.h"
|
1999-12-07 17:39:16 +00:00
|
|
|
#include "opt_inet6.h"
|
2007-09-10 14:22:15 +00:00
|
|
|
#include "opt_ipsec.h"
|
1999-12-07 17:39:16 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <sys/param.h>
|
2002-04-30 01:54:54 +00:00
|
|
|
#include <sys/domain.h>
|
2006-04-21 09:25:40 +00:00
|
|
|
#include <sys/eventhandler.h>
|
2002-04-30 01:54:54 +00:00
|
|
|
#include <sys/jail.h>
|
1997-02-24 20:31:25 +00:00
|
|
|
#include <sys/kernel.h>
|
2002-04-30 01:54:54 +00:00
|
|
|
#include <sys/lock.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <sys/malloc.h>
|
|
|
|
#include <sys/mbuf.h>
|
2006-11-06 13:42:10 +00:00
|
|
|
#include <sys/priv.h>
|
1999-07-11 18:32:46 +00:00
|
|
|
#include <sys/proc.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <sys/protosw.h>
|
2013-08-25 21:54:41 +00:00
|
|
|
#include <sys/sdt.h>
|
2002-04-30 01:54:54 +00:00
|
|
|
#include <sys/signalvar.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <sys/socket.h>
|
|
|
|
#include <sys/socketvar.h>
|
2002-04-30 01:54:54 +00:00
|
|
|
#include <sys/sx.h>
|
1995-03-16 18:17:34 +00:00
|
|
|
#include <sys/sysctl.h>
|
1996-04-04 10:46:44 +00:00
|
|
|
#include <sys/syslog.h>
|
2007-09-10 14:22:15 +00:00
|
|
|
#include <sys/systm.h>
|
1998-03-28 10:18:26 +00:00
|
|
|
|
2002-03-20 05:48:55 +00:00
|
|
|
#include <vm/uma.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
#include <net/if.h>
|
2013-10-26 17:58:36 +00:00
|
|
|
#include <net/if_var.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <net/route.h>
|
|
|
|
|
|
|
|
#include <netinet/in.h>
|
2013-08-25 21:54:41 +00:00
|
|
|
#include <netinet/in_kdtrace.h>
|
2002-04-30 01:54:54 +00:00
|
|
|
#include <netinet/in_pcb.h>
|
2007-09-10 14:22:15 +00:00
|
|
|
#include <netinet/in_systm.h>
|
2002-04-30 01:54:54 +00:00
|
|
|
#include <netinet/in_var.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <netinet/ip.h>
|
1999-12-07 17:39:16 +00:00
|
|
|
#ifdef INET6
|
|
|
|
#include <netinet/ip6.h>
|
|
|
|
#endif
|
2002-04-30 01:54:54 +00:00
|
|
|
#include <netinet/ip_icmp.h>
|
|
|
|
#include <netinet/icmp_var.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <netinet/ip_var.h>
|
2005-11-18 20:12:40 +00:00
|
|
|
#include <netinet/ip_options.h>
|
1999-12-07 17:39:16 +00:00
|
|
|
#ifdef INET6
|
|
|
|
#include <netinet6/ip6_var.h>
|
|
|
|
#endif
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <netinet/udp.h>
|
|
|
|
#include <netinet/udp_var.h>
|
2014-04-07 01:53:03 +00:00
|
|
|
#include <netinet/udplite.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2007-07-03 12:13:45 +00:00
|
|
|
#ifdef IPSEC
|
2002-10-16 02:25:05 +00:00
|
|
|
#include <netipsec/ipsec.h>
|
Added support for NAT-Traversal (RFC 3948) in IPsec stack.
Thanks to (no special order) Emmanuel Dreyfus (manu@netbsd.org), Larry
Baird (lab@gta.com), gnn, bz, and other FreeBSD devs, Julien Vanherzeele
(julien.vanherzeele@netasq.com, for years of bug reporting), the PFSense
team, and all people who used / tried the NAT-T patch for years and
reported bugs, patches, etc...
X-MFC: never
Reviewed by: bz
Approved by: gnn(mentor)
Obtained from: NETASQ
2009-06-12 15:44:35 +00:00
|
|
|
#include <netipsec/esp.h>
|
2007-02-20 10:13:11 +00:00
|
|
|
#endif
|
2002-10-16 02:25:05 +00:00
|
|
|
|
2000-03-27 19:14:27 +00:00
|
|
|
#include <machine/in_cksum.h>
|
|
|
|
|
2006-10-22 11:52:19 +00:00
|
|
|
#include <security/mac/mac_framework.h>
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
2014-04-07 01:53:03 +00:00
|
|
|
* UDP and UDP-Lite protocols implementation.
|
1994-05-24 10:09:53 +00:00
|
|
|
* Per RFC 768, August, 1980.
|
2014-04-07 01:53:03 +00:00
|
|
|
* Per RFC 3828, July, 2004.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2006-12-31 21:34:53 +00:00
|
|
|
|
|
|
|
/*
|
2007-02-20 10:13:11 +00:00
|
|
|
* BSD 4.2 defaulted the udp checksum to be off. Turning off udp checksums
|
|
|
|
* removes the only data integrity mechanism for packets and malformed
|
2007-09-10 14:22:15 +00:00
|
|
|
* packets that would otherwise be discarded due to bad checksums, and may
|
|
|
|
* cause problems (especially for NFS data blocks).
|
2006-12-31 21:34:53 +00:00
|
|
|
*/
|
2012-03-27 15:14:29 +00:00
|
|
|
VNET_DEFINE(int, udp_cksum) = 1;
|
|
|
|
SYSCTL_VNET_INT(_net_inet_udp, UDPCTL_CHECKSUM, checksum, CTLFLAG_RW,
|
|
|
|
&VNET_NAME(udp_cksum), 0, "compute udp checksum");
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2007-02-20 10:20:03 +00:00
|
|
|
int udp_log_in_vain = 0;
|
2004-08-16 18:32:07 +00:00
|
|
|
SYSCTL_INT(_net_inet_udp, OID_AUTO, log_in_vain, CTLFLAG_RW,
|
2007-02-20 10:20:03 +00:00
|
|
|
&udp_log_in_vain, 0, "Log all incoming UDP packets");
|
1996-04-04 10:46:44 +00:00
|
|
|
|
2010-04-29 11:52:42 +00:00
|
|
|
VNET_DEFINE(int, udp_blackhole) = 0;
|
Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
2009-07-14 22:48:30 +00:00
|
|
|
SYSCTL_VNET_INT(_net_inet_udp, OID_AUTO, blackhole, CTLFLAG_RW,
|
|
|
|
&VNET_NAME(udp_blackhole), 0,
|
2007-02-20 10:13:11 +00:00
|
|
|
"Do not send port unreachables for refused connects");
|
1999-08-17 12:17:53 +00:00
|
|
|
|
2007-07-10 09:30:46 +00:00
|
|
|
u_long udp_sendspace = 9216; /* really max datagram size */
|
|
|
|
/* 40 1K datagrams */
|
|
|
|
SYSCTL_ULONG(_net_inet_udp, UDPCTL_MAXDGRAM, maxdgram, CTLFLAG_RW,
|
|
|
|
&udp_sendspace, 0, "Maximum outgoing UDP datagram size");
|
|
|
|
|
|
|
|
u_long udp_recvspace = 40 * (1024 +
|
|
|
|
#ifdef INET6
|
|
|
|
sizeof(struct sockaddr_in6)
|
|
|
|
#else
|
|
|
|
sizeof(struct sockaddr_in)
|
|
|
|
#endif
|
|
|
|
);
|
|
|
|
|
|
|
|
SYSCTL_ULONG(_net_inet_udp, UDPCTL_RECVSPACE, recvspace, CTLFLAG_RW,
|
|
|
|
&udp_recvspace, 0, "Maximum space for incoming UDP datagrams");
|
|
|
|
|
Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
2009-07-14 22:48:30 +00:00
|
|
|
VNET_DEFINE(struct inpcbhead, udb); /* from udp_var.h */
|
|
|
|
VNET_DEFINE(struct inpcbinfo, udbinfo);
|
2014-04-07 01:53:03 +00:00
|
|
|
VNET_DEFINE(struct inpcbhead, ulitecb);
|
|
|
|
VNET_DEFINE(struct inpcbinfo, ulitecbinfo);
|
2010-11-22 19:32:54 +00:00
|
|
|
static VNET_DEFINE(uma_zone_t, udpcb_zone);
|
2009-07-16 21:13:04 +00:00
|
|
|
#define V_udpcb_zone VNET(udpcb_zone)
|
1995-04-09 01:29:31 +00:00
|
|
|
|
|
|
|
#ifndef UDBHASHSIZE
|
2008-07-26 23:07:34 +00:00
|
|
|
#define UDBHASHSIZE 128
|
1995-04-09 01:29:31 +00:00
|
|
|
#endif
|
|
|
|
|
2013-07-09 09:50:15 +00:00
|
|
|
VNET_PCPUSTAT_DEFINE(struct udpstat, udpstat); /* from udp_var.h */
|
|
|
|
VNET_PCPUSTAT_SYSINIT(udpstat);
|
|
|
|
SYSCTL_VNET_PCPUSTAT(_net_inet_udp, UDPCTL_STATS, stats, struct udpstat,
|
|
|
|
udpstat, "UDP statistics (struct udpstat, netinet/udp_var.h)");
|
|
|
|
|
|
|
|
#ifdef VIMAGE
|
|
|
|
VNET_PCPUSTAT_SYSUNINIT(udpstat);
|
|
|
|
#endif /* VIMAGE */
|
2011-04-30 11:17:00 +00:00
|
|
|
#ifdef INET
|
2007-02-20 10:13:11 +00:00
|
|
|
static void udp_detach(struct socket *so);
|
|
|
|
static int udp_output(struct inpcb *, struct mbuf *, struct sockaddr *,
|
|
|
|
struct mbuf *, struct thread *);
|
2011-04-30 11:17:00 +00:00
|
|
|
#endif
|
|
|
|
|
Added support for NAT-Traversal (RFC 3948) in IPsec stack.
Thanks to (no special order) Emmanuel Dreyfus (manu@netbsd.org), Larry
Baird (lab@gta.com), gnn, bz, and other FreeBSD devs, Julien Vanherzeele
(julien.vanherzeele@netasq.com, for years of bug reporting), the PFSense
team, and all people who used / tried the NAT-T patch for years and
reported bugs, patches, etc...
X-MFC: never
Reviewed by: bz
Approved by: gnn(mentor)
Obtained from: NETASQ
2009-06-12 15:44:35 +00:00
|
|
|
#ifdef IPSEC
|
|
|
|
#ifdef IPSEC_NAT_T
|
|
|
|
#define UF_ESPINUDP_ALL (UF_ESPINUDP_NON_IKE|UF_ESPINUDP)
|
|
|
|
#ifdef INET
|
|
|
|
static struct mbuf *udp4_espdecap(struct inpcb *, struct mbuf *, int);
|
|
|
|
#endif
|
|
|
|
#endif /* IPSEC_NAT_T */
|
|
|
|
#endif /* IPSEC */
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2006-04-21 09:25:40 +00:00
|
|
|
static void
|
|
|
|
udp_zone_change(void *tag)
|
|
|
|
{
|
|
|
|
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
uma_zone_set_max(V_udbinfo.ipi_zone, maxsockets);
|
2009-05-23 16:51:13 +00:00
|
|
|
uma_zone_set_max(V_udpcb_zone, maxsockets);
|
2006-04-21 09:25:40 +00:00
|
|
|
}
|
|
|
|
|
2006-07-18 22:34:27 +00:00
|
|
|
static int
|
|
|
|
udp_inpcb_init(void *mem, int size, int flags)
|
|
|
|
{
|
2007-05-07 13:47:39 +00:00
|
|
|
struct inpcb *inp;
|
2006-12-29 14:58:18 +00:00
|
|
|
|
2007-05-07 13:47:39 +00:00
|
|
|
inp = mem;
|
2006-07-18 22:34:27 +00:00
|
|
|
INP_LOCK_INIT(inp, "inp", "udpinp");
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2014-04-07 01:53:03 +00:00
|
|
|
static int
|
|
|
|
udplite_inpcb_init(void *mem, int size, int flags)
|
|
|
|
{
|
|
|
|
struct inpcb *inp;
|
|
|
|
|
|
|
|
inp = mem;
|
|
|
|
INP_LOCK_INIT(inp, "inp", "udpliteinp");
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
void
|
2007-05-07 13:47:39 +00:00
|
|
|
udp_init(void)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2007-05-07 13:47:39 +00:00
|
|
|
|
2010-03-14 18:59:11 +00:00
|
|
|
in_pcbinfo_init(&V_udbinfo, "udp", &V_udb, UDBHASHSIZE, UDBHASHSIZE,
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
"udp_inpcb", udp_inpcb_init, NULL, UMA_ZONE_NOFREE,
|
|
|
|
IPI_HASHFIELDS_2TUPLE);
|
2009-05-23 16:51:13 +00:00
|
|
|
V_udpcb_zone = uma_zcreate("udpcb", sizeof(struct udpcb),
|
|
|
|
NULL, NULL, NULL, NULL, UMA_ALIGN_PTR, UMA_ZONE_NOFREE);
|
|
|
|
uma_zone_set_max(V_udpcb_zone, maxsockets);
|
2012-12-08 12:51:06 +00:00
|
|
|
uma_zone_set_warning(V_udpcb_zone, "kern.ipc.maxsockets limit reached");
|
2006-04-21 09:25:40 +00:00
|
|
|
EVENTHANDLER_REGISTER(maxsockets_change, udp_zone_change, NULL,
|
2007-05-07 13:47:39 +00:00
|
|
|
EVENTHANDLER_PRI_ANY);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
2014-04-07 01:53:03 +00:00
|
|
|
void
|
|
|
|
udplite_init(void)
|
|
|
|
{
|
|
|
|
|
|
|
|
in_pcbinfo_init(&V_ulitecbinfo, "udplite", &V_ulitecb, UDBHASHSIZE,
|
|
|
|
UDBHASHSIZE, "udplite_inpcb", udplite_inpcb_init, NULL,
|
|
|
|
UMA_ZONE_NOFREE, IPI_HASHFIELDS_2TUPLE);
|
|
|
|
}
|
|
|
|
|
2009-08-02 19:43:32 +00:00
|
|
|
/*
|
|
|
|
* Kernel module interface for updating udpstat. The argument is an index
|
|
|
|
* into udpstat treated as an array of u_long. While this encodes the
|
|
|
|
* general layout of udpstat into the caller, it doesn't encode its location,
|
|
|
|
* so that future changes to add, for example, per-CPU stats support won't
|
|
|
|
* cause binary compatibility problems for kernel modules.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
kmod_udpstat_inc(int statnum)
|
|
|
|
{
|
|
|
|
|
2013-07-09 09:50:15 +00:00
|
|
|
counter_u64_add(VNET(udpstat)[statnum], 1);
|
2009-08-02 19:43:32 +00:00
|
|
|
}
|
|
|
|
|
2009-05-23 16:51:13 +00:00
|
|
|
int
|
|
|
|
udp_newudpcb(struct inpcb *inp)
|
|
|
|
{
|
|
|
|
struct udpcb *up;
|
|
|
|
|
|
|
|
up = uma_zalloc(V_udpcb_zone, M_NOWAIT | M_ZERO);
|
|
|
|
if (up == NULL)
|
|
|
|
return (ENOBUFS);
|
|
|
|
inp->inp_ppcb = up;
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
udp_discardcb(struct udpcb *up)
|
|
|
|
{
|
|
|
|
|
|
|
|
uma_zfree(V_udpcb_zone, up);
|
|
|
|
}
|
|
|
|
|
Introduce an infrastructure for dismantling vnet instances.
Vnet modules and protocol domains may now register destructor
functions to clean up and release per-module state. The destructor
mechanisms can be triggered by invoking "vimage -d", or a future
equivalent command which will be provided via the new jail framework.
While this patch introduces numerous placeholder destructor functions,
many of those are currently incomplete, thus leaking memory or (even
worse) failing to stop all running timers. Many of such issues are
already known and will be incrementaly fixed over the next weeks in
smaller incremental commits.
Apart from introducing new fields in structs ifnet, domain, protosw
and vnet_net, which requires the kernel and modules to be rebuilt, this
change should have no impact on nooptions VIMAGE builds, since vnet
destructors can only be called in VIMAGE kernels. Moreover,
destructor functions should be in general compiled in only in
options VIMAGE builds, except for kernel modules which can be safely
kldunloaded at run time.
Bump __FreeBSD_version to 800097.
Reviewed by: bz, julian
Approved by: rwatson, kib (re), julian (mentor)
2009-06-08 17:15:40 +00:00
|
|
|
#ifdef VIMAGE
|
|
|
|
void
|
|
|
|
udp_destroy(void)
|
|
|
|
{
|
|
|
|
|
2010-03-14 18:59:11 +00:00
|
|
|
in_pcbinfo_destroy(&V_udbinfo);
|
2010-03-06 21:24:32 +00:00
|
|
|
uma_zdestroy(V_udpcb_zone);
|
Introduce an infrastructure for dismantling vnet instances.
Vnet modules and protocol domains may now register destructor
functions to clean up and release per-module state. The destructor
mechanisms can be triggered by invoking "vimage -d", or a future
equivalent command which will be provided via the new jail framework.
While this patch introduces numerous placeholder destructor functions,
many of those are currently incomplete, thus leaking memory or (even
worse) failing to stop all running timers. Many of such issues are
already known and will be incrementaly fixed over the next weeks in
smaller incremental commits.
Apart from introducing new fields in structs ifnet, domain, protosw
and vnet_net, which requires the kernel and modules to be rebuilt, this
change should have no impact on nooptions VIMAGE builds, since vnet
destructors can only be called in VIMAGE kernels. Moreover,
destructor functions should be in general compiled in only in
options VIMAGE builds, except for kernel modules which can be safely
kldunloaded at run time.
Bump __FreeBSD_version to 800097.
Reviewed by: bz, julian
Approved by: rwatson, kib (re), julian (mentor)
2009-06-08 17:15:40 +00:00
|
|
|
}
|
2014-04-07 01:53:03 +00:00
|
|
|
|
|
|
|
void
|
|
|
|
udplite_destroy(void)
|
|
|
|
{
|
|
|
|
|
|
|
|
in_pcbinfo_destroy(&V_ulitecbinfo);
|
|
|
|
}
|
Introduce an infrastructure for dismantling vnet instances.
Vnet modules and protocol domains may now register destructor
functions to clean up and release per-module state. The destructor
mechanisms can be triggered by invoking "vimage -d", or a future
equivalent command which will be provided via the new jail framework.
While this patch introduces numerous placeholder destructor functions,
many of those are currently incomplete, thus leaking memory or (even
worse) failing to stop all running timers. Many of such issues are
already known and will be incrementaly fixed over the next weeks in
smaller incremental commits.
Apart from introducing new fields in structs ifnet, domain, protosw
and vnet_net, which requires the kernel and modules to be rebuilt, this
change should have no impact on nooptions VIMAGE builds, since vnet
destructors can only be called in VIMAGE kernels. Moreover,
destructor functions should be in general compiled in only in
options VIMAGE builds, except for kernel modules which can be safely
kldunloaded at run time.
Bump __FreeBSD_version to 800097.
Reviewed by: bz, julian
Approved by: rwatson, kib (re), julian (mentor)
2009-06-08 17:15:40 +00:00
|
|
|
#endif
|
|
|
|
|
2011-04-30 11:17:00 +00:00
|
|
|
#ifdef INET
|
2007-07-10 09:30:46 +00:00
|
|
|
/*
|
|
|
|
* Subroutine of udp_input(), which appends the provided mbuf chain to the
|
|
|
|
* passed pcb/socket. The caller must provide a sockaddr_in via udp_in that
|
|
|
|
* contains the source address. If the socket ends up being an IPv6 socket,
|
|
|
|
* udp_append() will convert to a sockaddr_in6 before passing the address
|
|
|
|
* into the socket code.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
udp_append(struct inpcb *inp, struct ip *ip, struct mbuf *n, int off,
|
|
|
|
struct sockaddr_in *udp_in)
|
|
|
|
{
|
|
|
|
struct sockaddr *append_sa;
|
|
|
|
struct socket *so;
|
|
|
|
struct mbuf *opts = 0;
|
|
|
|
#ifdef INET6
|
|
|
|
struct sockaddr_in6 udp_in6;
|
Added support for NAT-Traversal (RFC 3948) in IPsec stack.
Thanks to (no special order) Emmanuel Dreyfus (manu@netbsd.org), Larry
Baird (lab@gta.com), gnn, bz, and other FreeBSD devs, Julien Vanherzeele
(julien.vanherzeele@netasq.com, for years of bug reporting), the PFSense
team, and all people who used / tried the NAT-T patch for years and
reported bugs, patches, etc...
X-MFC: never
Reviewed by: bz
Approved by: gnn(mentor)
Obtained from: NETASQ
2009-06-12 15:44:35 +00:00
|
|
|
#endif
|
|
|
|
struct udpcb *up;
|
2007-07-10 09:30:46 +00:00
|
|
|
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_LOCK_ASSERT(inp);
|
2007-07-10 09:30:46 +00:00
|
|
|
|
2011-04-14 10:40:57 +00:00
|
|
|
/*
|
|
|
|
* Engage the tunneling protocol.
|
|
|
|
*/
|
|
|
|
up = intoudpcb(inp);
|
|
|
|
if (up->u_tun_func != NULL) {
|
|
|
|
(*up->u_tun_func)(n, off, inp);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (n == NULL)
|
|
|
|
return;
|
|
|
|
|
|
|
|
off += sizeof(struct udphdr);
|
|
|
|
|
2007-07-10 09:30:46 +00:00
|
|
|
#ifdef IPSEC
|
|
|
|
/* Check AH/ESP integrity. */
|
|
|
|
if (ipsec4_in_reject(n, inp)) {
|
|
|
|
m_freem(n);
|
2013-07-23 14:14:24 +00:00
|
|
|
IPSECSTAT_INC(ips_in_polvio);
|
2007-07-10 09:30:46 +00:00
|
|
|
return;
|
|
|
|
}
|
Added support for NAT-Traversal (RFC 3948) in IPsec stack.
Thanks to (no special order) Emmanuel Dreyfus (manu@netbsd.org), Larry
Baird (lab@gta.com), gnn, bz, and other FreeBSD devs, Julien Vanherzeele
(julien.vanherzeele@netasq.com, for years of bug reporting), the PFSense
team, and all people who used / tried the NAT-T patch for years and
reported bugs, patches, etc...
X-MFC: never
Reviewed by: bz
Approved by: gnn(mentor)
Obtained from: NETASQ
2009-06-12 15:44:35 +00:00
|
|
|
#ifdef IPSEC_NAT_T
|
|
|
|
up = intoudpcb(inp);
|
|
|
|
KASSERT(up != NULL, ("%s: udpcb NULL", __func__));
|
|
|
|
if (up->u_flags & UF_ESPINUDP_ALL) { /* IPSec UDP encaps. */
|
|
|
|
n = udp4_espdecap(inp, n, off);
|
|
|
|
if (n == NULL) /* Consumed. */
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
#endif /* IPSEC_NAT_T */
|
2007-07-10 09:30:46 +00:00
|
|
|
#endif /* IPSEC */
|
|
|
|
#ifdef MAC
|
2007-10-24 19:04:04 +00:00
|
|
|
if (mac_inpcb_check_deliver(inp, n) != 0) {
|
2007-07-10 09:30:46 +00:00
|
|
|
m_freem(n);
|
|
|
|
return;
|
|
|
|
}
|
2011-04-30 11:17:00 +00:00
|
|
|
#endif /* MAC */
|
2007-07-10 09:30:46 +00:00
|
|
|
if (inp->inp_flags & INP_CONTROLOPTS ||
|
|
|
|
inp->inp_socket->so_options & (SO_TIMESTAMP | SO_BINTIME)) {
|
|
|
|
#ifdef INET6
|
2008-05-24 15:20:48 +00:00
|
|
|
if (inp->inp_vflag & INP_IPV6)
|
2008-08-16 06:39:18 +00:00
|
|
|
(void)ip6_savecontrol_v4(inp, n, &opts, NULL);
|
2008-05-24 15:20:48 +00:00
|
|
|
else
|
2011-04-30 11:17:00 +00:00
|
|
|
#endif /* INET6 */
|
2007-07-10 09:30:46 +00:00
|
|
|
ip_savecontrol(inp, &opts, ip, n);
|
|
|
|
}
|
|
|
|
#ifdef INET6
|
|
|
|
if (inp->inp_vflag & INP_IPV6) {
|
|
|
|
bzero(&udp_in6, sizeof(udp_in6));
|
|
|
|
udp_in6.sin6_len = sizeof(udp_in6);
|
|
|
|
udp_in6.sin6_family = AF_INET6;
|
|
|
|
in6_sin_2_v4mapsin6(udp_in, &udp_in6);
|
|
|
|
append_sa = (struct sockaddr *)&udp_in6;
|
|
|
|
} else
|
2011-04-30 11:17:00 +00:00
|
|
|
#endif /* INET6 */
|
2007-07-10 09:30:46 +00:00
|
|
|
append_sa = (struct sockaddr *)udp_in;
|
|
|
|
m_adj(n, off);
|
|
|
|
|
|
|
|
so = inp->inp_socket;
|
|
|
|
SOCKBUF_LOCK(&so->so_rcv);
|
|
|
|
if (sbappendaddr_locked(&so->so_rcv, append_sa, n, opts) == 0) {
|
|
|
|
SOCKBUF_UNLOCK(&so->so_rcv);
|
|
|
|
m_freem(n);
|
|
|
|
if (opts)
|
|
|
|
m_freem(opts);
|
2009-04-12 11:42:40 +00:00
|
|
|
UDPSTAT_INC(udps_fullsock);
|
2007-07-10 09:30:46 +00:00
|
|
|
} else
|
|
|
|
sorwakeup_locked(so);
|
|
|
|
}
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
void
|
2007-02-20 10:13:11 +00:00
|
|
|
udp_input(struct mbuf *m, int off)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
1999-12-07 17:39:16 +00:00
|
|
|
int iphlen = off;
|
2007-02-20 10:13:11 +00:00
|
|
|
struct ip *ip;
|
|
|
|
struct udphdr *uh;
|
Import rewrite of IPv4 socket multicast layer to support source-specific
and protocol-independent host mode multicast. The code is written to
accomodate IPv6, IGMPv3 and MLDv2 with only a little additional work.
This change only pertains to FreeBSD's use as a multicast end-station and
does not concern multicast routing; for an IGMPv3/MLDv2 router
implementation, consider the XORP project.
The work is based on Wilbert de Graaf's IGMPv3 code drop for FreeBSD 4.6,
which is available at: http://www.kloosterhof.com/wilbert/igmpv3.html
Summary
* IPv4 multicast socket processing is now moved out of ip_output.c
into a new module, in_mcast.c.
* The in_mcast.c module implements the IPv4 legacy any-source API in
terms of the protocol-independent source-specific API.
* Source filters are lazy allocated as the common case does not use them.
They are part of per inpcb state and are covered by the inpcb lock.
* struct ip_mreqn is now supported to allow applications to specify
multicast joins by interface index in the legacy IPv4 any-source API.
* In UDP, an incoming multicast datagram only requires that the source
port matches the 4-tuple if the socket was already bound by source port.
An unbound socket SHOULD be able to receive multicasts sent from an
ephemeral source port.
* The UDP socket multicast filter mode defaults to exclusive, that is,
sources present in the per-socket list will be blocked from delivery.
* The RFC 3678 userland functions have been added to libc: setsourcefilter,
getsourcefilter, setipv4sourcefilter, getipv4sourcefilter.
* Definitions for IGMPv3 are merged but not yet used.
* struct sockaddr_storage is now referenced from <netinet/in.h>. It
is therefore defined there if not already declared in the same way
as for the C99 types.
* The RFC 1724 hack (specify 0.0.0.0/8 addresses to IP_MULTICAST_IF
which are then interpreted as interface indexes) is now deprecated.
* A patch for the Rhyolite.com routed in the FreeBSD base system
is available in the -net archives. This only affects individuals
running RIPv1 or RIPv2 via point-to-point and/or unnumbered interfaces.
* Make IPv6 detach path similar to IPv4's in code flow; functionally same.
* Bump __FreeBSD_version to 700048; see UPDATING.
This work was financially supported by another FreeBSD committer.
Obtained from: p4://bms_netdev
Submitted by: Wilbert de Graaf (original work)
Reviewed by: rwatson (locking), silence from fenner,
net@ (but with encouragement)
2007-06-12 16:24:56 +00:00
|
|
|
struct ifnet *ifp;
|
2007-02-20 10:13:11 +00:00
|
|
|
struct inpcb *inp;
|
2012-10-22 21:09:03 +00:00
|
|
|
uint16_t len, ip_len;
|
2014-04-07 01:53:03 +00:00
|
|
|
struct inpcbinfo *pcbinfo;
|
1994-05-24 10:09:53 +00:00
|
|
|
struct ip save_ip;
|
Until this change, the UDP input code used global variables udp_in,
udp_in6, and udp_ip6 to pass socket address state between udp_input(),
udp_append(), and soappendaddr_locked(). While file in the default
configuration, when running with multiple netisrs or direct ithread
dispatch, this can result in races wherein user processes using
recvmsg() get back the wrong source IP/port. To correct this and
related races:
- Eliminate udp_ip6, which is believed to be generated but then never
used. Eliminate ip_2_ip6_hdr() as it is now unneeded.
- Eliminate setting, testing, and existence of 'init' status fields
for the IPv6 structures. While with multiple UDP delivery this
could lead to amortization of IPv4 -> IPv6 conversion when
delivering an IPv4 UDP packet to an IPv6 socket, it added
substantial complexity and side effects.
- Move global structures into the stack, declaring udp_in in
udp_input(), and udp_in6 in udp_append() to be used if a conversion
is required. Pass &udp_in into udp_append().
- Re-annotate comments to reflect updates.
With this change, UDP appears to operate correctly in the presence of
substantial inbound processing parallelism. This solution avoids
introducing additional synchronization, but does increase the
potential stack depth.
Discovered by: kris (Bug Magnet)
MFC after: 3 weeks
2004-11-04 01:25:23 +00:00
|
|
|
struct sockaddr_in udp_in;
|
2006-01-24 09:08:54 +00:00
|
|
|
struct m_tag *fwd_tag;
|
2014-04-07 01:53:03 +00:00
|
|
|
int cscov_partial;
|
|
|
|
uint8_t pr;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
Import rewrite of IPv4 socket multicast layer to support source-specific
and protocol-independent host mode multicast. The code is written to
accomodate IPv6, IGMPv3 and MLDv2 with only a little additional work.
This change only pertains to FreeBSD's use as a multicast end-station and
does not concern multicast routing; for an IGMPv3/MLDv2 router
implementation, consider the XORP project.
The work is based on Wilbert de Graaf's IGMPv3 code drop for FreeBSD 4.6,
which is available at: http://www.kloosterhof.com/wilbert/igmpv3.html
Summary
* IPv4 multicast socket processing is now moved out of ip_output.c
into a new module, in_mcast.c.
* The in_mcast.c module implements the IPv4 legacy any-source API in
terms of the protocol-independent source-specific API.
* Source filters are lazy allocated as the common case does not use them.
They are part of per inpcb state and are covered by the inpcb lock.
* struct ip_mreqn is now supported to allow applications to specify
multicast joins by interface index in the legacy IPv4 any-source API.
* In UDP, an incoming multicast datagram only requires that the source
port matches the 4-tuple if the socket was already bound by source port.
An unbound socket SHOULD be able to receive multicasts sent from an
ephemeral source port.
* The UDP socket multicast filter mode defaults to exclusive, that is,
sources present in the per-socket list will be blocked from delivery.
* The RFC 3678 userland functions have been added to libc: setsourcefilter,
getsourcefilter, setipv4sourcefilter, getipv4sourcefilter.
* Definitions for IGMPv3 are merged but not yet used.
* struct sockaddr_storage is now referenced from <netinet/in.h>. It
is therefore defined there if not already declared in the same way
as for the C99 types.
* The RFC 1724 hack (specify 0.0.0.0/8 addresses to IP_MULTICAST_IF
which are then interpreted as interface indexes) is now deprecated.
* A patch for the Rhyolite.com routed in the FreeBSD base system
is available in the -net archives. This only affects individuals
running RIPv1 or RIPv2 via point-to-point and/or unnumbered interfaces.
* Make IPv6 detach path similar to IPv4's in code flow; functionally same.
* Bump __FreeBSD_version to 700048; see UPDATING.
This work was financially supported by another FreeBSD committer.
Obtained from: p4://bms_netdev
Submitted by: Wilbert de Graaf (original work)
Reviewed by: rwatson (locking), silence from fenner,
net@ (but with encouragement)
2007-06-12 16:24:56 +00:00
|
|
|
ifp = m->m_pkthdr.rcvif;
|
2009-04-12 11:42:40 +00:00
|
|
|
UDPSTAT_INC(udps_ipackets);
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
/*
|
2007-02-20 10:13:11 +00:00
|
|
|
* Strip IP options, if any; should skip this, make available to
|
|
|
|
* user, and use on returned packets, but we don't yet have a way to
|
|
|
|
* check the checksum with options still present.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
|
|
|
if (iphlen > sizeof (struct ip)) {
|
2012-10-12 09:24:24 +00:00
|
|
|
ip_stripoptions(m);
|
1994-05-24 10:09:53 +00:00
|
|
|
iphlen = sizeof(struct ip);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Get IP and UDP header together in first mbuf.
|
|
|
|
*/
|
|
|
|
ip = mtod(m, struct ip *);
|
|
|
|
if (m->m_len < iphlen + sizeof(struct udphdr)) {
|
2014-04-07 01:55:53 +00:00
|
|
|
if ((m = m_pullup(m, iphlen + sizeof(struct udphdr))) == NULL) {
|
2009-04-12 11:42:40 +00:00
|
|
|
UDPSTAT_INC(udps_hdrops);
|
1994-05-24 10:09:53 +00:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
ip = mtod(m, struct ip *);
|
|
|
|
}
|
|
|
|
uh = (struct udphdr *)((caddr_t)ip + iphlen);
|
2014-04-07 01:53:03 +00:00
|
|
|
pr = ip->ip_p;
|
|
|
|
cscov_partial = (pr == IPPROTO_UDPLITE) ? 1 : 0;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2007-02-20 10:13:11 +00:00
|
|
|
/*
|
|
|
|
* Destination port of 0 is illegal, based on RFC768.
|
|
|
|
*/
|
2000-07-04 16:35:15 +00:00
|
|
|
if (uh->uh_dport == 0)
|
2002-06-10 20:05:46 +00:00
|
|
|
goto badunlocked;
|
2000-07-04 16:35:15 +00:00
|
|
|
|
2002-10-16 02:25:05 +00:00
|
|
|
/*
|
2007-02-20 10:13:11 +00:00
|
|
|
* Construct sockaddr format source address. Stuff source address
|
|
|
|
* and datagram in user buffer.
|
2002-10-16 02:25:05 +00:00
|
|
|
*/
|
Until this change, the UDP input code used global variables udp_in,
udp_in6, and udp_ip6 to pass socket address state between udp_input(),
udp_append(), and soappendaddr_locked(). While file in the default
configuration, when running with multiple netisrs or direct ithread
dispatch, this can result in races wherein user processes using
recvmsg() get back the wrong source IP/port. To correct this and
related races:
- Eliminate udp_ip6, which is believed to be generated but then never
used. Eliminate ip_2_ip6_hdr() as it is now unneeded.
- Eliminate setting, testing, and existence of 'init' status fields
for the IPv6 structures. While with multiple UDP delivery this
could lead to amortization of IPv4 -> IPv6 conversion when
delivering an IPv4 UDP packet to an IPv6 socket, it added
substantial complexity and side effects.
- Move global structures into the stack, declaring udp_in in
udp_input(), and udp_in6 in udp_append() to be used if a conversion
is required. Pass &udp_in into udp_append().
- Re-annotate comments to reflect updates.
With this change, UDP appears to operate correctly in the presence of
substantial inbound processing parallelism. This solution avoids
introducing additional synchronization, but does increase the
potential stack depth.
Discovered by: kris (Bug Magnet)
MFC after: 3 weeks
2004-11-04 01:25:23 +00:00
|
|
|
bzero(&udp_in, sizeof(udp_in));
|
|
|
|
udp_in.sin_len = sizeof(udp_in);
|
|
|
|
udp_in.sin_family = AF_INET;
|
2002-10-16 02:25:05 +00:00
|
|
|
udp_in.sin_port = uh->uh_sport;
|
|
|
|
udp_in.sin_addr = ip->ip_src;
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
2007-05-07 13:47:39 +00:00
|
|
|
* Make mbuf data length reflect UDP length. If not enough data to
|
|
|
|
* reflect UDP length, drop.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
|
|
|
len = ntohs((u_short)uh->uh_ulen);
|
2012-10-23 08:33:13 +00:00
|
|
|
ip_len = ntohs(ip->ip_len) - iphlen;
|
2014-04-07 01:53:03 +00:00
|
|
|
if (pr == IPPROTO_UDPLITE && len == 0) {
|
|
|
|
/* Zero means checksum over the complete packet. */
|
|
|
|
len = ip_len;
|
|
|
|
cscov_partial = 0;
|
|
|
|
}
|
2012-10-22 21:09:03 +00:00
|
|
|
if (ip_len != len) {
|
|
|
|
if (len > ip_len || len < sizeof(struct udphdr)) {
|
2009-04-12 11:42:40 +00:00
|
|
|
UDPSTAT_INC(udps_badlen);
|
2002-06-10 20:05:46 +00:00
|
|
|
goto badunlocked;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2014-04-07 01:53:03 +00:00
|
|
|
if (pr == IPPROTO_UDP)
|
|
|
|
m_adj(m, len - ip_len);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2007-02-20 10:13:11 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
2007-02-20 10:13:11 +00:00
|
|
|
* Save a copy of the IP header in case we want restore it for
|
|
|
|
* sending an ICMP error message in response.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
if (!V_udp_blackhole)
|
2000-10-31 09:13:02 +00:00
|
|
|
save_ip = *ip;
|
2007-06-17 04:07:11 +00:00
|
|
|
else
|
|
|
|
memset(&save_ip, 0, sizeof(save_ip));
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Checksum extended UDP header and data.
|
|
|
|
*/
|
1995-09-22 19:56:26 +00:00
|
|
|
if (uh->uh_sum) {
|
2007-05-16 09:12:16 +00:00
|
|
|
u_short uh_sum;
|
|
|
|
|
2014-04-07 01:53:03 +00:00
|
|
|
if ((m->m_pkthdr.csum_flags & CSUM_DATA_VALID) &&
|
|
|
|
!cscov_partial) {
|
2000-03-27 19:14:27 +00:00
|
|
|
if (m->m_pkthdr.csum_flags & CSUM_PSEUDO_HDR)
|
2007-05-16 09:12:16 +00:00
|
|
|
uh_sum = m->m_pkthdr.csum_data;
|
2000-03-27 19:14:27 +00:00
|
|
|
else
|
2007-05-16 09:12:16 +00:00
|
|
|
uh_sum = in_pseudo(ip->ip_src.s_addr,
|
2000-11-01 16:56:33 +00:00
|
|
|
ip->ip_dst.s_addr, htonl((u_short)len +
|
2014-04-07 01:53:03 +00:00
|
|
|
m->m_pkthdr.csum_data + pr));
|
2007-05-16 09:12:16 +00:00
|
|
|
uh_sum ^= 0xffff;
|
2000-03-27 19:14:27 +00:00
|
|
|
} else {
|
2001-10-22 12:43:30 +00:00
|
|
|
char b[9];
|
2007-05-07 13:47:39 +00:00
|
|
|
|
2001-10-22 12:43:30 +00:00
|
|
|
bcopy(((struct ipovly *)ip)->ih_x1, b, 9);
|
2000-03-27 19:14:27 +00:00
|
|
|
bzero(((struct ipovly *)ip)->ih_x1, 9);
|
2014-04-07 01:53:03 +00:00
|
|
|
((struct ipovly *)ip)->ih_len = (pr == IPPROTO_UDP) ?
|
|
|
|
uh->uh_ulen : htons(ip_len);
|
2007-05-16 09:12:16 +00:00
|
|
|
uh_sum = in_cksum(m, len + sizeof (struct ip));
|
2001-10-22 12:43:30 +00:00
|
|
|
bcopy(b, ((struct ipovly *)ip)->ih_x1, 9);
|
2000-03-27 19:14:27 +00:00
|
|
|
}
|
2007-05-16 09:12:16 +00:00
|
|
|
if (uh_sum) {
|
2009-04-12 11:42:40 +00:00
|
|
|
UDPSTAT_INC(udps_badsum);
|
1994-05-24 10:09:53 +00:00
|
|
|
m_freem(m);
|
|
|
|
return;
|
|
|
|
}
|
2001-03-13 13:26:06 +00:00
|
|
|
} else
|
2009-04-12 11:42:40 +00:00
|
|
|
UDPSTAT_INC(udps_nosum);
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2014-04-07 01:53:03 +00:00
|
|
|
pcbinfo = get_inpcbinfo(pr);
|
1994-05-24 10:09:53 +00:00
|
|
|
if (IN_MULTICAST(ntohl(ip->ip_dst.s_addr)) ||
|
Import rewrite of IPv4 socket multicast layer to support source-specific
and protocol-independent host mode multicast. The code is written to
accomodate IPv6, IGMPv3 and MLDv2 with only a little additional work.
This change only pertains to FreeBSD's use as a multicast end-station and
does not concern multicast routing; for an IGMPv3/MLDv2 router
implementation, consider the XORP project.
The work is based on Wilbert de Graaf's IGMPv3 code drop for FreeBSD 4.6,
which is available at: http://www.kloosterhof.com/wilbert/igmpv3.html
Summary
* IPv4 multicast socket processing is now moved out of ip_output.c
into a new module, in_mcast.c.
* The in_mcast.c module implements the IPv4 legacy any-source API in
terms of the protocol-independent source-specific API.
* Source filters are lazy allocated as the common case does not use them.
They are part of per inpcb state and are covered by the inpcb lock.
* struct ip_mreqn is now supported to allow applications to specify
multicast joins by interface index in the legacy IPv4 any-source API.
* In UDP, an incoming multicast datagram only requires that the source
port matches the 4-tuple if the socket was already bound by source port.
An unbound socket SHOULD be able to receive multicasts sent from an
ephemeral source port.
* The UDP socket multicast filter mode defaults to exclusive, that is,
sources present in the per-socket list will be blocked from delivery.
* The RFC 3678 userland functions have been added to libc: setsourcefilter,
getsourcefilter, setipv4sourcefilter, getipv4sourcefilter.
* Definitions for IGMPv3 are merged but not yet used.
* struct sockaddr_storage is now referenced from <netinet/in.h>. It
is therefore defined there if not already declared in the same way
as for the C99 types.
* The RFC 1724 hack (specify 0.0.0.0/8 addresses to IP_MULTICAST_IF
which are then interpreted as interface indexes) is now deprecated.
* A patch for the Rhyolite.com routed in the FreeBSD base system
is available in the -net archives. This only affects individuals
running RIPv1 or RIPv2 via point-to-point and/or unnumbered interfaces.
* Make IPv6 detach path similar to IPv4's in code flow; functionally same.
* Bump __FreeBSD_version to 700048; see UPDATING.
This work was financially supported by another FreeBSD committer.
Obtained from: p4://bms_netdev
Submitted by: Wilbert de Graaf (original work)
Reviewed by: rwatson (locking), silence from fenner,
net@ (but with encouragement)
2007-06-12 16:24:56 +00:00
|
|
|
in_broadcast(ip->ip_dst, ifp)) {
|
1996-11-11 04:56:32 +00:00
|
|
|
struct inpcb *last;
|
2014-04-07 01:53:03 +00:00
|
|
|
struct inpcbhead *pcblist;
|
Import rewrite of IPv4 socket multicast layer to support source-specific
and protocol-independent host mode multicast. The code is written to
accomodate IPv6, IGMPv3 and MLDv2 with only a little additional work.
This change only pertains to FreeBSD's use as a multicast end-station and
does not concern multicast routing; for an IGMPv3/MLDv2 router
implementation, consider the XORP project.
The work is based on Wilbert de Graaf's IGMPv3 code drop for FreeBSD 4.6,
which is available at: http://www.kloosterhof.com/wilbert/igmpv3.html
Summary
* IPv4 multicast socket processing is now moved out of ip_output.c
into a new module, in_mcast.c.
* The in_mcast.c module implements the IPv4 legacy any-source API in
terms of the protocol-independent source-specific API.
* Source filters are lazy allocated as the common case does not use them.
They are part of per inpcb state and are covered by the inpcb lock.
* struct ip_mreqn is now supported to allow applications to specify
multicast joins by interface index in the legacy IPv4 any-source API.
* In UDP, an incoming multicast datagram only requires that the source
port matches the 4-tuple if the socket was already bound by source port.
An unbound socket SHOULD be able to receive multicasts sent from an
ephemeral source port.
* The UDP socket multicast filter mode defaults to exclusive, that is,
sources present in the per-socket list will be blocked from delivery.
* The RFC 3678 userland functions have been added to libc: setsourcefilter,
getsourcefilter, setipv4sourcefilter, getipv4sourcefilter.
* Definitions for IGMPv3 are merged but not yet used.
* struct sockaddr_storage is now referenced from <netinet/in.h>. It
is therefore defined there if not already declared in the same way
as for the C99 types.
* The RFC 1724 hack (specify 0.0.0.0/8 addresses to IP_MULTICAST_IF
which are then interpreted as interface indexes) is now deprecated.
* A patch for the Rhyolite.com routed in the FreeBSD base system
is available in the -net archives. This only affects individuals
running RIPv1 or RIPv2 via point-to-point and/or unnumbered interfaces.
* Make IPv6 detach path similar to IPv4's in code flow; functionally same.
* Bump __FreeBSD_version to 700048; see UPDATING.
This work was financially supported by another FreeBSD committer.
Obtained from: p4://bms_netdev
Submitted by: Wilbert de Graaf (original work)
Reviewed by: rwatson (locking), silence from fenner,
net@ (but with encouragement)
2007-06-12 16:24:56 +00:00
|
|
|
struct ip_moptions *imo;
|
2007-02-20 10:13:11 +00:00
|
|
|
|
2014-04-07 01:53:03 +00:00
|
|
|
INP_INFO_RLOCK(pcbinfo);
|
|
|
|
pcblist = get_pcblist(pr);
|
1994-05-24 10:09:53 +00:00
|
|
|
last = NULL;
|
2014-04-07 01:53:03 +00:00
|
|
|
LIST_FOREACH(inp, pcblist, inp_list) {
|
2004-08-06 02:08:31 +00:00
|
|
|
if (inp->inp_lport != uh->uh_dport)
|
2002-06-10 20:05:46 +00:00
|
|
|
continue;
|
1999-12-07 17:39:16 +00:00
|
|
|
#ifdef INET6
|
1999-12-21 11:14:12 +00:00
|
|
|
if ((inp->inp_vflag & INP_IPV4) == 0)
|
2004-08-06 02:08:31 +00:00
|
|
|
continue;
|
1999-12-07 17:39:16 +00:00
|
|
|
#endif
|
Import rewrite of IPv4 socket multicast layer to support source-specific
and protocol-independent host mode multicast. The code is written to
accomodate IPv6, IGMPv3 and MLDv2 with only a little additional work.
This change only pertains to FreeBSD's use as a multicast end-station and
does not concern multicast routing; for an IGMPv3/MLDv2 router
implementation, consider the XORP project.
The work is based on Wilbert de Graaf's IGMPv3 code drop for FreeBSD 4.6,
which is available at: http://www.kloosterhof.com/wilbert/igmpv3.html
Summary
* IPv4 multicast socket processing is now moved out of ip_output.c
into a new module, in_mcast.c.
* The in_mcast.c module implements the IPv4 legacy any-source API in
terms of the protocol-independent source-specific API.
* Source filters are lazy allocated as the common case does not use them.
They are part of per inpcb state and are covered by the inpcb lock.
* struct ip_mreqn is now supported to allow applications to specify
multicast joins by interface index in the legacy IPv4 any-source API.
* In UDP, an incoming multicast datagram only requires that the source
port matches the 4-tuple if the socket was already bound by source port.
An unbound socket SHOULD be able to receive multicasts sent from an
ephemeral source port.
* The UDP socket multicast filter mode defaults to exclusive, that is,
sources present in the per-socket list will be blocked from delivery.
* The RFC 3678 userland functions have been added to libc: setsourcefilter,
getsourcefilter, setipv4sourcefilter, getipv4sourcefilter.
* Definitions for IGMPv3 are merged but not yet used.
* struct sockaddr_storage is now referenced from <netinet/in.h>. It
is therefore defined there if not already declared in the same way
as for the C99 types.
* The RFC 1724 hack (specify 0.0.0.0/8 addresses to IP_MULTICAST_IF
which are then interpreted as interface indexes) is now deprecated.
* A patch for the Rhyolite.com routed in the FreeBSD base system
is available in the -net archives. This only affects individuals
running RIPv1 or RIPv2 via point-to-point and/or unnumbered interfaces.
* Make IPv6 detach path similar to IPv4's in code flow; functionally same.
* Bump __FreeBSD_version to 700048; see UPDATING.
This work was financially supported by another FreeBSD committer.
Obtained from: p4://bms_netdev
Submitted by: Wilbert de Graaf (original work)
Reviewed by: rwatson (locking), silence from fenner,
net@ (but with encouragement)
2007-06-12 16:24:56 +00:00
|
|
|
if (inp->inp_laddr.s_addr != INADDR_ANY &&
|
|
|
|
inp->inp_laddr.s_addr != ip->ip_dst.s_addr)
|
2007-07-10 09:30:46 +00:00
|
|
|
continue;
|
Import rewrite of IPv4 socket multicast layer to support source-specific
and protocol-independent host mode multicast. The code is written to
accomodate IPv6, IGMPv3 and MLDv2 with only a little additional work.
This change only pertains to FreeBSD's use as a multicast end-station and
does not concern multicast routing; for an IGMPv3/MLDv2 router
implementation, consider the XORP project.
The work is based on Wilbert de Graaf's IGMPv3 code drop for FreeBSD 4.6,
which is available at: http://www.kloosterhof.com/wilbert/igmpv3.html
Summary
* IPv4 multicast socket processing is now moved out of ip_output.c
into a new module, in_mcast.c.
* The in_mcast.c module implements the IPv4 legacy any-source API in
terms of the protocol-independent source-specific API.
* Source filters are lazy allocated as the common case does not use them.
They are part of per inpcb state and are covered by the inpcb lock.
* struct ip_mreqn is now supported to allow applications to specify
multicast joins by interface index in the legacy IPv4 any-source API.
* In UDP, an incoming multicast datagram only requires that the source
port matches the 4-tuple if the socket was already bound by source port.
An unbound socket SHOULD be able to receive multicasts sent from an
ephemeral source port.
* The UDP socket multicast filter mode defaults to exclusive, that is,
sources present in the per-socket list will be blocked from delivery.
* The RFC 3678 userland functions have been added to libc: setsourcefilter,
getsourcefilter, setipv4sourcefilter, getipv4sourcefilter.
* Definitions for IGMPv3 are merged but not yet used.
* struct sockaddr_storage is now referenced from <netinet/in.h>. It
is therefore defined there if not already declared in the same way
as for the C99 types.
* The RFC 1724 hack (specify 0.0.0.0/8 addresses to IP_MULTICAST_IF
which are then interpreted as interface indexes) is now deprecated.
* A patch for the Rhyolite.com routed in the FreeBSD base system
is available in the -net archives. This only affects individuals
running RIPv1 or RIPv2 via point-to-point and/or unnumbered interfaces.
* Make IPv6 detach path similar to IPv4's in code flow; functionally same.
* Bump __FreeBSD_version to 700048; see UPDATING.
This work was financially supported by another FreeBSD committer.
Obtained from: p4://bms_netdev
Submitted by: Wilbert de Graaf (original work)
Reviewed by: rwatson (locking), silence from fenner,
net@ (but with encouragement)
2007-06-12 16:24:56 +00:00
|
|
|
if (inp->inp_faddr.s_addr != INADDR_ANY &&
|
|
|
|
inp->inp_faddr.s_addr != ip->ip_src.s_addr)
|
2007-07-10 09:30:46 +00:00
|
|
|
continue;
|
Import rewrite of IPv4 socket multicast layer to support source-specific
and protocol-independent host mode multicast. The code is written to
accomodate IPv6, IGMPv3 and MLDv2 with only a little additional work.
This change only pertains to FreeBSD's use as a multicast end-station and
does not concern multicast routing; for an IGMPv3/MLDv2 router
implementation, consider the XORP project.
The work is based on Wilbert de Graaf's IGMPv3 code drop for FreeBSD 4.6,
which is available at: http://www.kloosterhof.com/wilbert/igmpv3.html
Summary
* IPv4 multicast socket processing is now moved out of ip_output.c
into a new module, in_mcast.c.
* The in_mcast.c module implements the IPv4 legacy any-source API in
terms of the protocol-independent source-specific API.
* Source filters are lazy allocated as the common case does not use them.
They are part of per inpcb state and are covered by the inpcb lock.
* struct ip_mreqn is now supported to allow applications to specify
multicast joins by interface index in the legacy IPv4 any-source API.
* In UDP, an incoming multicast datagram only requires that the source
port matches the 4-tuple if the socket was already bound by source port.
An unbound socket SHOULD be able to receive multicasts sent from an
ephemeral source port.
* The UDP socket multicast filter mode defaults to exclusive, that is,
sources present in the per-socket list will be blocked from delivery.
* The RFC 3678 userland functions have been added to libc: setsourcefilter,
getsourcefilter, setipv4sourcefilter, getipv4sourcefilter.
* Definitions for IGMPv3 are merged but not yet used.
* struct sockaddr_storage is now referenced from <netinet/in.h>. It
is therefore defined there if not already declared in the same way
as for the C99 types.
* The RFC 1724 hack (specify 0.0.0.0/8 addresses to IP_MULTICAST_IF
which are then interpreted as interface indexes) is now deprecated.
* A patch for the Rhyolite.com routed in the FreeBSD base system
is available in the -net archives. This only affects individuals
running RIPv1 or RIPv2 via point-to-point and/or unnumbered interfaces.
* Make IPv6 detach path similar to IPv4's in code flow; functionally same.
* Bump __FreeBSD_version to 700048; see UPDATING.
This work was financially supported by another FreeBSD committer.
Obtained from: p4://bms_netdev
Submitted by: Wilbert de Graaf (original work)
Reviewed by: rwatson (locking), silence from fenner,
net@ (but with encouragement)
2007-06-12 16:24:56 +00:00
|
|
|
if (inp->inp_fport != 0 &&
|
|
|
|
inp->inp_fport != uh->uh_sport)
|
2007-07-10 09:30:46 +00:00
|
|
|
continue;
|
Import rewrite of IPv4 socket multicast layer to support source-specific
and protocol-independent host mode multicast. The code is written to
accomodate IPv6, IGMPv3 and MLDv2 with only a little additional work.
This change only pertains to FreeBSD's use as a multicast end-station and
does not concern multicast routing; for an IGMPv3/MLDv2 router
implementation, consider the XORP project.
The work is based on Wilbert de Graaf's IGMPv3 code drop for FreeBSD 4.6,
which is available at: http://www.kloosterhof.com/wilbert/igmpv3.html
Summary
* IPv4 multicast socket processing is now moved out of ip_output.c
into a new module, in_mcast.c.
* The in_mcast.c module implements the IPv4 legacy any-source API in
terms of the protocol-independent source-specific API.
* Source filters are lazy allocated as the common case does not use them.
They are part of per inpcb state and are covered by the inpcb lock.
* struct ip_mreqn is now supported to allow applications to specify
multicast joins by interface index in the legacy IPv4 any-source API.
* In UDP, an incoming multicast datagram only requires that the source
port matches the 4-tuple if the socket was already bound by source port.
An unbound socket SHOULD be able to receive multicasts sent from an
ephemeral source port.
* The UDP socket multicast filter mode defaults to exclusive, that is,
sources present in the per-socket list will be blocked from delivery.
* The RFC 3678 userland functions have been added to libc: setsourcefilter,
getsourcefilter, setipv4sourcefilter, getipv4sourcefilter.
* Definitions for IGMPv3 are merged but not yet used.
* struct sockaddr_storage is now referenced from <netinet/in.h>. It
is therefore defined there if not already declared in the same way
as for the C99 types.
* The RFC 1724 hack (specify 0.0.0.0/8 addresses to IP_MULTICAST_IF
which are then interpreted as interface indexes) is now deprecated.
* A patch for the Rhyolite.com routed in the FreeBSD base system
is available in the -net archives. This only affects individuals
running RIPv1 or RIPv2 via point-to-point and/or unnumbered interfaces.
* Make IPv6 detach path similar to IPv4's in code flow; functionally same.
* Bump __FreeBSD_version to 700048; see UPDATING.
This work was financially supported by another FreeBSD committer.
Obtained from: p4://bms_netdev
Submitted by: Wilbert de Graaf (original work)
Reviewed by: rwatson (locking), silence from fenner,
net@ (but with encouragement)
2007-06-12 16:24:56 +00:00
|
|
|
|
In udp_append() and udp_input(), make use of read locking on incpbs
rather than write locking: while we need to maintain a valid reference
to the inpcb and fix its state, no protocol layer state is modified
during an IPv4 UDP receive -- there are only changes at the socket
layer, which is separately protected by socket locking.
While parallel concurrent receive on a single UDP socket is currently
relatively unusual, introducing read locking in the transmit path,
allowing concurrent receive and transmit, will significantly improve
performance for loads such as BIND, memcached, etc.
MFC after: 2 months
Tested by: gnn, kris, ps
2008-06-30 18:26:43 +00:00
|
|
|
INP_RLOCK(inp);
|
Import rewrite of IPv4 socket multicast layer to support source-specific
and protocol-independent host mode multicast. The code is written to
accomodate IPv6, IGMPv3 and MLDv2 with only a little additional work.
This change only pertains to FreeBSD's use as a multicast end-station and
does not concern multicast routing; for an IGMPv3/MLDv2 router
implementation, consider the XORP project.
The work is based on Wilbert de Graaf's IGMPv3 code drop for FreeBSD 4.6,
which is available at: http://www.kloosterhof.com/wilbert/igmpv3.html
Summary
* IPv4 multicast socket processing is now moved out of ip_output.c
into a new module, in_mcast.c.
* The in_mcast.c module implements the IPv4 legacy any-source API in
terms of the protocol-independent source-specific API.
* Source filters are lazy allocated as the common case does not use them.
They are part of per inpcb state and are covered by the inpcb lock.
* struct ip_mreqn is now supported to allow applications to specify
multicast joins by interface index in the legacy IPv4 any-source API.
* In UDP, an incoming multicast datagram only requires that the source
port matches the 4-tuple if the socket was already bound by source port.
An unbound socket SHOULD be able to receive multicasts sent from an
ephemeral source port.
* The UDP socket multicast filter mode defaults to exclusive, that is,
sources present in the per-socket list will be blocked from delivery.
* The RFC 3678 userland functions have been added to libc: setsourcefilter,
getsourcefilter, setipv4sourcefilter, getipv4sourcefilter.
* Definitions for IGMPv3 are merged but not yet used.
* struct sockaddr_storage is now referenced from <netinet/in.h>. It
is therefore defined there if not already declared in the same way
as for the C99 types.
* The RFC 1724 hack (specify 0.0.0.0/8 addresses to IP_MULTICAST_IF
which are then interpreted as interface indexes) is now deprecated.
* A patch for the Rhyolite.com routed in the FreeBSD base system
is available in the -net archives. This only affects individuals
running RIPv1 or RIPv2 via point-to-point and/or unnumbered interfaces.
* Make IPv6 detach path similar to IPv4's in code flow; functionally same.
* Bump __FreeBSD_version to 700048; see UPDATING.
This work was financially supported by another FreeBSD committer.
Obtained from: p4://bms_netdev
Submitted by: Wilbert de Graaf (original work)
Reviewed by: rwatson (locking), silence from fenner,
net@ (but with encouragement)
2007-06-12 16:24:56 +00:00
|
|
|
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
/*
|
|
|
|
* XXXRW: Because we weren't holding either the inpcb
|
|
|
|
* or the hash lock when we checked for a match
|
|
|
|
* before, we should probably recheck now that the
|
|
|
|
* inpcb lock is held.
|
|
|
|
*/
|
|
|
|
|
Import rewrite of IPv4 socket multicast layer to support source-specific
and protocol-independent host mode multicast. The code is written to
accomodate IPv6, IGMPv3 and MLDv2 with only a little additional work.
This change only pertains to FreeBSD's use as a multicast end-station and
does not concern multicast routing; for an IGMPv3/MLDv2 router
implementation, consider the XORP project.
The work is based on Wilbert de Graaf's IGMPv3 code drop for FreeBSD 4.6,
which is available at: http://www.kloosterhof.com/wilbert/igmpv3.html
Summary
* IPv4 multicast socket processing is now moved out of ip_output.c
into a new module, in_mcast.c.
* The in_mcast.c module implements the IPv4 legacy any-source API in
terms of the protocol-independent source-specific API.
* Source filters are lazy allocated as the common case does not use them.
They are part of per inpcb state and are covered by the inpcb lock.
* struct ip_mreqn is now supported to allow applications to specify
multicast joins by interface index in the legacy IPv4 any-source API.
* In UDP, an incoming multicast datagram only requires that the source
port matches the 4-tuple if the socket was already bound by source port.
An unbound socket SHOULD be able to receive multicasts sent from an
ephemeral source port.
* The UDP socket multicast filter mode defaults to exclusive, that is,
sources present in the per-socket list will be blocked from delivery.
* The RFC 3678 userland functions have been added to libc: setsourcefilter,
getsourcefilter, setipv4sourcefilter, getipv4sourcefilter.
* Definitions for IGMPv3 are merged but not yet used.
* struct sockaddr_storage is now referenced from <netinet/in.h>. It
is therefore defined there if not already declared in the same way
as for the C99 types.
* The RFC 1724 hack (specify 0.0.0.0/8 addresses to IP_MULTICAST_IF
which are then interpreted as interface indexes) is now deprecated.
* A patch for the Rhyolite.com routed in the FreeBSD base system
is available in the -net archives. This only affects individuals
running RIPv1 or RIPv2 via point-to-point and/or unnumbered interfaces.
* Make IPv6 detach path similar to IPv4's in code flow; functionally same.
* Bump __FreeBSD_version to 700048; see UPDATING.
This work was financially supported by another FreeBSD committer.
Obtained from: p4://bms_netdev
Submitted by: Wilbert de Graaf (original work)
Reviewed by: rwatson (locking), silence from fenner,
net@ (but with encouragement)
2007-06-12 16:24:56 +00:00
|
|
|
/*
|
|
|
|
* Handle socket delivery policy for any-source
|
|
|
|
* and source-specific multicast. [RFC3678]
|
|
|
|
*/
|
|
|
|
imo = inp->inp_moptions;
|
2011-01-19 19:07:16 +00:00
|
|
|
if (IN_MULTICAST(ntohl(ip->ip_dst.s_addr))) {
|
2009-03-09 17:53:05 +00:00
|
|
|
struct sockaddr_in group;
|
|
|
|
int blocked;
|
2011-01-19 20:57:08 +00:00
|
|
|
if (imo == NULL) {
|
2011-01-19 19:07:16 +00:00
|
|
|
INP_RUNLOCK(inp);
|
|
|
|
continue;
|
|
|
|
}
|
2009-03-09 17:53:05 +00:00
|
|
|
bzero(&group, sizeof(struct sockaddr_in));
|
|
|
|
group.sin_len = sizeof(struct sockaddr_in);
|
|
|
|
group.sin_family = AF_INET;
|
|
|
|
group.sin_addr = ip->ip_dst;
|
|
|
|
|
|
|
|
blocked = imo_multi_filter(imo, ifp,
|
|
|
|
(struct sockaddr *)&group,
|
|
|
|
(struct sockaddr *)&udp_in);
|
|
|
|
if (blocked != MCAST_PASS) {
|
|
|
|
if (blocked == MCAST_NOTGMEMBER)
|
2009-04-11 23:35:20 +00:00
|
|
|
IPSTAT_INC(ips_notmember);
|
2009-03-09 17:53:05 +00:00
|
|
|
if (blocked == MCAST_NOTSMEMBER ||
|
|
|
|
blocked == MCAST_MUTED)
|
2009-04-12 11:42:40 +00:00
|
|
|
UDPSTAT_INC(udps_filtermcast);
|
In udp_append() and udp_input(), make use of read locking on incpbs
rather than write locking: while we need to maintain a valid reference
to the inpcb and fix its state, no protocol layer state is modified
during an IPv4 UDP receive -- there are only changes at the socket
layer, which is separately protected by socket locking.
While parallel concurrent receive on a single UDP socket is currently
relatively unusual, introducing read locking in the transmit path,
allowing concurrent receive and transmit, will significantly improve
performance for loads such as BIND, memcached, etc.
MFC after: 2 months
Tested by: gnn, kris, ps
2008-06-30 18:26:43 +00:00
|
|
|
INP_RUNLOCK(inp);
|
2004-08-06 02:08:31 +00:00
|
|
|
continue;
|
|
|
|
}
|
2003-11-12 20:17:11 +00:00
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
if (last != NULL) {
|
|
|
|
struct mbuf *n;
|
|
|
|
|
2002-11-20 19:00:54 +00:00
|
|
|
n = m_copy(m, 0, M_COPYALL);
|
2011-04-14 10:40:57 +00:00
|
|
|
udp_append(last, ip, n, iphlen, &udp_in);
|
2009-05-23 16:51:13 +00:00
|
|
|
INP_RUNLOCK(last);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
1996-11-11 04:56:32 +00:00
|
|
|
last = inp;
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* Don't look for additional matches if this one does
|
|
|
|
* not have either the SO_REUSEPORT or SO_REUSEADDR
|
2007-02-20 10:13:11 +00:00
|
|
|
* socket options set. This heuristic avoids
|
|
|
|
* searching through all pcbs in the common case of a
|
|
|
|
* non-shared port. It assumes that an application
|
|
|
|
* will never clear these options after setting them.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2007-02-20 10:13:11 +00:00
|
|
|
if ((last->inp_socket->so_options &
|
|
|
|
(SO_REUSEPORT|SO_REUSEADDR)) == 0)
|
1994-05-24 10:09:53 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (last == NULL) {
|
|
|
|
/*
|
2007-02-20 10:13:11 +00:00
|
|
|
* No matching pcb found; discard datagram. (No need
|
|
|
|
* to send an ICMP Port Unreachable for a broadcast
|
|
|
|
* or multicast datgram.)
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2009-04-12 11:42:40 +00:00
|
|
|
UDPSTAT_INC(udps_noportbcast);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
if (inp)
|
|
|
|
INP_RUNLOCK(inp);
|
2014-04-07 01:53:03 +00:00
|
|
|
INP_INFO_RUNLOCK(pcbinfo);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
goto badunlocked;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2011-04-14 10:40:57 +00:00
|
|
|
udp_append(last, ip, m, iphlen, &udp_in);
|
2009-05-23 16:51:13 +00:00
|
|
|
INP_RUNLOCK(last);
|
2014-04-07 01:53:03 +00:00
|
|
|
INP_INFO_RUNLOCK(pcbinfo);
|
1994-05-24 10:09:53 +00:00
|
|
|
return;
|
|
|
|
}
|
2007-02-20 10:13:11 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
1996-10-07 19:06:12 +00:00
|
|
|
* Locate pcb for datagram.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2012-10-25 09:39:14 +00:00
|
|
|
|
2011-08-20 17:05:11 +00:00
|
|
|
/*
|
|
|
|
* Grab info from PACKET_TAG_IPFORWARD tag prepended to the chain.
|
|
|
|
*/
|
2012-11-02 01:20:55 +00:00
|
|
|
if ((m->m_flags & M_IP_NEXTHOP) &&
|
2012-10-25 09:39:14 +00:00
|
|
|
(fwd_tag = m_tag_find(m, PACKET_TAG_IPFORWARD, NULL)) != NULL) {
|
2011-08-20 17:05:11 +00:00
|
|
|
struct sockaddr_in *next_hop;
|
|
|
|
|
|
|
|
next_hop = (struct sockaddr_in *)(fwd_tag + 1);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Transparently forwarded. Pretend to be the destination.
|
|
|
|
* Already got one like this?
|
|
|
|
*/
|
2014-04-07 01:53:03 +00:00
|
|
|
inp = in_pcblookup_mbuf(pcbinfo, ip->ip_src, uh->uh_sport,
|
2011-08-20 17:05:11 +00:00
|
|
|
ip->ip_dst, uh->uh_dport, INPLOOKUP_RLOCKPCB, ifp, m);
|
|
|
|
if (!inp) {
|
|
|
|
/*
|
|
|
|
* It's new. Try to find the ambushing socket.
|
|
|
|
* Because we've rewritten the destination address,
|
|
|
|
* any hardware-generated hash is ignored.
|
|
|
|
*/
|
2014-04-07 01:53:03 +00:00
|
|
|
inp = in_pcblookup(pcbinfo, ip->ip_src,
|
2011-08-20 17:05:11 +00:00
|
|
|
uh->uh_sport, next_hop->sin_addr,
|
|
|
|
next_hop->sin_port ? htons(next_hop->sin_port) :
|
|
|
|
uh->uh_dport, INPLOOKUP_WILDCARD |
|
|
|
|
INPLOOKUP_RLOCKPCB, ifp);
|
|
|
|
}
|
|
|
|
/* Remove the tag from the packet. We don't need it anymore. */
|
|
|
|
m_tag_delete(m, fwd_tag);
|
2012-11-02 01:20:55 +00:00
|
|
|
m->m_flags &= ~M_IP_NEXTHOP;
|
2011-08-20 17:05:11 +00:00
|
|
|
} else
|
2014-04-07 01:53:03 +00:00
|
|
|
inp = in_pcblookup_mbuf(pcbinfo, ip->ip_src, uh->uh_sport,
|
2011-08-20 17:05:11 +00:00
|
|
|
ip->ip_dst, uh->uh_dport, INPLOOKUP_WILDCARD |
|
|
|
|
INPLOOKUP_RLOCKPCB, ifp, m);
|
1995-04-09 01:29:31 +00:00
|
|
|
if (inp == NULL) {
|
2007-02-20 10:20:03 +00:00
|
|
|
if (udp_log_in_vain) {
|
1996-05-02 05:54:14 +00:00
|
|
|
char buf[4*sizeof "123"];
|
1996-04-27 18:19:12 +00:00
|
|
|
|
|
|
|
strcpy(buf, inet_ntoa(ip->ip_dst));
|
1997-12-19 23:46:21 +00:00
|
|
|
log(LOG_INFO,
|
|
|
|
"Connection attempt to UDP %s:%d from %s:%d\n",
|
|
|
|
buf, ntohs(uh->uh_dport), inet_ntoa(ip->ip_src),
|
|
|
|
ntohs(uh->uh_sport));
|
1996-04-27 18:19:12 +00:00
|
|
|
}
|
2009-04-12 11:42:40 +00:00
|
|
|
UDPSTAT_INC(udps_noport);
|
1994-05-24 10:09:53 +00:00
|
|
|
if (m->m_flags & (M_BCAST | M_MCAST)) {
|
2009-04-12 11:42:40 +00:00
|
|
|
UDPSTAT_INC(udps_noportbcast);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
goto badunlocked;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
if (V_udp_blackhole)
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
goto badunlocked;
|
2002-08-04 20:50:13 +00:00
|
|
|
if (badport_bandlim(BANDLIM_ICMP_UNREACH) < 0)
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
goto badunlocked;
|
Fixed broken ICMP error generation, unified conversion of IP header
fields between host and network byte order. The details:
o icmp_error() now does not add IP header length. This fixes the problem
when icmp_error() is called from ip_forward(). In this case the ip_len
of the original IP datagram returned with ICMP error was wrong.
o icmp_error() expects all three fields, ip_len, ip_id and ip_off in host
byte order, so DTRT and convert these fields back to network byte order
before sending a message. This fixes the problem described in PR 16240
and PR 20877 (ip_id field was returned in host byte order).
o ip_ttl decrement operation in ip_forward() was moved down to make sure
that it does not corrupt the copy of original IP datagram passed later
to icmp_error().
o A copy of original IP datagram in ip_forward() was made a read-write,
independent copy. This fixes the problem I first reported to Garrett
Wollman and Bill Fenner and later put in audit trail of PR 16240:
ip_output() (not always) converts fields of original datagram to network
byte order, but because copy (mcopy) and its original (m) most likely
share the same mbuf cluster, ip_output()'s manipulations on original
also corrupted the copy.
o ip_output() now expects all three fields, ip_len, ip_off and (what is
significant) ip_id in host byte order. It was a headache for years that
ip_id was handled differently. The only compatibility issue here is the
raw IP socket interface with IP_HDRINCL socket option set and a non-zero
ip_id field, but ip.4 manual page was unclear on whether in this case
ip_id field should be in host or network byte order.
2000-09-01 12:33:03 +00:00
|
|
|
*ip = save_ip;
|
2000-05-24 12:57:52 +00:00
|
|
|
icmp_error(m, ICMP_UNREACH, ICMP_UNREACH_PORT, 0, 0);
|
1994-05-24 10:09:53 +00:00
|
|
|
return;
|
|
|
|
}
|
2007-02-20 10:13:11 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Check the minimum TTL for socket.
|
|
|
|
*/
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_RLOCK_ASSERT(inp);
|
2008-07-07 09:26:52 +00:00
|
|
|
if (inp->inp_ip_minttl && inp->inp_ip_minttl > ip->ip_ttl) {
|
|
|
|
INP_RUNLOCK(inp);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
m_freem(m);
|
|
|
|
return;
|
2008-07-07 09:26:52 +00:00
|
|
|
}
|
2014-04-07 01:53:03 +00:00
|
|
|
if (cscov_partial) {
|
|
|
|
struct udpcb *up;
|
|
|
|
|
|
|
|
up = intoudpcb(inp);
|
|
|
|
if (up->u_rxcslen > len) {
|
|
|
|
INP_RUNLOCK(inp);
|
|
|
|
m_freem(m);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
2013-08-25 21:54:41 +00:00
|
|
|
|
2013-08-26 00:28:57 +00:00
|
|
|
UDP_PROBE(receive, NULL, inp, ip, inp, uh);
|
2011-04-14 10:40:57 +00:00
|
|
|
udp_append(inp, ip, m, iphlen, &udp_in);
|
In udp_append() and udp_input(), make use of read locking on incpbs
rather than write locking: while we need to maintain a valid reference
to the inpcb and fix its state, no protocol layer state is modified
during an IPv4 UDP receive -- there are only changes at the socket
layer, which is separately protected by socket locking.
While parallel concurrent receive on a single UDP socket is currently
relatively unusual, introducing read locking in the transmit path,
allowing concurrent receive and transmit, will significantly improve
performance for loads such as BIND, memcached, etc.
MFC after: 2 months
Tested by: gnn, kris, ps
2008-06-30 18:26:43 +00:00
|
|
|
INP_RUNLOCK(inp);
|
1994-05-24 10:09:53 +00:00
|
|
|
return;
|
2002-06-12 15:21:41 +00:00
|
|
|
|
2002-06-10 20:05:46 +00:00
|
|
|
badunlocked:
|
1994-05-24 10:09:53 +00:00
|
|
|
m_freem(m);
|
1999-12-07 17:39:16 +00:00
|
|
|
}
|
2011-04-30 11:17:00 +00:00
|
|
|
#endif /* INET */
|
1999-12-07 17:39:16 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
2007-02-20 10:13:11 +00:00
|
|
|
* Notify a udp user of an asynchronous error; just wake up so that they can
|
|
|
|
* collect error status.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2002-06-14 08:35:21 +00:00
|
|
|
struct inpcb *
|
2007-02-20 10:13:11 +00:00
|
|
|
udp_notify(struct inpcb *inp, int errno)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2007-02-20 10:13:11 +00:00
|
|
|
|
2008-07-07 12:27:55 +00:00
|
|
|
/*
|
|
|
|
* While udp_ctlinput() always calls udp_notify() with a read lock
|
|
|
|
* when invoking it directly, in_pcbnotifyall() currently uses write
|
|
|
|
* locks due to sharing code with TCP. For now, accept either a read
|
|
|
|
* or a write lock, but a read lock is sufficient.
|
|
|
|
*/
|
|
|
|
INP_LOCK_ASSERT(inp);
|
2008-04-17 21:38:18 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
inp->inp_socket->so_error = errno;
|
|
|
|
sorwakeup(inp->inp_socket);
|
|
|
|
sowwakeup(inp->inp_socket);
|
2007-02-20 10:13:11 +00:00
|
|
|
return (inp);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
2011-04-30 11:17:00 +00:00
|
|
|
#ifdef INET
|
2014-04-07 01:53:03 +00:00
|
|
|
static void
|
|
|
|
udp_common_ctlinput(int cmd, struct sockaddr *sa, void *vip,
|
|
|
|
struct inpcbinfo *pcbinfo)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2001-02-26 21:19:47 +00:00
|
|
|
struct ip *ip = vip;
|
|
|
|
struct udphdr *uh;
|
2004-08-16 18:32:07 +00:00
|
|
|
struct in_addr faddr;
|
2001-02-26 21:19:47 +00:00
|
|
|
struct inpcb *inp;
|
|
|
|
|
|
|
|
faddr = ((struct sockaddr_in *)sa)->sin_addr;
|
|
|
|
if (sa->sa_family != AF_INET || faddr.s_addr == INADDR_ANY)
|
2004-08-16 18:32:07 +00:00
|
|
|
return;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2003-11-20 20:07:39 +00:00
|
|
|
/*
|
|
|
|
* Redirects don't need to be handled up here.
|
|
|
|
*/
|
|
|
|
if (PRC_IS_REDIRECT(cmd))
|
|
|
|
return;
|
2007-02-20 10:13:11 +00:00
|
|
|
|
2003-11-20 20:07:39 +00:00
|
|
|
/*
|
|
|
|
* Hostdead is ugly because it goes linearly through all PCBs.
|
2007-02-20 10:13:11 +00:00
|
|
|
*
|
|
|
|
* XXX: We never get this from ICMP, otherwise it makes an excellent
|
|
|
|
* DoS attack on machines with many connections.
|
2003-11-20 20:07:39 +00:00
|
|
|
*/
|
|
|
|
if (cmd == PRC_HOSTDEAD)
|
2007-05-07 13:47:39 +00:00
|
|
|
ip = NULL;
|
2001-02-22 21:23:45 +00:00
|
|
|
else if ((unsigned)cmd >= PRC_NCMDS || inetctlerrmap[cmd] == 0)
|
1994-05-24 10:09:53 +00:00
|
|
|
return;
|
2007-05-07 13:47:39 +00:00
|
|
|
if (ip != NULL) {
|
1994-05-24 10:09:53 +00:00
|
|
|
uh = (struct udphdr *)((caddr_t)ip + (ip->ip_hl << 2));
|
2014-04-07 01:53:03 +00:00
|
|
|
inp = in_pcblookup(pcbinfo, faddr, uh->uh_dport,
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
ip->ip_src, uh->uh_sport, INPLOOKUP_RLOCKPCB, NULL);
|
2002-06-10 20:05:46 +00:00
|
|
|
if (inp != NULL) {
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_RLOCK_ASSERT(inp);
|
2003-02-15 02:37:57 +00:00
|
|
|
if (inp->inp_socket != NULL) {
|
2007-09-10 14:22:15 +00:00
|
|
|
udp_notify(inp, inetctlerrmap[cmd]);
|
2002-06-10 20:05:46 +00:00
|
|
|
}
|
2008-07-07 12:27:55 +00:00
|
|
|
INP_RUNLOCK(inp);
|
2002-06-10 20:05:46 +00:00
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
} else
|
2014-04-07 01:53:03 +00:00
|
|
|
in_pcbnotifyall(pcbinfo, faddr, inetctlerrmap[cmd],
|
2007-09-10 14:22:15 +00:00
|
|
|
udp_notify);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2014-04-07 01:53:03 +00:00
|
|
|
void
|
|
|
|
udp_ctlinput(int cmd, struct sockaddr *sa, void *vip)
|
|
|
|
{
|
|
|
|
|
|
|
|
return (udp_common_ctlinput(cmd, sa, vip, &V_udbinfo));
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
udplite_ctlinput(int cmd, struct sockaddr *sa, void *vip)
|
|
|
|
{
|
|
|
|
|
|
|
|
return (udp_common_ctlinput(cmd, sa, vip, &V_ulitecbinfo));
|
|
|
|
}
|
2011-04-30 11:17:00 +00:00
|
|
|
#endif /* INET */
|
1994-05-24 10:09:53 +00:00
|
|
|
|
1998-05-15 20:11:40 +00:00
|
|
|
static int
|
2000-07-04 11:25:35 +00:00
|
|
|
udp_pcblist(SYSCTL_HANDLER_ARGS)
|
1998-05-15 20:11:40 +00:00
|
|
|
{
|
2005-06-01 11:24:00 +00:00
|
|
|
int error, i, n;
|
1998-05-15 20:11:40 +00:00
|
|
|
struct inpcb *inp, **inp_list;
|
|
|
|
inp_gen_t gencnt;
|
|
|
|
struct xinpgen xig;
|
|
|
|
|
|
|
|
/*
|
2007-09-10 14:22:15 +00:00
|
|
|
* The process of preparing the PCB list is too time-consuming and
|
1998-05-15 20:11:40 +00:00
|
|
|
* resource-intensive to repeat twice on every request.
|
|
|
|
*/
|
|
|
|
if (req->oldptr == 0) {
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
n = V_udbinfo.ipi_count;
|
2010-08-17 16:41:16 +00:00
|
|
|
n += imax(n / 8, 10);
|
|
|
|
req->oldidx = 2 * (sizeof xig) + n * sizeof(struct xinpcb);
|
2007-02-20 10:13:11 +00:00
|
|
|
return (0);
|
1998-05-15 20:11:40 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
if (req->newptr != 0)
|
2007-02-20 10:13:11 +00:00
|
|
|
return (EPERM);
|
1998-05-15 20:11:40 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* OK, now we're committed to doing something.
|
|
|
|
*/
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
INP_INFO_RLOCK(&V_udbinfo);
|
|
|
|
gencnt = V_udbinfo.ipi_gencnt;
|
|
|
|
n = V_udbinfo.ipi_count;
|
|
|
|
INP_INFO_RUNLOCK(&V_udbinfo);
|
1998-05-15 20:11:40 +00:00
|
|
|
|
2004-02-26 00:27:04 +00:00
|
|
|
error = sysctl_wire_old_buffer(req, 2 * (sizeof xig)
|
2002-07-28 19:59:31 +00:00
|
|
|
+ n * sizeof(struct xinpcb));
|
2004-02-26 00:27:04 +00:00
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
2002-07-28 19:59:31 +00:00
|
|
|
|
1998-05-15 20:11:40 +00:00
|
|
|
xig.xig_len = sizeof xig;
|
|
|
|
xig.xig_count = n;
|
|
|
|
xig.xig_gen = gencnt;
|
|
|
|
xig.xig_sogen = so_gencnt;
|
|
|
|
error = SYSCTL_OUT(req, &xig, sizeof xig);
|
|
|
|
if (error)
|
2007-02-20 10:13:11 +00:00
|
|
|
return (error);
|
1998-05-15 20:11:40 +00:00
|
|
|
|
2003-02-19 05:47:46 +00:00
|
|
|
inp_list = malloc(n * sizeof *inp_list, M_TEMP, M_WAITOK);
|
1998-05-15 20:11:40 +00:00
|
|
|
if (inp_list == 0)
|
2007-02-20 10:13:11 +00:00
|
|
|
return (ENOMEM);
|
2004-08-16 18:32:07 +00:00
|
|
|
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
INP_INFO_RLOCK(&V_udbinfo);
|
|
|
|
for (inp = LIST_FIRST(V_udbinfo.ipi_listhead), i = 0; inp && i < n;
|
2001-02-04 13:13:25 +00:00
|
|
|
inp = LIST_NEXT(inp, inp_list)) {
|
2010-03-17 18:28:27 +00:00
|
|
|
INP_WLOCK(inp);
|
2002-06-21 22:54:16 +00:00
|
|
|
if (inp->inp_gencnt <= gencnt &&
|
2010-03-17 18:28:27 +00:00
|
|
|
cr_canseeinpcb(req->td->td_ucred, inp) == 0) {
|
|
|
|
in_pcbref(inp);
|
1998-05-15 20:11:40 +00:00
|
|
|
inp_list[i++] = inp;
|
2010-03-17 18:28:27 +00:00
|
|
|
}
|
|
|
|
INP_WUNLOCK(inp);
|
1998-05-15 20:11:40 +00:00
|
|
|
}
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
INP_INFO_RUNLOCK(&V_udbinfo);
|
1998-05-15 20:11:40 +00:00
|
|
|
n = i;
|
|
|
|
|
|
|
|
error = 0;
|
|
|
|
for (i = 0; i < n; i++) {
|
|
|
|
inp = inp_list[i];
|
2008-05-29 08:27:14 +00:00
|
|
|
INP_RLOCK(inp);
|
1998-05-15 20:11:40 +00:00
|
|
|
if (inp->inp_gencnt <= gencnt) {
|
|
|
|
struct xinpcb xi;
|
2010-03-17 18:28:27 +00:00
|
|
|
|
2005-05-06 02:50:00 +00:00
|
|
|
bzero(&xi, sizeof(xi));
|
1998-05-15 20:11:40 +00:00
|
|
|
xi.xi_len = sizeof xi;
|
|
|
|
/* XXX should avoid extra copy */
|
|
|
|
bcopy(inp, &xi.xi_inp, sizeof *inp);
|
|
|
|
if (inp->inp_socket)
|
|
|
|
sotoxsocket(inp->inp_socket, &xi.xi_socket);
|
2003-02-15 02:37:57 +00:00
|
|
|
xi.xi_inp.inp_gencnt = inp->inp_gencnt;
|
2008-05-29 08:27:14 +00:00
|
|
|
INP_RUNLOCK(inp);
|
1998-05-15 20:11:40 +00:00
|
|
|
error = SYSCTL_OUT(req, &xi, sizeof xi);
|
2006-07-18 22:34:27 +00:00
|
|
|
} else
|
2008-05-29 08:27:14 +00:00
|
|
|
INP_RUNLOCK(inp);
|
1998-05-15 20:11:40 +00:00
|
|
|
}
|
2010-03-17 18:28:27 +00:00
|
|
|
INP_INFO_WLOCK(&V_udbinfo);
|
|
|
|
for (i = 0; i < n; i++) {
|
|
|
|
inp = inp_list[i];
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_RLOCK(inp);
|
|
|
|
if (!in_pcbrele_rlocked(inp))
|
|
|
|
INP_RUNLOCK(inp);
|
2010-03-17 18:28:27 +00:00
|
|
|
}
|
|
|
|
INP_INFO_WUNLOCK(&V_udbinfo);
|
|
|
|
|
1998-05-15 20:11:40 +00:00
|
|
|
if (!error) {
|
|
|
|
/*
|
2007-02-20 10:13:11 +00:00
|
|
|
* Give the user an updated idea of our state. If the
|
|
|
|
* generation differs from what we told her before, she knows
|
|
|
|
* that something happened while we were processing this
|
|
|
|
* request, and it might be necessary to retry.
|
1998-05-15 20:11:40 +00:00
|
|
|
*/
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
INP_INFO_RLOCK(&V_udbinfo);
|
|
|
|
xig.xig_gen = V_udbinfo.ipi_gencnt;
|
1998-05-15 20:11:40 +00:00
|
|
|
xig.xig_sogen = so_gencnt;
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
xig.xig_count = V_udbinfo.ipi_count;
|
|
|
|
INP_INFO_RUNLOCK(&V_udbinfo);
|
1998-05-15 20:11:40 +00:00
|
|
|
error = SYSCTL_OUT(req, &xig, sizeof xig);
|
|
|
|
}
|
|
|
|
free(inp_list, M_TEMP);
|
2007-02-20 10:13:11 +00:00
|
|
|
return (error);
|
1998-05-15 20:11:40 +00:00
|
|
|
}
|
|
|
|
|
2011-01-18 21:14:13 +00:00
|
|
|
SYSCTL_PROC(_net_inet_udp, UDPCTL_PCBLIST, pcblist,
|
|
|
|
CTLTYPE_OPAQUE | CTLFLAG_RD, NULL, 0,
|
2007-02-20 10:13:11 +00:00
|
|
|
udp_pcblist, "S,xinpcb", "List of active UDP sockets");
|
1998-05-15 20:11:40 +00:00
|
|
|
|
2011-04-30 11:17:00 +00:00
|
|
|
#ifdef INET
|
1999-07-11 18:32:46 +00:00
|
|
|
static int
|
2000-07-04 11:25:35 +00:00
|
|
|
udp_getcred(SYSCTL_HANDLER_ARGS)
|
1999-07-11 18:32:46 +00:00
|
|
|
{
|
2001-02-18 13:30:20 +00:00
|
|
|
struct xucred xuc;
|
1999-07-11 18:32:46 +00:00
|
|
|
struct sockaddr_in addrs[2];
|
|
|
|
struct inpcb *inp;
|
2005-06-01 11:24:00 +00:00
|
|
|
int error;
|
1999-07-11 18:32:46 +00:00
|
|
|
|
2007-06-12 00:12:01 +00:00
|
|
|
error = priv_check(req->td, PRIV_NETINET_GETCRED);
|
1999-07-11 18:32:46 +00:00
|
|
|
if (error)
|
|
|
|
return (error);
|
|
|
|
error = SYSCTL_IN(req, addrs, sizeof(addrs));
|
|
|
|
if (error)
|
|
|
|
return (error);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
inp = in_pcblookup(&V_udbinfo, addrs[1].sin_addr, addrs[1].sin_port,
|
|
|
|
addrs[0].sin_addr, addrs[0].sin_port,
|
|
|
|
INPLOOKUP_WILDCARD | INPLOOKUP_RLOCKPCB, NULL);
|
2008-05-29 08:27:14 +00:00
|
|
|
if (inp != NULL) {
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_RLOCK_ASSERT(inp);
|
2008-05-29 08:27:14 +00:00
|
|
|
if (inp->inp_socket == NULL)
|
|
|
|
error = ENOENT;
|
|
|
|
if (error == 0)
|
2008-10-17 16:26:16 +00:00
|
|
|
error = cr_canseeinpcb(req->td->td_ucred, inp);
|
2008-05-29 08:27:14 +00:00
|
|
|
if (error == 0)
|
2008-10-04 15:06:34 +00:00
|
|
|
cru2x(inp->inp_cred, &xuc);
|
2008-05-29 08:27:14 +00:00
|
|
|
INP_RUNLOCK(inp);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
} else
|
1999-07-11 18:32:46 +00:00
|
|
|
error = ENOENT;
|
2002-07-11 23:18:43 +00:00
|
|
|
if (error == 0)
|
|
|
|
error = SYSCTL_OUT(req, &xuc, sizeof(struct xucred));
|
1999-07-11 18:32:46 +00:00
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
2001-06-24 12:18:27 +00:00
|
|
|
SYSCTL_PROC(_net_inet_udp, OID_AUTO, getcred,
|
|
|
|
CTLTYPE_OPAQUE|CTLFLAG_RW|CTLFLAG_PRISON, 0, 0,
|
|
|
|
udp_getcred, "S,xucred", "Get the xucred of a UDP connection");
|
2011-04-30 11:17:00 +00:00
|
|
|
#endif /* INET */
|
1999-07-11 18:32:46 +00:00
|
|
|
|
Added support for NAT-Traversal (RFC 3948) in IPsec stack.
Thanks to (no special order) Emmanuel Dreyfus (manu@netbsd.org), Larry
Baird (lab@gta.com), gnn, bz, and other FreeBSD devs, Julien Vanherzeele
(julien.vanherzeele@netasq.com, for years of bug reporting), the PFSense
team, and all people who used / tried the NAT-T patch for years and
reported bugs, patches, etc...
X-MFC: never
Reviewed by: bz
Approved by: gnn(mentor)
Obtained from: NETASQ
2009-06-12 15:44:35 +00:00
|
|
|
int
|
|
|
|
udp_ctloutput(struct socket *so, struct sockopt *sopt)
|
|
|
|
{
|
|
|
|
struct inpcb *inp;
|
|
|
|
struct udpcb *up;
|
2014-04-07 01:53:03 +00:00
|
|
|
int isudplite, error, optval;
|
Added support for NAT-Traversal (RFC 3948) in IPsec stack.
Thanks to (no special order) Emmanuel Dreyfus (manu@netbsd.org), Larry
Baird (lab@gta.com), gnn, bz, and other FreeBSD devs, Julien Vanherzeele
(julien.vanherzeele@netasq.com, for years of bug reporting), the PFSense
team, and all people who used / tried the NAT-T patch for years and
reported bugs, patches, etc...
X-MFC: never
Reviewed by: bz
Approved by: gnn(mentor)
Obtained from: NETASQ
2009-06-12 15:44:35 +00:00
|
|
|
|
2014-04-07 01:53:03 +00:00
|
|
|
error = 0;
|
|
|
|
isudplite = (so->so_proto->pr_protocol == IPPROTO_UDPLITE) ? 1 : 0;
|
Added support for NAT-Traversal (RFC 3948) in IPsec stack.
Thanks to (no special order) Emmanuel Dreyfus (manu@netbsd.org), Larry
Baird (lab@gta.com), gnn, bz, and other FreeBSD devs, Julien Vanherzeele
(julien.vanherzeele@netasq.com, for years of bug reporting), the PFSense
team, and all people who used / tried the NAT-T patch for years and
reported bugs, patches, etc...
X-MFC: never
Reviewed by: bz
Approved by: gnn(mentor)
Obtained from: NETASQ
2009-06-12 15:44:35 +00:00
|
|
|
inp = sotoinpcb(so);
|
|
|
|
KASSERT(inp != NULL, ("%s: inp == NULL", __func__));
|
|
|
|
INP_WLOCK(inp);
|
2014-04-07 01:53:03 +00:00
|
|
|
if (sopt->sopt_level != so->so_proto->pr_protocol) {
|
Added support for NAT-Traversal (RFC 3948) in IPsec stack.
Thanks to (no special order) Emmanuel Dreyfus (manu@netbsd.org), Larry
Baird (lab@gta.com), gnn, bz, and other FreeBSD devs, Julien Vanherzeele
(julien.vanherzeele@netasq.com, for years of bug reporting), the PFSense
team, and all people who used / tried the NAT-T patch for years and
reported bugs, patches, etc...
X-MFC: never
Reviewed by: bz
Approved by: gnn(mentor)
Obtained from: NETASQ
2009-06-12 15:44:35 +00:00
|
|
|
#ifdef INET6
|
|
|
|
if (INP_CHECK_SOCKAF(so, AF_INET6)) {
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
error = ip6_ctloutput(so, sopt);
|
2011-04-30 11:17:00 +00:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
#if defined(INET) && defined(INET6)
|
|
|
|
else
|
Added support for NAT-Traversal (RFC 3948) in IPsec stack.
Thanks to (no special order) Emmanuel Dreyfus (manu@netbsd.org), Larry
Baird (lab@gta.com), gnn, bz, and other FreeBSD devs, Julien Vanherzeele
(julien.vanherzeele@netasq.com, for years of bug reporting), the PFSense
team, and all people who used / tried the NAT-T patch for years and
reported bugs, patches, etc...
X-MFC: never
Reviewed by: bz
Approved by: gnn(mentor)
Obtained from: NETASQ
2009-06-12 15:44:35 +00:00
|
|
|
#endif
|
2011-04-30 11:17:00 +00:00
|
|
|
#ifdef INET
|
|
|
|
{
|
Added support for NAT-Traversal (RFC 3948) in IPsec stack.
Thanks to (no special order) Emmanuel Dreyfus (manu@netbsd.org), Larry
Baird (lab@gta.com), gnn, bz, and other FreeBSD devs, Julien Vanherzeele
(julien.vanherzeele@netasq.com, for years of bug reporting), the PFSense
team, and all people who used / tried the NAT-T patch for years and
reported bugs, patches, etc...
X-MFC: never
Reviewed by: bz
Approved by: gnn(mentor)
Obtained from: NETASQ
2009-06-12 15:44:35 +00:00
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
error = ip_ctloutput(so, sopt);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
|
|
|
switch (sopt->sopt_dir) {
|
|
|
|
case SOPT_SET:
|
|
|
|
switch (sopt->sopt_name) {
|
|
|
|
case UDP_ENCAP:
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
error = sooptcopyin(sopt, &optval, sizeof optval,
|
|
|
|
sizeof optval);
|
|
|
|
if (error)
|
|
|
|
break;
|
|
|
|
inp = sotoinpcb(so);
|
|
|
|
KASSERT(inp != NULL, ("%s: inp == NULL", __func__));
|
|
|
|
INP_WLOCK(inp);
|
|
|
|
#ifdef IPSEC_NAT_T
|
|
|
|
up = intoudpcb(inp);
|
|
|
|
KASSERT(up != NULL, ("%s: up == NULL", __func__));
|
|
|
|
#endif
|
|
|
|
switch (optval) {
|
|
|
|
case 0:
|
|
|
|
/* Clear all UDP encap. */
|
|
|
|
#ifdef IPSEC_NAT_T
|
|
|
|
up->u_flags &= ~UF_ESPINUDP_ALL;
|
|
|
|
#endif
|
|
|
|
break;
|
|
|
|
#ifdef IPSEC_NAT_T
|
|
|
|
case UDP_ENCAP_ESPINUDP:
|
|
|
|
case UDP_ENCAP_ESPINUDP_NON_IKE:
|
|
|
|
up->u_flags &= ~UF_ESPINUDP_ALL;
|
|
|
|
if (optval == UDP_ENCAP_ESPINUDP)
|
|
|
|
up->u_flags |= UF_ESPINUDP;
|
|
|
|
else if (optval == UDP_ENCAP_ESPINUDP_NON_IKE)
|
|
|
|
up->u_flags |= UF_ESPINUDP_NON_IKE;
|
|
|
|
break;
|
|
|
|
#endif
|
|
|
|
default:
|
|
|
|
error = EINVAL;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
break;
|
2014-04-07 01:53:03 +00:00
|
|
|
case UDPLITE_SEND_CSCOV:
|
|
|
|
case UDPLITE_RECV_CSCOV:
|
|
|
|
if (!isudplite) {
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
error = ENOPROTOOPT;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
error = sooptcopyin(sopt, &optval, sizeof(optval),
|
|
|
|
sizeof(optval));
|
|
|
|
if (error != 0)
|
|
|
|
break;
|
|
|
|
inp = sotoinpcb(so);
|
|
|
|
KASSERT(inp != NULL, ("%s: inp == NULL", __func__));
|
|
|
|
INP_WLOCK(inp);
|
|
|
|
up = intoudpcb(inp);
|
|
|
|
KASSERT(up != NULL, ("%s: up == NULL", __func__));
|
|
|
|
if (optval != 0 && optval < 8) {
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
error = EINVAL;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (sopt->sopt_name == UDPLITE_SEND_CSCOV)
|
|
|
|
up->u_txcslen = optval;
|
|
|
|
else
|
|
|
|
up->u_rxcslen = optval;
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
break;
|
Added support for NAT-Traversal (RFC 3948) in IPsec stack.
Thanks to (no special order) Emmanuel Dreyfus (manu@netbsd.org), Larry
Baird (lab@gta.com), gnn, bz, and other FreeBSD devs, Julien Vanherzeele
(julien.vanherzeele@netasq.com, for years of bug reporting), the PFSense
team, and all people who used / tried the NAT-T patch for years and
reported bugs, patches, etc...
X-MFC: never
Reviewed by: bz
Approved by: gnn(mentor)
Obtained from: NETASQ
2009-06-12 15:44:35 +00:00
|
|
|
default:
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
error = ENOPROTOOPT;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case SOPT_GET:
|
|
|
|
switch (sopt->sopt_name) {
|
|
|
|
#ifdef IPSEC_NAT_T
|
|
|
|
case UDP_ENCAP:
|
|
|
|
up = intoudpcb(inp);
|
|
|
|
KASSERT(up != NULL, ("%s: up == NULL", __func__));
|
|
|
|
optval = up->u_flags & UF_ESPINUDP_ALL;
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
error = sooptcopyout(sopt, &optval, sizeof optval);
|
|
|
|
break;
|
|
|
|
#endif
|
2014-04-07 01:53:03 +00:00
|
|
|
case UDPLITE_SEND_CSCOV:
|
|
|
|
case UDPLITE_RECV_CSCOV:
|
|
|
|
if (!isudplite) {
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
error = ENOPROTOOPT;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
up = intoudpcb(inp);
|
|
|
|
KASSERT(up != NULL, ("%s: up == NULL", __func__));
|
|
|
|
if (sopt->sopt_name == UDPLITE_SEND_CSCOV)
|
|
|
|
optval = up->u_txcslen;
|
|
|
|
else
|
|
|
|
optval = up->u_rxcslen;
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
error = sooptcopyout(sopt, &optval, sizeof(optval));
|
|
|
|
break;
|
Added support for NAT-Traversal (RFC 3948) in IPsec stack.
Thanks to (no special order) Emmanuel Dreyfus (manu@netbsd.org), Larry
Baird (lab@gta.com), gnn, bz, and other FreeBSD devs, Julien Vanherzeele
(julien.vanherzeele@netasq.com, for years of bug reporting), the PFSense
team, and all people who used / tried the NAT-T patch for years and
reported bugs, patches, etc...
X-MFC: never
Reviewed by: bz
Approved by: gnn(mentor)
Obtained from: NETASQ
2009-06-12 15:44:35 +00:00
|
|
|
default:
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
error = ENOPROTOOPT;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
2011-04-30 11:17:00 +00:00
|
|
|
#ifdef INET
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
#define UH_WLOCKED 2
|
|
|
|
#define UH_RLOCKED 1
|
|
|
|
#define UH_UNLOCKED 0
|
1995-11-14 20:34:56 +00:00
|
|
|
static int
|
2007-02-20 10:13:11 +00:00
|
|
|
udp_output(struct inpcb *inp, struct mbuf *m, struct sockaddr *addr,
|
|
|
|
struct mbuf *control, struct thread *td)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2007-02-20 10:13:11 +00:00
|
|
|
struct udpiphdr *ui;
|
|
|
|
int len = m->m_pkthdr.len;
|
2002-10-21 20:10:05 +00:00
|
|
|
struct in_addr faddr, laddr;
|
2002-10-21 20:40:02 +00:00
|
|
|
struct cmsghdr *cm;
|
2014-04-07 01:53:03 +00:00
|
|
|
struct inpcbinfo *pcbinfo;
|
2002-10-21 20:40:02 +00:00
|
|
|
struct sockaddr_in *sin, src;
|
2014-04-07 01:53:03 +00:00
|
|
|
int cscov_partial = 0;
|
2002-10-21 20:10:05 +00:00
|
|
|
int error = 0;
|
2003-08-20 14:46:40 +00:00
|
|
|
int ipflags;
|
2002-10-21 20:10:05 +00:00
|
|
|
u_short fport, lport;
|
2004-08-19 01:13:10 +00:00
|
|
|
int unlock_udbinfo;
|
2012-06-12 14:56:08 +00:00
|
|
|
u_char tos;
|
2014-04-07 01:53:03 +00:00
|
|
|
uint8_t pr;
|
|
|
|
uint16_t cscov = 0;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2004-08-19 01:13:10 +00:00
|
|
|
/*
|
|
|
|
* udp_output() may need to temporarily bind or connect the current
|
2007-09-10 14:22:15 +00:00
|
|
|
* inpcb. As such, we don't know up front whether we will need the
|
|
|
|
* pcbinfo lock or not. Do any work to decide what is needed up
|
|
|
|
* front before acquiring any locks.
|
2004-08-19 01:13:10 +00:00
|
|
|
*/
|
1996-10-25 17:57:53 +00:00
|
|
|
if (len + sizeof(struct udpiphdr) > IP_MAXPACKET) {
|
2002-10-21 20:40:02 +00:00
|
|
|
if (control)
|
|
|
|
m_freem(control);
|
2004-08-19 01:13:10 +00:00
|
|
|
m_freem(m);
|
2007-02-20 10:13:11 +00:00
|
|
|
return (EMSGSIZE);
|
1996-10-25 17:57:53 +00:00
|
|
|
}
|
|
|
|
|
2007-03-08 15:26:54 +00:00
|
|
|
src.sin_family = 0;
|
2012-05-25 09:24:45 +00:00
|
|
|
INP_RLOCK(inp);
|
2012-06-12 14:56:08 +00:00
|
|
|
tos = inp->inp_ip_tos;
|
2002-10-21 20:40:02 +00:00
|
|
|
if (control != NULL) {
|
|
|
|
/*
|
2007-02-20 10:13:11 +00:00
|
|
|
* XXX: Currently, we assume all the optional information is
|
|
|
|
* stored in a single mbuf.
|
2002-10-21 20:40:02 +00:00
|
|
|
*/
|
|
|
|
if (control->m_next) {
|
2012-05-25 09:24:45 +00:00
|
|
|
INP_RUNLOCK(inp);
|
2002-10-21 20:40:02 +00:00
|
|
|
m_freem(control);
|
2004-08-19 01:13:10 +00:00
|
|
|
m_freem(m);
|
2007-02-20 10:13:11 +00:00
|
|
|
return (EINVAL);
|
2002-10-21 20:40:02 +00:00
|
|
|
}
|
|
|
|
for (; control->m_len > 0;
|
|
|
|
control->m_data += CMSG_ALIGN(cm->cmsg_len),
|
|
|
|
control->m_len -= CMSG_ALIGN(cm->cmsg_len)) {
|
|
|
|
cm = mtod(control, struct cmsghdr *);
|
2007-05-07 13:47:39 +00:00
|
|
|
if (control->m_len < sizeof(*cm) || cm->cmsg_len == 0
|
|
|
|
|| cm->cmsg_len > control->m_len) {
|
2002-10-21 20:40:02 +00:00
|
|
|
error = EINVAL;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (cm->cmsg_level != IPPROTO_IP)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
switch (cm->cmsg_type) {
|
|
|
|
case IP_SENDSRCADDR:
|
|
|
|
if (cm->cmsg_len !=
|
|
|
|
CMSG_LEN(sizeof(struct in_addr))) {
|
|
|
|
error = EINVAL;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
bzero(&src, sizeof(src));
|
|
|
|
src.sin_family = AF_INET;
|
|
|
|
src.sin_len = sizeof(src);
|
|
|
|
src.sin_port = inp->inp_lport;
|
2007-05-07 13:47:39 +00:00
|
|
|
src.sin_addr =
|
|
|
|
*(struct in_addr *)CMSG_DATA(cm);
|
2002-10-21 20:40:02 +00:00
|
|
|
break;
|
2007-05-07 13:47:39 +00:00
|
|
|
|
2012-06-12 14:56:08 +00:00
|
|
|
case IP_TOS:
|
|
|
|
if (cm->cmsg_len != CMSG_LEN(sizeof(u_char))) {
|
|
|
|
error = EINVAL;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
tos = *(u_char *)CMSG_DATA(cm);
|
|
|
|
break;
|
|
|
|
|
2002-10-21 20:40:02 +00:00
|
|
|
default:
|
|
|
|
error = ENOPROTOOPT;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (error)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
m_freem(control);
|
|
|
|
}
|
2004-08-19 01:13:10 +00:00
|
|
|
if (error) {
|
2012-05-25 09:24:45 +00:00
|
|
|
INP_RUNLOCK(inp);
|
2004-08-19 01:13:10 +00:00
|
|
|
m_freem(m);
|
2007-02-20 10:13:11 +00:00
|
|
|
return (error);
|
2004-08-19 01:13:10 +00:00
|
|
|
}
|
|
|
|
|
Merge last of a series of rwlock conversion changes to UDP, which
completes the move to a fully parallel UDP transmit path by using
global read, rather than write, locking of inpcbinfo in further
semi-connected cases:
- Add macros to allow try-locking of inpcb and inpcbinfo.
- Always acquire an incpcb read lock in udp_output(), which stablizes the
local inpcb address and port bindings in order to determine what further
locking is required:
- If the inpcb is currently not bound (at all) and are implicitly
connecting, we require inpcbinfo and inpcb write locks, so drop the
read lock and re-acquire.
- If the inpcb is bound for at least one of the port or address, but an
explicit source or destination is requested, trylock the inpcbinfo
lock, and if that fails, drop the inpcb lock, lock the global lock,
and relock the inpcb lock.
- Otherwise, no further locking is required (common case).
- Update comments.
In practice, this means that the vast majority of consumers of UDP sockets
will not acquire any exclusive locks at the socket or UDP levels of the
network stack. This leads to a marked performance improvement in several
important workloads, including BIND, nsd, and memcached over UDP, as well
as significant improvements in pps microbenchmarks.
The plan is to MFC all of the rwlock changes to RELENG_7 once they have
settled for a weeks in the tree.
Tested by: ps, kris (older revision), bde
MFC after: 3 weeks
2008-07-15 15:38:47 +00:00
|
|
|
/*
|
|
|
|
* Depending on whether or not the application has bound or connected
|
2008-07-16 10:55:50 +00:00
|
|
|
* the socket, we may have to do varying levels of work. The optimal
|
|
|
|
* case is for a connected UDP socket, as a global lock isn't
|
|
|
|
* required at all.
|
Merge last of a series of rwlock conversion changes to UDP, which
completes the move to a fully parallel UDP transmit path by using
global read, rather than write, locking of inpcbinfo in further
semi-connected cases:
- Add macros to allow try-locking of inpcb and inpcbinfo.
- Always acquire an incpcb read lock in udp_output(), which stablizes the
local inpcb address and port bindings in order to determine what further
locking is required:
- If the inpcb is currently not bound (at all) and are implicitly
connecting, we require inpcbinfo and inpcb write locks, so drop the
read lock and re-acquire.
- If the inpcb is bound for at least one of the port or address, but an
explicit source or destination is requested, trylock the inpcbinfo
lock, and if that fails, drop the inpcb lock, lock the global lock,
and relock the inpcb lock.
- Otherwise, no further locking is required (common case).
- Update comments.
In practice, this means that the vast majority of consumers of UDP sockets
will not acquire any exclusive locks at the socket or UDP levels of the
network stack. This leads to a marked performance improvement in several
important workloads, including BIND, nsd, and memcached over UDP, as well
as significant improvements in pps microbenchmarks.
The plan is to MFC all of the rwlock changes to RELENG_7 once they have
settled for a weeks in the tree.
Tested by: ps, kris (older revision), bde
MFC after: 3 weeks
2008-07-15 15:38:47 +00:00
|
|
|
*
|
|
|
|
* In order to decide which we need, we require stability of the
|
|
|
|
* inpcb binding, which we ensure by acquiring a read lock on the
|
|
|
|
* inpcb. This doesn't strictly follow the lock order, so we play
|
|
|
|
* the trylock and retry game; note that we may end up with more
|
|
|
|
* conservative locks than required the second time around, so later
|
|
|
|
* assertions have to accept that. Further analysis of the number of
|
|
|
|
* misses under contention is required.
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
*
|
|
|
|
* XXXRW: Check that hash locking update here is correct.
|
Merge last of a series of rwlock conversion changes to UDP, which
completes the move to a fully parallel UDP transmit path by using
global read, rather than write, locking of inpcbinfo in further
semi-connected cases:
- Add macros to allow try-locking of inpcb and inpcbinfo.
- Always acquire an incpcb read lock in udp_output(), which stablizes the
local inpcb address and port bindings in order to determine what further
locking is required:
- If the inpcb is currently not bound (at all) and are implicitly
connecting, we require inpcbinfo and inpcb write locks, so drop the
read lock and re-acquire.
- If the inpcb is bound for at least one of the port or address, but an
explicit source or destination is requested, trylock the inpcbinfo
lock, and if that fails, drop the inpcb lock, lock the global lock,
and relock the inpcb lock.
- Otherwise, no further locking is required (common case).
- Update comments.
In practice, this means that the vast majority of consumers of UDP sockets
will not acquire any exclusive locks at the socket or UDP levels of the
network stack. This leads to a marked performance improvement in several
important workloads, including BIND, nsd, and memcached over UDP, as well
as significant improvements in pps microbenchmarks.
The plan is to MFC all of the rwlock changes to RELENG_7 once they have
settled for a weeks in the tree.
Tested by: ps, kris (older revision), bde
MFC after: 3 weeks
2008-07-15 15:38:47 +00:00
|
|
|
*/
|
2014-04-07 01:53:03 +00:00
|
|
|
pr = inp->inp_socket->so_proto->pr_protocol;
|
|
|
|
pcbinfo = get_inpcbinfo(pr);
|
Merge last of a series of rwlock conversion changes to UDP, which
completes the move to a fully parallel UDP transmit path by using
global read, rather than write, locking of inpcbinfo in further
semi-connected cases:
- Add macros to allow try-locking of inpcb and inpcbinfo.
- Always acquire an incpcb read lock in udp_output(), which stablizes the
local inpcb address and port bindings in order to determine what further
locking is required:
- If the inpcb is currently not bound (at all) and are implicitly
connecting, we require inpcbinfo and inpcb write locks, so drop the
read lock and re-acquire.
- If the inpcb is bound for at least one of the port or address, but an
explicit source or destination is requested, trylock the inpcbinfo
lock, and if that fails, drop the inpcb lock, lock the global lock,
and relock the inpcb lock.
- Otherwise, no further locking is required (common case).
- Update comments.
In practice, this means that the vast majority of consumers of UDP sockets
will not acquire any exclusive locks at the socket or UDP levels of the
network stack. This leads to a marked performance improvement in several
important workloads, including BIND, nsd, and memcached over UDP, as well
as significant improvements in pps microbenchmarks.
The plan is to MFC all of the rwlock changes to RELENG_7 once they have
settled for a weeks in the tree.
Tested by: ps, kris (older revision), bde
MFC after: 3 weeks
2008-07-15 15:38:47 +00:00
|
|
|
sin = (struct sockaddr_in *)addr;
|
|
|
|
if (sin != NULL &&
|
|
|
|
(inp->inp_laddr.s_addr == INADDR_ANY && inp->inp_lport == 0)) {
|
|
|
|
INP_RUNLOCK(inp);
|
2008-07-07 10:56:55 +00:00
|
|
|
INP_WLOCK(inp);
|
2014-04-07 01:53:03 +00:00
|
|
|
INP_HASH_WLOCK(pcbinfo);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
unlock_udbinfo = UH_WLOCKED;
|
Merge last of a series of rwlock conversion changes to UDP, which
completes the move to a fully parallel UDP transmit path by using
global read, rather than write, locking of inpcbinfo in further
semi-connected cases:
- Add macros to allow try-locking of inpcb and inpcbinfo.
- Always acquire an incpcb read lock in udp_output(), which stablizes the
local inpcb address and port bindings in order to determine what further
locking is required:
- If the inpcb is currently not bound (at all) and are implicitly
connecting, we require inpcbinfo and inpcb write locks, so drop the
read lock and re-acquire.
- If the inpcb is bound for at least one of the port or address, but an
explicit source or destination is requested, trylock the inpcbinfo
lock, and if that fails, drop the inpcb lock, lock the global lock,
and relock the inpcb lock.
- Otherwise, no further locking is required (common case).
- Update comments.
In practice, this means that the vast majority of consumers of UDP sockets
will not acquire any exclusive locks at the socket or UDP levels of the
network stack. This leads to a marked performance improvement in several
important workloads, including BIND, nsd, and memcached over UDP, as well
as significant improvements in pps microbenchmarks.
The plan is to MFC all of the rwlock changes to RELENG_7 once they have
settled for a weeks in the tree.
Tested by: ps, kris (older revision), bde
MFC after: 3 weeks
2008-07-15 15:38:47 +00:00
|
|
|
} else if ((sin != NULL && (
|
|
|
|
(sin->sin_addr.s_addr == INADDR_ANY) ||
|
|
|
|
(sin->sin_addr.s_addr == INADDR_BROADCAST) ||
|
|
|
|
(inp->inp_laddr.s_addr == INADDR_ANY) ||
|
|
|
|
(inp->inp_lport == 0))) ||
|
|
|
|
(src.sin_family == AF_INET)) {
|
2014-04-07 01:53:03 +00:00
|
|
|
INP_HASH_RLOCK(pcbinfo);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
unlock_udbinfo = UH_RLOCKED;
|
Merge last of a series of rwlock conversion changes to UDP, which
completes the move to a fully parallel UDP transmit path by using
global read, rather than write, locking of inpcbinfo in further
semi-connected cases:
- Add macros to allow try-locking of inpcb and inpcbinfo.
- Always acquire an incpcb read lock in udp_output(), which stablizes the
local inpcb address and port bindings in order to determine what further
locking is required:
- If the inpcb is currently not bound (at all) and are implicitly
connecting, we require inpcbinfo and inpcb write locks, so drop the
read lock and re-acquire.
- If the inpcb is bound for at least one of the port or address, but an
explicit source or destination is requested, trylock the inpcbinfo
lock, and if that fails, drop the inpcb lock, lock the global lock,
and relock the inpcb lock.
- Otherwise, no further locking is required (common case).
- Update comments.
In practice, this means that the vast majority of consumers of UDP sockets
will not acquire any exclusive locks at the socket or UDP levels of the
network stack. This leads to a marked performance improvement in several
important workloads, including BIND, nsd, and memcached over UDP, as well
as significant improvements in pps microbenchmarks.
The plan is to MFC all of the rwlock changes to RELENG_7 once they have
settled for a weeks in the tree.
Tested by: ps, kris (older revision), bde
MFC after: 3 weeks
2008-07-15 15:38:47 +00:00
|
|
|
} else
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
unlock_udbinfo = UH_UNLOCKED;
|
2004-08-19 01:13:10 +00:00
|
|
|
|
2007-03-08 15:26:54 +00:00
|
|
|
/*
|
|
|
|
* If the IP_SENDSRCADDR control message was specified, override the
|
2007-09-10 14:22:15 +00:00
|
|
|
* source address for this datagram. Its use is invalidated if the
|
2007-03-08 15:26:54 +00:00
|
|
|
* address thus specified is incomplete or clobbers other inpcbs.
|
|
|
|
*/
|
2002-10-21 20:10:05 +00:00
|
|
|
laddr = inp->inp_laddr;
|
|
|
|
lport = inp->inp_lport;
|
2007-03-08 15:26:54 +00:00
|
|
|
if (src.sin_family == AF_INET) {
|
2014-04-07 01:53:03 +00:00
|
|
|
INP_HASH_LOCK_ASSERT(pcbinfo);
|
2007-03-08 15:26:54 +00:00
|
|
|
if ((lport == 0) ||
|
|
|
|
(laddr.s_addr == INADDR_ANY &&
|
|
|
|
src.sin_addr.s_addr == INADDR_ANY)) {
|
2002-10-21 20:40:02 +00:00
|
|
|
error = EINVAL;
|
|
|
|
goto release;
|
|
|
|
}
|
|
|
|
error = in_pcbbind_setup(inp, (struct sockaddr *)&src,
|
2004-03-27 21:05:46 +00:00
|
|
|
&laddr.s_addr, &lport, td->td_ucred);
|
2002-10-21 20:40:02 +00:00
|
|
|
if (error)
|
|
|
|
goto release;
|
|
|
|
}
|
|
|
|
|
2008-07-10 16:20:18 +00:00
|
|
|
/*
|
|
|
|
* If a UDP socket has been connected, then a local address/port will
|
|
|
|
* have been selected and bound.
|
|
|
|
*
|
Merge last of a series of rwlock conversion changes to UDP, which
completes the move to a fully parallel UDP transmit path by using
global read, rather than write, locking of inpcbinfo in further
semi-connected cases:
- Add macros to allow try-locking of inpcb and inpcbinfo.
- Always acquire an incpcb read lock in udp_output(), which stablizes the
local inpcb address and port bindings in order to determine what further
locking is required:
- If the inpcb is currently not bound (at all) and are implicitly
connecting, we require inpcbinfo and inpcb write locks, so drop the
read lock and re-acquire.
- If the inpcb is bound for at least one of the port or address, but an
explicit source or destination is requested, trylock the inpcbinfo
lock, and if that fails, drop the inpcb lock, lock the global lock,
and relock the inpcb lock.
- Otherwise, no further locking is required (common case).
- Update comments.
In practice, this means that the vast majority of consumers of UDP sockets
will not acquire any exclusive locks at the socket or UDP levels of the
network stack. This leads to a marked performance improvement in several
important workloads, including BIND, nsd, and memcached over UDP, as well
as significant improvements in pps microbenchmarks.
The plan is to MFC all of the rwlock changes to RELENG_7 once they have
settled for a weeks in the tree.
Tested by: ps, kris (older revision), bde
MFC after: 3 weeks
2008-07-15 15:38:47 +00:00
|
|
|
* If a UDP socket has not been connected to, then an explicit
|
2008-07-10 16:20:18 +00:00
|
|
|
* destination address must be used, in which case a local
|
|
|
|
* address/port may not have been selected and bound.
|
|
|
|
*/
|
Merge last of a series of rwlock conversion changes to UDP, which
completes the move to a fully parallel UDP transmit path by using
global read, rather than write, locking of inpcbinfo in further
semi-connected cases:
- Add macros to allow try-locking of inpcb and inpcbinfo.
- Always acquire an incpcb read lock in udp_output(), which stablizes the
local inpcb address and port bindings in order to determine what further
locking is required:
- If the inpcb is currently not bound (at all) and are implicitly
connecting, we require inpcbinfo and inpcb write locks, so drop the
read lock and re-acquire.
- If the inpcb is bound for at least one of the port or address, but an
explicit source or destination is requested, trylock the inpcbinfo
lock, and if that fails, drop the inpcb lock, lock the global lock,
and relock the inpcb lock.
- Otherwise, no further locking is required (common case).
- Update comments.
In practice, this means that the vast majority of consumers of UDP sockets
will not acquire any exclusive locks at the socket or UDP levels of the
network stack. This leads to a marked performance improvement in several
important workloads, including BIND, nsd, and memcached over UDP, as well
as significant improvements in pps microbenchmarks.
The plan is to MFC all of the rwlock changes to RELENG_7 once they have
settled for a weeks in the tree.
Tested by: ps, kris (older revision), bde
MFC after: 3 weeks
2008-07-15 15:38:47 +00:00
|
|
|
if (sin != NULL) {
|
2008-07-07 12:14:10 +00:00
|
|
|
INP_LOCK_ASSERT(inp);
|
1994-05-24 10:09:53 +00:00
|
|
|
if (inp->inp_faddr.s_addr != INADDR_ANY) {
|
|
|
|
error = EISCONN;
|
|
|
|
goto release;
|
|
|
|
}
|
2008-07-10 16:20:18 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Jail may rewrite the destination address, so let it do
|
|
|
|
* that before we use it.
|
|
|
|
*/
|
2009-02-05 14:06:09 +00:00
|
|
|
error = prison_remote_ip4(td->td_ucred, &sin->sin_addr);
|
|
|
|
if (error)
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
goto release;
|
2008-07-10 16:20:18 +00:00
|
|
|
|
|
|
|
/*
|
Merge last of a series of rwlock conversion changes to UDP, which
completes the move to a fully parallel UDP transmit path by using
global read, rather than write, locking of inpcbinfo in further
semi-connected cases:
- Add macros to allow try-locking of inpcb and inpcbinfo.
- Always acquire an incpcb read lock in udp_output(), which stablizes the
local inpcb address and port bindings in order to determine what further
locking is required:
- If the inpcb is currently not bound (at all) and are implicitly
connecting, we require inpcbinfo and inpcb write locks, so drop the
read lock and re-acquire.
- If the inpcb is bound for at least one of the port or address, but an
explicit source or destination is requested, trylock the inpcbinfo
lock, and if that fails, drop the inpcb lock, lock the global lock,
and relock the inpcb lock.
- Otherwise, no further locking is required (common case).
- Update comments.
In practice, this means that the vast majority of consumers of UDP sockets
will not acquire any exclusive locks at the socket or UDP levels of the
network stack. This leads to a marked performance improvement in several
important workloads, including BIND, nsd, and memcached over UDP, as well
as significant improvements in pps microbenchmarks.
The plan is to MFC all of the rwlock changes to RELENG_7 once they have
settled for a weeks in the tree.
Tested by: ps, kris (older revision), bde
MFC after: 3 weeks
2008-07-15 15:38:47 +00:00
|
|
|
* If a local address or port hasn't yet been selected, or if
|
|
|
|
* the destination address needs to be rewritten due to using
|
|
|
|
* a special INADDR_ constant, invoke in_pcbconnect_setup()
|
|
|
|
* to do the heavy lifting. Once a port is selected, we
|
|
|
|
* commit the binding back to the socket; we also commit the
|
|
|
|
* binding of the address if in jail.
|
|
|
|
*
|
|
|
|
* If we already have a valid binding and we're not
|
|
|
|
* requesting a destination address rewrite, use a fast path.
|
2008-07-10 16:20:18 +00:00
|
|
|
*/
|
Merge last of a series of rwlock conversion changes to UDP, which
completes the move to a fully parallel UDP transmit path by using
global read, rather than write, locking of inpcbinfo in further
semi-connected cases:
- Add macros to allow try-locking of inpcb and inpcbinfo.
- Always acquire an incpcb read lock in udp_output(), which stablizes the
local inpcb address and port bindings in order to determine what further
locking is required:
- If the inpcb is currently not bound (at all) and are implicitly
connecting, we require inpcbinfo and inpcb write locks, so drop the
read lock and re-acquire.
- If the inpcb is bound for at least one of the port or address, but an
explicit source or destination is requested, trylock the inpcbinfo
lock, and if that fails, drop the inpcb lock, lock the global lock,
and relock the inpcb lock.
- Otherwise, no further locking is required (common case).
- Update comments.
In practice, this means that the vast majority of consumers of UDP sockets
will not acquire any exclusive locks at the socket or UDP levels of the
network stack. This leads to a marked performance improvement in several
important workloads, including BIND, nsd, and memcached over UDP, as well
as significant improvements in pps microbenchmarks.
The plan is to MFC all of the rwlock changes to RELENG_7 once they have
settled for a weeks in the tree.
Tested by: ps, kris (older revision), bde
MFC after: 3 weeks
2008-07-15 15:38:47 +00:00
|
|
|
if (inp->inp_laddr.s_addr == INADDR_ANY ||
|
|
|
|
inp->inp_lport == 0 ||
|
|
|
|
sin->sin_addr.s_addr == INADDR_ANY ||
|
|
|
|
sin->sin_addr.s_addr == INADDR_BROADCAST) {
|
2014-04-07 01:53:03 +00:00
|
|
|
INP_HASH_LOCK_ASSERT(pcbinfo);
|
Merge last of a series of rwlock conversion changes to UDP, which
completes the move to a fully parallel UDP transmit path by using
global read, rather than write, locking of inpcbinfo in further
semi-connected cases:
- Add macros to allow try-locking of inpcb and inpcbinfo.
- Always acquire an incpcb read lock in udp_output(), which stablizes the
local inpcb address and port bindings in order to determine what further
locking is required:
- If the inpcb is currently not bound (at all) and are implicitly
connecting, we require inpcbinfo and inpcb write locks, so drop the
read lock and re-acquire.
- If the inpcb is bound for at least one of the port or address, but an
explicit source or destination is requested, trylock the inpcbinfo
lock, and if that fails, drop the inpcb lock, lock the global lock,
and relock the inpcb lock.
- Otherwise, no further locking is required (common case).
- Update comments.
In practice, this means that the vast majority of consumers of UDP sockets
will not acquire any exclusive locks at the socket or UDP levels of the
network stack. This leads to a marked performance improvement in several
important workloads, including BIND, nsd, and memcached over UDP, as well
as significant improvements in pps microbenchmarks.
The plan is to MFC all of the rwlock changes to RELENG_7 once they have
settled for a weeks in the tree.
Tested by: ps, kris (older revision), bde
MFC after: 3 weeks
2008-07-15 15:38:47 +00:00
|
|
|
error = in_pcbconnect_setup(inp, addr, &laddr.s_addr,
|
|
|
|
&lport, &faddr.s_addr, &fport, NULL,
|
|
|
|
td->td_ucred);
|
|
|
|
if (error)
|
|
|
|
goto release;
|
2002-10-21 20:10:05 +00:00
|
|
|
|
2005-02-22 07:50:02 +00:00
|
|
|
/*
|
Merge last of a series of rwlock conversion changes to UDP, which
completes the move to a fully parallel UDP transmit path by using
global read, rather than write, locking of inpcbinfo in further
semi-connected cases:
- Add macros to allow try-locking of inpcb and inpcbinfo.
- Always acquire an incpcb read lock in udp_output(), which stablizes the
local inpcb address and port bindings in order to determine what further
locking is required:
- If the inpcb is currently not bound (at all) and are implicitly
connecting, we require inpcbinfo and inpcb write locks, so drop the
read lock and re-acquire.
- If the inpcb is bound for at least one of the port or address, but an
explicit source or destination is requested, trylock the inpcbinfo
lock, and if that fails, drop the inpcb lock, lock the global lock,
and relock the inpcb lock.
- Otherwise, no further locking is required (common case).
- Update comments.
In practice, this means that the vast majority of consumers of UDP sockets
will not acquire any exclusive locks at the socket or UDP levels of the
network stack. This leads to a marked performance improvement in several
important workloads, including BIND, nsd, and memcached over UDP, as well
as significant improvements in pps microbenchmarks.
The plan is to MFC all of the rwlock changes to RELENG_7 once they have
settled for a weeks in the tree.
Tested by: ps, kris (older revision), bde
MFC after: 3 weeks
2008-07-15 15:38:47 +00:00
|
|
|
* XXXRW: Why not commit the port if the address is
|
|
|
|
* !INADDR_ANY?
|
2005-02-22 07:50:02 +00:00
|
|
|
*/
|
Merge last of a series of rwlock conversion changes to UDP, which
completes the move to a fully parallel UDP transmit path by using
global read, rather than write, locking of inpcbinfo in further
semi-connected cases:
- Add macros to allow try-locking of inpcb and inpcbinfo.
- Always acquire an incpcb read lock in udp_output(), which stablizes the
local inpcb address and port bindings in order to determine what further
locking is required:
- If the inpcb is currently not bound (at all) and are implicitly
connecting, we require inpcbinfo and inpcb write locks, so drop the
read lock and re-acquire.
- If the inpcb is bound for at least one of the port or address, but an
explicit source or destination is requested, trylock the inpcbinfo
lock, and if that fails, drop the inpcb lock, lock the global lock,
and relock the inpcb lock.
- Otherwise, no further locking is required (common case).
- Update comments.
In practice, this means that the vast majority of consumers of UDP sockets
will not acquire any exclusive locks at the socket or UDP levels of the
network stack. This leads to a marked performance improvement in several
important workloads, including BIND, nsd, and memcached over UDP, as well
as significant improvements in pps microbenchmarks.
The plan is to MFC all of the rwlock changes to RELENG_7 once they have
settled for a weeks in the tree.
Tested by: ps, kris (older revision), bde
MFC after: 3 weeks
2008-07-15 15:38:47 +00:00
|
|
|
/* Commit the local port if newly assigned. */
|
|
|
|
if (inp->inp_laddr.s_addr == INADDR_ANY &&
|
|
|
|
inp->inp_lport == 0) {
|
|
|
|
INP_WLOCK_ASSERT(inp);
|
2014-04-07 01:53:03 +00:00
|
|
|
INP_HASH_WLOCK_ASSERT(pcbinfo);
|
Merge last of a series of rwlock conversion changes to UDP, which
completes the move to a fully parallel UDP transmit path by using
global read, rather than write, locking of inpcbinfo in further
semi-connected cases:
- Add macros to allow try-locking of inpcb and inpcbinfo.
- Always acquire an incpcb read lock in udp_output(), which stablizes the
local inpcb address and port bindings in order to determine what further
locking is required:
- If the inpcb is currently not bound (at all) and are implicitly
connecting, we require inpcbinfo and inpcb write locks, so drop the
read lock and re-acquire.
- If the inpcb is bound for at least one of the port or address, but an
explicit source or destination is requested, trylock the inpcbinfo
lock, and if that fails, drop the inpcb lock, lock the global lock,
and relock the inpcb lock.
- Otherwise, no further locking is required (common case).
- Update comments.
In practice, this means that the vast majority of consumers of UDP sockets
will not acquire any exclusive locks at the socket or UDP levels of the
network stack. This leads to a marked performance improvement in several
important workloads, including BIND, nsd, and memcached over UDP, as well
as significant improvements in pps microbenchmarks.
The plan is to MFC all of the rwlock changes to RELENG_7 once they have
settled for a weeks in the tree.
Tested by: ps, kris (older revision), bde
MFC after: 3 weeks
2008-07-15 15:38:47 +00:00
|
|
|
/*
|
|
|
|
* Remember addr if jailed, to prevent
|
|
|
|
* rebinding.
|
|
|
|
*/
|
2009-05-27 14:11:23 +00:00
|
|
|
if (prison_flag(td->td_ucred, PR_IP4))
|
Merge last of a series of rwlock conversion changes to UDP, which
completes the move to a fully parallel UDP transmit path by using
global read, rather than write, locking of inpcbinfo in further
semi-connected cases:
- Add macros to allow try-locking of inpcb and inpcbinfo.
- Always acquire an incpcb read lock in udp_output(), which stablizes the
local inpcb address and port bindings in order to determine what further
locking is required:
- If the inpcb is currently not bound (at all) and are implicitly
connecting, we require inpcbinfo and inpcb write locks, so drop the
read lock and re-acquire.
- If the inpcb is bound for at least one of the port or address, but an
explicit source or destination is requested, trylock the inpcbinfo
lock, and if that fails, drop the inpcb lock, lock the global lock,
and relock the inpcb lock.
- Otherwise, no further locking is required (common case).
- Update comments.
In practice, this means that the vast majority of consumers of UDP sockets
will not acquire any exclusive locks at the socket or UDP levels of the
network stack. This leads to a marked performance improvement in several
important workloads, including BIND, nsd, and memcached over UDP, as well
as significant improvements in pps microbenchmarks.
The plan is to MFC all of the rwlock changes to RELENG_7 once they have
settled for a weeks in the tree.
Tested by: ps, kris (older revision), bde
MFC after: 3 weeks
2008-07-15 15:38:47 +00:00
|
|
|
inp->inp_laddr = laddr;
|
|
|
|
inp->inp_lport = lport;
|
|
|
|
if (in_pcbinshash(inp) != 0) {
|
|
|
|
inp->inp_lport = 0;
|
|
|
|
error = EAGAIN;
|
|
|
|
goto release;
|
|
|
|
}
|
|
|
|
inp->inp_flags |= INP_ANONPORT;
|
2002-10-21 20:10:05 +00:00
|
|
|
}
|
Merge last of a series of rwlock conversion changes to UDP, which
completes the move to a fully parallel UDP transmit path by using
global read, rather than write, locking of inpcbinfo in further
semi-connected cases:
- Add macros to allow try-locking of inpcb and inpcbinfo.
- Always acquire an incpcb read lock in udp_output(), which stablizes the
local inpcb address and port bindings in order to determine what further
locking is required:
- If the inpcb is currently not bound (at all) and are implicitly
connecting, we require inpcbinfo and inpcb write locks, so drop the
read lock and re-acquire.
- If the inpcb is bound for at least one of the port or address, but an
explicit source or destination is requested, trylock the inpcbinfo
lock, and if that fails, drop the inpcb lock, lock the global lock,
and relock the inpcb lock.
- Otherwise, no further locking is required (common case).
- Update comments.
In practice, this means that the vast majority of consumers of UDP sockets
will not acquire any exclusive locks at the socket or UDP levels of the
network stack. This leads to a marked performance improvement in several
important workloads, including BIND, nsd, and memcached over UDP, as well
as significant improvements in pps microbenchmarks.
The plan is to MFC all of the rwlock changes to RELENG_7 once they have
settled for a weeks in the tree.
Tested by: ps, kris (older revision), bde
MFC after: 3 weeks
2008-07-15 15:38:47 +00:00
|
|
|
} else {
|
|
|
|
faddr = sin->sin_addr;
|
|
|
|
fport = sin->sin_port;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
} else {
|
2008-07-07 12:14:10 +00:00
|
|
|
INP_LOCK_ASSERT(inp);
|
2002-10-21 20:10:05 +00:00
|
|
|
faddr = inp->inp_faddr;
|
|
|
|
fport = inp->inp_fport;
|
|
|
|
if (faddr.s_addr == INADDR_ANY) {
|
1994-05-24 10:09:53 +00:00
|
|
|
error = ENOTCONN;
|
|
|
|
goto release;
|
|
|
|
}
|
|
|
|
}
|
2004-08-21 16:14:04 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
2004-08-21 16:14:04 +00:00
|
|
|
* Calculate data length and get a mbuf for UDP, IP, and possible
|
2004-08-22 01:32:48 +00:00
|
|
|
* link-layer headers. Immediate slide the data pointer back forward
|
|
|
|
* since we won't use that space at this layer.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2012-12-05 08:04:20 +00:00
|
|
|
M_PREPEND(m, sizeof(struct udpiphdr) + max_linkhdr, M_NOWAIT);
|
2004-08-21 16:14:04 +00:00
|
|
|
if (m == NULL) {
|
1994-05-24 10:09:53 +00:00
|
|
|
error = ENOBUFS;
|
2004-06-16 08:50:14 +00:00
|
|
|
goto release;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2004-08-21 16:14:04 +00:00
|
|
|
m->m_data += max_linkhdr;
|
|
|
|
m->m_len -= max_linkhdr;
|
2004-08-22 01:32:48 +00:00
|
|
|
m->m_pkthdr.len -= max_linkhdr;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
/*
|
2007-02-20 10:13:11 +00:00
|
|
|
* Fill in mbuf with extended UDP header and addresses and length put
|
|
|
|
* into network format.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
|
|
|
ui = mtod(m, struct udpiphdr *);
|
2000-03-27 19:14:27 +00:00
|
|
|
bzero(ui->ui_x1, sizeof(ui->ui_x1)); /* XXX still needed? */
|
2014-04-07 01:53:03 +00:00
|
|
|
ui->ui_pr = pr;
|
2002-10-21 20:10:05 +00:00
|
|
|
ui->ui_src = laddr;
|
|
|
|
ui->ui_dst = faddr;
|
|
|
|
ui->ui_sport = lport;
|
|
|
|
ui->ui_dport = fport;
|
2000-03-27 19:14:27 +00:00
|
|
|
ui->ui_ulen = htons((u_short)len + sizeof(struct udphdr));
|
2014-04-07 01:53:03 +00:00
|
|
|
if (pr == IPPROTO_UDPLITE) {
|
|
|
|
struct udpcb *up;
|
|
|
|
uint16_t plen;
|
|
|
|
|
|
|
|
up = intoudpcb(inp);
|
|
|
|
cscov = up->u_txcslen;
|
|
|
|
plen = (u_short)len + sizeof(struct udphdr);
|
|
|
|
if (cscov >= plen)
|
|
|
|
cscov = 0;
|
|
|
|
ui->ui_len = htons(plen);
|
|
|
|
ui->ui_ulen = htons(cscov);
|
|
|
|
/*
|
|
|
|
* For UDP-Lite, checksum coverage length of zero means
|
|
|
|
* the entire UDPLite packet is covered by the checksum.
|
|
|
|
*/
|
2014-05-10 08:48:04 +00:00
|
|
|
cscov_partial = (cscov == 0) ? 0 : 1;
|
2014-04-07 01:53:03 +00:00
|
|
|
} else
|
|
|
|
ui->ui_v = IPVERSION << 4;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2005-09-26 20:25:16 +00:00
|
|
|
/*
|
|
|
|
* Set the Don't Fragment bit in the IP header.
|
|
|
|
*/
|
|
|
|
if (inp->inp_flags & INP_DONTFRAG) {
|
|
|
|
struct ip *ip;
|
2007-02-20 10:13:11 +00:00
|
|
|
|
2005-09-26 20:25:16 +00:00
|
|
|
ip = (struct ip *)&ui->ui_i;
|
2012-10-22 21:09:03 +00:00
|
|
|
ip->ip_off |= htons(IP_DF);
|
2005-09-26 20:25:16 +00:00
|
|
|
}
|
|
|
|
|
2004-09-05 02:34:12 +00:00
|
|
|
ipflags = 0;
|
|
|
|
if (inp->inp_socket->so_options & SO_DONTROUTE)
|
|
|
|
ipflags |= IP_ROUTETOIF;
|
|
|
|
if (inp->inp_socket->so_options & SO_BROADCAST)
|
|
|
|
ipflags |= IP_ALLOWBROADCAST;
|
2006-09-06 19:04:36 +00:00
|
|
|
if (inp->inp_flags & INP_ONESBCAST)
|
2003-08-20 14:46:40 +00:00
|
|
|
ipflags |= IP_SENDONES;
|
|
|
|
|
2008-07-10 09:45:28 +00:00
|
|
|
#ifdef MAC
|
|
|
|
mac_inpcb_create_mbuf(inp, m);
|
|
|
|
#endif
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
2000-03-27 19:14:27 +00:00
|
|
|
* Set up checksum and output datagram.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2014-04-07 01:53:03 +00:00
|
|
|
ui->ui_sum = 0;
|
2014-05-12 09:46:48 +00:00
|
|
|
if (pr == IPPROTO_UDPLITE) {
|
2014-04-07 01:53:03 +00:00
|
|
|
if (inp->inp_flags & INP_ONESBCAST)
|
|
|
|
faddr.s_addr = INADDR_BROADCAST;
|
2014-05-12 09:46:48 +00:00
|
|
|
if (cscov_partial) {
|
|
|
|
if ((ui->ui_sum = in_cksum(m, sizeof(struct ip) + cscov)) == 0)
|
|
|
|
ui->ui_sum = 0xffff;
|
|
|
|
} else {
|
|
|
|
if ((ui->ui_sum = in_cksum(m, sizeof(struct udpiphdr) + len)) == 0)
|
|
|
|
ui->ui_sum = 0xffff;
|
|
|
|
}
|
|
|
|
} else if (V_udp_cksum) {
|
2006-09-06 19:04:36 +00:00
|
|
|
if (inp->inp_flags & INP_ONESBCAST)
|
2003-09-03 02:19:29 +00:00
|
|
|
faddr.s_addr = INADDR_BROADCAST;
|
|
|
|
ui->ui_sum = in_pseudo(ui->ui_src.s_addr, faddr.s_addr,
|
2014-04-07 01:53:03 +00:00
|
|
|
htons((u_short)len + sizeof(struct udphdr) + pr));
|
2000-03-27 19:14:27 +00:00
|
|
|
m->m_pkthdr.csum_flags = CSUM_UDP;
|
|
|
|
m->m_pkthdr.csum_data = offsetof(struct udphdr, uh_sum);
|
2014-04-07 01:53:03 +00:00
|
|
|
}
|
2012-10-22 21:09:03 +00:00
|
|
|
((struct ip *)ui)->ip_len = htons(sizeof(struct udpiphdr) + len);
|
1997-04-03 05:14:45 +00:00
|
|
|
((struct ip *)ui)->ip_ttl = inp->inp_ip_ttl; /* XXX */
|
2012-06-12 14:56:08 +00:00
|
|
|
((struct ip *)ui)->ip_tos = tos; /* XXX */
|
2009-04-12 11:42:40 +00:00
|
|
|
UDPSTAT_INC(udps_opackets);
|
1999-12-07 17:39:16 +00:00
|
|
|
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
if (unlock_udbinfo == UH_WLOCKED)
|
2014-04-07 01:53:03 +00:00
|
|
|
INP_HASH_WUNLOCK(pcbinfo);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
else if (unlock_udbinfo == UH_RLOCKED)
|
2014-04-07 01:53:03 +00:00
|
|
|
INP_HASH_RUNLOCK(pcbinfo);
|
2013-08-25 21:54:41 +00:00
|
|
|
UDP_PROBE(send, NULL, inp, &ui->ui_i, inp, &ui->ui_u);
|
2003-11-20 20:07:39 +00:00
|
|
|
error = ip_output(m, inp->inp_options, NULL, ipflags,
|
2002-10-16 01:54:46 +00:00
|
|
|
inp->inp_moptions, inp);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
if (unlock_udbinfo == UH_WLOCKED)
|
2008-07-07 10:56:55 +00:00
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
else
|
|
|
|
INP_RUNLOCK(inp);
|
1994-05-24 10:09:53 +00:00
|
|
|
return (error);
|
|
|
|
|
|
|
|
release:
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
if (unlock_udbinfo == UH_WLOCKED) {
|
2014-04-07 01:53:03 +00:00
|
|
|
INP_HASH_WUNLOCK(pcbinfo);
|
2008-07-07 10:56:55 +00:00
|
|
|
INP_WUNLOCK(inp);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
} else if (unlock_udbinfo == UH_RLOCKED) {
|
2014-04-07 01:53:03 +00:00
|
|
|
INP_HASH_RUNLOCK(pcbinfo);
|
Merge last of a series of rwlock conversion changes to UDP, which
completes the move to a fully parallel UDP transmit path by using
global read, rather than write, locking of inpcbinfo in further
semi-connected cases:
- Add macros to allow try-locking of inpcb and inpcbinfo.
- Always acquire an incpcb read lock in udp_output(), which stablizes the
local inpcb address and port bindings in order to determine what further
locking is required:
- If the inpcb is currently not bound (at all) and are implicitly
connecting, we require inpcbinfo and inpcb write locks, so drop the
read lock and re-acquire.
- If the inpcb is bound for at least one of the port or address, but an
explicit source or destination is requested, trylock the inpcbinfo
lock, and if that fails, drop the inpcb lock, lock the global lock,
and relock the inpcb lock.
- Otherwise, no further locking is required (common case).
- Update comments.
In practice, this means that the vast majority of consumers of UDP sockets
will not acquire any exclusive locks at the socket or UDP levels of the
network stack. This leads to a marked performance improvement in several
important workloads, including BIND, nsd, and memcached over UDP, as well
as significant improvements in pps microbenchmarks.
The plan is to MFC all of the rwlock changes to RELENG_7 once they have
settled for a weeks in the tree.
Tested by: ps, kris (older revision), bde
MFC after: 3 weeks
2008-07-15 15:38:47 +00:00
|
|
|
INP_RUNLOCK(inp);
|
2008-07-07 10:56:55 +00:00
|
|
|
} else
|
|
|
|
INP_RUNLOCK(inp);
|
1994-05-24 10:09:53 +00:00
|
|
|
m_freem(m);
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
Added support for NAT-Traversal (RFC 3948) in IPsec stack.
Thanks to (no special order) Emmanuel Dreyfus (manu@netbsd.org), Larry
Baird (lab@gta.com), gnn, bz, and other FreeBSD devs, Julien Vanherzeele
(julien.vanherzeele@netasq.com, for years of bug reporting), the PFSense
team, and all people who used / tried the NAT-T patch for years and
reported bugs, patches, etc...
X-MFC: never
Reviewed by: bz
Approved by: gnn(mentor)
Obtained from: NETASQ
2009-06-12 15:44:35 +00:00
|
|
|
|
|
|
|
#if defined(IPSEC) && defined(IPSEC_NAT_T)
|
|
|
|
/*
|
|
|
|
* Potentially decap ESP in UDP frame. Check for an ESP header
|
|
|
|
* and optional marker; if present, strip the UDP header and
|
|
|
|
* push the result through IPSec.
|
|
|
|
*
|
|
|
|
* Returns mbuf to be processed (potentially re-allocated) or
|
|
|
|
* NULL if consumed and/or processed.
|
|
|
|
*/
|
|
|
|
static struct mbuf *
|
|
|
|
udp4_espdecap(struct inpcb *inp, struct mbuf *m, int off)
|
|
|
|
{
|
|
|
|
size_t minlen, payload, skip, iphlen;
|
|
|
|
caddr_t data;
|
|
|
|
struct udpcb *up;
|
|
|
|
struct m_tag *tag;
|
|
|
|
struct udphdr *udphdr;
|
|
|
|
struct ip *ip;
|
|
|
|
|
|
|
|
INP_RLOCK_ASSERT(inp);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Pull up data so the longest case is contiguous:
|
|
|
|
* IP/UDP hdr + non ESP marker + ESP hdr.
|
|
|
|
*/
|
|
|
|
minlen = off + sizeof(uint64_t) + sizeof(struct esp);
|
|
|
|
if (minlen > m->m_pkthdr.len)
|
|
|
|
minlen = m->m_pkthdr.len;
|
|
|
|
if ((m = m_pullup(m, minlen)) == NULL) {
|
2013-07-23 14:14:24 +00:00
|
|
|
IPSECSTAT_INC(ips_in_inval);
|
Added support for NAT-Traversal (RFC 3948) in IPsec stack.
Thanks to (no special order) Emmanuel Dreyfus (manu@netbsd.org), Larry
Baird (lab@gta.com), gnn, bz, and other FreeBSD devs, Julien Vanherzeele
(julien.vanherzeele@netasq.com, for years of bug reporting), the PFSense
team, and all people who used / tried the NAT-T patch for years and
reported bugs, patches, etc...
X-MFC: never
Reviewed by: bz
Approved by: gnn(mentor)
Obtained from: NETASQ
2009-06-12 15:44:35 +00:00
|
|
|
return (NULL); /* Bypass caller processing. */
|
|
|
|
}
|
|
|
|
data = mtod(m, caddr_t); /* Points to ip header. */
|
|
|
|
payload = m->m_len - off; /* Size of payload. */
|
|
|
|
|
|
|
|
if (payload == 1 && data[off] == '\xff')
|
|
|
|
return (m); /* NB: keepalive packet, no decap. */
|
|
|
|
|
|
|
|
up = intoudpcb(inp);
|
|
|
|
KASSERT(up != NULL, ("%s: udpcb NULL", __func__));
|
|
|
|
KASSERT((up->u_flags & UF_ESPINUDP_ALL) != 0,
|
|
|
|
("u_flags 0x%x", up->u_flags));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check that the payload is large enough to hold an
|
|
|
|
* ESP header and compute the amount of data to remove.
|
|
|
|
*
|
|
|
|
* NB: the caller has already done a pullup for us.
|
|
|
|
* XXX can we assume alignment and eliminate bcopys?
|
|
|
|
*/
|
|
|
|
if (up->u_flags & UF_ESPINUDP_NON_IKE) {
|
|
|
|
/*
|
|
|
|
* draft-ietf-ipsec-nat-t-ike-0[01].txt and
|
|
|
|
* draft-ietf-ipsec-udp-encaps-(00/)01.txt, ignoring
|
|
|
|
* possible AH mode non-IKE marker+non-ESP marker
|
|
|
|
* from draft-ietf-ipsec-udp-encaps-00.txt.
|
|
|
|
*/
|
|
|
|
uint64_t marker;
|
|
|
|
|
|
|
|
if (payload <= sizeof(uint64_t) + sizeof(struct esp))
|
|
|
|
return (m); /* NB: no decap. */
|
|
|
|
bcopy(data + off, &marker, sizeof(uint64_t));
|
|
|
|
if (marker != 0) /* Non-IKE marker. */
|
|
|
|
return (m); /* NB: no decap. */
|
|
|
|
skip = sizeof(uint64_t) + sizeof(struct udphdr);
|
|
|
|
} else {
|
|
|
|
uint32_t spi;
|
|
|
|
|
|
|
|
if (payload <= sizeof(struct esp)) {
|
2013-07-23 14:14:24 +00:00
|
|
|
IPSECSTAT_INC(ips_in_inval);
|
Added support for NAT-Traversal (RFC 3948) in IPsec stack.
Thanks to (no special order) Emmanuel Dreyfus (manu@netbsd.org), Larry
Baird (lab@gta.com), gnn, bz, and other FreeBSD devs, Julien Vanherzeele
(julien.vanherzeele@netasq.com, for years of bug reporting), the PFSense
team, and all people who used / tried the NAT-T patch for years and
reported bugs, patches, etc...
X-MFC: never
Reviewed by: bz
Approved by: gnn(mentor)
Obtained from: NETASQ
2009-06-12 15:44:35 +00:00
|
|
|
m_freem(m);
|
|
|
|
return (NULL); /* Discard. */
|
|
|
|
}
|
|
|
|
bcopy(data + off, &spi, sizeof(uint32_t));
|
|
|
|
if (spi == 0) /* Non-ESP marker. */
|
|
|
|
return (m); /* NB: no decap. */
|
|
|
|
skip = sizeof(struct udphdr);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Setup a PACKET_TAG_IPSEC_NAT_T_PORT tag to remember
|
|
|
|
* the UDP ports. This is required if we want to select
|
|
|
|
* the right SPD for multiple hosts behind same NAT.
|
|
|
|
*
|
|
|
|
* NB: ports are maintained in network byte order everywhere
|
|
|
|
* in the NAT-T code.
|
|
|
|
*/
|
|
|
|
tag = m_tag_get(PACKET_TAG_IPSEC_NAT_T_PORTS,
|
|
|
|
2 * sizeof(uint16_t), M_NOWAIT);
|
|
|
|
if (tag == NULL) {
|
2013-07-23 14:14:24 +00:00
|
|
|
IPSECSTAT_INC(ips_in_nomem);
|
Added support for NAT-Traversal (RFC 3948) in IPsec stack.
Thanks to (no special order) Emmanuel Dreyfus (manu@netbsd.org), Larry
Baird (lab@gta.com), gnn, bz, and other FreeBSD devs, Julien Vanherzeele
(julien.vanherzeele@netasq.com, for years of bug reporting), the PFSense
team, and all people who used / tried the NAT-T patch for years and
reported bugs, patches, etc...
X-MFC: never
Reviewed by: bz
Approved by: gnn(mentor)
Obtained from: NETASQ
2009-06-12 15:44:35 +00:00
|
|
|
m_freem(m);
|
|
|
|
return (NULL); /* Discard. */
|
|
|
|
}
|
|
|
|
iphlen = off - sizeof(struct udphdr);
|
|
|
|
udphdr = (struct udphdr *)(data + iphlen);
|
|
|
|
((uint16_t *)(tag + 1))[0] = udphdr->uh_sport;
|
|
|
|
((uint16_t *)(tag + 1))[1] = udphdr->uh_dport;
|
|
|
|
m_tag_prepend(m, tag);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Remove the UDP header (and possibly the non ESP marker)
|
|
|
|
* IP header length is iphlen
|
|
|
|
* Before:
|
|
|
|
* <--- off --->
|
|
|
|
* +----+------+-----+
|
|
|
|
* | IP | UDP | ESP |
|
|
|
|
* +----+------+-----+
|
|
|
|
* <-skip->
|
|
|
|
* After:
|
|
|
|
* +----+-----+
|
|
|
|
* | IP | ESP |
|
|
|
|
* +----+-----+
|
|
|
|
* <-skip->
|
|
|
|
*/
|
|
|
|
ovbcopy(data, data + skip, iphlen);
|
|
|
|
m_adj(m, skip);
|
|
|
|
|
|
|
|
ip = mtod(m, struct ip *);
|
2012-10-22 21:09:03 +00:00
|
|
|
ip->ip_len = htons(ntohs(ip->ip_len) - skip);
|
Added support for NAT-Traversal (RFC 3948) in IPsec stack.
Thanks to (no special order) Emmanuel Dreyfus (manu@netbsd.org), Larry
Baird (lab@gta.com), gnn, bz, and other FreeBSD devs, Julien Vanherzeele
(julien.vanherzeele@netasq.com, for years of bug reporting), the PFSense
team, and all people who used / tried the NAT-T patch for years and
reported bugs, patches, etc...
X-MFC: never
Reviewed by: bz
Approved by: gnn(mentor)
Obtained from: NETASQ
2009-06-12 15:44:35 +00:00
|
|
|
ip->ip_p = IPPROTO_ESP;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We cannot yet update the cksums so clear any
|
|
|
|
* h/w cksum flags as they are no longer valid.
|
|
|
|
*/
|
|
|
|
if (m->m_pkthdr.csum_flags & CSUM_DATA_VALID)
|
|
|
|
m->m_pkthdr.csum_flags &= ~(CSUM_DATA_VALID|CSUM_PSEUDO_HDR);
|
|
|
|
|
|
|
|
(void) ipsec4_common_input(m, iphlen, ip->ip_p);
|
|
|
|
return (NULL); /* NB: consumed, bypass processing. */
|
|
|
|
}
|
|
|
|
#endif /* defined(IPSEC) && defined(IPSEC_NAT_T) */
|
|
|
|
|
2006-04-01 15:15:05 +00:00
|
|
|
static void
|
1997-02-14 18:15:53 +00:00
|
|
|
udp_abort(struct socket *so)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
1997-02-14 18:15:53 +00:00
|
|
|
struct inpcb *inp;
|
2014-04-07 01:53:03 +00:00
|
|
|
struct inpcbinfo *pcbinfo;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2014-04-07 01:53:03 +00:00
|
|
|
pcbinfo = get_inpcbinfo(so->so_proto->pr_protocol);
|
1997-02-14 18:15:53 +00:00
|
|
|
inp = sotoinpcb(so);
|
Update in_pcb-derived basic socket types following changes to
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, in protocol
shutdown methods, and in raw IP send.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Invoke in_pcbfree() after in_pcbdetach() in order to free the
detached in_pcb structure for a socket.
MFC after: 3 months
2006-04-01 16:20:54 +00:00
|
|
|
KASSERT(inp != NULL, ("udp_abort: inp == NULL"));
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK(inp);
|
2006-07-21 17:11:15 +00:00
|
|
|
if (inp->inp_faddr.s_addr != INADDR_ANY) {
|
2014-04-07 01:53:03 +00:00
|
|
|
INP_HASH_WLOCK(pcbinfo);
|
2006-07-21 17:11:15 +00:00
|
|
|
in_pcbdisconnect(inp);
|
|
|
|
inp->inp_laddr.s_addr = INADDR_ANY;
|
2014-04-07 01:53:03 +00:00
|
|
|
INP_HASH_WUNLOCK(pcbinfo);
|
2006-07-21 17:11:15 +00:00
|
|
|
soisdisconnected(so);
|
|
|
|
}
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
1997-02-14 18:15:53 +00:00
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
|
1997-02-14 18:15:53 +00:00
|
|
|
static int
|
2001-09-12 08:38:13 +00:00
|
|
|
udp_attach(struct socket *so, int proto, struct thread *td)
|
1997-02-14 18:15:53 +00:00
|
|
|
{
|
|
|
|
struct inpcb *inp;
|
2014-04-07 01:53:03 +00:00
|
|
|
struct inpcbinfo *pcbinfo;
|
2005-06-01 11:24:00 +00:00
|
|
|
int error;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2014-04-07 01:53:03 +00:00
|
|
|
pcbinfo = get_inpcbinfo(so->so_proto->pr_protocol);
|
1997-02-14 18:15:53 +00:00
|
|
|
inp = sotoinpcb(so);
|
Update in_pcb-derived basic socket types following changes to
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, in protocol
shutdown methods, and in raw IP send.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Invoke in_pcbfree() after in_pcbdetach() in order to free the
detached in_pcb structure for a socket.
MFC after: 3 months
2006-04-01 16:20:54 +00:00
|
|
|
KASSERT(inp == NULL, ("udp_attach: inp != NULL"));
|
1999-12-07 17:39:16 +00:00
|
|
|
error = soreserve(so, udp_sendspace, udp_recvspace);
|
2006-06-03 19:29:26 +00:00
|
|
|
if (error)
|
2007-02-20 10:13:11 +00:00
|
|
|
return (error);
|
2014-04-07 01:53:03 +00:00
|
|
|
INP_INFO_WLOCK(pcbinfo);
|
|
|
|
error = in_pcballoc(so, pcbinfo);
|
2003-08-19 17:11:46 +00:00
|
|
|
if (error) {
|
2014-04-07 01:53:03 +00:00
|
|
|
INP_INFO_WUNLOCK(pcbinfo);
|
2007-02-20 10:13:11 +00:00
|
|
|
return (error);
|
2003-08-19 17:11:46 +00:00
|
|
|
}
|
1999-12-07 17:39:16 +00:00
|
|
|
|
2010-03-07 10:47:47 +00:00
|
|
|
inp = sotoinpcb(so);
|
1999-12-07 17:39:16 +00:00
|
|
|
inp->inp_vflag |= INP_IPV4;
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
inp->inp_ip_ttl = V_ip_defttl;
|
2009-05-23 16:51:13 +00:00
|
|
|
|
|
|
|
error = udp_newudpcb(inp);
|
|
|
|
if (error) {
|
|
|
|
in_pcbdetach(inp);
|
|
|
|
in_pcbfree(inp);
|
2014-04-07 01:53:03 +00:00
|
|
|
INP_INFO_WUNLOCK(pcbinfo);
|
2009-05-23 16:51:13 +00:00
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
2009-01-06 12:13:40 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2014-04-07 01:53:03 +00:00
|
|
|
INP_INFO_WUNLOCK(pcbinfo);
|
2009-01-06 12:13:40 +00:00
|
|
|
return (0);
|
|
|
|
}
|
2011-04-30 11:17:00 +00:00
|
|
|
#endif /* INET */
|
2009-01-06 12:13:40 +00:00
|
|
|
|
|
|
|
int
|
|
|
|
udp_set_kernel_tunneling(struct socket *so, udp_tun_func_t f)
|
|
|
|
{
|
|
|
|
struct inpcb *inp;
|
2009-05-23 16:51:13 +00:00
|
|
|
struct udpcb *up;
|
2009-01-06 12:13:40 +00:00
|
|
|
|
2010-03-07 10:47:47 +00:00
|
|
|
KASSERT(so->so_type == SOCK_DGRAM,
|
|
|
|
("udp_set_kernel_tunneling: !dgram"));
|
|
|
|
inp = sotoinpcb(so);
|
|
|
|
KASSERT(inp != NULL, ("udp_set_kernel_tunneling: inp == NULL"));
|
2009-01-06 12:13:40 +00:00
|
|
|
INP_WLOCK(inp);
|
2009-05-23 16:51:13 +00:00
|
|
|
up = intoudpcb(inp);
|
|
|
|
if (up->u_tun_func != NULL) {
|
2009-01-06 13:27:56 +00:00
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
return (EBUSY);
|
|
|
|
}
|
2009-05-23 16:51:13 +00:00
|
|
|
up->u_tun_func = f;
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2007-02-20 10:13:11 +00:00
|
|
|
return (0);
|
1997-02-14 18:15:53 +00:00
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2011-04-30 11:17:00 +00:00
|
|
|
#ifdef INET
|
1997-02-14 18:15:53 +00:00
|
|
|
static int
|
2001-09-12 08:38:13 +00:00
|
|
|
udp_bind(struct socket *so, struct sockaddr *nam, struct thread *td)
|
1997-02-14 18:15:53 +00:00
|
|
|
{
|
|
|
|
struct inpcb *inp;
|
2014-04-07 01:53:03 +00:00
|
|
|
struct inpcbinfo *pcbinfo;
|
2005-06-01 11:24:00 +00:00
|
|
|
int error;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2014-04-07 01:53:03 +00:00
|
|
|
pcbinfo = get_inpcbinfo(so->so_proto->pr_protocol);
|
1997-02-14 18:15:53 +00:00
|
|
|
inp = sotoinpcb(so);
|
Update in_pcb-derived basic socket types following changes to
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, in protocol
shutdown methods, and in raw IP send.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Invoke in_pcbfree() after in_pcbdetach() in order to free the
detached in_pcb structure for a socket.
MFC after: 3 months
2006-04-01 16:20:54 +00:00
|
|
|
KASSERT(inp != NULL, ("udp_bind: inp == NULL"));
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK(inp);
|
2014-04-07 01:53:03 +00:00
|
|
|
INP_HASH_WLOCK(pcbinfo);
|
2004-03-27 21:05:46 +00:00
|
|
|
error = in_pcbbind(inp, nam, td->td_ucred);
|
2014-04-07 01:53:03 +00:00
|
|
|
INP_HASH_WUNLOCK(pcbinfo);
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2007-02-20 10:13:11 +00:00
|
|
|
return (error);
|
1997-02-14 18:15:53 +00:00
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2006-07-21 17:11:15 +00:00
|
|
|
static void
|
|
|
|
udp_close(struct socket *so)
|
|
|
|
{
|
|
|
|
struct inpcb *inp;
|
2014-04-07 01:53:03 +00:00
|
|
|
struct inpcbinfo *pcbinfo;
|
2006-07-21 17:11:15 +00:00
|
|
|
|
2014-04-07 01:53:03 +00:00
|
|
|
pcbinfo = get_inpcbinfo(so->so_proto->pr_protocol);
|
2006-07-21 17:11:15 +00:00
|
|
|
inp = sotoinpcb(so);
|
|
|
|
KASSERT(inp != NULL, ("udp_close: inp == NULL"));
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK(inp);
|
2006-07-21 17:11:15 +00:00
|
|
|
if (inp->inp_faddr.s_addr != INADDR_ANY) {
|
2014-04-07 01:53:03 +00:00
|
|
|
INP_HASH_WLOCK(pcbinfo);
|
2006-07-21 17:11:15 +00:00
|
|
|
in_pcbdisconnect(inp);
|
|
|
|
inp->inp_laddr.s_addr = INADDR_ANY;
|
2014-04-07 01:53:03 +00:00
|
|
|
INP_HASH_WUNLOCK(pcbinfo);
|
2006-07-21 17:11:15 +00:00
|
|
|
soisdisconnected(so);
|
|
|
|
}
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2006-07-21 17:11:15 +00:00
|
|
|
}
|
|
|
|
|
1997-02-14 18:15:53 +00:00
|
|
|
static int
|
2001-09-12 08:38:13 +00:00
|
|
|
udp_connect(struct socket *so, struct sockaddr *nam, struct thread *td)
|
1997-02-14 18:15:53 +00:00
|
|
|
{
|
|
|
|
struct inpcb *inp;
|
2014-04-07 01:53:03 +00:00
|
|
|
struct inpcbinfo *pcbinfo;
|
This Implements the mumbled about "Jail" feature.
This is a seriously beefed up chroot kind of thing. The process
is jailed along the same lines as a chroot does it, but with
additional tough restrictions imposed on what the superuser can do.
For all I know, it is safe to hand over the root bit inside a
prison to the customer living in that prison, this is what
it was developed for in fact: "real virtual servers".
Each prison has an ip number associated with it, which all IP
communications will be coerced to use and each prison has its own
hostname.
Needless to say, you need more RAM this way, but the advantage is
that each customer can run their own particular version of apache
and not stomp on the toes of their neighbors.
It generally does what one would expect, but setting up a jail
still takes a little knowledge.
A few notes:
I have no scripts for setting up a jail, don't ask me for them.
The IP number should be an alias on one of the interfaces.
mount a /proc in each jail, it will make ps more useable.
/proc/<pid>/status tells the hostname of the prison for
jailed processes.
Quotas are only sensible if you have a mountpoint per prison.
There are no privisions for stopping resource-hogging.
Some "#ifdef INET" and similar may be missing (send patches!)
If somebody wants to take it from here and develop it into
more of a "virtual machine" they should be most welcome!
Tools, comments, patches & documentation most welcome.
Have fun...
Sponsored by: http://www.rndassociates.com/
Run for almost a year by: http://www.servetheweb.com/
1999-04-28 11:38:52 +00:00
|
|
|
struct sockaddr_in *sin;
|
2014-04-07 01:53:03 +00:00
|
|
|
int error;
|
1997-02-14 18:15:53 +00:00
|
|
|
|
2014-04-07 01:53:03 +00:00
|
|
|
pcbinfo = get_inpcbinfo(so->so_proto->pr_protocol);
|
1997-02-14 18:15:53 +00:00
|
|
|
inp = sotoinpcb(so);
|
Update in_pcb-derived basic socket types following changes to
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, in protocol
shutdown methods, and in raw IP send.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Invoke in_pcbfree() after in_pcbdetach() in order to free the
detached in_pcb structure for a socket.
MFC after: 3 months
2006-04-01 16:20:54 +00:00
|
|
|
KASSERT(inp != NULL, ("udp_connect: inp == NULL"));
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK(inp);
|
2002-06-10 20:05:46 +00:00
|
|
|
if (inp->inp_faddr.s_addr != INADDR_ANY) {
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2007-02-20 10:13:11 +00:00
|
|
|
return (EISCONN);
|
2002-06-10 20:05:46 +00:00
|
|
|
}
|
2000-09-17 13:34:18 +00:00
|
|
|
sin = (struct sockaddr_in *)nam;
|
2009-02-05 14:06:09 +00:00
|
|
|
error = prison_remote_ip4(td->td_ucred, &sin->sin_addr);
|
|
|
|
if (error != 0) {
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2009-02-05 14:06:09 +00:00
|
|
|
return (error);
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
}
|
2014-04-07 01:53:03 +00:00
|
|
|
INP_HASH_WLOCK(pcbinfo);
|
2004-03-27 21:05:46 +00:00
|
|
|
error = in_pcbconnect(inp, nam, td->td_ucred);
|
2014-04-07 01:53:03 +00:00
|
|
|
INP_HASH_WUNLOCK(pcbinfo);
|
2002-05-31 11:52:35 +00:00
|
|
|
if (error == 0)
|
1997-02-14 18:15:53 +00:00
|
|
|
soisconnected(so);
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2007-02-20 10:13:11 +00:00
|
|
|
return (error);
|
1997-02-14 18:15:53 +00:00
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
|
Chance protocol switch method pru_detach() so that it returns void
rather than an error. Detaches do not "fail", they other occur or
the protocol flags SS_PROTOREF to take ownership of the socket.
soclose() no longer looks at so_pcb to see if it's NULL, relying
entirely on the protocol to decide whether it's time to free the
socket or not using SS_PROTOREF. so_pcb is now entirely owned and
managed by the protocol code. Likewise, no longer test so_pcb in
other socket functions, such as soreceive(), which have no business
digging into protocol internals.
Protocol detach routines no longer try to free the socket on detach,
this is performed in the socket code if the protocol permits it.
In rts_detach(), no longer test for rp != NULL in detach, and
likewise in other protocols that don't permit a NULL so_pcb, reduce
the incidence of testing for it during detach.
netinet and netinet6 are not fully updated to this change, which
will be in an upcoming commit. In their current state they may leak
memory or panic.
MFC after: 3 months
2006-04-01 15:42:02 +00:00
|
|
|
static void
|
1997-02-14 18:15:53 +00:00
|
|
|
udp_detach(struct socket *so)
|
|
|
|
{
|
|
|
|
struct inpcb *inp;
|
2014-04-07 01:53:03 +00:00
|
|
|
struct inpcbinfo *pcbinfo;
|
2009-05-23 16:51:13 +00:00
|
|
|
struct udpcb *up;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2014-04-07 01:53:03 +00:00
|
|
|
pcbinfo = get_inpcbinfo(so->so_proto->pr_protocol);
|
1997-02-14 18:15:53 +00:00
|
|
|
inp = sotoinpcb(so);
|
Update in_pcb-derived basic socket types following changes to
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, in protocol
shutdown methods, and in raw IP send.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Invoke in_pcbfree() after in_pcbdetach() in order to free the
detached in_pcb structure for a socket.
MFC after: 3 months
2006-04-01 16:20:54 +00:00
|
|
|
KASSERT(inp != NULL, ("udp_detach: inp == NULL"));
|
2006-07-21 17:11:15 +00:00
|
|
|
KASSERT(inp->inp_faddr.s_addr == INADDR_ANY,
|
|
|
|
("udp_detach: not disconnected"));
|
2014-04-07 01:53:03 +00:00
|
|
|
INP_INFO_WLOCK(pcbinfo);
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK(inp);
|
2009-05-23 16:51:13 +00:00
|
|
|
up = intoudpcb(inp);
|
|
|
|
KASSERT(up != NULL, ("%s: up == NULL", __func__));
|
|
|
|
inp->inp_ppcb = NULL;
|
1997-02-14 18:15:53 +00:00
|
|
|
in_pcbdetach(inp);
|
Update in_pcb-derived basic socket types following changes to
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, in protocol
shutdown methods, and in raw IP send.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Invoke in_pcbfree() after in_pcbdetach() in order to free the
detached in_pcb structure for a socket.
MFC after: 3 months
2006-04-01 16:20:54 +00:00
|
|
|
in_pcbfree(inp);
|
2014-04-07 01:53:03 +00:00
|
|
|
INP_INFO_WUNLOCK(pcbinfo);
|
2009-05-23 16:51:13 +00:00
|
|
|
udp_discardcb(up);
|
1997-02-14 18:15:53 +00:00
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
|
1997-02-14 18:15:53 +00:00
|
|
|
static int
|
|
|
|
udp_disconnect(struct socket *so)
|
|
|
|
{
|
|
|
|
struct inpcb *inp;
|
2014-04-07 01:53:03 +00:00
|
|
|
struct inpcbinfo *pcbinfo;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2014-04-07 01:53:03 +00:00
|
|
|
pcbinfo = get_inpcbinfo(so->so_proto->pr_protocol);
|
1997-02-14 18:15:53 +00:00
|
|
|
inp = sotoinpcb(so);
|
Update in_pcb-derived basic socket types following changes to
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, in protocol
shutdown methods, and in raw IP send.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Invoke in_pcbfree() after in_pcbdetach() in order to free the
detached in_pcb structure for a socket.
MFC after: 3 months
2006-04-01 16:20:54 +00:00
|
|
|
KASSERT(inp != NULL, ("udp_disconnect: inp == NULL"));
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK(inp);
|
2002-06-10 20:05:46 +00:00
|
|
|
if (inp->inp_faddr.s_addr == INADDR_ANY) {
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2007-02-20 10:13:11 +00:00
|
|
|
return (ENOTCONN);
|
2002-06-10 20:05:46 +00:00
|
|
|
}
|
2014-04-07 01:53:03 +00:00
|
|
|
INP_HASH_WLOCK(pcbinfo);
|
1997-02-14 18:15:53 +00:00
|
|
|
in_pcbdisconnect(inp);
|
|
|
|
inp->inp_laddr.s_addr = INADDR_ANY;
|
2014-04-07 01:53:03 +00:00
|
|
|
INP_HASH_WUNLOCK(pcbinfo);
|
2006-05-21 19:28:46 +00:00
|
|
|
SOCK_LOCK(so);
|
|
|
|
so->so_state &= ~SS_ISCONNECTED; /* XXX */
|
|
|
|
SOCK_UNLOCK(so);
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2007-02-20 10:13:11 +00:00
|
|
|
return (0);
|
1997-02-14 18:15:53 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
1997-08-16 19:16:27 +00:00
|
|
|
udp_send(struct socket *so, int flags, struct mbuf *m, struct sockaddr *addr,
|
2007-02-20 10:13:11 +00:00
|
|
|
struct mbuf *control, struct thread *td)
|
1997-02-14 18:15:53 +00:00
|
|
|
{
|
|
|
|
struct inpcb *inp;
|
|
|
|
|
|
|
|
inp = sotoinpcb(so);
|
Update in_pcb-derived basic socket types following changes to
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, in protocol
shutdown methods, and in raw IP send.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Invoke in_pcbfree() after in_pcbdetach() in order to free the
detached in_pcb structure for a socket.
MFC after: 3 months
2006-04-01 16:20:54 +00:00
|
|
|
KASSERT(inp != NULL, ("udp_send: inp == NULL"));
|
2007-02-20 10:13:11 +00:00
|
|
|
return (udp_output(inp, m, addr, control, td));
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2011-04-30 11:17:00 +00:00
|
|
|
#endif /* INET */
|
1994-05-24 10:09:53 +00:00
|
|
|
|
1999-11-05 14:41:39 +00:00
|
|
|
int
|
1997-02-14 18:15:53 +00:00
|
|
|
udp_shutdown(struct socket *so)
|
|
|
|
{
|
1994-05-24 10:09:53 +00:00
|
|
|
struct inpcb *inp;
|
1997-02-14 18:15:53 +00:00
|
|
|
|
|
|
|
inp = sotoinpcb(so);
|
Update in_pcb-derived basic socket types following changes to
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, in protocol
shutdown methods, and in raw IP send.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Invoke in_pcbfree() after in_pcbdetach() in order to free the
detached in_pcb structure for a socket.
MFC after: 3 months
2006-04-01 16:20:54 +00:00
|
|
|
KASSERT(inp != NULL, ("udp_shutdown: inp == NULL"));
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK(inp);
|
1997-02-14 18:15:53 +00:00
|
|
|
socantsendmore(so);
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2007-02-20 10:13:11 +00:00
|
|
|
return (0);
|
1997-02-14 18:15:53 +00:00
|
|
|
}
|
|
|
|
|
2011-04-30 11:17:00 +00:00
|
|
|
#ifdef INET
|
1997-02-14 18:15:53 +00:00
|
|
|
struct pr_usrreqs udp_usrreqs = {
|
2004-11-08 14:44:54 +00:00
|
|
|
.pru_abort = udp_abort,
|
|
|
|
.pru_attach = udp_attach,
|
|
|
|
.pru_bind = udp_bind,
|
|
|
|
.pru_connect = udp_connect,
|
|
|
|
.pru_control = in_control,
|
|
|
|
.pru_detach = udp_detach,
|
|
|
|
.pru_disconnect = udp_disconnect,
|
2007-05-11 10:20:51 +00:00
|
|
|
.pru_peeraddr = in_getpeeraddr,
|
2004-11-08 14:44:54 +00:00
|
|
|
.pru_send = udp_send,
|
2008-07-02 23:23:27 +00:00
|
|
|
.pru_soreceive = soreceive_dgram,
|
2006-05-06 11:24:59 +00:00
|
|
|
.pru_sosend = sosend_dgram,
|
2004-11-08 14:44:54 +00:00
|
|
|
.pru_shutdown = udp_shutdown,
|
2007-05-11 10:20:51 +00:00
|
|
|
.pru_sockaddr = in_getsockaddr,
|
2006-07-21 17:11:15 +00:00
|
|
|
.pru_sosetlabel = in_pcbsosetlabel,
|
|
|
|
.pru_close = udp_close,
|
1997-02-14 18:15:53 +00:00
|
|
|
};
|
2011-04-30 11:17:00 +00:00
|
|
|
#endif /* INET */
|