inp_socket. To avoid panic, do not dereference inp_socket,
but obtain reuse port option from inp_flags2, like this
is done after next call to in_pcblookup_local() a few lines
down below.
Submitted by: rwatson
inp_socket->so_options dereference when we may not acquire the lock on
the inpcb.
This fixes the crash due to NULL pointer dereference in
in_pcbbind_setup() when inp_socket->so_options in a pcb returned by
in_pcblookup_local() was checked.
Reported by: dave jones <s.dave.jones@gmail.com>, Arnaud Lacombe <lacombar@gmail.com>
Suggested by: rwatson
Glanced by: rwatson
Tested by: dave jones <s.dave.jones@gmail.com>
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
hash install, etc. For now, these are arguments are unused, but as we add
RSS support, we will want to use hashes extracted from mbufs, rather than
manually calculated hashes of header fields, due to the expensive of the
software version of Toeplitz (and similar hashes).
Add notes that it would be nice to be able to pass mbufs into lookup
routines in pf(4), optimising firewall lookup in the same way, but the
code structure there doesn't facilitate that currently.
(In principle there is no reason this couldn't be MFCed -- the change
extends rather than modifies the KBI. However, it won't be useful without
other previous possibly less MFCable changes.)
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
in_pcb_lport(), in_pcblookup_local(), and in_pcblookup_hash(), and similarly
for IPv6 functions. In the future, we would like to support other flags
relating to locking strategy.
This change doesn't appear to modify the KBI in practice, as callers already
passed in INPLOOKUP_WILDCARD rather than a simple boolean.
MFC after: 3 weeks
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
callers. This also fixes a problem when the prison call could set
the inp->in6p_laddr (laddr) and a following priv_check_cred() call
would return an error and will allow us to merge the IPv4 and IPv6
implementation.
MFC after: 2 weeks
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks. Minor cleanups are done in the process,
and comments updated to reflect these changes.
Reviewed by: bz
Approved by: re (vimage blanket)
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
for the global IPv6 address list (in6_ifaddr -> in6_ifaddrhead). Adopt
the code styles and conventions present in netinet where possible.
Reviewed by: gnn, bz
MFC after: 6 weeks (possibly not MFCable?)
to save the selected source address rather than returning an
unreferenced copy to a pointer that might long be gone by the
time we use the pointer for anything meaningful.
Asked for by: rwatson
Reviewed by: rwatson
rather than pointers, requiring callers to properly dispose of those
references. The following routines now return references:
ifaddr_byindex
ifa_ifwithaddr
ifa_ifwithbroadaddr
ifa_ifwithdstaddr
ifa_ifwithnet
ifaof_ifpforaddr
ifa_ifwithroute
ifa_ifwithroute_fib
rt_getifa
rt_getifa_fib
IFP_TO_IA
ip_rtaddr
in6_ifawithifp
in6ifa_ifpforlinklocal
in6ifa_ifpwithaddr
in6_ifadd
carp_iamatch6
ip6_getdstifaddr
Remove unused macro which didn't have required referencing:
IFP_TO_IA6
This closes many small races in which changes to interface
or address lists while an ifaddr was in use could lead to use of freed
memory (etc). In a few cases, add missing if_addr_list locking
required to safely acquire references.
Because of a lack of deep copying support, we accept a race in which
an in6_ifaddr pointed to by mbuf tags and extracted with
ip6_getdstifaddr() doesn't hold a reference while in transmit. Once
we have mbuf tag deep copy support, this can be fixed.
Reviewed by: bz
Obtained from: Apple, Inc. (portions)
MFC after: 6 weeks (portions)
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.
Discussed with: pjd
with OpenBSD (and BSD/OS originally). We can't easly do it SOL_SOCKET option
as there is no more space for more SOL_SOCKET options, but this option also
fits better as an IP socket option, it seems.
- Implement this functionality also for IPv6 and RAW IP sockets.
- Always compile it in (don't use additional kernel options).
- Remove sysctl to turn this functionality on and off.
- Introduce new privilege - PRIV_NETINET_BINDANY, which allows to use this
functionality (currently only unjail root can use it).
Discussed with: julian, adrian, jhb, rwatson, kmacy
by creating a child jail, which is visible to that jail and to any
parent jails. Child jails may be restricted more than their parents,
but never less. Jail names reflect this hierarchy, being MIB-style
dot-separated strings.
Every thread now points to a jail, the default being prison0, which
contains information about the physical system. Prison0's root
directory is the same as rootvnode; its hostname is the same as the
global hostname, and its securelevel replaces the global securelevel.
Note that the variable "securelevel" has actually gone away, which
should not cause any problems for code that properly uses
securelevel_gt() and securelevel_ge().
Some jail-related permissions that were kept in global variables and
set via sysctls are now per-jail settings. The sysctls still exist for
backward compatibility, used only by the now-deprecated jail(2) system
call.
Approved by: bz (mentor)
import from p4 bms_netdev. Summary of changes:
* Connect netinet6/in6_mcast.c to build.
The legacy KAME KPIs are mostly preserved.
* Eliminate now dead code from ip6_output.c.
Don't do mbuf bingo, we are not going to do RFC 2292 style
CMSG tricks for multicast options as they are not required
by any current IPv6 normative reference.
* Refactor transports (UDP, raw_ip6) to do own mcast filtering.
SCTP, TCP unaffected by this change.
* Add ip6_msource, in6_msource structs to in6_var.h.
* Hookup mld_ifinfo state to in6_ifextra, allocate from
domifattach path.
* Eliminate IN6_LOOKUP_MULTI(), it is no longer referenced.
Kernel consumers which need this should use in6m_lookup().
* Refactor IPv6 socket group memberships to use a vector (like IPv4).
* Update ifmcstat(8) for IPv6 SSM.
* Add witness lock order for IN6_MULTI_LOCK.
* Move IN6_MULTI_LOCK out of lower ip6_output()/ip6_input() paths.
* Introduce IP6STAT_ADD/SUB/INC/DEC as per rwatson's IPv4 cleanup.
* Update carp(4) for new IPv6 SSM KPIs.
* Virtualize ip6_mrouter socket.
Changes mostly localized to IPv6 MROUTING.
* Don't do a local group lookup in MROUTING.
* Kill unused KAME prototypes in6_purgemkludge(), in6_restoremkludge().
* Preserve KAME DAD timer jitter behaviour in MLDv1 compatibility mode.
* Bump __FreeBSD_version to 800084.
* Update UPDATING.
NOTE WELL:
* This code hasn't been tested against real MLDv2 queriers
(yet), although the on-wire protocol has been verified in Wireshark.
* There are a few unresolved issues in the socket layer APIs to
do with scope ID propagation.
* There is a LOR present in ip6_output()'s use of
in6_setscope() which needs to be resolved. See comments in mld6.c.
This is believed to be benign and can't be avoided for the moment
without re-introducing an indirect netisr.
This work was mostly derived from the IGMPv3 implementation, and
has been sponsored by a third party.
certain flags that should have been in inp_flags ended up in inp_vflag,
meaning that they were inconsistently locked, and in one case,
interpreted. Move the following flags from inp_vflag to gaps in the
inp_flags space (and clean up the inp_flags constants to make gaps
more obvious to future takers):
INP_TIMEWAIT
INP_SOCKREF
INP_ONESBCAST
INP_DROPPED
Some aspects of this change have no effect on kernel ABI at all, as these
are UDP/TCP/IP-internal uses; however, netstat and sockstat detect
INP_TIMEWAIT when listing TCP sockets, so any MFC will need to take this
into account.
MFC after: 1 week (or after dependencies are MFC'd)
Reviewed by: bz
return zero on success and an error code otherwise. The possible errors
are EADDRNOTAVAIL if an address being checked for doesn't match the
prison, and EAFNOSUPPORT if the prison doesn't have any addresses in
that address family. For most callers of these functions, use the
returned error code instead of e.g. a hard-coded EADDRNOTAVAIL or
EINVAL.
Always include a jailed() check in these functions, where a non-jailed
cred always returns success (and makes no changes). Remove the explicit
jailed() checks that preceded many of the function calls.
Approved by: bz (mentor)
the inpcb names rather than the following IPv6 compat macros:
in6pcb,in6p_sp, in6p_ip6_nxt,in6p_flowinfo,in6p_vflag,
in6p_flags,in6p_socket,in6p_lport,in6p_fport,in6p_ppcb and
sotoin6pcb().
Apart from removing duplicate code in netipsec, this is a pure
whitespace, not a functional change.
Discussed with: rwatson
Reviewed by: rwatson (version before review requested changes)
MFC after: 4 weeks (set the timer and see then)
directly include only the header files needed. This reduces the
unneeded spamming of various headers into lots of files.
For now, this leaves us with very few modules including vnet.h
and thus needing to depend on opt_route.h.
Reviewed by: brooks, gnn, des, zec, imp
Sponsored by: The FreeBSD Foundation
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
IPsec change in r185366 only differed in two additonal IPv6 lines.
Rather than splattering conditional code everywhere add the v6
check centrally at this single place.
Reviewed by: rwatson (as part of a larger changset)
MFC after: 6 weeks (*)
(*) possibly need to leave a stub wrapper in 7 to keep the symbol.
Ignoring different names because of macros (in6pcb, in6p_sp) and
inp vs. in6p variable name both functions were entirely identical.
Reviewed by: rwatson (as part of a larger changeset)
MFC after: 6 weeks (*)
(*) possibly need to leave a stub wrappers in 7 to keep the symbols.
as in_pcbdetach() and we don't need the code twice.
Reviewed by: rwatson
MFC after: 6 weeks (*)
(*) possibly need to leave a stub wrapper in 7 to keep the symbol.
This means that inp_cred is always there, even after the socket
has gone away. It also means that it is constant for the lifetime
of the inp.
Both facts lead to simpler code and possibly less locking.
Suggested by: rwatson
Reviewed by: rwatson
MFC after: 6 weeks
X-MFC Note: use a inp_pspare for inp_cred
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit
Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.
Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().
Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).
All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).
(*) netipsec/keysock.c did not validate depending on compile time options.
Implemented by: julian, bz, brooks, zec
Reviewed by: julian, bz, brooks, kris, rwatson, ...
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
explicitly select write locking for all use of the inpcb mutex.
Update some pcbinfo lock assertions to assert locked rather than
write-locked, although in practice almost all uses of the pcbinfo
rwlock main exclusive, and all instances of inpcb lock acquisition
are exclusive.
This change should introduce (ideally) little functional change.
However, it lays the groundwork for significantly increased
parallelism in the TCP/IP code.
MFC after: 3 months
Tested by: kris (superset of committered patch)
destroy call; this transpired because the inpcb alloc path for IPv4/IPv6
is the same code, but IPv6 has a separate free path. The results was
that as new IPv6 TCP connections were created, kernel memory would
gradually leak.
MFC after: 3 days
Reported by: tanyong <tanyong at ercist dot iscas dot ac dot cn>,
zhouzhouyi