freebsd-skq/sys
Robert Watson fa046d8774 Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
  inpcb counter.  This lock is now relegated to a small number of
  allocation and free operations, and occasional operations that walk
  all connections (including, awkwardly, certain UDP multicast receive
  operations -- something to revisit).

- A new ipi_hash_lock protects the two inpcbinfo hash tables for
  looking up connections and bound sockets, manipulated using new
  INP_HASH_*() macros.  This lock, combined with inpcb locks, protects
  the 4-tuple address space.

Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required.  As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.

A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb.  Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed.  In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup.  New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:

  INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
  INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb

Callers must pass exactly one of these flags (for the time being).

Some notes:

- All protocols are updated to work within the new regime; especially,
  TCP, UDPv4, and UDPv6.  pcbinfo ipi_lock acquisitions are largely
  eliminated, and global hash lock hold times are dramatically reduced
  compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
  may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
  is no longer available -- hash lookup locks are now held only very
  briefly during inpcb lookup, rather than for potentially extended
  periods.  However, the pcbinfo ipi_lock will still be acquired if a
  connection state might change such that a connection is added or
  removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
  due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
  callers to acquire hash locks and perform one or more lookups atomically
  with 4-tuple allocation: this is required only for TCPv6, as there is no
  in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
  locking, which relates to source address selection.  This needs
  attention, as it likely significantly reduces parallelism in this code
  for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
  somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
  is no longer sufficient.  A second check once the inpcb lock is held
  should do the trick, keeping the general case from requiring the inpcb
  lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
  which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
  undesirable, and probably another argument is required to take care of
  this (or a char array name field in the pcbinfo?).

This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics.  It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.

Reviewed by:    bz
Sponsored by:   Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
..
amd64 Bring back r222275. runfw(4) will statically link in rt2870.fw.uu 2011-05-25 10:04:13 +00:00
arm Move the ZERO_REGION_SIZE to a machine-dependent file, as on many 2011-05-13 19:35:01 +00:00
boot Include forgotten framework changes to get some of the new menu files installed correctly on non x86/amd systems. 2011-05-30 04:23:33 +00:00
bsm Add ECAPMODE, "Not permitted in capability mode", a new kernel errno 2011-03-01 13:14:28 +00:00
cam Change new constant names to ones used by OpenSolaris. 2011-05-27 03:44:47 +00:00
cddl Silence warnings about unsupoorted value types. 2011-05-27 08:34:31 +00:00
compat Commit the missing linux_videdev2_compat.h (lost somewhere between 2011-05-04 13:09:20 +00:00
conf Add a new driver, the ad7417, to read temperatures and voltages on some 2011-05-29 14:25:42 +00:00
contrib Decompose the current single inpcbinfo lock into two locks: 2011-05-30 09:43:55 +00:00
crypto Fix a bug in the result of manual assembly. 2011-03-02 14:56:58 +00:00
ddb Trim some additional unnecessary <linker_set.h> includes. 2011-04-28 17:59:33 +00:00
dev Fix read_ivar implementation for MMC and SD. 2011-05-30 06:23:51 +00:00
fs Fix the new NFS client so that it handles NFSv4 state 2011-05-27 22:05:10 +00:00
gdb Modify kdb_trap() so that it re-calls the dbbe_trap function as long as 2011-02-18 22:25:11 +00:00
geom Some partitioning tools may have a different opinion about disk 2011-05-27 06:37:42 +00:00
gnu Fix typo in unused function name 2011-05-22 09:58:48 +00:00
i386 Bring back r222275. runfw(4) will statically link in rt2870.fw.uu 2011-05-25 10:04:13 +00:00
ia64 Prefer switching the memory stack from user to kernel *before* switching 2011-05-14 14:55:15 +00:00
isa Move VT switching hack for suspend/resume from bus drivers to syscons.c 2011-05-09 18:46:49 +00:00
kern In soreceive_generic(), if MSG_WAITALL is set but the request is 2011-05-29 18:00:50 +00:00
kgssapi
libkern Fix typos - remove duplicate "is". 2011-02-23 09:22:33 +00:00
mips Merge r221846 from largeSMP project branch: 2011-05-23 23:35:50 +00:00
modules Introduce AR9287 support to the FreeBSD HAL. 2011-05-26 20:31:08 +00:00
net Rework netisr policy mechanism so that per-protocol dispatch policies can 2011-05-24 12:34:19 +00:00
net80211 Fix typo, it is MPDU not MDPU. 2011-05-21 16:41:41 +00:00
netatalk
netgraph Assume the link to be dead if bit error rate (BER) parameter is set to 1. 2011-05-24 14:36:32 +00:00
netinet Decompose the current single inpcbinfo lock into two locks: 2011-05-30 09:43:55 +00:00
netinet6 Decompose the current single inpcbinfo lock into two locks: 2011-05-30 09:43:55 +00:00
netipsec Release SP's refcount in key_get_spdbyid(). 2011-05-09 13:16:21 +00:00
netipx
netnatm
netncp
netsmb Change some variables from int to size_t. This is more accurate since 2011-01-08 23:06:54 +00:00
nfs Change the sysctl naming for the old and new NFS clients 2011-05-15 20:52:43 +00:00
nfsclient Add a check for MNTK_UNMOUNTF at the beginning of nfs_sync() 2011-05-29 20:55:23 +00:00
nfsserver Add a lock flags argument to the VFS_FHTOVP() file system 2011-05-22 01:07:54 +00:00
nlm Add a lock flags argument to the VFS_FHTOVP() file system 2011-05-22 01:07:54 +00:00
ofed In ipoib_cm_handle_rx_wc(): Count incoming packets and 2011-05-26 22:29:43 +00:00
opencrypto After the r219999 is merged to stable/8, rename fallocf(9) to falloc(9) 2011-04-01 13:28:34 +00:00
pc98 Move VT switching hack for suspend/resume from bus drivers to syscons.c 2011-05-09 18:46:49 +00:00
pci Do a sweep of the tree replacing calls to pci_find_extcap() with calls to 2011-03-23 13:10:15 +00:00
powerpc Use kproc_exit() instead of returning from the management function on 2011-05-29 22:37:23 +00:00
rpc This patch is believed to fix a problem in the kernel rpc for 2011-04-27 18:19:26 +00:00
security - Add a FEATURE for capsicum (security_capabilities). 2011-03-04 09:03:54 +00:00
sparc64 Recognize the eeprom device found in Fujitsu PRIMEPOWER650 and 900. 2011-05-15 13:25:26 +00:00
sys Remove definitions for RACCT_FSIZE and RACCT_SBSIZE - these two are rather 2011-05-27 19:57:58 +00:00
teken Add proper build infrastructure for teken. 2011-05-09 16:27:39 +00:00
tools GNU awk does not output escaped newlines in multi-line printc statements. This 2011-03-31 21:33:33 +00:00
ufs Due to a lag in updating the fs_pendinginodes count, we cannot depend 2011-05-28 15:07:29 +00:00
vm Correct an error in r222163. Unless UMA_MD_SMALL_ALLOC is defined, 2011-05-22 17:46:16 +00:00
x86 Implement boot-time TSC synchronization test for SMP. This test is executed 2011-05-09 17:34:00 +00:00
xdr
xen Fix a few more SYSCTL_PROC() that were missing a CTLFLAG type specifier. 2011-01-19 00:57:58 +00:00
Makefile Disconnect sun4v architecture from the three. 2011-05-14 01:53:38 +00:00