freebsd-skq

Author	SHA1	Message	Date
mmacy	482d0f575f	inpcb: revert deferred inpcb free pending further review	2018-05-21 16:13:43 +00:00
mmacy	7110d4d796	inpcb: defer destruction of inpcb until after a grace period has elapsed in_pcbfree will remove the incpb from the list and release the rtentry while the vnet is set, but the actual destruction will be deferred until any threads in a (not yet used) epoch section, no longer potentially have references.	2018-05-20 04:38:04 +00:00
mmacy	99ec59840b	in_pcb: add helper for deferring inpcb rele calls from list functions	2018-05-20 02:17:30 +00:00
sbruno	257e6e5563	Revert r332894 at the request of the submitter. Submitted by: Johannes Lundberg <johalun0_gmail.com> Sponsored by: Limelight Networks	2018-04-24 19:55:12 +00:00
sbruno	bbf7d4dd03	Load balance sockets with new SO_REUSEPORT_LB option This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple programs or threads to bind to the same port and incoming connections will be load balanced using a hash function. Most of the code was copied from a similar patch for DragonflyBSD. However, in DragonflyBSD, load balancing is a global on/off setting and can not be set per socket. This patch allows for simultaneous use of both the current SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system. Required changes to structures Globally change so_options from 16 to 32 bit value to allow for more options. Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets. Limitations As DragonflyBSD, a load balance group is limited to 256 pcbs (256 programs or threads sharing the same socket). Submitted by: Johannes Lundberg <johanlun0@gmail.com> Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D11003	2018-04-23 19:51:00 +00:00
rrs	863f90dbfa	This commit brings in the TCP high precision timer system (tcp_hpts). It is the forerunner/foundational work of bringing in both Rack and BBR which use hpts for pacing out packets. The feature is optional and requires the TCPHPTS option to be enabled before the feature will be active. TCP modules that use it must assure that the base component is compile in the kernel in which they are loaded. MFC after: Never Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D15020	2018-04-19 13:37:59 +00:00
pfg	4736ccfd9c	sys: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 3-Clause license. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point.	2017-11-20 19:43:44 +00:00
hselasky	7c2ab1d9f6	Add support for generic backpressure indicator for ratelimited transmit queues aswell as non-ratelimited ones. Add the required structure bits in order to support a backpressure indication with ratelimited connections aswell as non-ratelimited ones. The backpressure indicator is a value between zero and 65535 inclusivly, indicating if the destination transmit queue is empty or full respectivly. Applications can use this value as a decision point for when to stop transmitting data to avoid endless ENOBUFS error codes upon transmitting an mbuf. This indicator is also useful to reduce the latency for ratelimited queues. Reviewed by: gallatin, kib, gnn Differential Revision: https://reviews.freebsd.org/D11518 Sponsored by: Mellanox Technologies	2017-09-06 13:56:18 +00:00
glebius	7845c5b75c	o Rearrange struct inpcb fields to optimize the TCP output code path considering cache line hits and misses. Put the lock and hash list glue into the first cache line, put inp_refcount inp_flags inp_socket into the second cache line. o On allocation zero out entire structure except the lock and list entries, including inp_route inp_lle inp_gencnt. When inp_route and inp_lle were introduced, they were added below inp_zero_size, resulting on not being cleared after free/alloc. This definitely was a source of bugs with route caching. Could be that r315956 has just fixed one of them. The inp_gencnt is reinitialized on every alloc, so it is safe to clear it. This has been proved to improve TCP performance at Netflix. Obtained from: rrs Differential Revision: D10686	2017-05-24 17:47:16 +00:00
glebius	19fea9c3e0	Reduce in_pcbinfo_init() by two params. No users supply any flags to this function (they used to say UMA_ZONE_NOFREE), so flag parameter goes away. The zone_fini parameter also goes away. Previously no protocols (except divert) supplied zone_fini function, so inpcb locks were leaked with slabs. This was okay while zones were allocated with UMA_ZONE_NOFREE flag, but now this is a leak. Fix that by suppling inpcb_fini() function as fini method for all inpcb zones.	2017-05-15 21:58:36 +00:00
glebius	a58d0019c1	Force same alignment on struct xinpgen as we have on struct xinpcb. This fixes 32-bit builds.	2017-03-21 16:23:44 +00:00
glebius	3a5c9aaf2b	Hide struct inpcb, struct tcpcb from the userland. This is a painful change, but it is needed. On the one hand, we avoid modifying them, and this slows down some ideas, on the other hand we still eventually modify them and tools like netstat(1) never work on next version of FreeBSD. We maintain a ton of spares in them, and we already got some ifdef hell at the end of tcpcb. Details: - Hide struct inpcb, struct tcpcb under _KERNEL \|\| _WANT_FOO. - Make struct xinpcb, struct xtcpcb pure API structures, not including kernel structures inpcb and tcpcb inside. Export into these structures the fields from inpcb and tcpcb that are known to be used, and put there a ton of spare space. - Make kernel and userland utilities compilable after these changes. - Bump __FreeBSD_version. Reviewed by: rrs, gnn Differential Revision: D10018	2017-03-21 06:39:49 +00:00
glebius	65907cccb6	Make inp_lock_assert() depend on INVARIANT_SUPPORT, not INVARIANTS. This will make INVARIANT-enabled modules, that use this function to load successfully on a kernel that has INVARIANT_SUPPORT only.	2017-03-09 00:55:19 +00:00
eri	f2a480c25c	The patch provides the same socket option as Linux IP_ORIGDSTADDR. Unfortunately they will have different integer value due to Linux value being already assigned in FreeBSD. The patch is similar to IP_RECVDSTADDR but also provides the destination port value to the application. This allows/improves implementation of transparent proxies on UDP sockets due to having the whole information on forwarded packets. Reviewed by: adrian, aw Approved by: ae (mentor) Sponsored by: rsync.net Differential Revision: D9235	2017-03-06 04:01:58 +00:00
imp	7e6cabd06e	Renumber copyright clause 4 Renumber cluase 4 to 3, per what everybody else did when BSD granted them permission to remove clause 3. My insistance on keeping the same numbering for legal reasons is too pedantic, so give up on that point. Submitted by: Jan Schaumann <jschauma@stevens.edu> Pull Request: https://github.com/freebsd/freebsd/pull/96	2017-02-28 23:42:47 +00:00
eri	bb348ba959	Committed without approval from mentor. Reported by: gnn	2017-02-12 06:56:33 +00:00
eri	6898c4334b	Revert r313527 Heh svn is not git	2017-02-10 05:58:16 +00:00
eri	b429db62bc	Correct missed variable name. Reported-by: ohartmann@walstatt.org	2017-02-10 05:51:39 +00:00
eri	ed45b31494	The patch provides the same socket option as Linux IP_ORIGDSTADDR. Unfortunately they will have different integer value due to Linux value being already assigned in FreeBSD. The patch is similar to IP_RECVDSTADDR but also provides the destination port value to the application. This allows/improves implementation of transparent proxies on UDP sockets due to having the whole information on forwarded packets. Sponsored-by: rsync.net Differential Revision: D9235 Reviewed-by: adrian	2017-02-10 05:16:14 +00:00
hselasky	efa6326974	Implement kernel support for hardware rate limited sockets. - Add RATELIMIT kernel configuration keyword which must be set to enable the new functionality. - Add support for hardware driven, Receive Side Scaling, RSS aware, rate limited sendqueues and expose the functionality through the already established SO_MAX_PACING_RATE setsockopt(). The API support rates in the range from 1 to 4Gbytes/s which are suitable for regular TCP and UDP streams. The setsockopt(2) manual page has been updated. - Add rate limit function callback API to "struct ifnet" which supports the following operations: if_snd_tag_alloc(), if_snd_tag_modify(), if_snd_tag_query() and if_snd_tag_free(). - Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT flag, which tells if a network driver supports rate limiting or not. - This patch also adds support for rate limiting through VLAN and LAGG intermediate network devices. - How rate limiting works: 1) The userspace application calls setsockopt() after accepting or making a new connection to set the rate which is then stored in the socket structure in the kernel. Later on when packets are transmitted a check is made in the transmit path for rate changes. A rate change implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the destination network interface, which then sets up a custom sendqueue with the given rate limitation parameter. A "struct m_snd_tag" pointer is returned which serves as a "snd_tag" hint in the m_pkthdr for the subsequently transmitted mbufs. 2) When the network driver sees the "m->m_pkthdr.snd_tag" different from NULL, it will move the packets into a designated rate limited sendqueue given by the snd_tag pointer. It is up to the individual drivers how the rate limited traffic will be rate limited. 3) Route changes are detected by the NIC drivers in the ifp->if_transmit() routine when the ifnet pointer in the incoming snd_tag mismatches the one of the network interface. The network adapter frees the mbuf and returns EAGAIN which causes the ip_output() to release and clear the send tag. Upon next ip_output() a new "snd_tag" will be tried allocated. 4) When the PCB is detached the custom sendqueue will be released by a non-blocking ifp->if_snd_tag_free() call to the currently bound network interface. Reviewed by: wblock (manpages), adrian, gallatin, scottl (network) Differential Revision: https://reviews.freebsd.org/D3687 Sponsored by: Mellanox Technologies MFC after: 3 months	2017-01-18 13:31:17 +00:00
np	af533198e3	Add spares to struct ifnet and socket for packet pacing and/or general use. Update comments regarding the spare fields in struct inpcb. Bump __FreeBSD_version for the changes to the size of the structures. Reviewed by: gnn@ Approved by: re@ (gjb@) Sponsored by: Chelsio Communications	2016-06-23 21:07:15 +00:00
pfg	d9c9113377	sys/net*: minor spelling fixes. No functional change.	2016-05-03 18:05:43 +00:00
gnn	c3d5404bbe	FreeBSD previously provided route caching for TCP (and UDP). Re-add route caching for TCP, with some improvements. In particular, invalidate the route cache if a new route is added, which might be a better match. The cache is automatically invalidated if the old route is deleted. Submitted by: Mike Karels Reviewed by: gnn Differential Revision: https://reviews.freebsd.org/D4306	2016-03-24 07:54:56 +00:00
glebius	790dc6f94a	Use Jenkins hash for TCP syncache. o Unlike xor, in Jenkins hash every bit of input affects virtually every bit of output, thus salting the hash actually works. With xor salting only provides a false sense of security, since if hash(x) collides with hash(y), then of course, hash(x) ^ salt would also collide with hash(y) ^ salt. [1] o Jenkins provides much better distribution than xor, very close to ideal. TCP connection setup/teardown benchmark has shown a 10% increase with default hash size, and with bigger hashes that still provide possibility for collisions. With enormous hash size, when dataset is by an order of magnitude smaller than hash size, the benchmark has shown 4% decrease in performance decrease, which is expected and acceptable. Noticed by: Jeffrey Knockel <jeffk cs.unm.edu> [1] Benchmarks by: jch Reviewed by: jch, pkelsey, delphij Security: strengthens protection against hash collision DoS Sponsored by: Nginx, Inc.	2015-09-05 10:15:19 +00:00
jch	349429fe82	Fix a kernel assertion issue introduced with r286227: Avoid too strict INP_INFO_RLOCK_ASSERT checks due to tcp_notify() being called from in6_pcbnotify(). Reported by: Larry Rosenman <ler@lerctr.org> Submitted by: markj, jch	2015-08-08 08:40:36 +00:00
jch	67927a7a7c	Decompose TCP INP_INFO lock to increase short-lived TCP connections scalability: - The existing TCP INP_INFO lock continues to protect the global inpcb list stability during full list traversal (e.g. tcp_pcblist()). - A new INP_LIST lock protects inpcb list actual modifications (inp allocation and free) and inpcb global counters. It allows to use TCP INP_INFO_RLOCK lock in critical paths (e.g. tcp_input()) and INP_INFO_WLOCK only in occasional operations that walk all connections. PR: 183659 Differential Revision: https://reviews.freebsd.org/D2599 Reviewed by: jhb, adrian Tested by: adrian, nitroboost-gmail.com Sponsored by: Verisign, Inc.	2015-08-03 12:13:54 +00:00
hselasky	12fec3618b	Start process of removing the use of the deprecated "M_FLOWID" flag from the FreeBSD network code. The flag is still kept around in the "sys/mbuf.h" header file, but does no longer have any users. Instead the "m_pkthdr.rsstype" field in the mbuf structure is now used to decide the meaning of the "m_pkthdr.flowid" field. To modify the "m_pkthdr.rsstype" field please use the existing "M_HASHTYPE_XXX" macros as defined in the "sys/mbuf.h" header file. This patch introduces new behaviour in the transmit direction. Previously network drivers checked if "M_FLOWID" was set in "m_flags" before using the "m_pkthdr.flowid" field. This check has now now been replaced by checking if "M_HASHTYPE_GET(m)" is different from "M_HASHTYPE_NONE". In the future more hashtypes will be added, for example hashtypes for hardware dedicated flows. "M_HASHTYPE_OPAQUE" indicates that the "m_pkthdr.flowid" value is valid and has no particular type. This change removes the need for an "if" statement in TCP transmit code checking for the presence of a valid flowid value. The "if" statement mentioned above is now a direct variable assignment which is then later checked by the respective network drivers like before. Additional notes: - The SCTP code changes will be committed as a separate patch. - Removal of the "M_FLOWID" flag will also be done separately. - The FreeBSD version has been bumped. MFC after: 1 month Sponsored by: Mellanox Technologies	2014-12-01 11:45:24 +00:00
melifaro	b5d711d3a6	Renove faith(4) and faithd(8) from base. It looks like industry have chosen different (and more traditional) stateless/statuful NAT64 as translation mechanism. Last non-trivial commits to both faith(4) and faithd(8) happened more than 12 years ago, so I assume it is time to drop RFC3142 in FreeBSD. No objections from: net@	2014-11-09 21:33:01 +00:00
ae	1576b695b6	Add scope zone id to the in_endpoints and hc_metrics structures. A non-global IPv6 address can be used in more than one zone of the same scope. This zone index is used to identify to which zone a non-global address belongs. Also we can have many foreign hosts with equal non-global addresses, but from different zones. So, they can have different metrics in the host cache. Obtained from: Yandex LLC Sponsored by: Yandex LLC	2014-09-10 16:26:18 +00:00
ae	82d0b71937	Introduce INP6_PCBHASHKEY macro. Replace usage of hardcoded part of IPv6 address as hash key in all places. Obtained from: Yandex LLC	2014-09-10 12:35:42 +00:00
adrian	e623d51cd5	Add support for receiving and setting flowtype, flowid and RSS bucket information as part of recvmsg(). This is primarily used for debugging/verification of the various processing paths in the IP, PCB and driver layers. Unfortunately the current implementation of the control message path results in a ~10% or so drop in UDP frame throughput when it's used. Differential Revision: https://reviews.freebsd.org/D527 Reviewed by: grehan	2014-09-09 01:45:39 +00:00
adrian	01670bf027	Expose in_pcbbind_check_bindmulti() so the upcoming IPv6 RSS changes can be made to use it.	2014-07-12 05:40:13 +00:00
adrian	627c6869c3	Implement the first stage of multi-bind listen sockets and RSS socket awareness. * Introduce IP_BINDMULTI - indicating that it's okay to bind multiple sockets on the same bind details. Although the PCB code has been taught about this (see below) this patch doesn't introduce the rest of the PCB changes necessary to distribute lookups among multiple PCB entries in the global wildcard table. * Introduce IP_RSS_LISTEN_BUCKET - placing an listen socket into the given RSS bucket (and thus a single PCBGROUP hash.) * Modify the PCB add path to be aware of IP_BINDMULTI: + Only allow further PCB entries to be added if the owner credentials and IP_BINDMULTI has been specified. Ie, only allow further IP_BINDMULTI sockets to appear if the first bind() was IP_BINDMULTI. * Teach the PCBGROUP code about IP_RSS_LISTE_BUCKET marked PCB entries. Instead of using the wildcard logic and hashing, these sockets are simply placed into the PCBGROUP and _not_ in the wildcard hash. * When doing a PCBGROUP lookup, also do a wildcard match as well. This allows for an RSS bucket PCB entry to appear in a PCBGROUP rather than having to exist in the wildcard list. Tested: * TCP IPv4 server testing with igb(4) * TCP IPv4 server testing with ix(4) TODO: * The pcbgroup lookup code duplicated the wildcard and wildcard-PCB logic. This could be refactored into a single function. * This doesn't yet work for IPv6 (The PCBGROUP code in netinet6/ doesn't yet know about this); nor does it yet fully work for UDP.	2014-07-10 03:10:56 +00:00
adrian	f91e4baca7	Add the flowtype to the inpcb. The flowid isn't enough to use as part of any RSS related CPU affinity lookups - the RSS code would like to know what kind of hash it is.	2014-05-18 22:30:12 +00:00
smh	977eea5c3a	Fix jailed raw sockets not setting the correct source address by calling in_pcbladdr instead of prison_get_ip4 MFC after: 1 month	2014-04-24 12:52:31 +00:00
trociny	9b554dcd02	In r227207, to fix the issue with possible NULL inp_socket pointer dereferencing, when checking for SO_REUSEPORT option (and SO_REUSEADDR for multicast), INP_REUSEPORT flag was introduced to cache the socket option. It was decided then that one flag would be enough to cache both SO_REUSEPORT and SO_REUSEADDR: when processing SO_REUSEADDR setsockopt(2), it was checked if it was called for a multicast address and INP_REUSEPORT was set accordingly. Unfortunately that approach does not work when setsockopt(2) is called before binding to a multicast address: the multicast check fails and INP_REUSEPORT is not set. Fix this by adding INP_REUSEADDR flag to unconditionally cache SO_REUSEADDR. PR: 179901 Submitted by: Michael Gmelin freebsd grem.de (initial version) Reviewed by: rwatson MFC after: 1 week	2013-07-04 18:38:00 +00:00
andre	cc8c6e4d01	Back out r249318, r249320 and r249327 due to a heisenbug most likely related to a race condition in the ipi_hash_lock with the exact cause currently unknown but under investigation.	2013-05-06 16:42:18 +00:00
andre	306fddaf78	Change certain heavily used network related mutexes and rwlocks to reside on their own cache line to prevent false sharing with other nearby structures, especially for those in the .bss segment. NB: Those mutexes and rwlocks with variables next to them that get changed on every invocation do not benefit from their own cache line. Actually it may be net negative because two cache misses would be incurred in those cases.	2013-04-09 21:02:20 +00:00
glebius	a73c365c3b	There is a complex race in in_pcblookup_hash() and in_pcblookup_group(). Both functions need to obtain lock on the found PCB, and they can't do classic inter-lock with the PCB hash lock, due to lock order reversal. To keep the PCB stable, these functions put a reference on it and after PCB lock is acquired drop it. If the reference was the last one, this means we've raced with in_pcbfree() and the PCB is no longer valid. This approach works okay only if we are acquiring writer-lock on the PCB. In case of reader-lock, the following scenario can happen: - 2 threads locate pcb, and do in_pcbref() on it. - These 2 threads drop the inp hash lock. - Another thread comes to delete pcb via in_pcbfree(), it obtains hash lock, does in_pcbremlists(), drops hash lock, and runs in_pcbrele_wlocked(), which doesn't free the pcb due to two references on it. Then it unlocks the pcb. - 2 aforementioned threads acquire reader lock on the pcb and run in_pcbrele_rlocked(). One gets 1 from in_pcbrele_rlocked() and continues, second gets 0 and considers pcb freed, returns. - The thread that got 1 continutes working with detached pcb, which later leads to panic in the underlying protocol level. To plumb that problem an additional INPCB flag introduced - INP_FREED. We check for that flag in the in_pcbrele_rlocked() and if it is set, we pretend that that was the last reference. Discussed with: rwatson, jhb Reported by: Vladimir Medvedkin <medved rambler-co.ru>	2012-10-02 12:03:02 +00:00
tuexen	32041f44ed	Add a IP_RECVTOS socket option to receive for received UDP/IPv4 packets a cmsg of type IP_RECVTOS which contains the TOS byte. Much like IP_RECVTTL does for TTL. This allows to implement a protocol on top of UDP and implementing ECN. MFC after: 3 days	2012-06-12 14:02:38 +00:00
rmh	26b44e3c9d	Hide a few declarations from userland (including `struct inpcbgroup'). This removes the dependency on <machine/param.h> which was introduced with SVN rev 222748 (due to CACHE_LINE_SIZE). Reviewed by: bde MFC after: 10 days	2012-03-17 21:51:39 +00:00
trociny	f9135967f2	Cache SO_REUSEPORT socket option in inpcb-layer in order to avoid inp_socket->so_options dereference when we may not acquire the lock on the inpcb. This fixes the crash due to NULL pointer dereference in in_pcbbind_setup() when inp_socket->so_options in a pcb returned by in_pcblookup_local() was checked. Reported by: dave jones <s.dave.jones@gmail.com>, Arnaud Lacombe <lacombar@gmail.com> Suggested by: rwatson Glanced by: rwatson Tested by: dave jones <s.dave.jones@gmail.com>	2011-11-06 10:47:20 +00:00
bz	352be4e985	Add spares to the network stack for FreeBSD-9: - TCP keep* timers - TCP UTO (adjust from what was there already) - netmap - route caching - user cookie (temporary to allow for the real fix) Slightly re-shuffle struct ifnet moving fields out of the middle of spares and to better align. Discussed with: rwatson (slightly earlier version)	2011-07-17 21:15:20 +00:00
bz	45c14b9c11	Unbreak kernels with non-default PCBGROUP included but no WITNESS. Rather than including lock.h in in_pcbgroup.c in right order, fix it for all consumers of in_pcb.h by further header file pollution under #ifdef KERNEL. Reported by: Pan Tsu (inyaoo gmail.com)	2011-06-06 21:45:32 +00:00
rwatson	6e29aea1db	Implement a CPU-affine TCP and UDP connection lookup data structure, struct inpcbgroup. pcbgroups, or "connection groups", supplement the existing inpcbinfo connection hash table, which when pcbgroups are enabled, might now be thought of more usefully as a per-protocol 4-tuple reservation table. Connections are assigned to connection groups base on a hash of their 4-tuple; wildcard sockets require special handling, and are members of all connection groups. During a connection lookup, a per-connection group lock is employed rather than the global pcbinfo lock. By aligning connection groups with input path processing, connection groups take on an effective CPU affinity, especially when aligned with RSS work placement (see a forthcoming commit for details). This eliminates cache line migration associated with global, protocol-layer data structures in steady state TCP and UDP processing (with the exception of protocol-layer statistics; further commit to follow). Elements of this approach were inspired by Willman, Rixner, and Cox's 2006 USENIX paper, "An Evaluation of Network Stack Parallelization Strategies in Modern Operating Systems". However, there are also significant differences: we maintain the inpcb lock, rather than using the connection group lock for per-connection state. Likewise, the focus of this implementation is alignment with NIC packet distribution strategies such as RSS, rather than pure software strategies. Despite that focus, software distribution is supported through the parallel netisr implementation, and works well in configurations where the number of hardware threads is greater than the number of NIC input queues, such as in the RMI XLR threaded MIPS architecture. Another important difference is the continued maintenance of existing hash tables as "reservation tables" -- these are useful both to distinguish the resource allocation aspect of protocol name management and the more common-case lookup aspect. In configurations where connection tables are aligned with hardware hashes, it is desirable to use the traditional lookup tables for loopback or encapsulated traffic rather than take the expense of hardware hashes that are hard to implement efficiently in software (such as RSS Toeplitz). Connection group support is enabled by compiling "options PCBGROUP" into your kernel configuration; for the time being, this is an experimental feature, and hence is not enabled by default. Subject to the limited MFCability of change dependencies in inpcb, and its change to the inpcbinfo init function signature, this change in principle could be merged to FreeBSD 8.x. Reviewed by: bz Sponsored by: Juniper Networks, Inc.	2011-06-06 12:55:02 +00:00
rwatson	e9eb5d3b9c	Add _mbuf() variants of various inpcb-related interfaces, including lookup, hash install, etc. For now, these are arguments are unused, but as we add RSS support, we will want to use hashes extracted from mbufs, rather than manually calculated hashes of header fields, due to the expensive of the software version of Toeplitz (and similar hashes). Add notes that it would be nice to be able to pass mbufs into lookup routines in pf(4), optimising firewall lookup in the same way, but the code structure there doesn't facilitate that currently. (In principle there is no reason this couldn't be MFCed -- the change extends rather than modifies the KBI. However, it won't be useful without other previous possibly less MFCable changes.) Reviewed by: bz Sponsored by: Juniper Networks, Inc.	2011-06-04 16:33:06 +00:00
rwatson	fdfdadb612	Decompose the current single inpcbinfo lock into two locks: - The existing ipi_lock continues to protect the global inpcb list and inpcb counter. This lock is now relegated to a small number of allocation and free operations, and occasional operations that walk all connections (including, awkwardly, certain UDP multicast receive operations -- something to revisit). - A new ipi_hash_lock protects the two inpcbinfo hash tables for looking up connections and bound sockets, manipulated using new INP_HASH_*() macros. This lock, combined with inpcb locks, protects the 4-tuple address space. Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb connection locks, so may be acquired while manipulating a connection on which a lock is already held, avoiding the need to acquire the inpcbinfo lock preemptively when a binding change might later be required. As a result, however, lookup operations necessarily go through a reference acquire while holding the lookup lock, later acquiring an inpcb lock -- if required. A new function in_pcblookup() looks up connections, and accepts flags indicating how to return the inpcb. Due to lock order changes, callers no longer need acquire locks before performing a lookup: the lookup routine will acquire the ipi_hash_lock as needed. In the future, it will also be able to use alternative lookup and locking strategies transparently to callers, such as pcbgroup lookup. New lookup flags are, supplementing the existing INPLOOKUP_WILDCARD flag: INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb Callers must pass exactly one of these flags (for the time being). Some notes: - All protocols are updated to work within the new regime; especially, TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely eliminated, and global hash lock hold times are dramatically reduced compared to previous locking. - The TCP syncache still relies on the pcbinfo lock, something that we may want to revisit. - Support for reverting to the FreeBSD 7.x locking strategy in TCP input is no longer available -- hash lookup locks are now held only very briefly during inpcb lookup, rather than for potentially extended periods. However, the pcbinfo ipi_lock will still be acquired if a connection state might change such that a connection is added or removed. - Raw IP sockets continue to use the pcbinfo ipi_lock for protection, due to maintaining their own hash tables. - The interface in6_pcblookup_hash_locked() is maintained, which allows callers to acquire hash locks and perform one or more lookups atomically with 4-tuple allocation: this is required only for TCPv6, as there is no in6_pcbconnect_setup(), which there should be. - UDPv6 locking remains significantly more conservative than UDPv4 locking, which relates to source address selection. This needs attention, as it likely significantly reduces parallelism in this code for multithreaded socket use (such as in BIND). - In the UDPv4 and UDPv6 multicast cases, we need to revisit locking somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which is no longer sufficient. A second check once the inpcb lock is held should do the trick, keeping the general case from requiring the inpcb lock for every inpcb visited. - This work reminds us that we need to revisit locking of the v4/v6 flags, which may be accessed lock-free both before and after this change. - Right now, a single lock name is used for the pcbhash lock -- this is undesirable, and probably another argument is required to take care of this (or a char array name field in the pcbinfo?). This is not an MFC candidate for 8.x due to its impact on lookup and locking semantics. It's possible some of these issues could be worked around with compatibility wrappers, if necessary. Reviewed by: bz Sponsored by: Juniper Networks, Inc.	2011-05-30 09:43:55 +00:00
rwatson	95a805600b	Continue to refine inpcb reference counting and locking, in preparation for reworking of inpcbinfo locking: (1) Convert inpcb reference counting from manually manipulated integers to the refcount(9) KPI. This allows the refcount to be managed atomically with an inpcb read lock rather than write lock, or even with no inpcb lock at all. As a result, in_pcbref() also no longer requires an inpcb lock, so can be performed solely using the lock used to look up an inpcb. (2) Shift more inpcb freeing activity from the in_pcbrele() context (via in_pcbfree_internal) to the explicit in_pcbfree() context. This means that the inpcb refcount is increasingly used only to maintain memory stability, not actually defer the clean up of inpcb protocol parts. This is desirable as many of those protocol parts required the pcbinfo lock, which we'd like not to acquire in in_pcbrele() contexts. Document this in comments better. (3) Introduce new read-locked and write-locked in_pcbrele() variations, in_pcbrele_rlocked() and in_pcbrele_wlocked(), which allow the inpcb to be properly unlocked as needed. in_pcbrele() is a wrapper around the latter, and should probably go away at some point. This makes it easier to use this weak reference model when holding only a read lock, as will happen in the future. This may well be safe to MFC, but some more KBI analysis is required. Reviewed by: bz MFC after: 3 weeks Sponsored by: Juniper Networks, Inc.	2011-05-23 19:32:02 +00:00
rwatson	56d91395ad	A number of quite incremental refinements to struct inpcbinfo's definition: (1) Add a locking guide for inpcbinfo. (2) Annotate inpcbinfo fields with synchronisation information; not all annotations are 100% satisfactory. (3) Reorder inpcbinfo fields so that the lock is at the head of the structure, and close to fields it protects. (4) Sort fields that will eventually be hashlock/pcbgroup-related together even though they remain locked by ipi_lock for now. Reviewed by: bz Sponsored by: Juniper Networks X-MFC after: KBI analysis required	2011-05-23 13:51:57 +00:00
bz	25cccaf492	MFp4 CH=191470: Move the ipport_tick_callout and related functions from ip_input.c to in_pcb.c. The random source port allocation code has been merged and is now local to in_pcb.c only. Use a SYSINIT to get the callout started and no longer depend on initialization from the inet code, which would not work in an IPv6 only setup. Reviewed by: gnn Sponsored by: The FreeBSD Foundation Sponsored by: iXsystems MFC after: 4 days	2011-04-20 08:00:29 +00:00

1 2 3 4

195 Commits