freebsd-dev

Author	SHA1	Message	Date
Gleb Smirnoff	1e4f4e56b9	ip6_output() has a complex set of gotos, and some can jump out of the epoch section towards return statement. Since entering epoch is cheap, it is easier to cover the whole function with epoch, rather than try to properly maintain its state.	2019-10-09 17:02:28 +00:00
Mark Johnston	cb49ec5431	Improve locking in the IPV6_V6ONLY socket option handler. Acquire the inp lock before checking whether the socket is already bound, and around updates to the inp_vflag field. MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21867	2019-10-07 23:35:23 +00:00
Gleb Smirnoff	b8a6e03fac	Widen NET_EPOCH coverage. When epoch(9) was introduced to network stack, it was basically dropped in place of existing locking, which was mutexes and rwlocks. For the sake of performance mutex covered areas were as small as possible, so became epoch covered areas. However, epoch doesn't introduce any contention, it just delays memory reclaim. So, there is no point to minimise epoch covered areas in sense of performance. Meanwhile entering/exiting epoch also has non-zero CPU usage, so doing this less often is a win. Not the least is also code maintainability. In the new paradigm we can assume that at any stage of processing a packet, we are inside network epoch. This makes coding both input and output path way easier. On output path we already enter epoch quite early - in the ip_output(), in the ip6_output(). This patch does the same for the input path. All ISR processing, network related callouts, other ways of packet injection to the network stack shall be performed in net_epoch. Any leaf function that walks network configuration now asserts epoch. Tricky part is configuration code paths - ioctls, sysctls. They also call into leaf functions, so some need to be changed. This patch would introduce more epoch recursions (see EPOCH_TRACE) than we had before. They will be cleaned up separately, as several of them aren't trivial. Note, that unlike a lock recursion the epoch recursion is safe and just wastes a bit of resources. Reviewed by: gallatin, hselasky, cy, adrian, kristof Differential Revision: https://reviews.freebsd.org/D19111	2019-10-07 22:40:05 +00:00
John Baldwin	b2e60773c6	Add kernel-side support for in-kernel TLS. KTLS adds support for in-kernel framing and encryption of Transport Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports offload of TLS for transmitted data. Key negotation must still be performed in userland. Once completed, transmit session keys for a connection are provided to the kernel via a new TCP_TXTLS_ENABLE socket option. All subsequent data transmitted on the socket is placed into TLS frames and encrypted using the supplied keys. Any data written to a KTLS-enabled socket via write(2), aio_write(2), or sendfile(2) is assumed to be application data and is encoded in TLS frames with an application data type. Individual records can be sent with a custom type (e.g. handshake messages) via sendmsg(2) with a new control message (TLS_SET_RECORD_TYPE) specifying the record type. At present, rekeying is not supported though the in-kernel framework should support rekeying. KTLS makes use of the recently added unmapped mbufs to store TLS frames in the socket buffer. Each TLS frame is described by a single ext_pgs mbuf. The ext_pgs structure contains the header of the TLS record (and trailer for encrypted records) as well as references to the associated TLS session. KTLS supports two primary methods of encrypting TLS frames: software TLS and ifnet TLS. Software TLS marks mbufs holding socket data as not ready via M_NOTREADY similar to sendfile(2) when TLS framing information is added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then called to schedule TLS frames for encryption. In the case of sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving the mbufs marked M_NOTREADY until encryption is completed. For other writes (vn_sendfile when pages are available, write(2), etc.), the PRUS_NOTREADY is set when invoking pru_send() along with invoking ktls_enqueue(). A pool of worker threads (the "KTLS" kernel process) encrypts TLS frames queued via ktls_enqueue(). Each TLS frame is temporarily mapped using the direct map and passed to a software encryption backend to perform the actual encryption. (Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if someone wished to make this work on architectures without a direct map.) KTLS supports pluggable software encryption backends. Internally, Netflix uses proprietary pure-software backends. This commit includes a simple backend in a new ktls_ocf.ko module that uses the kernel's OpenCrypto framework to provide AES-GCM encryption of TLS frames. As a result, software TLS is now a bit of a misnomer as it can make use of hardware crypto accelerators. Once software encryption has finished, the TLS frame mbufs are marked ready via pru_ready(). At this point, the encrypted data appears as regular payload to the TCP stack stored in unmapped mbufs. ifnet TLS permits a NIC to offload the TLS encryption and TCP segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS) is allocated on the interface a socket is routed over and associated with a TLS session. TLS records for a TLS session using ifnet TLS are not marked M_NOTREADY but are passed down the stack unencrypted. The ip_output_send() and ip6_output_send() helper functions that apply send tags to outbound IP packets verify that the send tag of the TLS record matches the outbound interface. If so, the packet is tagged with the TLS send tag and sent to the interface. The NIC device driver must recognize packets with the TLS send tag and schedule them for TLS encryption and TCP segmentation. If the the outbound interface does not match the interface in the TLS send tag, the packet is dropped. In addition, a task is scheduled to refresh the TLS send tag for the TLS session. If a new TLS send tag cannot be allocated, the connection is dropped. If a new TLS send tag is allocated, however, subsequent packets will be tagged with the correct TLS send tag. (This latter case has been tested by configuring both ports of a Chelsio T6 in a lagg and failing over from one port to another. As the connections migrated to the new port, new TLS send tags were allocated for the new port and connections resumed without being dropped.) ifnet TLS can be enabled and disabled on supported network interfaces via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported across both vlan devices and lagg interfaces using failover, lacp with flowid enabled, or lacp with flowid enabled. Applications may request the current KTLS mode of a connection via a new TCP_TXTLS_MODE socket option. They can also use this socket option to toggle between software and ifnet TLS modes. In addition, a testing tool is available in tools/tools/switch_tls. This is modeled on tcpdrop and uses similar syntax. However, instead of dropping connections, -s is used to force KTLS connections to switch to software TLS and -i is used to switch to ifnet TLS. Various sysctls and counters are available under the kern.ipc.tls sysctl node. The kern.ipc.tls.enable node must be set to true to enable KTLS (it is off by default). The use of unmapped mbufs must also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS. KTLS is enabled via the KERN_TLS kernel option. This patch is the culmination of years of work by several folks including Scott Long and Randall Stewart for the original design and implementation; Drew Gallatin for several optimizations including the use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records awaiting software encryption, and pluggable software crypto backends; and John Baldwin for modifications to support hardware TLS offload. Reviewed by: gallatin, hselasky, rrs Obtained from: Netflix Sponsored by: Netflix, Chelsio Communications Differential Revision: https://reviews.freebsd.org/D21277	2019-08-27 00:01:56 +00:00
Bjoern A. Zeeb	0ecd976e80	IPv6 cleanup: kernel Finish what was started a few years ago and harmonize IPv6 and IPv4 kernel names. We are down to very few places now that it is feasible to do the change for everything remaining with causing too much disturbance. Remove "aliases" for IPv6 names which confusingly could indicate that we are talking about a different data structure or field or have two fields, one for each address family. Try to follow common conventions used in FreeBSD. * Rename sin6p to sin6 as that is how it is spelt in most places. * Remove "aliases" (#defines) for: - in6pcb which really is an inpcb and nothing separate - sotoin6pcb which is sotoinpcb (as per above) - in6p_sp which is inp_sp - in6p_flowinfo which is inp_flow * Try to use ia6 for in6_addr rather than in6p. * With all these gone also rename the in6p variables to inp as that is what we call it in most of the network stack including parts of netinet6. The reasons behind this cleanup are that we try to further unify netinet and netinet6 code where possible and that people will less ignore one or the other protocol family when doing code changes as they may not have spotted places due to different names for the same thing. No functional changes. Discussed with: tuexen (SCTP changes) MFC after: 3 months Sponsored by: Netflix	2019-08-02 07:41:36 +00:00
John Baldwin	82334850ea	Add an external mbuf buffer type that holds multiple unmapped pages. Unmapped mbufs allow sendfile to carry multiple pages of data in a single mbuf, without mapping those pages. It is a requirement for Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web serving workloads when used by sendfile, due to effectively compressing socket buffers by an order of magnitude, and hence reducing cache misses. For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer now points to a struct mbuf_ext_pgs structure instead of a data buffer. This structure contains an array of physical addresses (this reduces cache misses compared to an earlier version that stored an array of vm_page_t pointers). It also stores additional fields needed for in-kernel TLS such as the TLS header and trailer data that are currently unused. To more easily detect these mbufs, the M_NOMAP flag is set in m_flags in addition to M_EXT. Various functions like m_copydata() have been updated to safely access packet contents (using uiomove_fromphys()), to make things like BPF safe. NIC drivers advertise support for unmapped mbufs on transmit via a new IFCAP_NOMAP capability. This capability can be toggled via the new 'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only transmit packet contents via DMA and use bus_dma, adding the capability to if_capabilities and if_capenable should be all that is required. If a NIC does not support unmapped mbufs, they are converted to a chain of mapped mbufs (using sf_bufs to provide the mapping) in ip_output or ip6_output. If an unmapped mbuf requires software checksums, it is also converted to a chain of mapped mbufs before computing the checksum. Submitted by: gallatin (earlier version) Reviewed by: gallatin, hselasky, rrs Discussed with: ae, kp (firewalls) Relnotes: yes Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20616	2019-06-29 00:48:33 +00:00
John Baldwin	77a0144145	Sort opt_foo.h #includes and add a missing blank line in ip_output().	2019-06-11 22:07:39 +00:00
John Baldwin	fb3bc59600	Restructure mbuf send tags to provide stronger guarantees. - Perform ifp mismatch checks (to determine if a send tag is allocated for a different ifp than the one the packet is being output on), in ip_output() and ip6_output(). This avoids sending packets with send tags to ifnet drivers that don't support send tags. Since we are now checking for ifp mismatches before invoking if_output, we can now try to allocate a new tag before invoking if_output sending the original packet on the new tag if allocation succeeds. To avoid code duplication for the fragment and unfragmented cases, add ip_output_send() and ip6_output_send() as wrappers around if_output and nd6_output_ifp, respectively. All of the logic for setting send tags and dealing with send tag-related errors is done in these wrapper functions. For pseudo interfaces that wrap other network interfaces (vlan and lagg), wrapper send tags are now allocated so that ip*_output see the wrapper ifp as the ifp in the send tag. The if_transmit routines rewrite the send tags after performing an ifp mismatch check. If an ifp mismatch is detected, the transmit routines fail with EAGAIN. - To provide clearer life cycle management of send tags, especially in the presence of vlan and lagg wrapper tags, add a reference count to send tags managed via m_snd_tag_ref() and m_snd_tag_rele(). Provide a helper function (m_snd_tag_init()) for use by drivers supporting send tags. m_snd_tag_init() takes care of the if_ref on the ifp meaning that code alloating send tags via if_snd_tag_alloc no longer has to manage that manually. Similarly, m_snd_tag_rele drops the refcount on the ifp after invoking if_snd_tag_free when the last reference to a send tag is dropped. This also closes use after free races if there are pending packets in driver tx rings after the socket is closed (e.g. from tcpdrop). In order for m_free to work reliably, add a new CSUM_SND_TAG flag in csum_flags to indicate 'snd_tag' is set (rather than 'rcvif'). Drivers now also check this flag instead of checking snd_tag against NULL. This avoids false positive matches when a forwarded packet has a non-NULL rcvif that was treated as a send tag. - cxgbe was relying on snd_tag_free being called when the inp was detached so that it could kick the firmware to flush any pending work on the flow. This is because the driver doesn't require ACK messages from the firmware for every request, but instead does a kind of manual interrupt coalescing by only setting a flag to request a completion on a subset of requests. If all of the in-flight requests don't have the flag when the tag is detached from the inp, the flow might never return the credits. The current snd_tag_free command issues a flush command to force the credits to return. However, the credit return is what also frees the mbufs, and since those mbufs now hold references on the tag, this meant that snd_tag_free would never be called. To fix, explicitly drop the mbuf's reference on the snd tag when the mbuf is queued in the firmware work queue. This means that once the inp's reference on the tag goes away and all in-flight mbufs have been queued to the firmware, tag's refcount will drop to zero and snd_tag_free will kick in and send the flush request. Note that we need to avoid doing this in the middle of ethofld_tx(), so the driver grabs a temporary reference on the tag around that loop to defer the free to the end of the function in case it sends the last mbuf to the queue after the inp has dropped its reference on the tag. - mlx5 preallocates send tags and was using the ifp pointer even when the send tag wasn't in use. Explicitly use the ifp from other data structures instead. - Sprinkle some assertions in various places to assert that received packets don't have a send tag, and that other places that overwrite rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer. Reviewed by: gallatin, hselasky, rgrimes, ae Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20117	2019-05-24 22:30:40 +00:00
John Baldwin	c9d337083f	Apply r280991 to ip6_fragment. This uses m_dup_pkthdr() to copy all of the metadata about a packet to each of its fragments including VLAN tags, mbuf tags, etc. instead of hand-copying a few fields. Reviewed by: bz MFC after: 1 month Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20117	2019-05-10 20:15:40 +00:00
Andrew Gallatin	50575ce11c	Track TCP connection's NUMA domain in the inpcb Drivers can now pass up numa domain information via the mbuf numa domain field. This information is then used by TCP syncache_socket() to associate that information with the inpcb. The domain information is then fed back into transmitted mbufs in ip{6}_output(). This mechanism is nearly identical to what is done to track RSS hash values in the inp_flowid. Follow on changes will use this information for lacp egress port selection, binding TCP pacers to the appropriate NUMA domain, etc. Reviewed by: markj, kib, slavash, bz, scottl, jtl, tuexen Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20028	2019-04-25 15:37:28 +00:00
Michael Tuexen	2f041b74b9	Improve input validation for the socket option IPV6_CHECKSUM. When using the IPPROTO_IPV6 level socket option IPV6_CHECKSUM on a raw IPv6 socket, ensure that the value is either -1 or a non-negative even number. Reviewed by: bz@, thj@ MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D19966	2019-04-19 17:17:41 +00:00
Gleb Smirnoff	b252313f0b	New pfil(9) KPI together with newborn pfil API and control utility. The KPI have been reviewed and cleansed of features that were planned back 20 years ago and never implemented. The pfil(9) internals have been made opaque to protocols with only returned types and function declarations exposed. The KPI is made more strict, but at the same time more extensible, as kernel uses same command structures that userland ioctl uses. In nutshell [KA]PI is about declaring filtering points, declaring filters and linking and unlinking them together. New [KA]PI makes it possible to reconfigure pfil(9) configuration: change order of hooks, rehook filter from one filtering point to a different one, disconnect a hook on output leaving it on input only, prepend/append a filter to existing list of filters. Now it possible for a single packet filter to provide multiple rulesets that may be linked to different points. Think of per-interface ACLs in Cisco or Juniper. None of existing packet filters yet support that, however limited usage is already possible, e.g. default ruleset can be moved to single interface, as soon as interface would pride their filtering points. Another future feature is possiblity to create pfil heads, that provide not an mbuf pointer but just a memory pointer with length. That would allow filtering at very early stages of a packet lifecycle, e.g. when packet has just been received by a NIC and no mbuf was yet allocated. Differential Revision: https://reviews.freebsd.org/D18951	2019-01-31 23:01:03 +00:00
Hans Petter Selasky	ef0111fdf3	Fix loopback traffic when using non-lo0 link local IPv6 addresses. The loopback interface can only receive packets with a single scope ID, namely the scope ID of the loopback interface itself. To mitigate this packets which use the scope ID are appearing as received by the real network interface, see "origifp" in the patch. The current code would drop packets which are designated for loopback which use a link-local scope ID in the destination address or source address, because they won't match the lo0's scope ID. To fix this restore the network interface pointer from the scope ID in the destination address for the problematic cases. See comments added in patch for a more detailed description. This issue was introduced with route caching (ae@). Reviewed by: bz (network) Differential Revision: https://reviews.freebsd.org/D18769 MFC after: 1 week Sponsored by: Mellanox Technologies	2019-01-09 14:28:08 +00:00
Mateusz Guzik	cc426dd319	Remove unused argument to priv_check_cred. Patch mostly generated with cocinnelle: @@ expression E1,E2; @@ - priv_check_cred(E1,E2,0) + priv_check_cred(E1,E2) Sponsored by: The FreeBSD Foundation	2018-12-11 19:32:16 +00:00
Bjoern A. Zeeb	ec86402ecd	Replicate r328271 from legacy IP to IPv6 using a single macro to clear L2 and L3 route caches. Also mark one function argument as __unused. Reviewed by: karels, ae Approved by: re (rgrimes) Differential Revision: https://reviews.freebsd.org/D17007	2018-09-03 22:27:27 +00:00
Matt Macy	56713d16a0	acquire inp lock around ip6_pcbopt to fix IPV6_TCLASS panic Simple fix to address panics relating to setting IPV6_TCLASS with setsockopt(). The premise of this change is that it is ok to call malloc with M_NOWAIT while holding a lock on the in6p. If it later turns out that it is not ok, then major surgery will be required, as ip6_setpktopt() will have to be fixed (as it also calls malloc with M_NOWAIT) which pulls in the ip6_pcbopts(), ip6_setpktopts(), ip6_setpktopt() call chain. Submitted by: Jason Eggnet Reviewed by: rrs, transport, sbruno Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D16201	2018-07-15 00:47:06 +00:00
Sean Bruno	1a43cff92a	Load balance sockets with new SO_REUSEPORT_LB option. This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple programs or threads to bind to the same port and incoming connections will be load balanced using a hash function. Most of the code was copied from a similar patch for DragonflyBSD. However, in DragonflyBSD, load balancing is a global on/off setting and can not be set per socket. This patch allows for simultaneous use of both the current SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system. Required changes to structures: Globally change so_options from 16 to 32 bit value to allow for more options. Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets. Limitations: As DragonflyBSD, a load balance group is limited to 256 pcbs (256 programs or threads sharing the same socket). This is a substantially different contribution as compared to its original incarnation at svn r332894 and reverted at svn r332967. Thanks to rwatson@ for the substantive feedback that is included in this commit. Submitted by: Johannes Lundberg <johalun0@gmail.com> Obtained from: DragonflyBSD Relnotes: Yes Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D11003	2018-06-06 15:45:57 +00:00
Andrey V. Elsukov	4a089e6bc5	Use m_copyback() function to write delayed checksum when it isn't located in the first mbuf of the chain. MFC after: 1 week	2018-06-06 10:46:24 +00:00
Sean Bruno	7875017ca9	Revert r332894 at the request of the submitter. Submitted by: Johannes Lundberg <johalun0_gmail.com> Sponsored by: Limelight Networks	2018-04-24 19:55:12 +00:00
Sean Bruno	7b7796eea5	Load balance sockets with new SO_REUSEPORT_LB option This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple programs or threads to bind to the same port and incoming connections will be load balanced using a hash function. Most of the code was copied from a similar patch for DragonflyBSD. However, in DragonflyBSD, load balancing is a global on/off setting and can not be set per socket. This patch allows for simultaneous use of both the current SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system. Required changes to structures Globally change so_options from 16 to 32 bit value to allow for more options. Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets. Limitations As DragonflyBSD, a load balance group is limited to 256 pcbs (256 programs or threads sharing the same socket). Submitted by: Johannes Lundberg <johanlun0@gmail.com> Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D11003	2018-04-23 19:51:00 +00:00
Jonathan T. Looney	c187c03466	Remove some unneccessary variable sets in IPv6 code, as detected by clang's static analyzer. Reviewed by: bz MFC after: 2 weeks Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D10940	2018-03-24 12:43:34 +00:00
Sean Bruno	72bfa0bf63	Revert r331379 as the "simple" lock changes have revealed a deeper problem and need for a rethink. Submitted by: Jason Eggleston <jason@eggnet.com> Sponsored by: Limelight Networks	2018-03-23 18:34:38 +00:00
Kristof Provost	effaab8861	netpfil: Introduce PFIL_FWD flag Forwarded packets passed through PFIL_OUT, which made it difficult for firewalls to figure out if they were forwarding or producing packets. This in turn is an issue for pf for IPv6 fragment handling: it needs to call ip6_output() or ip6_forward() to handle the fragments. Figuring out which was difficult (and until now, incorrect). Having pfil distinguish the two removes an ugly piece of code from pf. Introduce a new variant of the netpfil callbacks with a flags variable, which has PFIL_FWD set for forwarded packets. This allows pf to reliably work out if a packet is forwarded. Reviewed by: ae, kevans Differential Revision: https://reviews.freebsd.org/D13715	2018-03-23 16:56:44 +00:00
Sean Bruno	06b479a6a7	Refactor ip6_getpcbopt() for better locking and memory management Created GET_PKTOPT_EXT_HDR() and GET_PKTOPT_SOCKADDR() macros to handle safely fetching options from in6p_outputopts, including properly dealing with in6p locking and preparing memory for sooptcopyout(). Changed the function signature of ip6_getpcbopt() to allow the function to acquire and release locks on in6p as needed. Submitted by: Jason Eggleston <jason@eggnet.com> Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D14619	2018-03-22 23:34:48 +00:00
Sean Bruno	2a499acf59	Simple locking fixes in ip_ctloutput, ip6_ctloutput, rip_ctloutput. Submitted by: Jason Eggleston <jason@eggnet.com> Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D14624	2018-03-22 22:29:32 +00:00
Sean Bruno	5cbeca4497	Handle locking and memory safety for IPV6_PATHMTU in ip6_ctloutput(). Submitted by: Jason Eggleston <jason@eggnet.com> Reviewed by: ae Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D14622	2018-03-22 21:18:34 +00:00
Sean Bruno	37d4fc1e70	Improve write locking in ip6_ctloutput() with macros. Submitted by: Jason Eggleston <jason@eggnet.com> Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D14620	2018-03-22 20:21:05 +00:00
Pedro F. Giffuni	51369649b0	sys: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 3-Clause license. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point.	2017-11-20 19:43:44 +00:00
Bjoern A. Zeeb	ae69ad884d	After inpcb route caching was put back in place there is no need for flowtable anymore (as flowtable was never considered to be useful in the forwarding path). Reviewed by: np Differential Revision: https://reviews.freebsd.org/D11448	2017-07-27 13:03:36 +00:00
Jonathan T. Looney	8b07e00e99	Fix an unnecessary/incorrect check in the PKTOPT_EXTHDRCPY macro. This macro allocates memory and, if malloc does not return NULL, copies data into the new memory. However, it doesn't just check whether malloc returns NULL. It also checks whether we called malloc with M_NOWAIT. That is not necessary. While it may be that malloc() will only return NULL when the M_NOWAIT flag is set, we don't need to check for this when checking malloc's return value. Further, in this case, the check was not completely accurate, because it checked for flags == M_NOWAIT, rather than treating it as a bit field and checking for (flags & M_NOWAIT). Reviewed by: ae MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D10942	2017-05-30 14:50:28 +00:00
Navdeep Parhar	ce9ac139d4	ip6_output runs with the inp lock held, just like ip_output.	2017-05-10 00:14:55 +00:00
Kristof Provost	d78c0804fb	Rename variable for clarity Rename the mtu variable in ip6_fragment(), because mtu is misleading. The variable actually holds the fragment length. No functional change. Suggested by: ae	2017-04-22 13:04:36 +00:00
Kristof Provost	00eab743ab	pf: Fix possible incorrect IPv6 fragmentation When forwarding pf tracks the size of the largest fragment in a fragmented packet, and refragments based on this size. It failed to ensure that this size was a multiple of 8 (as is required for all but the last fragment), so it could end up generating incorrect fragments. For example, if we received an 8 byte and 12 byte fragment pf would emit a first fragment with 12 bytes of payload and the final fragment would claim to be at offset 8 (not 12). We now assert that the fragment size is a multiple of 8 in ip6_fragment(), so other users won't make the same mistake. Reported by: Antonios Atlasis <aatlasis at secfu net> MFC after: 3 days	2017-04-20 09:05:53 +00:00
Mike Karels	8c1960d506	Fix reference count leak with L2 caching. ip_forward, TCP/IPv6, and probably SCTP leaked references to L2 cache entry because they used their own routes on the stack, not in_pcb routes. The original model for route caching was callers that provided a route structure to ip{,6}input() would keep the route, and this model was used for L2 caching as well. Instead, change L2 caching to be done by default only when using a route structure in the in_pcb; the pcb deallocation code frees L2 as well as L3 cacches. A separate change will add route caching to TCP/IPv6. Another suggestion was to have the transport protocols indicate willingness to use L2 caching, but this approach keeps the changes in the network level Reviewed by: ae gnn MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D10059	2017-03-25 15:06:28 +00:00
Ermal Luçi	dce33a45c9	The patch provides the same socket option as Linux IP_ORIGDSTADDR. Unfortunately they will have different integer value due to Linux value being already assigned in FreeBSD. The patch is similar to IP_RECVDSTADDR but also provides the destination port value to the application. This allows/improves implementation of transparent proxies on UDP sockets due to having the whole information on forwarded packets. Reviewed by: adrian, aw Approved by: ae (mentor) Sponsored by: rsync.net Differential Revision: D9235	2017-03-06 04:01:58 +00:00
Warner Losh	fbbd9655e5	Renumber copyright clause 4 Renumber cluase 4 to 3, per what everybody else did when BSD granted them permission to remove clause 3. My insistance on keeping the same numbering for legal reasons is too pedantic, so give up on that point. Submitted by: Jan Schaumann <jschauma@stevens.edu> Pull Request: https://github.com/freebsd/freebsd/pull/96	2017-02-28 23:42:47 +00:00
Ermal Luçi	c10c5b1eba	Committed without approval from mentor. Reported by: gnn	2017-02-12 06:56:33 +00:00
Ermal Luçi	ed55edceef	The patch provides the same socket option as Linux IP_ORIGDSTADDR. Unfortunately they will have different integer value due to Linux value being already assigned in FreeBSD. The patch is similar to IP_RECVDSTADDR but also provides the destination port value to the application. This allows/improves implementation of transparent proxies on UDP sockets due to having the whole information on forwarded packets. Sponsored-by: rsync.net Differential Revision: D9235 Reviewed-by: adrian	2017-02-10 05:16:14 +00:00
Andrey V. Elsukov	fcf596178b	Merge projects/ipsec into head/. Small summary ------------- o Almost all IPsec releated code was moved into sys/netipsec. o New kernel modules added: ipsec.ko and tcpmd5.ko. New kernel option IPSEC_SUPPORT added. It enables support for loading and unloading of ipsec.ko and tcpmd5.ko kernel modules. o IPSEC_NAT_T option was removed. Now NAT-T support is enabled by default. The UDP_ENCAP_ESPINUDP_NON_IKE encapsulation type support was removed. Added TCP/UDP checksum handling for inbound packets that were decapsulated by transport mode SAs. setkey(8) modified to show run-time NAT-T configuration of SA. o New network pseudo interface if_ipsec(4) added. For now it is build as part of ipsec.ko module (or with IPSEC kernel). It implements IPsec virtual tunnels to create route-based VPNs. o The network stack now invokes IPsec functions using special methods. The only one header file <netipsec/ipsec_support.h> should be included to declare all the needed things to work with IPsec. o All IPsec protocols handlers (ESP/AH/IPCOMP protosw) were removed. Now these protocols are handled directly via IPsec methods. o TCP_SIGNATURE support was reworked to be more close to RFC. o PF_KEY SADB was reworked: - now all security associations stored in the single SPI namespace, and all SAs MUST have unique SPI. - several hash tables added to speed up lookups in SADB. - SADB now uses rmlock to protect access, and concurrent threads can do SA lookups in the same time. - many PF_KEY message handlers were reworked to reflect changes in SADB. - SADB_UPDATE message was extended to support new PF_KEY headers: SADB_X_EXT_NEW_ADDRESS_SRC and SADB_X_EXT_NEW_ADDRESS_DST. They can be used by IKE daemon to change SA addresses. o ipsecrequest and secpolicy structures were cardinally changed to avoid locking protection for ipsecrequest. Now we support only limited number (4) of bundled SAs, but they are supported for both INET and INET6. o INPCB security policy cache was introduced. Each PCB now caches used security policies to avoid SP lookup for each packet. o For inbound security policies added the mode, when the kernel does check for full history of applied IPsec transforms. o References counting rules for security policies and security associations were changed. The proper SA locking added into xform code. o xform code was also changed. Now it is possible to unregister xforms. tdb_xxx structures were changed and renamed to reflect changes in SADB/SPDB, and changed rules for locking and refcounting. Reviewed by: gnn, wblock Obtained from: Yandex LLC Relnotes: yes Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D9352	2017-02-06 08:49:57 +00:00
Hans Petter Selasky	f3e7afe2d7	Implement kernel support for hardware rate limited sockets. - Add RATELIMIT kernel configuration keyword which must be set to enable the new functionality. - Add support for hardware driven, Receive Side Scaling, RSS aware, rate limited sendqueues and expose the functionality through the already established SO_MAX_PACING_RATE setsockopt(). The API support rates in the range from 1 to 4Gbytes/s which are suitable for regular TCP and UDP streams. The setsockopt(2) manual page has been updated. - Add rate limit function callback API to "struct ifnet" which supports the following operations: if_snd_tag_alloc(), if_snd_tag_modify(), if_snd_tag_query() and if_snd_tag_free(). - Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT flag, which tells if a network driver supports rate limiting or not. - This patch also adds support for rate limiting through VLAN and LAGG intermediate network devices. - How rate limiting works: 1) The userspace application calls setsockopt() after accepting or making a new connection to set the rate which is then stored in the socket structure in the kernel. Later on when packets are transmitted a check is made in the transmit path for rate changes. A rate change implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the destination network interface, which then sets up a custom sendqueue with the given rate limitation parameter. A "struct m_snd_tag" pointer is returned which serves as a "snd_tag" hint in the m_pkthdr for the subsequently transmitted mbufs. 2) When the network driver sees the "m->m_pkthdr.snd_tag" different from NULL, it will move the packets into a designated rate limited sendqueue given by the snd_tag pointer. It is up to the individual drivers how the rate limited traffic will be rate limited. 3) Route changes are detected by the NIC drivers in the ifp->if_transmit() routine when the ifnet pointer in the incoming snd_tag mismatches the one of the network interface. The network adapter frees the mbuf and returns EAGAIN which causes the ip_output() to release and clear the send tag. Upon next ip_output() a new "snd_tag" will be tried allocated. 4) When the PCB is detached the custom sendqueue will be released by a non-blocking ifp->if_snd_tag_free() call to the currently bound network interface. Reviewed by: wblock (manpages), adrian, gallatin, scottl (network) Differential Revision: https://reviews.freebsd.org/D3687 Sponsored by: Mellanox Technologies MFC after: 3 months	2017-01-18 13:31:17 +00:00
George V. Neville-Neil	aec9c8d5a5	Limit the number of mbufs that can be allocated for IPV6_2292PKTOPTIONS (and IPV6_PKTOPTIONS). PR: 100219 Submitted by: Joseph Kong MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D5157	2016-10-17 23:25:31 +00:00
Gleb Smirnoff	cc94f0c2d7	- Revert r300854, r303657 which tried to fix regression from r297225. - Fix the regression proper way using RO_RTFREE(). Submitted by: ae	2016-10-13 20:15:47 +00:00
Kevin Lo	c3bef61e58	Remove the 4.3BSD compatible macro m_copy(), use m_copym() instead. Reviewed by: gnn Differential Revision: https://reviews.freebsd.org/D7878	2016-09-15 07:41:48 +00:00
Mike Karels	0f5687f2ae	Fix L2 caching for UDP over IPv6 ip6_output() was missing cache invalidation code analougous to ip_output.c. r304545 disabled L2 caching for UDP/IPv6 as a workaround. This change adds the missing cache invalidation code and reverts r304545. Reviewed by: gnn Approved by: gnn (mentor) Tested by: peter@, Mike Andrews MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D7591	2016-08-24 00:52:30 +00:00
Andrey V. Elsukov	723758b7ce	Fix NULL pointer dereference. ro pointer can be NULL when IPSec consumes mbuf. PR: 211486 MFC after: 3 days	2016-08-02 12:18:06 +00:00
Andrew Gallatin	d4c22202e6	Rework IPV6 TCP path MTU discovery to match IPv4 - Re-write tcp_ctlinput6() to closely mimic the IPv4 tcp_ctlinput() - Now that tcp_ctlinput6() updates t_maxseg, we can allow ip6_output() to send TCP packets without looking at the tcp host cache for every single transmit. - Make the icmp6 code mimic the IPv4 code & avoid returning PRC_HOSTDEAD because it is so expensive. Without these changes in place, every TCP6 pmtu discovery or host unreachable ICMP resulted in a call to in6_pcbnotify() which walks the tcbinfo table with the write lock held. Because the tcbinfo table is shared between IPv4 and IPv6, this causes huge scalabilty issues on servers with lots of (~100K) TCP connections, to the point where even a small percent of IPv6 traffic had a disproportionate impact on overall throughput. Reviewed by: bz, rrs, ae (all earlier versions), lstewart (in Netflix's tree) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D7272	2016-08-01 17:02:21 +00:00
Dimitry Andric	be39349169	Fix a page fault in ip6_setpktopt(), occurring when the pflog module is loaded, and syncthing is started, which uses setsockopt(IPV6_PKGINFO). This is because pflog interfaces do not normally have an IPv6 address, causing the ND_IFINFO() macro to dereference a NULL pointer. Reviewed by: ae PR: 210943 MFC after: 3 days	2016-07-13 19:41:19 +00:00
Andrey V. Elsukov	4c10540274	Cleanup unneded include "opt_ipfw.h". It was used for conditional build IPFIREWALL_FORWARD support. But IPFIREWALL_FORWARD option was removed a long time ago.	2016-06-09 05:48:34 +00:00
George V. Neville-Neil	6d76822688	This change re-adds L2 caching for TCP and UDP, as originally added in D4306 but removed due to other changes in the system. Restore the llentry pointer to the "struct route", and use it to cache the L2 lookup (ARP or ND6) as appropriate. Submitted by: Mike Karels Differential Revision: https://reviews.freebsd.org/D6262	2016-06-02 17:51:29 +00:00
Gleb Smirnoff	6351b3857b	Plug route reference underleak that happens with FLOWTABLE after r297225. Submitted by: Mike Karels <mike karels.net>	2016-05-27 17:31:02 +00:00

1 2 3 4 5 ...

264 Commits